Monday 12 June 2017

What are Slowly Changing Dimensions?

Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather than changing on regular schedule, time-base. In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimension's attribute value for given date. Example of such dimensions could be: customer, geography, employee.
  • Type 0 - The passive method
  • Type 1 - Overwriting the old value
  • Type 2 - Creating a new additional record
  • Type 3 - Adding a new column
  • Type 4 - Using historical table
  • Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)
There are many approaches how to deal with SCD. The most popular are:

accel-DS Shell Script Engine V2.0 is Now Available

accel-DS Shell Script Engine can data transformation using sql,files with most of the popular database and file formats used in Hadoop.

In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key.

In our example, recall we originally have the following table:

Customer Key Name State
5001 Peter Washington

After Peter moved from Washington to Maine, we add the new information as a new row into the table:

Customer Key Name State
5001 Peter Washington
5005 Peter Maine

Advantages:

This allows us to accurately keep all historical information.

Disadvantages:

This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.
This necessarily complicates the ETL process.

Usage:

About 50% of the time.

When to use Type 2:

Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.

No Key Name Description
1 Target_Table Table where data should be loaded
2 Join_Columns Enter source sql column and target table column, join will be created between source and target using these columns.
3 Mapped_Columns Enter source sql column and target table column, source column data will be loaded in the respective target table column.
4 Source_SQL Enter Source SQL query, which has new and updated records..
5 Target_Columns Enter all the target table column names seperated by comma in the insersion order.
6 Eff_Date_Column Enter target table column name where record created date will be loaded. Enter 'NA' if you don't want to track created date.
7 End_Date_Column Enter target table column name where record last updated date will be loaded. Enter 'NA' if you don't want to updated track date.
8 Latest_Rec_Column Enter target table column name where flag will specify the record is latest or old. Enter 'NA' if you don't want to track latest record.
9 Data_Base_Name Hadoop Database Name where the Target Table is located.
10 Transpose_Flag Flag for Transpose. Enter 'Y' if you have Transpose column to load.
11 Transpose_Track_Column Enter the source sql transpose column name, target table transpose column name and source column names to track transpose column (#This option is not applicable if (Arg) Transpose_Flag is 'N')
12 Audit_Columns Audit columns delimited by ',' FORMAT : "creation_ts-CURRENT_TIMESTAMP,created_by-DEFAULT-VISHNU,process_control_id-UNIQUE_VALUE"
13 SQL_File_Location Enter the path in which downloaded Type_2.sql file present.
14 Log_File Log File path and name to log the script execution details.

Syntax

Target_Table="$Target_Table$" Source_SQL="$Source_SQL$" Mapped_Columns="$Mapped_Columns$" Join_Columns="$Join_Columns$" Target_Columns="$Target_Columns$" Data_Base_Name="$Data_Base_Name$" SQL_File_Location="$SQL_File_Location$" Log_File="$Log_File$" ./Type_1.sh

Example

Target_Table="EDS_Request_Prod_KV" Join_Columns="ACCOUNT_ID,ACCOUNT_ID,SSN,SSN,SRC_SYSTEM,SRC_SYSTEM,ROW_TYPE,ROW_TYPE" Mapped_Columns="ACCOUNT_ID,ACCOUNT_ID,SSN,SSN,SRC_SYSTEM,SRC_SYSTEM,KV_COLUMN,KV_COLUMN,ROW_TYPE,ROW_TYPE,ADDRESS1,ADDRESS1,ADDRESS2,ADDRESS2,ADDRESS3,ADDRESS3,ADDRESS4,ADDRESS4,CITY,CITY,COUNTRY,COUNTRY,STATE,STATE,POSTAL,POSTAL" Source_SQL="SELECT * FROM cygnus.EDS_Request_STG_KV" Target_Columns="ACCOUNT_ID,SSN,SRC_SYSTEM,KV_COLUMN,ROW_TYPE,ADDRESS1,ADDRESS2,ADDRESS3,ADDRESS4,CITY,COUNTRY,STATE,POSTAL,as_of_date,creation_ts, created_by,process_control_id,effective_date,latestcol" Eff_Date_Column="effective_date" End_Date_Column="as_of_date" Latest_Rec_Column="latestcol" Data_Base_Name="cygnus" Transpose_Flag="Y" Transpose_Track_Column="kv_column,kv_column,ADDRESS1+ADDRESS2+ADDRESS3+ADDRESS4+CITY+COUNTRY+STATE+POSTAL,ADDRESS1+ADDRESS2+ADDRESS3+ADDRESS4+CITY+COUNTRY+STATE+POSTAL" Audit_Columns="creation_ts-CURRENT_TIMESTAMP,created_by-DEFAULT-VISHNU,process_control_id-UNIQUE_VALUE" SQL_File_Location="/home/cloudera/Desktop/Type_2" Log_File="/home/cloudera/Desktop/Type_2/type2_EDS_log.txt" ./Type_2.sh

Disclaimer

Please ensure you read and understand the following general risk disclaimer:

IMPORTANT:

THIS SOFTWARE END USER LICENSE AGREEMENT (“EULA”) IS A LEGAL AGREEMENT BETWEEN YOU AND PRODEN TECHNOLOGIES, INC. READ IT CAREFULLY BEFORE COMPLETING THE INSTALLATION PROCESS AND USING THE SOFTWARE. IT PROVIDES A LICENSE TO USE THE SOFTWARE AND CONTAINS WARRANTY INFORMATION AND LIABILITY DISCLAIMERS. BY INSTALLING AND USING THE SOFTWARE, YOU ARE CONFIRMING YOUR ACCEPTANCE OF THE SOFTWARE AND AGREEING TO BECOME BOUND BY THE TERMS OF THIS AGREEMENT. IF YOU DO NOT AGREE TO BE BOUND BY THESE TERMS, THEN SELECT THE "CANCEL" BUTTON. DO NOT PROCEED TO REGISTER & INSTALL THE SOFTWARE. LIABILITY DISCLAIMER•THE accel<>DS PROGRAM IS DISTRIBUTED "AS IS". NO WARRANTY OF ANY KIND IS EXPRESSED OR IMPLIED. YOU USE IT AT YOUR OWN RISK. NEITHER THE AUTHORS NOR PRODEN TECHNOLOGIES, INC. WILL BE LIABLE FOR DATA LOSS, DAMAGES AND LOSS OF PROFITS OR ANY OTHER KIND OF LOSS WHILE USING OR MISUSING THIS SOFTWARE.

RESTRICTIONS:

You may not use, copy, emulate, clone, rent, lease, sell, modify, decompile, disassemble, otherwise reverse engineer, or transfer any version of the Software, or any subset of it, except as provided for in this agreement. Any such unauthorized use shall result in immediate and automatic termination of this license and may result in criminal and/or civil prosecution.

TERMS:

This license is effective until terminated. You may terminate it by destroying the program, the documentation and copies thereof. This license will also terminate if you fail to comply with any terms or conditions of this agreement. You agree upon such termination to destroy all copies of the program and of the documentation, or return them to the author.
Download Data Ingestion Framework (Free)
SCD Type 2 Data Ingestion Scripts for Hadoop

No comments:

Post a Comment