Today, we will learn about implementing Slowly Changing Dimensions (SCD) in Oracle Cloud Infrastructure (OCI) Data Integration. A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.
OCI Data Integration can be used to define, deploy, and load most types of SCDs. The three broadly used SCD types are:
In this blog, we are going to walk through a complete end-to-end SCD Type 2 implementation.
There are several ways to implement SCD Type-2. Here, with the help of full-outer and left-outer joins we will identify the new and old records and use conditional statements such as Case/When to implement an SCD Type-2 process. In this example, using a single data flow, we are going to cater to the full and incremental data load to populate the SCD Type -2 table. Below is the list of tables used in the data flow with the structure of the target table and the corresponding mapping with SCD behavior.
List of source and target tables used -
The tables used in this example are from the HR schema and can be referred from Oracle livesql
Target Table "Emp_Tgt" Mapping -
DDL for EMP_TGT -
CREATE TABLE EMP_TGT ( ENAME VARCHAR2(100), EMPNO NUMBER(10,0), JOB VARCHAR2(50), MGR NUMBER(4,0), HIREDATE DATE, SAL NUMBER(10,2), COMM NUMBER(7,2), DEPTNO NUMBER(10,0), DEPARTMENT_NAME VARCHAR2(200), STATE_PROVINCE VARCHAR2(200), COUNTRY_NAME VARCHAR2(200), REGION_NAME VARCHAR2(200), SEQ_NO VARCHAR2(500), START_DATE DATE, END_DATE DATE, ACTIVE_FLAG VARCHAR2(1), PRIMARY KEY (SEQ_NO) );
Below is the data flow created for building a Type 2 slowly changing dimension -
With the help of the left outer join and full outer join, we have identified the updated, inserted, and changed records based on the primary key, SCD Type 2 column. Here, the left outer join is used to get only the target data matching with the source along with additional records from the source instead of bringing the complete target data for reprocessing.
With the help of full outer join, we will be comparing the incoming records with the target records on basis of SCD type 2 columns. Here, COMMISSION_PCT and SALARY are the SCD type-2 columns in the target. We have used NVL conditions to handle NULL values coming from the source.
Join Condition -
NVL(JOIN_7_1.EMPLOYEES.EMPLOYEE_ID,-99) = NVL(JOIN_7_2.EMP_TGT.EMPNO,-99)
AND NVL(JOIN_7_1.EMPLOYEES.COMMISSION_PCT,-999) = NVL(JOIN_7_2.EMP_TGT.COMM,-999)
AND NVL(JOIN_7_1.EMPLOYEES.SALARY,-222) = NVL(JOIN_7_2.EMP_TGT.SAL,-222)
In the expression operator i.e. EXPRESSION_1, we have derived the unique ID (generating unique identifier using UUID()), active flag, effective start date, and effective end date.
SEQ_NO:
CASE WHEN EXPRESSION_1.EMPLOYEES.EMPLOYEE_ID IS NOT NULL AND EXPRESSION_1.EMP_TGT.EMPNO IS NULL THEN UUID()
ELSE EXPRESSION_1.EMP_TGT.SEQ_NO
END
ACTIVE_FLAG:
CASE WHEN EXPRESSION_1.EMPLOYEES.EMPLOYEE_ID IS NOT NULL AND EXPRESSION_1.EMP_TGT.EMPNO IS NULL THEN 'Y'
WHEN EXPRESSION_1.EMPLOYEES.EMPLOYEE_ID IS NOT NULL AND EXPRESSION_1.EMP_TGT.EMPNO IS NOT NULL THEN 'Y'
WHEN EXPRESSION_1.EMPLOYEES.EMPLOYEE_ID IS NULL AND EXPRESSION_1.EMP_TGT.EMPNO IS NOT NULL THEN 'N'
ELSE 'N'
END
START_DATE:
CASE WHEN EXPRESSION_1.EMPLOYEES.EMPLOYEE_ID IS NOT NULL AND EXPRESSION_1.EMP_TGT.EMPNO IS NULL
THEN CURRENT_DATE
ELSE EXPRESSION_1.EMP_TGT.START_DATE
END
END_DATE:
CASE WHEN EXPRESSION_1.EMPLOYEES.EMPLOYEE_ID IS NOT NULL AND EXPRESSION_1.EMP_TGT.EMPNO IS NULL
THEN TO_DATE('31-12-9999','dd-MM-yyyy')
WHEN EXPRESSION_1.EMPLOYEES.EMPLOYEE_ID IS NULL AND EXPRESSION_1.EMP_TGT.EMPNO IS NOT NULL
THEN CURRENT_DATE
ELSE EXPRESSION_1.EMP_TGT.END_DATE
END
Once the flow is completed, save the data flow, create an Integration Task, publish to an application, and run the corresponding task for implementing SCD type-2.
The 1st run inserts all the records from the source table into the target assuming that there are no records in the target table. In the 1st run, for all the records, ACTIVE_FLAG = 'Y', START_DATE = <<current date>>, and END_DATE = '31-DEC-9999'.
In the 2nd run, changed records coming from the source are identified with the target and SCD type-2 logic is applied to those records. Below are the screenshots which showcase the changes that happened in the target during 2nd run based on the changes in the source.
In the 3rd run - further changes were performed in the source table Employee where salary and commision_pct of certain employees were changed. After a successful run of the Integration Task, new rows got inserted for the changed records.
This concludes an overview of implementing SCD Type-2 with Oracle Cloud Infrastructure Data Integration.
Have you seen some of our other blogs? Check out all the blogs related to Oracle Cloud Infrastructure Data Integration. To learn more, check out some Oracle Cloud Infrastructure Data Integration Tutorials and Documentation.