Monday Apr 21, 2014

How does TYPE2_FLG Work in ETL

Author: Vivian(Weibei) Li

Instruction

TYPE2_FLG is usually used in slowly changing dimensions in BI Applications. This flag indicates if the dimension is type 2, and it determines the data storing behavior in ETL. This blog is to give you a better understanding on how TYPE2_FLG works in ETL.

Background

Slowly Changing dimension

There are many fundamental dimensions such as Customer, Product, Location and Employee in BI application. The attributes in these dimensions are revised from time to time. Sometimes the revised attributes merely correct an error in the data. But many times the revised attributes represent a true change at a point in time. These changes arrive unexpectedly, sporadically and far less frequently than fact table measurements, so we call this topic slowly changing dimensions (SCDs).

Slowly changing dimensions (SCD) entities like Employee, Customer, Product and others determine how the historical changes in the dimension tables are handled and decide how to respond to the changes. There are three different kinds of responses are needed: slowly changing dimension (SCD) Types 1, 2 and 3.

Type 1: Overwrite the attributes with new changes

Type 2: Add a New Dimension Record

Type 3: Add a New Field

We are talking about type 2 in this blog. In the Type 2 SCD model the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current) value is easy to extract and the history is clear. A minimum of three additional columns should be added to the dimension row with type 2 changes: 1) row effective date or date/time stamp (EFFECTIVE_FROM_DT); 2) row expiration date or date/time stamp (EFFECTIVE_END_DT); and 3) current row indicator (CURRENT_FLG).

SRC_EFF_FROM_DT and EFFECTIVE_FROM_DT

The two columns have different concepts though they have similar name. We saw many customers getting confused about the two columns.

SRC_EFF_FROM_DT is extracted from the effective start date of the source (mainly from the main driven source) if the source has the history. If the source doesn’t store history or the history is not extracted, it is hard coded as #LOW_DATE.

EFFECTIVE_FROM_DT is a system column in dimension table to track the history. Remember that we use the knowledge modules (KM) for repeatable logic that can be reused across ETL tasks. Updating the SCD related columns, such as EFFECTIVE_FROM_DT, is usually handled by KM. EFFECTIVE_FROM_DT is modified when inserting a new type 2 record in incremental run, and it is usually modified to the same date as the changed on date from the source. EFFECTIVE_FROM_DT does not always map to the Source Effective Dates.

In type 2 SCD model, EFFECTIVE_FROM_DT is the date used to track the history.

TYPE2_FLG in BI Application

TYPE2_FLG is a flag used to indicate if the dimension is type 2 or not. This flag is used in many dimensions in BI application, such as employee, user, position, and so on. This flag is very important because it determines the history storing behavior.

TYPE2_FLG has two values: ‘Y’ and ‘N’. ‘Y’ means the dimension is a type 2, and ‘N’ means the dimension is type 1. Type 2 dimensions store the history, while type 1 dimensions only store the current record.

For example, if the supervisor is changed from Peter to Susan for an employee on 01/02/2012:

Type 1

EMPLOYEE_ID

SUPERVISOR_NAME

CURRENT_FLG

123

Susan

Y

Type 2

EMPLOYEE_ID

EFFECTIVE_FROM_DT

EFFECTIVE_TO_DT

SUPERVISOR_NAME

CURRENT_FLG

123

01/02/2012

Future

Susan

‘Y’

123

01/01/1999

01/02/2012

Peter

‘N’

As shown above, type 1 dimension overwrites the supervisor with the new supervisor, and only stores the current record. Type 2 dimension inserts a new record with the new supervisor name and keeps the old record as a history. The EFFECTIVE_FROM_DT, EFFECTIVE_TO_DT and CURRENT_FLG are modified accordingly: EFFECTIVE_TO_DT is changed to 01/02/2012 and CURRENT_FLG is set as ‘N’ for the old record. The ‘CURRENT_FLG’ is set as ‘Y’ for the new record with the new EFFECTIVE_FROM_DT.

How to Setup TYPE2_FLG

The out of the box code in BI application should have setup the default values. For the type 2 dimensions, it is usually set as ‘Y’.

The TYPE2_FLG can be configured in BIACM. This variable is configured by different dimension groups.

The screenshot above shows that you can configure the value of this flag for difference dimension groups by clicking the parameter value and overwriting it to a different value.

Note: You can only configure the TYPE2_FLG for the dimension groups that are in this BIACM list. The dimension groups that are not in the list cannot be configured.

You should set the value of TYPE2_FLG carefully. If you override the TYPE2_FLG to ‘N’ for a type 2 dimension, you may meet some issues. I will describe more details in the next session.

Possible Issues Related to TYPE2_FLG

As mentioned earlier, sometimes for some reason, the value of TYPE2_FLG may be set to ‘N’ for the type 2 dimension. This may cause some issues.

In BI application, SDE mapping brings the history from the source in the initial full load in some adapters, such as EBS. TYPE2_FLG affects the storing behavior for these historic records. Here compares the different behaviors when setting TYPE2_FLG to ‘Y’ and ‘N’ for a type 2 dimension.

Case 1-TYPE2_FLG = ‘Y’

Let’s take employee dimension (type 2 dimension) as an example

Source

EMPLOYEE_ID

SRC_EFF_FROM_DT

SUPERVISOR_NAME

ROW #

123

01/01/1999

Peter

1

123

01/02/2012

Susan

2

When loading the data into data warehouse in the initial full run, both the rows (including the historical record #1) will be loaded. TYPE2_FLG is ‘Y’ in this case, KM, which will handle the loading behavior, uses this value to determine the type of employee dimension, and accordingly the storing method.

KM will modify EFFECTIVE_TO_DT and CURRENT_FLG for the two records as TYPE2_FLG=’Y’ in this case.

EMPLOYEE_ID

EFFECTIVE_FROM_DT

EFFECTIVE_TO_DT

SUPERVISOR_NAME

CURRENT_FLG

123

01/02/2012

Future

Susan

‘Y’

123

01/01/1999

01/02/2012

Peter

‘N’

Case 2 - TYPE2_FLG =’N’

This time, the TYPE2_FLG is set as ‘N’ for employee dimension (type 2 dimension), which is incorrect. KM will treat it as type 1 rather than type 2.

Source

EMPLOYEE_ID

SRC_EFF_FROM_DT

SUPERVISOR_NAME

ROW #

123

01/01/1999

Peter

1

123

01/02/2012

Susan

2

When loading the data into data warehouse, both the rows will be loaded because the history from the source is stored. However, because TYPE2_FLG is ‘N’, KM won’t modify EFFECTIVE_TO_DT and CURRENT_FLG accordingly, and this will cause issues.

Employee Table in Data warehouse

EMPLOYEE_ID

EFFECTIVE_FROM_DT

EFFECTIVE_TO_DT

SUPERVISOR_NAME

CURRENT_FLG

123

01/02/2012

Future

Susan

‘Y’

123

01/01/1999

Future

Peter

‘Y’

As shown above, the two records are in an overlapping time range, and both have CURRENT_FLG as ‘Y’. It may give duplicates when resolving the employee from the facts. For example, the transaction date 02/04/2013 will fall into the time range of the two records, so both will be extracted, thus causing the duplicates in the facts.

How to Debug TYPE2_FLG Issues

As discussed in the previous session, in order to avoid this kind of issues, you should set the value of TYPE2_FLG carefully, and set it as ‘Y’ for out of the box TYPE2 dimensions.

In addition, when you get the duplicates in the fact, you can do the following checks.

  • Check where the duplicates come from in the fact, and find out the problematic dimension if they are from the dimension.
  • Check the data in the dimension for the duplicates to see if you see the similar loading behavior as the one in use case 2 of the previous session. You can first simply see if multiple records having CURRENT_FLG=’Y’.
  • Check the value of the TYPE2_FLG in ODI repository.

1. Open the session log of the task

2. Open ‘Definition’

3. Expand ‘Variable and Sequence Values’

4. Find TYPE2_FLG and check the value

5. If the value is ‘N’ but the dimension is type 2, you may hit the issue described in the previous session.

I also would like to provide you some tips to find out the type of a dimension here. You can find out this information in ODI repository.

  • For one dimension, such as employee dimension, you should first know the dimension table name, for example, W_EMPLOYEE_D
  • Go to ODI repository->’Designer’->’Models’
  • Find out the dimension table and open it by double clicking it
  • Go to ‘Definition’ and check the OLAP type. The type of slowly changing dimension tells you that this dimension is type 2

  • You can also find out which attributes are type 2 by checking the column attribute

1. Expand the dimension table, for example, W_EMPLOYEE_D and then expand Columns

2. Open the attribute of a column by double clicking it

3. Go to ‘Description’ and check ‘Slowly Changing Dimension Behavior’

As shown above, ‘Add Rows on Change’ option tells you that this attribute is type 2.

Conclusion

This blog helps you understand how TYPE2_FLG works in ETL and recognize the importance of this flag. It also gives you a way to debug the possible TYPE2_FLG issue.

Thursday Apr 17, 2014

3 Modes for Moving Data to the BI Applications DW from a Source Application Database

In BI Applications 11.1.1.7.1 the adaptors for the following product lines use the LKM BIAPPS SQL to Oracle (Multi Transport) to move the data from the source Application database to the target BI Applications DW database:

  • E-Business Suite
  • Siebel
  • PeopleSoft
  • JDE

A key feature of this LKM developed specifically for BI Applications is that the data from the source system may be transported in 3 different ways and using a parameters set in Configuration Manager the mode can be selected to suit how the system has been setup, thereby optimizing ETL performance.  This blog post details those 3 modes, and how to configure BI Applications to use the mode that best suits the deployment.

[Read More]

Friday Apr 11, 2014

Snapshot Facts in OBIA (3)

Authors: Anbing Xue, Zhi Lin

Delta Snapshot History Fact

Besides the trend lines of snapshots, we have requirement to plot the trend lines of the delta changes on a transaction along time line. The changes here can be either quantitative (on amount, units, etc) or qualitative (on names, description, etc). Hence we invented a new kind of snapshot fact, to specifically address it.

Typical Data Model

A typical attributes of a delta snapshot history fact is–

SNAPSHOT_DT

IMAGE_TYPE

Many attributes of the original transactions are kept and inherited, like–

Primary Key

Foreign keys to the dimensions

Implementation

The delta snapshot history fact would capture and store a new pair of images, whenever we detect a change on the original transaction. The image pair is essential, especially for qualitative changes. Usually it consists of one pre image (or “negative” image) as well as one post image (or “positive” image).  For example,

IMAGE_NUM

ORDER_ID

SHIP_TO_LOCATION

AMOUNT

SNAPSHOT_DT

IMAGE_TYPE

1

001

New York City

$100

Mar 1, 2014

POST

2

001

New York City

-$100

Mar 2, 2014

PRE

3

001

New York City

$120

Mar 2, 2014

POST

4

001

New York City

-$120

Mar 3, 2014

PRE

5

001

Boston

$120

Mar 3, 2014

POST

Basically, this delta snapshot history fact stores regular snapshots per change, and adds the “negative” snapshots before change. So it is enriched with a unique feature to report delta’s trend line, simply by clubbing both kinds of snapshots together.

Besides, we also introduced more flexibility for users to configure a subset of columns they are interested to track. Based on the configuration, the ETL would create new snapshots for changes only on the interested columns. Changes on the other columns would trigger an update to existing snapshots instead, in order to sync up with the original transactions.

Though, extra ETL complexity has to be introduced to handle pre vs. post images separately, plus the flexibility to track subset but all changes. The number of records is much less than regular daily snapshot facts. The data size of this fact is proportional to the number of changes.

Snapshot Facts in OBIA (2)

Authors: Divya Nandakumar, Zhi Lin

Range Snapshot History Fact

In a data warehouse implementation, the volume of OLTP transactions could be very big already, and consequentially the volume of the snapshot fact could be humongous, depending on the snapshot frequency.  The dilemma is that better accuracy of change history would be achieved with more frequent captures, which makes data size overwhelming and performance badly impacted.

A solution is to create a variation of the snapshot history fact, which we call snapshot period history fact. The idea is simple, concatenating consecutive snapshots, if they happen to share identical images, of a transaction into a new snapshot. The new snapshot would be labeled with starting as well as end timestamps, which indicates the time period the image lasts. This way merges duplicate snapshots and reduces the resulted data size significantly.

Typical data model

The typical attributes of a range snapshot history fact are–

EFFECTIVE_FROM_DT

EFFECTIVE_TO_DT

Many attributes of the original transactions are kept and inherited, like–

Primary Key

Foreign keys to the dimensions

Implementations

Concatenate daily snapshots into a range

This is a conventional way to build up range snapshots. Two consecutive daily snapshots sharing identical status can be merged into one snapshot spanning across these two days. Two having different statuses would be stored as two separate snapshots for each day.

Concatenation of these daily snapshots could be created as a result of gathering related daily snapshot records together. The degree of condensation of data that can be achieved is remarkable, because the gathering may span to range of period unlike the fixed period of week or month. In case the triggering event occurs very often, for example 20 times a day then, this approach is not advisable. Meanwhile, every detail got preserved as no daily snapshot got dropped off the concatenation.

The ETL flow requires daily snapshots to start with, and do group-by on “interested” status to merge identical rows. Its dependency on accumulation of daily snapshots is extra task and large storage. Incremental load could be a challenge, especially for a back-dated snapshot. Also, this method assumes no gap between daily snapshots, which could lead to an exception difficult to handle in ETL.

A status change in these related daily snapshots could trigger a snapshot record to be entered into the data warehouse.  

Range snapshots directly from transactions

Here we invented a new way to overcome the shortage of the conventional method above to build range snapshots. We removed the dependency on daily snapshots and directly build range snapshots by scanning through all transaction footprints.

A few key points we have introduced to achieve this.

1)      Create multiple range snapshots trailing each footprint (transaction). For example, one order placed in Mar 25, 2012 by Adam, derives to range snapshots trailing as below. The period duration in each snapshot is one year here, which is configurable.

Customer

Status

Status Start Date

Status End Date

Adam

Active

Mar 25, 2012

Mar 25, 2013

Adam

Dormant

Mar 25, 2013

Mar 25, 2014

Adam

Lost

Mar 25, 2014

Mar 25, 2015

2)       Collapse all trailing series generated in (1), and come out only one status at any point of time, using priority rules. In the same example, the priority rule to override is, Active > Dormant > Lost.

3)       On top of the results from collapsing, concatenate the snapshots having identical statuses.

The new snapshot would be labeled with starting as well as end timestamps, which indicates the time period the image lasts. This way merges duplicate snapshots and reduces the resulted data size significantly.

The challenge on incremental load, especially back-dated records, can be solved here relatively easier, as all the source information here, the transaction footprints, are usually persisted anyway. In similar example, our ETL can be as simple as deleting records from the target table and recreating the records for a particular customer from scratch, every time there is an order placed by the customer.

Here we still achieve a great amount of data compression and robust ETL processing. The incremental load is still not precise yet to the most granular level. One incremental load involving one transaction per customer would end up to truncate and rebuild the entire target table.

Snapshot Facts in OBIA (1)

Authors: Divya Nandakumar, Zhi Lin

 

Snapshot History Fact

A snapshot captures an image of source data at certain point of time, and preserves the data image plus a (snapshot) time label. A regular, transactional fact intends to store data in data warehouse format and reflect OLTP transactions with near-real-time latency. In this context, a regular fact basically captures the near-current snapshot of the OLTP transactions, and is capable to support status de quo analysis.

A snapshot history fact, (or snapshot fact, in short,) accumulates a series of snapshots and preserve all, each with a different time label. In this way, the change history of each transaction is preserved by the snapshot series. This fact is very useful to do trend analysis over time.

Typical Data Model

The typical attribute of a snapshot history fact is–

SNAPSHOT_PERIOD_DT

Many attributes of the original transactions are kept and inherited, like–

Primary Key

Foreign keys to the dimensions

Implementation

Rolling period of daily snapshots

In the case of a Business need to Analyze DAILY quantity-on-hand inventory levels by product and store for a Business process like “Retail store inventory”, the Granularity could be - Daily inventory by product at each store, the Dimensions - Date, product, store & Fact - Quantity on hand.

Storing source system daily snapshot would have a serious impact on storage of Data Warehouse. The first solution is to keep daily snapshots for a limited rolling period, like last 90 days for example. The daily snapshot table would accumulate daily snapshots from 1 day, 2 days,…, until 90 days. After that, it would always drop off the oldest daily snapshot, before it adds one more daily snapshot. Hence the ETL should always delete the snapshots older than 90 days first, and then append a new snapshot.

This method enables to keep a fairly granular of snapshots on daily level. However, older snapshots are not kept, so it’s not good for long term historical trending.

 

Monthly snapshots

At the end of the month all accounts have their month ending balance captured. The event is the end of the month, and the month is stored as part of the data warehouse. The selection program reads through the operational data and upon encountering a record that meets the qualifications, moves the record to the data warehouse. At the end of the month, each account is queried and the balance of the account at the end of the month is transferred to the data warehouse environment. One account may have had no activities during the month and another account may have had 200 activities during the month. Both accounts will show up as exactly one record in the data warehouse environment. No continuity of activity is assumed using this technique.

The passage of time - day end, week end, month end, etc. - is all common ways of triggering a snapshot. But the periodic passage of time is hardly the only way that snapshots are triggered.

The Monthly snapshot table stores snapshots of all previous day’s historical data. ETL design would have a preload mapping which deletes the data loaded for current month, based on Current Month End date and then load with the latest data for current month.

 

In this way, we “aggregate” up from daily snapshots and archive great compact on data size. Longer term history trending can be stored and reported. However, we lost gross levels of details in between every two month ends.

 

About

Phil Wang-Oracle

Search

Categories
Archives
« April 2014 »
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
12
13
14
15
16
18
19
20
22
23
24
25
26
27
28
29
30
   
       
Today