By Saurabh Verma-Oracle on May 21, 2013
Authored by : Sushanta Chakraborty
The purpose of this document is to provide enough information one might need while attempting an implementation of one or more offerings and subject areas using the Oracle BI Applications Universal Adaptors. There are several myths around what needs to be done while implementing Universal Adaptors, where can things go wrong if not configured correctly, what columns are to be populated as a ‘must’, how to provide ‘delta’ data set while shooting for incremental ETL runs and so on. All of these topics are discussed in this document.
Understanding the ‘entry points’ required to implement a subject area is the key to a successful implementation of Universal Adaptors.
Oracle BI Applications provide packaged ETL mappings against source OLTP systems like Oracle Fusion Applications, Oracle E-Business Suites, PeopleSoft, JD Edwards and Siebel across various business areas such as Human Resources, Supply Chain & Procurements, Order Management, Financials, Service and so on. However, Oracle BI Applications does acknowledge that there can be quite a few other source systems, including home-grown ones, typically used by SMB customers. And to that extent, some of the enterprise customers may also be using SAP as their source. Until it gets to a point where Oracle BI Applications can deliver pre-built ETL adaptors against each of these source systems, the Universal Adaptor becomes a viable choice.
A mixed OLTP system where one of them has pre-built adaptor for and the other doesn’t – is also a scenario that calls for the usage of Universal Adaptors. For instance, the core portion of Human Resources may be in PeopleSoft systems, but the Talent Management portion may be maintained in a non-supported (yet) application such as “Taleo”.
In order for customers to enable pulling in data from non-supported source systems into the Data Warehouse, Oracle BI Applications have created a so called “Universal Adaptor”. The reason this was doable in the first place was the fact that the ETL architecture of Oracle BI Applications had the evident support for this. Oracle BI Applications’ Data Warehouse consists of a huge set of facts, dimensions and aggregate tables. The portion of the ETL that loads to these ‘end’ tables are typically Source Independent (loaded using the folder SILOS, acronym of “Source Independent Load Objects”). These ETL maps/interfaces start from a staging table and load data incrementally into the corresponding end table. Aggregates are created upstream, and have no dependencies to the source system where the data came from. The ETL logic in SDE folder (acronym of “Source Dependent Extract”) that extracts into these staging tables (also called Universal Stage Tables) are the ones that go against a given source system, like EBS or PSFT FUSION and so on. For Universal, they go against a similarly structured CSV file. Take any Adaptor – the universal stage tables are exactly the same, structurally. The grain expectation is also exactly the same for all adaptors. And no wonder, while all these conditions are met, the SILOS logic will load the data (extracted from Universal) from the universal stage tables seamlessly.
Why did Oracle BI Applications decide to source from CSV files? In short, the answer to this is “to complete the end-to-end extract-transform-load story”. We will cover this in a bit more details and what the options are, in the next section.
One myth that implementers have while implementing Universal Adaptors is “Data for the universal staging tables should always be presented to Oracle BI Applications in the required CSV file format”.
If your source data is already present in a relational database, why dump it to CSV files and give it to Oracle BI Applications? You will anyway have to write brand new ETL mappings that read from those relational tables to get to the right grain and right columns. Then why target those to CSV files and then use the Oracle BI Applications Universal Adaptor to read from them and write to the universal staging tables? Why not directly target those custom ETL maps to the universal staging tables? In fact, when your source data is in relational tables, this is the preferred approach.
However, if your source data comes from 3rd party sources which you have outsourced, and probably have agreements with them to send you data files/reports/feeds once in a while, and if that 3rd party source doesn’t allow you to access their relational schema, then probably CSV files is the only alternative. A typical example would be Payroll data. A lot of organizations typically outsource their Payroll to 3rd party companies like ADP systems and so on. In those cases, ask for the data in the same manner that you expect in the Oracle BI Applications CSV files. Another valid example might be SaaS applications which usually do not allow direct database accesses. File is the best option for them.
Also, if your source data lies in IBM mainframe systems, where it is typically easier to write COBOL programs or whatever to extract the data in flat files, presenting CSV files to Oracle BI Applications Universal Adaptor is probably easier. Irrespective of how to populate the universal staging tables (relational sources or CSV sources) five very important points should always be kept in mind:
- Grain of the universal staging tables are met properly.
- The uniqueness of records do exists in the (typically) INTEGRATION_ID columns.
- The mandatory columns are populated the way they should be.
- The relational constraints are met well while populating data for facts. In other words, the natural keys that you provide in the fact universal staging table, must exist in the dimensions. This is with respect to the FK resolution (dimension keys into the end fact table) topic.
- Incremental extraction policy is well set up. Some overlap of data is OK, but populating the entire dataset to the universal staging tables will prove to be non-performing.
Note: For the rest of the document, we will assume that you are going the CSV file approach, although re-iterating, it is recommended that if your source data is stored in a relational database you should write your own extract mappings.
There are several entry points while implementing a subject area using Universal Adaptors. The base dimension tables and base fact tables have their corresponding CSV files where you should configure the data at the right grain and expectations. Other kinds of tables include ‘Exchange Rate’ and ‘Codes’ (Domains). Exchange Rate (W_EXCH_RATE_G) has its own single CSV file, whereas the Codes table (W_CODE_D) has a CSV file, one per each code category. To get to see all code-names well enough in the dashboards/reports, you should configure all the ‘required’ code CSV files for the subject area in question.
Note: The Codes table has been replaced with Domains architecture in newer releases of BI Applications.
- Start with populating the offering specific common dimension CSV files, like Job, Pay Grade, HR Position, GL Account, etc., depending on your offering (such as HR, or FINANCE etc). It is not necessary that all offerings would have “their set of common dimensions used in all facts within the offering”.
- Then configure subject area specific dimensions, like Job Requisitions, Recruitment Source etc (when implementing Recruitment) or Learning Grade, Course (when implementing Learning) or Pay Type etc (when implementing Payroll) or Absence Event, Absence Type Reason (when implementing Absence) and so on. These are examples from HR offering. Similarly, it applies for Financials subject areas such as AR or AP etc., where you should consider area specific dimension needs at this time.
- Then configure related COMMON class dimensions applicable for all, like Internal Organizations (logical/applicable partitions being “Department”, “Legal Entity” , “Business Unit” etc), or Business Locations (logical/applicable partitions being “Employee Locations”, “Inventory Locations” etc.).
- Consider other shared dimensions and helper dimensions like Status, Transaction Types, and Exchange Rate etc. Special handling of time dimension should be addressed here (such as Fiscal or enterprise calendars etc).
- Then consider the code dimensions. By this time you are already aware of what all dimensions you are considering to implement, and hence, can save time by configuring the CSVs for only the corresponding code categories. For newer releases, you would configure the domain set up at this point.
- For facts, start with configuring the primary fact for your offering. Since the dimensions are already configured, the natural key of the dimensions are already known to you and hence should be easy to configure them in the fact. For some offerings, there isn’t a concept of a “primary fact”. If so, go to the next step.
- Primary fact should be followed by subject area specific facts, like Payroll, Job Requisition Event, Applicant Event, etc (under HR offering) or AR, AP etc (under Financial offering) and so on.
- Now that all the CSV files for facts, dimensions, and helper tables are populated, you should move your focus towards Domains. For E-Business Suite & PeopleSoft Adaptors, we do mid-stream lookups against preconfigured lookup CSV files. The map between source values/codes to their corresponding domain values/codes come pre-configured in these lookup files. However, for Universal Adaptor, no such lookup files exist. This is because of the fact that we expect that the accurate domain values/codes will be configured “along-with” configuring the base dimension tables where they apply. Since everything is from a CSV file, there is no need to have the overhead of an additional lookup file acting in the middle. Domain value columns begin with “W_” [excluding the system columns like W_INSERT_DT and W_UPDATE_DT] and normally they are mandatory, cannot be nulls, and the value-set cannot be changed or extended. We do relax the extension part on a case by case basis, but in no way, the values can be changed. The recommendation at this stage is that you go to the DMR guide (Data Model Reference Guide), get the list of table-wise domain values, understand the relationships clearly in cases there exists any hierarchical or orthogonal relations, identify the tables where they apply and then their corresponding CSV files, look at the source data and configure the domain values in the same CSV files. Note that if your source data is in a relational database and you have chosen to go the ‘alternate’ route of creating all extract mappings by yourself, the recommendation is to follow what we have done for E-Business Suite Adaptors and PeopleSoft Adaptors and create separate domain value lookup CSV files, and do a mid-stream lookup.
- Note that the above discussion on Domains has been revamped in newer releases on BI Applications. Now, we only have the CODE columns in our data model and the NAME and DESC columns have been taken away. We still do a mid-stream lookup in EBS/PSFT adaptors to resolve the target domain code against the source domain code, but no longer use lookups to resolve the target domain names and descriptions based on source/target domain codes. Rather, these name/description lookups happen on the fly at the RPD level, where the name or description is brought back to the report depending on the code and language preferred in reports. The domain mapping happens in Configuration Manager now, and no longer in CSV files. You will be taken to the step of configuring or mapping them in Configuration Manager.
- Last, but not the least, configure the ETL parameters. Read up the documentation for these parameters, understand their expectations, study your own business requirements and then set the values accordingly.
Domain values constitute a very important foundation for Oracle Business Intelligence Applications. We use this concept heavily all across the board to ‘equalize’ similar aspects from a variety of source systems. The Oracle Business Intelligence Applications provide packaged data warehouse solutions for various source systems such as E-Business Suite, PeopleSoft, Siebel, JD Edwards and so on. We attempt to provide a “source dependent extract” type of a mapping that leads to a “source independent load” type of a mapping, followed by a “post load” (also source independent) type of mapping. With data possibly coming in from a variety of source systems, this equalization is necessary. Moreover, the reporting metadata (OBIEE RPD) is also source independent. The metric calculations are obviously source independent.
The following diagram shows how a worker status code/value is mapped onto a warehouse domain to conform to a single target set of values. The domain is then re-used by any measures that are based on worker status.
Domain values help us to equalize similar aspects or attributes as they come from different source systems. We use these values in our ETL logic, sometimes even as hard-coded filters. We use these values in defining our reporting layer metrics. And hence, not configuring, incorrectly configuring, or changing the values of these domain value columns from what we expect, will lead to unpredictable results. You may have a single source system to implement, but still you have to go through all the steps and configure the domain values based on your source data. Unfortunately, this is small price you pay for going the “buy” approach VS the traditional “build” approach for your data warehouse.
One of the very frequently asked question is “what is the difference between domain value code/name pairs VS the regular code/name pairs that are stored in W_CODE_D (or W_DOMAIN_MEMBER_G and W_DOMAIN_MEMBER_MAP_G along with their corresponding Translation tables in newer versions of BI Applications)”.
If you look at the structure of W_CODE_D table, it appears to be also capable of standardizing code/name pairs to something common. This is correct. However, we wanted to give an extensive freedom to users to be able to do that standardization (not necessarily equalization) of their code/names and possibly use that for cleansing as well. For example, if the source supplied code/name are possibly CA/CALIF or CA/California, you can choose the W_CODE_D approach (using Master Code and Master Map tables – see configuration guide for details) to standardize on CA/CALIFORNIA. The equivalent of the above is the W_DOMAIN_MEMBER_MAP_G table in newer versions of BI Applications.
Now, to explain the difference of domain value code/name pairs Vs the regular source code/name pairs, it is enough if you understand the significance of the domain value concept. To keep it simple, wherever we (Oracle Business Intelligence Applications) felt that we should equalize two similar topics that give us analytic values, metric calculation possibilities etc, we have “promoted” a regular code/name pair to a domain value code/name pair.
If we have a requirement to provide a metric called “Male
Headcount”, we can’t do that accurately unless we know which of the headcount
is “Male” and which is “Female”. This metric therefore has easy calculation
logic: Sum of headcount where sex = Male. Since PeopleSoft can call it “M” and
EBS can have “male”, we decided to call it a domain value code/name pair,
W_SEX_MF_CODE (available in the employee dimension table). Needless to say, if
you didn’t configure your domain value for this column accurately, you won’t
get this metric right.
The Oracle BI Applications mostly use Name and Description columns in the out-of-the-box dashboards and reports. We use Codes only during calculations, wherever required. Therefore, it is obvious that if the names and descriptions didn’t resolve against their codes during the ETL, you will see blank values of attributes (or in some cases, depending on the parameter setting, you might see strings like <Source_Code_Not_Supplied> or <Source_Name_Not_Supplied> and so on). In newer versions of BI Applications where names and descriptions are resolved using on-the-fly RPD lookups and not in the ETL, the reports might show values such as ‘__UNSASSIGNED__’, if not mapped well enough in Configuration Manager.
Another point to keep in mind is that all codes should have distinct name values. If two or more codes have the same name value, at the report level you will see them merged. The metric values may sometimes appear in different lines of the report, because OBIEE Server typically throws in a GROUP BY clause on the lowest attribute (code).
Once implemented, you are free to promote the source code columns from the logical layer to the presentation layer. You might do this when you know your business users are more acquainted to the source code values rather than the name values. But that is a separate business decision. The general behavior is not like that.
Although you can choose to supply the entire dataset during incremental runs, for all practical reasons, this is not recommended. Firstly because then the ETL has to process all the records and determine what needs to be applied and what can be rejected. Secondly, the decision ETL takes may not be accurate. ETL decisions are based on the values of the system date columns like CHANGED_ON_DT, AUX1_CHANGED_ON_DT, AUX2_CHANGED_ON_DT, AUX3_CHANGED_ON_DT and AUX4_CHANGED_ON_DT columns only. We do not explicitly compare column-by-column and determine whether an update is required. We believe that if something has changed, probably one of the four date columns must have changed. And in that case, we simply update. If all 5 date columns are same, we pretty much tend to reject. The base of this decision is the correctness of the date columns. If your source system does not track the last updated date column on a record well enough, it becomes your responsibility to force an update, no matter what. An easy way to do this is to set SESSSTARTTIME in one of these columns during extract. This will force to detect a change, and we will end up updating.
No wonder, this is not be the best idea. By all means, you should provide the true “delta” data set during every incremental run. A small amount of overlap is acceptable, especially when you deal with flat files. Our generally accepted rules for facts or large dimensions are either:
· Customer does their own version of “persisted staging” so they can determine changes at the earliest opportunity and only load changes into universal staging tables
· If absolutely impossible to determine the “delta” or to go the “persistent staging” route, Customer only does full load. Otherwise doing a full extract every time and processing incrementally will take longer.
Follow the principles below to decide on your incremental strategy:
- (Applies to relational table sources) Does your source system capture last update date/time accurately in the source record that change? If so, extracting based on this column would be the best idea. Now, your extract mapping may have used 2 or 3 different source tables. Decide which one is primary and which ones are secondary. The last update date on the primary table goes to the CHANGED_ON_DT column in the stage table. The same from the other two tables go to one of the auxiliary changed on date column in the stage table. If you design your extract mapping this way, you are almost done. Just make sure you add the filter criteria “primary_table.last_update_date >= LAST_EXTRACT_DATE parameter”. The value of this parameter is usually maintained within the ETL orchestration layer.
- (Applies to CSV file sources) Assuming there is a mechanism which you can trust that gives you the delta file during each incremental load; does the ‘delta’ file come with a changed value of system dates? If yes, you’re OK. But if not, then you should add an extra piece of logic in the out of the box SDE_Universal** mappings that sets SESSSTARTTIME to one of the system date columns. This will force an update (when possible) no matter what.
- (Applies to CSV file sources) If there are no mechanisms to easily give your delta file during incremental, and it seems easier to get a complete ‘dump’ every time, you have actually couple of choices:
a. Pass the whole file in every load, but run true incremental loads. Note that this is not an option for large dimensions or facts.
b. Pass the whole file each time and run full load always.
c. Do something at the back-office to process the files and produce the ‘delta’ file yourselves.
The choices (a) and (b) may sound a bad idea, but we’ve seen it to be a worthwhile solution compared to (c), if the source data volume is not very high. For an HR Analytics implementation (as an example), this could be OK as long as your employee strength is no more than 5000 and you have no more than 5 years of data. The benchmark might be different for a Financial Analytics implementation.
The choice (c) is more involved but produces best results. The idea is simple. You store the last image of the full file that your source gave you [call ‘A’]. You get your new full file today [call ‘B’]. Compare A & B. There are quite a few data-diff software available in the market, or better if you could write a Perl or python script on your own. The result of this script should be a new delta file [call ‘C’] that has the lines copied from B that has changed as compared to A. Use this new file C as your delta data for your incremental runs. Also, discard A and rename B as A, thereby getting ready for the next batch.
Having said that, it is worthwhile to re-iterate that the “Persisted Staging” is a better approach as it is simpler and uses the ETL logic to do the comparison. Oracle BI Applications have used this technique in HR adaptors for E-Business Suite and PeopleSoft, in case you wanted to refer to them.
If there are other options not considered here, by all means, try them out. This list is not comprehensive, it is rather indicative.