Generating Sample Data with ODI: A Case Study For Knowledge Modules and User Functions
By Christophe Dupupet on Sep 09, 2009
The posts in this series assume that you have some level of familiarity with ODI. The concepts of Interface, Model, Knowledge Module and User Function are used here assuming that you understand them in the context of ODI. If you need more details on these elements, please refer to the ODI Tutorial for a quick introduction, or to the complete ODI documentation for detailed information.
We've all been there: we start coding, waiting for a set of sample data to be available. We move along with the code... and the data is not available. Or we need to build a small (or not so small) data set quickly. Sure, we all have sample databases left and right for that purpose. But recently I was looking for a decent size data set for some tests (more than the traditional 30 sample records) and could not put my hands on what I needed. What the heck: why not have ODI build this for me?
The techniques that we will leveraged for this are the following:
- Creation of a temporary interface to create the sample table (See this previous post for details on how to create a temporary interface)
- Creation of a new knowledge module to generate enough records in the new table
- Creation of ODI User Functions to simplify the generation of random values
if you want to import in your repository a project that already contains all the objects (IKM and User functions). Click
if you want to download a file that will let you import the different objects individually. You will have to unzip the file before importing the objects in the later case.
All the objects mentioned in this article can be downloaded. Save
The samples provided here have all been designed for an Oracle database, but can be modified and adapted for other technologies.
Today we will discuss the different elements that allow us to generate the sample data set. In future posts, we will dissect the Knowledge Modules and User Functions to see what technological choices were made based on the different challenges that had to be solved.
1. THE INTERFACE
For more details on how to create a temporary interface, you can refer to this post. For our example, we will create a new table in an existing schema. When you create your temporary interface, remember to set the following elements:
- Select of your staging area ( In the Definition tab of the interface)
- Name your target table
- Select the location of your target table (work schema / data schema)
- Name the Columns, and set their individual data type and length
For our example, we will use a fairly simple table structure:
2. USER FUNCTIONS
The Oracle database comes with a package called DBMS_RANDOM. Other random generators can be used (DBMS_CRYPTO for instance has random generation functions as well). These functions take more or less parameters, and if we realize after creating dozens of mappings that using the "other" package would have been better... we would be in a lot of trouble. Creating user functions will allow us to:
- Have a naming convention that is simplified
- Limit the number of parameters
- Limit the complexity of the code
- Later maintain the code independently of our interfaces, in a centralized location: if we decide to change the code entirely, we will make modifications in one single place - no matter how often we use that function.
For our example, we will have 5 ODI user functions in ODI (again, these can be downloaded
- RandomDecimal(Min, Max): generates a random value (with decimals) between the Min and Max values
- RandomNumber(Min, Max): generates a random value (without decimals) between the Min and Max values
- RandomBool(): generate a 0 or a 1
- RandomDate(MinDate, MaxDate): returns a date between MinDate and MaxDate (make sure MinDate and MaxDate are valid dates for Oracle)
- RandomString(Format, Min, Max): generates a random string with a minimum of Min characters and a maximum of Max characters. Valid formats are:
- 'u', 'U' - returning string in uppercase alpha characters
- 'l', 'L' - returning string in lowercase alpha characters
- 'a', 'A' - returning string in mixed case alpha characters
- 'x', 'X' - returning string in uppercase alpha-numeric characters
- 'p', 'P' - returning string in any printable characters.
- 'u', 'U' - returning string in uppercase alpha characters
We can either use these functions as is or as part of a more complex code logic, such as a case...when statement.
For our example, we will build the following mappings:
|SAMPLER_NAME||RandomString('A', 1, 30)|
|SAMPLER_PROMOTION||case when RandomBool()=0 then 'FALSE'
In ODI, the mappings will look like this:
3. THE KNOWLEDGE MODULE
Since we do not have any source table in this interface, we only need an IKM. The IKM provided will this example needs to be imported in your project.
Because the purpose of this KM is to generate sample data, it will have a few options where the default values will be different from the usual KMs:
- TRUNCATE defaults to 'YES': we assume here that if you re-run the interface, you want to create a new sample. If you only want to add more records to an existing table, simply set this option to 'NO' in your interface.
- CREATE_TABLE defaults to 'YES': we assume that the table to be loaded does not exist yet. You can turn that option to 'NO' if there is no need to create the table.
- THOUSANDS_OF_RECORDS: set this to any value between 1 and 1,000 to generate between 1,000 and 1,000,000 records
Once you have set the values for your KM, you can run the interface and let it generate the random data set.
With the above configuration, and using a standard laptop (dual core 1.86GHz processor and 2 Gb of RAM) equipped with Oracle XE my statistics were as follows:
10,000 records generated in 5 seconds
100,000 records generated in 24 to 35 seconds (about 30 seconds on average)
1,000,000 records generated in 211 to 235 seconds (about 4 minutes on average)
Note that the machine was not dedicated to this process and was running other processes.
Statistics are available in the Operator interface.
To review the data loaded by ODI in your target table, simply reverse-engineer this table in a model, then right-click on the table and select View Data to see what was generated!
4. EXPANDING TO OTHER TECHNOLOGIES
One question: why did I stop here and did not try to make this work for other technologies? Well, it turns out that ODI is really meant to move and transform data. As long as I have at least ONE table with random data in any one of my databases, it is now faster to just create a regular ODI interface and move the data across... The design will take less than a minute. The data transfer should not take much time either. Who would try to spend more time coding when the solution is that simple?
But if you want to make this work for other databases, here are your entry points:
- Duplicate the KM and modify it to use SQL that would work on these other databases
- Update the user functions to make sure that they use the appropriate functions for the given databases
- Use the same logic to create your interface
All Screenshots were taken using version 10.1.3.5 of ODI. Actual icons and graphical representations may vary with other versions of ODI.