Friday Feb 14, 2014

ODI 12c - Table Functions, Parallel Unload to File and More

ODI 12c includes a new component for integrating and transformation data programmatically, there have been plenty of examples through the years of such implementations, recent examples include SQL access to R from Mark Hornick (see an example blog here). As well as a great integration technique they have fantastic performance and scalability options - hence you see posts and talks from Kuassi Mensah on in-database map-reduce; all about leveraging the Oracle database's parallel query engine and the skills you already have (SQL and PLSQL/java).

The table function component in ODI 12c lets you integrate an existing table function implementation into a flow - the parameters for the table function can be scalar or a ref cursor, you can see how the examples from the AMIS posting here are defined within the mapping designer below, there are multiple table functions chained together, used as both a data source and a transformation;

In the above image you can see the table function name defined in the database is specified in the component's general properties (property is Function Name). The signature for the function must be manually defined by adding input/output connector points and attributes. Check the AMIS blog and reflect on the design above.

Regarding performance, one of the examples I blogged (OWB here and ODI here) was parallel unload to file. The table function examples from those previous blogs were fairly rudimentary, in this blog we will see what happens when we tweak the implementation of such functions - we can get much better performance. Here is the table function implementation I will use within the ODI examples (the type definitions used come from the OWB blog post above).

  1. create or replace function ParallelUnloadX (r SYS_REFCURSOR) return NumSet 
  2.    PIPELINED  PARALLEL_ENABLE (PARTITION r BY ANY) is 
  3.    TYPE row_ntt IS TABLE OF VARCHAR2(32767);
  4.    v_rows row_ntt;
  5.    v_buffer VARCHAR2(32767);
  6.    i binary_integer := 0; 
  7.    v_lines pls_integer := 0;
  8.    c_eol CONSTANT VARCHAR2(1) := CHR(10); 
  9.    c_eollen CONSTANT PLS_INTEGER := LENGTH(c_eol); 
  10.    c_maxline CONSTANT PLS_INTEGER := 32767; 
  11.    out utl_file.file_type; 
  12.    filename varchar2(256) := 'dbunload'; 
  13.    directoryname varchar2(256) := 'MY_DIR'; 
  14.    vsid varchar2(120); 
  15. begin 
  16.    select sid into vsid from v$mystat where rownum=1; 
  17.    filename := filename || vsid || '.dat'; 
  18.    out := utl_file.fopen (directoryname, filename , 'w');

  19.    loop 
  20.      fetch r BULK COLLECT INTO v_rows; 
  21.      for i in 1..v_rows.COUNT LOOP
  22.        if LENGTH(v_buffer) + c_eollen + LENGTH (v_rows(i)) <= c_maxline THEN
  23.          v_buffer := v_buffer || c_eol || v_rows(i);
  24.        else
  25.          IF v_buffer IS NOT NULL then
  26.            utl_file.put_line(out, v_buffer);
  27.          end if;
  28.          v_buffer := v_rows(i);
  29.        end if;
  30.      end loop;
  31.      v_lines := v_lines + v_rows.COUNT;
  32.      exit when r%notfound;
  33.    end loop;
  34.    close r;
  35.    utl_file.put_line(out, v_buffer); 

  36.    utl_file.fclose(out); 
  37.    PIPE ROW(i); 
  38.    return ;
  39. end; 
  40. /

The function uses PARALLEL_ENABLE and PARTITION BY keywords - these 2 are critical to performance and scalability. In addition, this function is further optimized; it uses the PLSQL BULK COLLECT capability and also buffers data in PLSQL variables before writing to file (this avoids IO calls). This was not rocket science to tune (plenty of posts on PLSQL IO tuning such as this) yet you can see the impact it has on performance further below.

My mapping using the table function as a target is shown below, 

In the physical design I define the parallel hints, this will then perform parallel unloads to file and you can easily leverage the hardware and power of the Oracle database. Using the hints to tweak the physical design let's the designer very easily compare and tune performance - you do not have to design the parallelism in your own flows.

In the table below you can see the performance difference when I use the PARALLEL(4) hint on a 4 CPU machine;

No Hint PARALLEL(4)
5 million rows  16s  6s
32 million rows 200s  47s 

If I execute the agent based SQL to file LKM, the time taken out of the box is 398 seconds (slower than 47s above when a hint is used) on the 32 million row example, the only divide and conquer techniques with the LKM are building a custom workflow to do such. With the table function approach if your database is on a bigger, more powerful host you can easily take advantage of the system by tweaking the hints.

As you see, the ODI table function component provides another custom exit point in a flow which let's you not only provide some useful integration capabilities but you can also do it in a very efficient manner - leveraging and exploiting the database you are running on. Hopefully this gives you a little insight and an interesting use case or two.

Wednesday Jan 16, 2013

ODI - Java Table Function for PDF capture

You can leverage the java table function approach to easily integrate PDF using an API like iText. I blogged earlier about the approach of writing a specific LKM for this, although this is perfectly reasonable, the writing and subsequent maintenance of the KM especially if you have many such APIs can be simplified, the code is also much cleaner in my opinion. What do you think? Its simplified by using the LKM for a java table function I posted here, then writing specific table functions for adapters.

All I did was write the table function wrapper and the makeRow method - this can even support the convention for naming columns from the earlier PDF blog post by overloading the findColumn method (so the query will use COLUMN_n, such as COLUMN_2 - under the hood, we will get the number n and return the appropriate column).

The makeRow method produces the columns from a java Object, the object is a File object, and represents the PDF file, here is a snippet of the code;

  1.     public String[] makeRow(Object obj) throws SQLException
  2.     {
  3.         String[] row = null;
  4.         int i = 0;
  5.         try {
  6.           PdfReader reader = new PdfReader(((File)obj).getPath());
  7.           Iterator it = reader.getAcroFields().getFields().entrySet().iterator();
  8.           row = new String[ reader.getAcroFields().getFields().size() ];
  9.           while (it.hasNext())
  10.             row[i++] = reader.getAcroFields().getField(((Map.Entry)it.next()).getKey().toString()).toString();
  11.           reader.close();
  12.         } catch (Exception e) { e.printStackTrace(); }
  13.         return row;
  14.     }

The code is very simple, for the table function itself, I simply created a Java Iterator (line 19 below) over the array of files in the directory, the class I extended from conveniently takes care of the iteration.

  1. public static ResultSet readCollection(String dirName, String endstr)
  2.         throws SQLException, UnknownHostException, IOException
  3.     {
  4.       final String suffix = endstr;
  5.       File folder = new File( dirName );
  6.       File[] listOfFiles = folder.listFiles(new FilenameFilter() {
  7.         public boolean accept(File dir, String name) {
  8.           return name.toLowerCase().endsWith(suffix);
  9.         }
  10.       } ); 
  11.       ArrayList<File> al = new ArrayList<File>();
  12.       for (int i = 0; i < listOfFiles.length; i++)
  13.         al.add( listOfFiles[i] );
  14.       PdfReader reader = new PdfReader(((File)listOfFiles[0]).getPath());
  15.       int sz = reader.getAcroFields().getFields().entrySet().size();
  16.       String[] cols = new String[sz];
  17.       for (int j = 0; j < sz; j++)
  18.         cols[j] = new String(new Integer(j+1).toString());
  19.       return new pdf_table( cols, al.iterator() );
  20.     }

The entire Java source for the PDF table function can be found here.

We can assign the LKM to the source set and set the table function name to pdf_table.readCollection, define the directory to use and the file extension to filter.

For this case using the table function I set the model type for the PDF datastore model to be my Derby/JavaDB technology (and not file). This generated the SQL SELECT........from table(PDF_W4('d:\temp\pdfs', 'pdf' )) PDF where (1=1) statement to load into the work table. This was using the exact LKM that I used to extract from MongoDB, and I can write any java table function to extract data and load.

Friday Jan 11, 2013

ODI - Java Table Function for MongoDB

Behind the scenes of the MongoDB posting was a very simple JavaDB/Derby table function. The function implemented a couple of methods - the table function readCollection and the function makeRow which creates an array from a Java Object. It can't get much simpler. The iteration through the collection is handled by the class I extended from EnumeratorTableFunction  which came from the posting by Rick Hillegas, and it fits nicely into ODIs source/target generic integration task in the KM framework. Here is a viewlet I have created showing you everything very briefly but end to end.

The makeRow function uses the MongoDB java SDK, and produces a row for each BasicDBObject, each value in the document is output as a column in the table. Nested/complex values are serialized as Java Strings - so you will get a JSON string for anything complex.

  1. public String[] makeRow(Object obj) throws SQLException
  2. {
  3.   int idx = 0;
  4.   BasicDBObject dbo = (BasicDBObject) obj;
  5.   Iterator it = dbo.entrySet().iterator();
  6.   String[]    row = new String[ getColumnCount() ];
  7.   it.next(); // skip the 'id' column
  8.   while (it.hasNext()) {
  9.     Map.Entry pairs = (Map.Entry)it.next();
  10.     row[ idx++ ] = pairs.getValue().toString();
  11.   }
  12.   return row;
  13. }

The readCollection table function is a static method and has a couple of parameters (for demonstration) - one is the MongoDB database name and the other is the collection name. The function initializes the object instance with the column names which are defined to be the key names for the objects in the collection (the first object is taken and its keys used as the column names);

  1. public static ResultSet readCollection(String dbName, String collectionName)
  2.   throws SQLException, UnknownHostException
  3. {
  4.   int idx = 0;
  5.   MongoClient mongoClient = new MongoClient();
  6.   DB db = mongoClient.getDB(dbName);
  7.   DBCollection coll = db.getCollection(collectionName);
  8.   DBCursor cursor = coll.find();
  9.   BasicDBObject dbo = (BasicDBObject)  coll.findOne();
  10.   Set<String> keys = dbo.keySet();
  11.   String[] skeys = new String[keys.size()];
  12.   Iterator it = keys.iterator();
  13.   it.next(); // skip the id
  14.   while (it.hasNext()) {
  15.     skeys[idx++] = it.next().toString();
  16.   }
  17.   return new mongo_table( skeys, cursor );
  18. }

The mongo_table constructor just initializes itself and sets the enumeration to iterate over - the class I extend from is very useful, it can iterate over Java Enumeration, Iterator, Iterable, or array objects - the super class initializes the column names, and the setEnumeration defines the collection/iterator - which in this case is a MongoDB DBCursor which happens to be a Java Iterator<DBObject>.

  1. public mongo_table(String[] column_names, DBCursor cursor)
  2.   throws SQLException
  3. {
  4.   super( column_names );
  5.   setEnumeration( cursor );
  6. }

This approach can be used for sourcing pretty much anything, which is great for integration needs. The ODI Knowledge Module is an LKM and stages the result of the table function into a work table, then everything else is as normal. The KM creates the work table and also registers the table function with JavaDB/Derby. My code for the function registration is as follows;

  1. create function <%=odiRef.getSrcTablesList("","[TABLE_NAME]", "","")%>( dbName varchar( 330), collName varchar( 30))
  2. returns table
  3. (
  4. <%=odiRef.getSrcColList("","[COL_NAME] [SOURCE_CRE_DT]","[COL_NAME] [SOURCE_CRE_DT]",",\n","")%> )
  5. language java
  6. parameter style DERBY_JDBC_RESULT_SET
  7. no sql
  8. external name '<%=odiRef.getOption("TABLE_FUNCTION_NAME")%>'

This creates the table function with the same name as the datastore in the interface, plus the resultant table of the function has the columns (and types) from that datastore. The external JavaDB function name is taken from the KM option TABLE_FUNCTION_NAME. As I mentioned I have hard-wired 2 parameters just now. The Java code implementing this should be created and put in a JAR in the normal userlib directory for adding custom code including JDBC drivers. The other JARs needed are the MongoDB Java SDK jar, derby.jar and vtis-example.jar (from the zip here). You can get the Java source for mongo_table.java here, it is compiled using the MongoDB Java SDK on the classpath as follows (on Windows).

  1. javac -classpath mongo-2.10.1.jar;vtis-example.jar mongo_table.java
  2. jar cvf mongo_table.jar mongo_table.class

The LKM is here it needs imported into your project.

Anyway...this wasn't all about MongoDB per se, it was also about the JavaDB table function capability, any other examples spring to mind about integration capabilities using this route? Going to post about loading into MongoDB and how an IKM is built for this. Interested to hear any ideas/feedback from you on this....so don't be shy!

Thursday Jan 10, 2013

ODI - MongoDB and Java Table Functions Anyone?

Let's see how we could integrate MongoDB using ODI, first take a step back. Derby/JavaDB introduced table functions a few years ago. Table functions are really useful, they are in the Oracle database and as much fun in JavaDB! ODI is a great platform for integration and JavaDB and table functions provide a really nice way to integrate arbitrary Java APIs into your designs. What I have done here is;

  • built a very simple java table function to project a table to represent the documents in a MongoDB collection. The collection is passed as a parameter to the KM and the column names are the keys for the MongoDB document. The data comes from the values.
  • built a very simple LKM from a Java table function to SQL

All of this will use the JavaDB in-memory, so no admin, simple to use. Rick Hillegas wrote a nice article with some handy classes that I have used. The mongo_table class I have written uses the EnumeratorTableFunction class included in Rick's examples. The MongoDB DBCursor class is a Java Iterator, which makes it really nice to pass to the  EnumeratorTableFunction class, and let it do all of the work.

The LKM I constructed declares the table function to JavaDB/Derby, for example below, the function is declared based on the source datastore name (MONGO_EMPS) and columns (my source datastore has EMPNO,ENAME,SAL, note the table function will actually project types defined in ODI's datastore), the function has the MongoDB database name and collection name as parameters.

  1. create function MONGO_EMPS( dbName varchar( 330), collectionName varchar( 30))
  2. returns table
  3. (
  4.   EMPNO VARCHAR(20),
  5.   ENAME VARCHAR(30),
  6.   SAL NUMERIC(10),
  7. )
  8. language java
  9. parameter style DERBY_JDBC_RESULT_SET
  10. no sql
  11. external name 'mongo_table.readCollection'

Then the actual code to use the function as a source is executed from a source task (the target is SQL as I mentioned earlier for the LKM). Below you can see my execution using the test MongoDB and the myStuff collection;

  1. select
  2. MON.EMPNO   C1_EMPNO,
  3. MON.ENAME   C2_ENAME,
  4. MON.SAL   C6_SAL
  5. from table(MONGO_EMPS('test', 'myStuff' )) MON
  6. where (1=1)
  7. And (MON.SAL > 4000)

Note I can also perform some filtering as an example, here it is being done in JavaDB and in my case its in-memory. No setup, no persistence just on the fly Java. Ideally I would push the filter down to MongoDB rather than reading and filtering in the driver - more on that later.

I had defined my documents in MongoDB using the following basic commands in the mongo shell;

  1. use test
  2. a1 = { EMPNO: "1", ENAME : "Fred", SAL : 10000 }
  3. a2 = { EMPNO: "2", ENAME : "John", SAL : 2000 }
  4. db.myStuff.insert( a1 )
  5. db.myStuff.insert( a2 )

In ODI, I can simply then use the datastore representing the MongoDB collection of documents in an interface and map it to my target;

The physical design uses the LKM JavaTableFunction to SQL KM and sets the MongoDB databases, collection and the Java table function name.

That's it. Pretty straightforward and we are reading and consuming MongoDB documents. So what about complex document types like this? These are transported, more to come. The other point here is that this is a generic LKM that you can plug in other arbitrary table functions - so you can consume from any API, its very simple. For the LKM I created, I just defined 2 parameters (because my table function only had two), but really we need a better way to handle this and ensure they are ignored if not defined etc. That's all polishing tho, anyway fun stuff you can see all with a small piece of code leveraging JavaDB!

Wednesday Jan 02, 2013

ODI - Hive and NoSQL, the code

This post includes the Java client demonstration code used in the Hive and NoSQL post illustrated here. The BasicBigData.java code is a NoSQL client which populates a key value store that is queryable using the Hive external table from that post. It didn't take long to code and a few peeks at the NoSQL javadoc to get it going. You can take this java code and compile and run it (instructions for compiling are similar to the verification demo here - it is very easy).

The java code uses the NoSQL major/minor path constructor to describe the Key, below is a snippet to define the birthdate for Bob Smith;

  1. ArrayList<String> mjc1 = new ArrayList<String>();
  2. mjc1.add("Smith");
  3. mjc1.add("Bob");
  4. ...
  5. ArrayList<String> mnrb = new ArrayList<String>();
  6. mnrb.add("birthdate");
  7. ...
  8. store.put(Key.createKey(mjc1,mnrb),Value.createValue("05/02/1975".getBytes()));
  9. ...

In the referenced post, to actually aggregate the key values, we used the Hive collect_set aggregation function (see here for Hive aggregation functions). The collect_set aggregation function returns a set of objects with duplicates eliminated. To get the aggregation function behavior in ODI with the correct group by we must tell ODI about the Hive aggregation function. We can define a new language element for collect set in the Topology tree, define the element as a group function, and also define the expression for Hive under the Implementation tab;

We are then able to define expressions which reference this aggregation function and get the exact syntax defined in the earlier post. Below we see the Hive expressions using collect_set below;

From this design and the definition of the aggregation function in ODI, when its executed you can see the generated Hive QL with the correct columns in the grouping function;

The target Hive datastore in the interface I defined as been loaded with the key values from the NoSQL keystore, cool!

Those are a few of the missing pieces which would let you query NoSQL through Hive external tables, hopefully some useful pointers. 

Monday Dec 31, 2012

ODI - Hive and NoSQL

The Hive external table let's us do lots of cool stuff including processing data from NoSQL. We have seen how custom SerDes are used, Hive storage handlers also provide some cool capabilities. Using the Hive storage handler defined here, an external table can be defined to project data from a NoSQL key-value store. The external table can then be used as a source in ODI, very simple.

The illustration on github has the following data stored in a Oracle NoSQL Database (the key is the lastname/firstname etc):

  • /Smith/Bob/-/birthdate: 05/02/1975
  • /Smith/Bob/-/phonenumber: 1111-1111
  • /Smith/Bob/-/userid: 1
  • /Smith/Patricia/-/birthdate: 10/25/1967
  • /Smith/Patricia/-/phonenumber: 2222-2222
  • /Smith/Patricia/-/userid: 2
  • /Wong/Bill/-/birthdate: 03/10/1982
  • /Wong/Bill/-/phonenumber: 3333-3333
  • /Wong/Bill/-/userid: 3

Using the Hive external table and the custom storage handler for a key value store, we define a mask to project the data through the external table. 

  1. ADD JAR /home/oracle/kv/HiveKVStorageHandler.jar;
  2. CREATE EXTERNAL TABLE MY_KV_TABLE (lastname string, firstname string, birthdate string, phonenumber string, userid string)
  3.       STORED BY 'org.vilcek.hive.kv.KVHiveStorageHandler'
  4.       WITH SERDEPROPERTIES ("kv.major.keys.mapping" = "lastname,firstname", "kv.minor.keys.mapping" = "birthdate,phonenumber,userID")
  5.       TBLPROPERTIES ("kv.host.port" = "localhost:5000", "kv.name" = "kvstore");

There are a few interesting properties here;

  • we specify the keyvalue store using TBLPROPERTIES, identify the host/port and the keystore name (kvstore).
  • the SerDe properties contains the mapping of the keys to column names, you will get a row for each value of birthdate, phonenumber, userID

Fairly straightforward. We can then reverse engineer this into ODI, using the same mechanism as I described in previous posts here setting the ODI_HIVE_SESSION_JARS and so forth. The data projected looks like this;

  1. hive> SELECT * FROM MY_KV_TABLE;
  2. OK
  3. Smith     Patricia     10/25/1967     NULL NULL
  4. Smith Patricia NULL 2222-2222     NULL
  5. Smith Patricia NULL NULL 2
  6. Smith Bob 05/02/1975 NULL NULL
  7. Smith Bob NULL 1111-1111 NULL
  8. Smith Bob NULL NULL 1
  9. Wong Bill 03/10/1982 NULL NULL
  10. Wong Bill NULL 3333-3333 NULL
  11. Wong Bill NULL NULL 3

In ODI by defining the Hive collect_set function as an aggregation function, we can then aggregate the data and pivot the data to get it as a row;

  1. SELECT lastname, firstname, collect_set(birthdate)[0], collect_set(phonenumber)[0], collect_set(userid)[0]
  2.       FROM MY_KV_TABLE
  3.       GROUP BY lastname, firstname;

So another interesting illustration of external tables in Hive and what they can provide.

Saturday Dec 29, 2012

ODI - Hive External Tables, reverse engineering

Here we see Hive external tables on JSON data files in action and being reversed engineered from Hive into ODI. Carrying on from my earlier post on Hive reverse engineering here we will tackle external tables with primitive data types. Just like Oracle external tables for accessing data on the filesystem, Hive external tables can access big data files in HDFS. This example uses a JSON SerDe for Hive which was downloaded from here. The external tables in Hive are quite nice! You just specify the HDFS directory and all files in that directory are projected through the external table definition, also you can specify sophisticated SerDe classes that comply with the Hive SerDe interfaces - these classes can reverse engineer complex types also which make them very useful.

Before launching ODI or the agent for this I added the SerDe JAR file to this environment variable used by ODI, here is what I did in my system;

  • export ODI_HIVE_SESSION_JARS=$ODI_HIVE_SESSION_JARS:/home/oracle/json/hive-json-serde-0.2.jar

I have 2 JSON files with data in my HDFS under /user/oracle/json I will specify this directory in the external table definition, along with the JSON SerDe. Within my Hive shell I can define the external table, remembering to add the JSON SerDe file (using the ADD JAR command) just before I create the external table.

  1. ADD JAR /home/oracle/json/hive-json-serde-0.2.jar;
  2. CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
  3.           field1 string, field2 int, field3 string, field4 double
  4.        )
  5.        ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
  6.        LOCATION '/user/oracle/json';

 The data in the first JSON file looks like the following, this was just an example from the google code snippets;

  • {"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
  • {"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
  • {"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
  • {"field1":"data4","field2":400,"field3":"more data4","field4":123.004}

I have defined these fields in the external table. Back in ODI, in my model below I am using the customized reverse with the RKM Hive mentioned above, also for this reverse I just want one table imported, so I defined the mask just to be my_table. Clicking the 'Reverse Engineer' button on the top left of the panel starts the reverse engineering.

If there are errors, then check the ODI Operator, that's where you will see anything that goes wrong. The table in ODI has just some rudimentary information about the number of columns, all of the external table definition for this is defined in Hive, this is the same as when we reverse engineer an Oracle external table. Once the table is captured in ODI I can view the data;

The custom Hive RKM I used is here - this has some changes to the one shipped with ODI to include external table reverse (I included EXTERNAL_TABLE in the getTables API call), plus support for custom SerDe JARs loaded via the environment variable ODI_HIVE_SESSION_JARS (just like the ODI IKMs load the JARs, the RKM also needs to).

Anyway, that's a quick peek at external tables in Hive and reverse engineering them into ODI. Once the table is captured in ODI you can use it in mappings just like other tables!

Thursday Dec 20, 2012

ODI 11g - Loading More Files Faster

This post shows you how to leverage more of your Oracle system, you can load more data and faster using these kind of best practices. The LKM here, lets you leverage the Oracle database's preprocessor to load MORE data faster. The KM also has the capability to leverage a regular expression to identify the file or files so that you can load much much more, it also sorts the files based on size to comply with the most efficient way of loading into Oracle via the external table. The preprocessor can basically do anything, but is commonly used to uncompress data on the fly - so you can eliminate transferring large uncompressed files around or needlessly uncompressing them! There are various creative blog posts around showing you what some customers have done - some even include the transfer in the preprocessing phase.

I posted the LKM on the java.net site here

You can see the LKM configured below, I have used the preprocessor and a regular expression to load all gzipped files in the directory; 

When the code is generated below you can see the PREPROCESSOR directive used, plus you can see that the file names are going to be generated from a Java API call where the regular expression and directory are passed.

My gunzipdb.bat script has the following definition, wrappering the gzip utility (note the use of the -d -c directives);

  • @echo off
  • c:\ext_tab_test\gzip -d -c %1

Check this creative use of the PREPROCESSOR directive here (External table to load compressed data residing on remote locations) which does some creative work in the preprocessing script. This could do pretty much anything you want including transforming from some complex file to project through an external table.

Thursday Jul 14, 2011

ODI 11g - Parallel, Pipelined Unload to File

Here we can see how we can leverage table functions as targets in ODI interfaces to do cool stuff! It carries on from my pipelining post. Table functions let you do all kinds of powerful transformations and are a great way for incorporating in-memory custom transformations on sets. The illustration here is using a table function to parallel unload to file, it was initially posted on the OWB blog here.

There is a viewlet right here that walks you through the creation of the interface.

There are controls at the bottom if you need to pause, or step forward/backward. 

 

Tuesday Jul 05, 2011

ODI 11g - Pipelines

Carrying on from the simple, flexible, powerful posting on the code generation capabilities of ODI 11g which illustrated the SQL as a Source option we can see how additional new code generation components can be defined (which let us capture information about specific transformations) and assemble in a pipeline. This is the kind of info I covered in my ODTUG KScope11 'Pushing the Envelope with ODIEE' talk at Long Beach this year - had a great time meeting everyone there!

The example we will illustrate here is using table functions (could be other transformations such as SQL unpivot) in the Oracle database – table functions are functions that can be queried like a table and support parallelization capabilities along with piping of data which are critical elements for improving performance and leveraging capabilities of the server (the same approach can be used for DB2 table functions or Microsoft SQLServer CROSS APPLY style capabilities). The flow can be constructed by piping SQL and transformations (like table functions) in an assembly like manner by constructing interfaces in a modular manner.

Below is an illustration of a number of interfaces assembled in a pipeline which can leverage the code generation capabilities in ODI 11g for assembling components.

We have seen from other posts how the interfaces can be organized into flows which are resolved into a single SQL (for example) with sub-queries. In this post you will see how transformations such as table functions can be incorporated into a design and the components assembled. The ODI 11g capability to define the SQL subquery generator within a KM opens the door to this flexibility. Table functions themselves are useful for many things;

  • parallelizing arbitrarily complex (PLSQL, SQL PL) code in a SQL manner
  • eliminating intermediate storage
  • SQL-izing the stored procedure

In the illustration below I have a pipelined table function FX which is a temporary target data store process data from datastores T1 and T2. The output of the table function is then joined with T3 and further consumed in target FX_TAB. The temporary datastore FX (representing the table function) has user defined properties set which indicate the inputs and outputs for FX (A and B are inputs and have expressions, X and Y are outputs).
The image has the two interfaces side by side, on the left we have a temporary interface for T1 and T2 joined and providing data for table function FX. On the right, the result of the table function is joined with T3 and written to FX_TAB.


The above illustrates how table functions can be used in pipelined manner with query inputs. The meaning of the columns in the FX temporary target datastore is inferred from the user defined property settings – this lets us distinguish the inputs from the outputs. Below in the KM definition (SQL_TableFunction) we can see how the SQL for the transformation type is generated.


Things to note are that the KM supports both cursor and scalar definitions (so a query can be the input to the table function or alternatively you can provide scalar values such as ODI variables, the option CURSOR_INPUT provides the switch between cursor and scalar). So, importantly table functions can be used as a data source, as transformation pipes and and targets (since there is a step in the KM for this). A good use case for a target table function is when the results of the SQL are being written in parallel to file for example.

In my case I used the cursor approach, so the CURSOR_INPUT option for the SQL_TableFunction KM assignment is set to true (the default);


The cursor option in the KM lets us generate either the following style for the cursor;

…from TABLE ( <tablefunction> ( CURSOR(SELECT <columns> FROM <tables> where <conditions>

or for scalar inputs;

…from TABLE ( <tablefunction> ( <variables> ) )

Simple and flexible. As you can see we can assemble a flow of ODI interfaces into a series of pipes that interconnect leveraging ODIs declarative description for the capabilities that are supported well out of the box and extend to encompass table functions for incorporating pipelining or other transformation capabilities that you can think of. A topical area of assembly style problem solving is the divide and conquer Map-Reduce style problem, check out Jean-Pierre’s example from the Oracle Data Warehousing blog MapReduce + Oracle = Table Functions, we can see how this maps to the components we have described above;

each mapper and reducer is an ODI interface.

Where is the sample KM? It's on the Oracle Code Sample site here;

https://www.samplecode.oracle.com/tracker/tracking/linkid/prpl1003?id=CS627

The example above was illustrated using SQL and table functions to chain pipes together, there are other avenues for creating named pipes between interfaces to do much, much more.

Monday Nov 09, 2009

The Benefits of ODI Knowledge Modules: a Template Approach for Better Data Integration

This post assumes that you have some level of familiarity with ODI. The concepts of Knowledge Module are used here assuming that you understand them in the context of ODI. If you need more details on these elements, please refer to the ODI Tutorial for a quick introduction, or to the complete ODI documentation for detailed information..

At the core, ODI knowledge modules are templates of code that will be leveraged for data integration tasks: they pre-define data integration steps that are required to extract, load, stage - if needed - and integrate data.

Several types of Knowledge Modules are available, and are grouped in families for Loading operations (LKMs), Integration operations (IKMs), Data Control operations (CKMs), and more.

For a more detailed description of what a knowledge module is, simply picture the multiple steps required to load data from a flat file into a database. You can connect to the file using JDBC, or leverage the database native loading utility (sqlldr for Oracle, bcp for SQL Server or Sybase, load for db2, etc.). External tables are another alternative for databases that support this feature.
As you use one or the other technique, you may first want to stage the data before loading your actual target table; in other cases, staging will only slow down your data load.

As far as the integration in the target system is concerned, again multiple strategies are available: simple inserts, inserts and updates, upserts, slowly changing dimension... these techniques may be as simple as one step, or be a complex series of commands that must be issued to your database for proper execution.

The Knowledge Modules will basically list these steps so that a developer who needs to repeat the same integration pattern only has to select the appropriate templates, versus re-developing the same logic over and over again.

The immediate benefits of this approach are well known and well documented:
- All developers use the same approach, and code development is consistent across the company, hence guarantying the quality of the code
- Productivity is greatly improved, as proven path are re-used versus being re-developed
- Code improvement and modification can be centralized and has a much broader impact: optimization and regulatory changes are done once and inherited by all processes
- Maintenance is greatly simplified

To fully appreciate all the benefits of using knowledge Modules, there is a lot more that needs to be exposed and understood about the technology. This post is a modest attempt at addressing this need.

GENERATION OF CODE AND TRANSFORMATIONS

Most tools today will offer the ability to generate SQL code (or some other type of code, such as scripts) on your source or target system. As most products come with a transformation engine, they will also generate proprietary code for this engine where data is staged (I'll skip the debate here as to whether a transformation engine is a staging area or not - the point being that code can be generated on either source, "middle-tier" or target).

However, real life requirements are rarely either/or. Often times, it makes sense to leverage all systems to optimize the processing: spread out the load for the transformations, reduce the amount of data to be transferred over the network, process the data where it is versus moving the data around solely for the purpose of transformations.

To achieve this, Data Integration tools must be able to distribute the transformation logic across the different systems.

KMCodeExecution.PNG

Only ODI will effectively generate code and transformations on all systems. This feature is only possible thanks to the KM technology.

Beyond the ability to generate code, you have to make sure that the generated code is the best possible code for the selected technology. Too often, tools first generate code that is then translated for the appropriate database. With the KMs technology, no translation is required: the generated code was initially conceived explicitly for a given technology, hence taking advantage of all the specifics of this technology.

And since the KMs are technology specific, there is no limit to what can be leveraged on the databases, including user defined functions or stored procedures.

KMCodeGeneration.PNG

CODE ADAPTABILITY

Whenever a tool generates code, the most common complaint is that there is very little (if any) control over the generated result. What if a simple modification of the code could provide dramatic performance improvements? Basic examples would include index management, statistics generation, joins management, and a lot more.

The KM technology is open and expansible so that developers have complete control over the code generation process. Beyond the ability to optimize the code, they can extend their solution to define and enforce in house best practices, and comply with corporate, industry or regulatory requirements. KMs Modifications are done directly from the developers graphical interface.

One point that can easily be adapted is whether data have to be materialized throughout the integration process. Some out-of-the-box KMs will explicitly land data in a physical file or tables. Others will avoid I/Os by leveraging pipes instead of files, views and synonyms instead of tables. Again, developers can adapt the behavior to their actual requirements.

EXPANDING THE TOOL TO NEW TECHNOLOGIES

How much time does it take to adapt your code to a new release of your database? How much time does it take to add a new technology altogether? In both cases, KMs will provide a quick and easy answer.

Let us start with the case of a new version of the database. While our engineering teams will release new KMs as quickly as possible to take advantage of the latest releases of any new database, you do not have to wait for them. A new release typically means new parameters for your DDL and DML, as well as new functions for your existing transformations. Adapt the existing KMs with the features you need, and in minutes your code is ready to leverage the latest and greatest of your database.

Likewise, if you ever need to define a new technology that would not be listed by ODI (in spite of the already extensive list we provide), simply define the behavior of this technology in the Topology interface, and design technology specific KMs to take advantage of the specific features of this database. I can guaranty you that 80% of the code you need (at least!) is already available in an existing KM... Thus dramatically reducing the amount of effort required to generate code for your own technology.

ARE KM MODIFICATIONS REQUIRED?

I am a strong advocate of the customization of KMs: I like to get the best I can out of what I am given. But often times, good enough is more than enough. I will always remember trying to optimize performance for a customer: we did not know initially what our processing window would be - other than "give us your best possible performance". The first out-of-the-box KM we tried processed the required 30,000,000 records in 20 minutes. Due to IT limitations, we could only leverage lesser systems for faster KMs... but still reduced performance to 6 minutes for the same volume of data. We started modifying KMs to get even better results, when the customer admitted that we actually had 3 hour for the process to complete... At this point, spending time in KM modifications was clearly not needed anymore.

KMs are meant to give the best possible performance out of the box. But every environment is unique, and assuming that we can have the best possible code for you before knowing your own specific challenges would be an illusion - hence the ability to push the product and the code to the limit

Another common question is: do you have to leverage both source and target systems as part of your transformations? Clearly, the answer is no. But in most cases, it is crucial to have the flexibility to leverage all systems, versus being cornered in using only one of them. Over time, you will want to reduce the volume of data transferred over the network; you will want to distribute some of your processing... all more reasons to leverage all available engines in your environment.

Do not hesitate and share with us how you extend your KMs!

Screenshots were taken using version 10.1.3.5 of ODI. Actual icons and graphical representations may vary with other versions of ODI.

 

Friday Oct 09, 2009

Did You Know that ODI Automatically Summarizes Data Errors?

Looking for Data Integration at OpenWorld 2009? Click here!

This post assumes that you have some level of familiarity with ODI. The concepts of Interface, Flow and Static Control, as well as Knowledge Module are used here assuming that you understand them in the context of ODI. If you need more details on these elements, please refer to the ODI Tutorial for a quick introduction, or to the complete ODI documentation for detailed information..


TABLES GENERATED BY ODI TO IDENTIFY ERRORS

If you take advantage of either Flow Control or Static Control in your interfaces, you know that ODI will automatically trap errors for you as you run your interfaces.

When you select the Controls tab of your interface, where you will decide which Knowledge Module will be used to identify the errors, you have an option to drop the Error table and another one to drop the Check table. Have you ever wondered what these are?

Interface Controls Tab

The Error table is the table that will be created by ODI to store all errors trapped by the FLOW_CONTROL and STATIC_CONTROL of your interface. You have probably already used the error table. This table is structured after your target table, along with administrative information needed to re-cycle or re-process the invalid records. It is loaded by ODI with all records that fail to pass the validation of the rules defined on your Target table. This feature is often referred to as a Data Quality Firewall as only the "good" data will make it to the target table.

Once all errors have been identified for a given interface, ODI will summarize them into another table: the Check table. There will be only one such table per data server: all target tables in the server (irrespectively of their schema) will share the same summary table. The name of this table is defined by default by the CKMs as SNP_CHECK_TAB.


LOCATION OF THE CHECK TABLE

You will find the check table in the default work schema of your server. To locate this schema, you have to go back to topology, in the Physical Architecture tab. Expand your data server to list the different physical schemas. One of the schemas is your default schema and will be identified by a checkmark on the schema icon (see SALES_DWH in the example below).

Default Schema

When you edit the schema, it has an associated work schema. The work schema associated to your default schema is your default work schema: ODI_TMP is the following example.

Default Work Schema

 

Note that you can change your default schema by selecting/unselecting the default option in the schema definition. But remember that you will always need exactly one default schema for each server.

 

Now that we know where to find this table, let's look at its structure:

  • CATALOG_NAME, SCHEMA_NAME: location of the table that was being loaded (i.e. the target table)
  • RESOURCE_NAME, FULL_RES_NAME: name of the table that was being loaded
  • ERR_TYPE: type of control that was performed (Flow Control or Static Control)
  • ERR_MESS: plain English error message associated with the error
  • CHECK_DATE: date and time of the control
  • ORIGIN: name of the ODI process that identified the errors
  • CONS_NAME: name of the constraint (as defined in the ODI Models) that defines the rule that the record violated
  • CONS_TYPE: type of error (duplicate primary key, invalid reference, conditional check failed, Null Value)
  • ERR_COUNT: number of records identified by the process that failed to pass that specific control rule.

 

ErrorsSummaryStructure.PNG

A sample of the data available in that summary table is show below (we split the content in 2 screenshots to make this more readable - this is one and only one table):

Errors Summary Data

Errors Summary Data2

There are many possible uses for this table: decision making in your ODI processes based on the number of errors identified or the type of errors identified, basic reporting on errors trapped by ODI, trend analysis or the evolution of errors over time...

Do not hesitate and share with us how you leverage this table!

Screenshots were taken using version 10.1.3.5 of ODI. Actual icons and graphical representations may vary with other versions of ODI.

 

Thursday Oct 01, 2009

Creating a New Knowledge Module for Sample Data Sets Generation

Looking for Data Integration at OpenWorld 2009? Click here!

The posts in this series assume that you have some level of familiarity with ODI. The concepts of Interface and Knowledge Module are used here assuming that you understand them in the context of ODI. If you need more details on these elements, please refer to the ODI Tutorial for a quick introduction, or to the complete ODI documentation for detailed information..
In particular, to learn more on Knowledge Modules, I strongly recommend the Knowledge Module Developer's Guide - Fundamentals that comes with the product. You will have to download and install ODI to access this document in the Documentation Library.

This post will look into "when" and "how" to create a knowledge module. Then it will walk through some of the choices that can be made when designing a Knowledge Module.

To illustrate the descriptions, we are working on an example described previously in this post.


1. WHAT BELONGS TO A KNOWLEDGE MODULE?

The first element to look into is what parts of the logic of your code are reusable. What you are typically looking for are the following:

  • Sequences of steps that are repeated commonly, even though some steps may be optional. For instance: creation of a staging table, creation of a script or parameter file for a utility, invoking an external program, extraction of data from a database, etc.
  • For each step, identification of the non variable parts vs. the variable parts. For instance, in a select statement, the body of the code remains the same. In the following example, the elements in brackets are variables, the others are fixed:
    • Insert into [TableName] ([ListOfColumns]) select ([ListOfColumns and Expressions]) from [List of Tables] where [conditions]

  • For your module to be re-usable, you want to make sure that no information that physically relates your code to any system or table structure is left out of the code. The idea behind the KMs is that as developers will build their transformations and mappings, ODI will "fill in the blanks" with the appropriate data.

The easiest way to get started with a knowledge module is actually to take an existing one and modify it. As the syntax has already been validated in existing KMs, the amount of work required to produce valid code will be greatly reduced.
In most cases, column names, mapping expressions do not belong to a knowledge module. The exception would be administrative columns that you add as part of the logic of your KM. For instance, most Incremental Update knowledge modules that ship with ODI create an IND_UPDATE column to differentiate records that will be updated from those that will be inserted. These columns definitely belong in the code of the KM.

 

Likewise, you may want to create your own tables (administrative tables, audit tables, etc.) with a very static name. These can be created by the Knowledge Module. But in general, it is better to dynamically generate the table name after the table being loaded, to prevent multiple processes running in parallel from trying to use the same intermediate table.


2. DO I HAVE A CASE FOR A KNOWLEDGE MODULE?

Any technique used to extract data out of a database (or file, or messaging system, or web service for that matter) can be a good opportunity to create a new KM. The same is true for loading techniques and integration techniques: inserts, updates, slowly changing dimension, etc.

In the scenario that we are contemplating, we want to insert data (albeit random data) into a table, so we probably have a good case for a knowledge module.

The first step is usually to look for available techniques, try the code independently of any knowledge module, and check out how it behaves: how is performance? How does the code behave when data volume grows? You want to make sure that the code you will integrate as a template is as good as it can be before you share it with the entire corporation!

Typically, extracting from a source system to stage data is done in an LKM. Loading data into a target table is done with an IKM. In our case, we will clearly create an IKM.


3. RANDOM DATA GENERATION: CODE GENESIS

For our example, will start with a KM that works exclusively for Oracle Databases. Adaptations of the code will be possible later on to make similar processes run on other databases.

The Oracle database provides a fantastic feature that we can leverage to generate a large number of records: group by cube: it returns all the possible permutation for the selected columns. So the following code:

select NULL from dual group by cube(1,1,1)

returns 8 records (2 to the power 3). Add columns for the list for the permutations, and you are adding an exponential number of records.

Now when I played with this function on my (very) little installation of the database, I seemed to hit a limit for (very) large permutation numbers. I have to admit that I am not using the function in its expected fashion, no I cannot really complain. But at least I can easily generate 1024 records (2 to the power 10). Now from a usability perspective, I do not really want to use that number for the users of my KM (1024 has a geeky flavor to it, doesn't it?). How about generating a table with just 1,000 records?

The following code will do the trick:

select NULL from dual group by cube(1,1,1,1,1,1,1,1,1,1)
where rownum<=1000

Note that so far, all the instructions we have are hard-coded. We still do not have anything that would be dynamic in nature.

Now we need to use the above query to create some table with our 1,000 records. Again, we can hard-code the table name - but this does not make for very portable code. In particular, from one environment to the next, the database schema names will vary. We have three options to create our staging table, from the least portable to the most portable:

  • Hardcoded table name and schema name: myschema.SEED
  • Dynamic schema name, hardcoded table name: let ODI retrieve the proper schema name and automatically update the code at execution time:  (Generated code: myschema.SEED)
  • Fully dynamic table name and schema name (usually, dynamic tables are named after the target table with some sort of extension): _SEED (generated code: if you are loading TRG_CUSTOMERS, then the SEED table name is myschema.TRG_CUSTOMER_SEED)

Best practice is of course to use the last one of these options to allow for multiple processes to run in parallel. To keep our explanations simple, we will use the second option above - but keep in mind that best practice would be to use the fully dynamic one.

 

As we will use our KM over and over, it is important to make the developer's life easy. Steps have to be included here to create our seeding table, drop it when we are done, and make sure before we create it that it is not there from a previous run that could have failed.
The typical sequence of steps for a KM creating any type of staging table is:

  • Drop table (and ignore errors - if there is no table, we are fine)
  • Create table (re-create it to reflect any possible meta-data changes)
  • Load the table with staging data - in our case a sequence of numbers that we will be able to leverage later on for filtering... (please be patient: we will come back to this). Here a rownum will do the trick...

Now that we have the code to insert data into our staging table, we can put all the pieces together and have the first three steps of our knowledge module. Keep in mind that you have to be consistent from one step to the next as you name your table. The actual knowledge module with all the matching code is available here (look for KM_IKM Oracle - Build Sample Data - Gen II.xml).

 


4. COMPLETING THE CODE

So far our table only has 1,000 records. Not much in terms of volume. But all we need now to create a table with 1,000,000 records... is a Cartesian product (you know, the one thing your mother told you NOT to do with a database? It comes very handy here!):

insert into [Target] ([Columns]) Select * from SEED S1, SEED s2

And if we want to return less records, all we have to do is filter on the S2 table. For instance the clause:

where S2.SEED_ID<=10

will return 10,000 records!. Remember when we stored rownums in this table earlier? This is where it becomes very handy...

So far the only thing we have done is to generate a fairly large number of records. Where the exercise becomes even more interesting is if we can generate data for each record that matches our requirements for sample data. In a previous post we have seen how to generate User Functions in ODI to abstract this type of random generation. The code for the sample data generation typically does not belong to the Knowledge Module as it would not give us enough flexibility for all the possible combinations out there.

The User Function examples used before can generate numerics and strings. We could expand their design to generate random phone numbers, or random social security numbers... and work with random data that will now look like real data, instead of exposing sensitive information.

5. ADDING FLEXIBILITY TO THE KNOWLEDGE MODULE

The fact that User Functions do not belong to the code of the Knowledge Module does not mean that there is no flexibility in the Knowledge Modules. Here, we are building a sample table out of thin air: it is very possible that the table does not exist in the first place. Or if want to run multiple tests with different amounts of records each time, we may want to truncate the table before each new run.

KM options are a very simple way to toggle such behaviors in a KM. We can create 2 options: CREATE_TABLE and TRUNCATE. Then create the appropriate steps in our KM:

  • Create Target Table
  • Truncate table

When you want to associate an option to a given step, edit the step itself; then click on the "option" tab and un-check "always execute". Select only the appropriate option in the list and click OK...

 

KM_Option.PNG

As we define the options, it is also good to think of the most common usage for the KM. In our case, chances are we will often want to create the table and truncate it for successive runs: we can then define that these steps will be executed by default (set the default for the variables to "Yes". More conventional KMs would typically have these options, but their defaults would be set to "No".

We now have a complete Knowledge Module that can be used to generate between 1,000 and 1,000,000 records in any table of your choice, complete with options that will let the users of the KM adapt the behavior to their actual needs...

Again, if you want to review all the code in details, it is available here (look for KM_IKM Oracle - Build Sample Data - Gen II.xml).

Enjoy!

Screenshots were taken using version 10.1.3.5 of ODI. Actual icons and graphical representations may vary with other versions of ODI.

Sunday Sep 20, 2009

ODI User Functions: A Case Study

Looking for Data Integration at OpenWorld 2009? Check out here!

The posts in this series assume that you have some level of familiarity with ODI. The concepts of Interface and User Function are used here assuming that you understand them in the context of ODI. If you need more details on these elements, please refer to the ODI Tutorial for a quick introduction, or to the complete ODI documentation for detailed information..

This post will give some examples of how and where user functions can be used in ODI. We will look back at a previous post and see why and how user functions were created for this case.

1. A CASE FOR USER FUNCTIONS

As I was trying to design a simple way to generate random data, I simple approach to specify the type of data to be generated. An example would be to easily generate a random string.

Working on Oracle for this example, I could leverage the database function DBMS_RANDOM.STRING. It takes 2 parameters: one for the type of characters (uppercase, lowercase, mixed case), and on for the length of the string. But I want a little more than this: I also want the ability to force my generation to have a minimum number of characters. For this, I now need to generate a random number. The database function DBMS_RANDOM.VALUE does this, but returns a decimal value. Well, the TRUNC function can take care of this... but my mapping expression becomes somewhat complex:

DBMS_RANDOM.STRING(Format, TRUNC(DBMS_RANDOM.VALUE(MinLen,MaxLen)))

Imagine now using this formula over and over again in your mappings - not the easiest and most readable portion of code to handle. And maintenance will easily become a nightmare if you simply cut and paste...

A user function makes sense at this point, and will provide the following benefits:


  • A readable name that makes the usage and understanding of the transformation formula a lot easier
  • A central place to maintain the transformation code
  • The ability to share the transformation logic with other developers who do not have to come up with the code for this anymore.

 

From a code generation perspective, if you use the User Function in your interfaces, ODI will replace the function name with the associated code and send that code to the database. Databases will never see the User Function name - the substitution is part of the code generation.

2. BUILDING THE USER FUNCTION

You need to provide 3 elements when you build a user function:


  • A name (and a group name to organize the functions)
  • A syntax
  • The actual code to be generated when the developers will use the functions

 


2.1 Naming the User Function

The name of the user function is yours to choose, but make sure that it properly describes what this function is doing: this will make the function all the more usable for others. You will also notice a drop down menu that will let you select a group for this function. If you want to create a new group, you can directly type the group name in the drop down itself.

For our example, we will name the user Function RandomString and create a new group called RandomGenerators.

User Function Name


2.2 The User Function Syntax

The next step will be to define the syntax for the function. Parameters are defined with a starting dollarsign:
$
and enclosed in parenthesis:
().
You can name your parameters as you want: these names will be used when you put together the code for the user functions, along with the $ and (). You can have no parameters or as many parameters as you want...

In our case, we need 3 parameters: One for the string format, one for the minimum length of the generated string, one for the maximum length. Our syntax will hence be:

RandomString($(Format), $(MinLen), $(MaxLen))

If you want to force the data types for the parameters, you can do so by adding the appropriate character after the parameter name: s for String, n for Numeric and d for Date. In that case our User Function syntax would be:

RandomString($(Format)s, $(MinLen)n, $(MaxLen)n)

User Function Syntax


2.3 Code of the User Function

The last part in the definition of the user function will be to define the associated SQL code. To do this, click on the Implementation tab and click the Add button. A window will pop up with two parts: the upper part is for the code per se, the bottom part is for the selection of the technologies where this syntax can be used. Whenever you will use the User Function in your mappings, if the underlying technology is one of the ones you have selected, then ODI will substitute the function name with the code you enter here.

For our example, select Oracle in the list of Linked Technologies and type the following code in the upper part:

DBMS_RANDOM.STRING($(Format), TRUNC(DBMS_RANDOM.VALUE($(MinLen),$(MaxLen))))

Note that we are replacing here the column names with the names of the parameters we have defined in the syntax field previously.

User Function Code

Click on the Ok button to save your syntax and on the Ok or Apply button to save your User Function.

3. EXPANDING THE USER FUNCTIONS TO OTHER TECHNOLOGIES

Once we are done with the previous step, the user function is ready to be used. One downside with our design so far: it can only be used on one technology. Chances are we will need a more flexibility.

One key feature of the User Functions is that they will give you the opportunity to enter the matching syntax for any other database of your choice. No matter which technology you will use later on, you will only have to provide the name of the user function, irrespectively of the underlying technology.

To add other implementations and technologies, simply go back to the Implementation tab of the User Function, click the Add button. Select the technology for which you are defining the code, and add the code.

Note that you can select multiple technologies for any given code implementation: these technologies will then be listed next to one antother.

4. USING THE USER FUNCTIONS IN IINTERFACES

To use the User Functions in your interfaces, simply enter them in your mappings, filters, joins and constraints the same way you would use SQL code.

Interface Mappings

When you will execute your interface, the generated code can be reviewed in the Operator interface. Note that you should never see the function names in the generated code. If you do, check out the following elements:


  • User Function names are case sensitive. Make sure that you are using the appropriate combination of uppercase and lowercase characters
  • Make sure that you are using the appropriate number of parameters, and that they have the appropriate type (string, number or date)
  • Make sure that there is a definition for the User Function for the technology in which it is running. This last case may be the easiest one to oversee, so try to keep it in mind!

As long as you see SQL code in place of the User Function name, the substitution happened successfully.

 

Enjoy!

All Screenshots were taken using version 10.1.3.5 of ODI. Actual icons and graphical representations may vary with other versions of ODI.

Wednesday Sep 09, 2009

Generating Sample Data with ODI: A Case Study For Knowledge Modules and User Functions

Looking for Data Integration at OpenWorld 2009? Look no further: all you need is here!

The posts in this series assume that you have some level of familiarity with ODI. The concepts of Interface, Model, Knowledge Module and User Function are used here assuming that you understand them in the context of ODI. If you need more details on these elements, please refer to the ODI Tutorial for a quick introduction, or to the complete ODI documentation for detailed information.


We've all been there: we start coding, waiting for a set of sample data to be available. We move along with the code... and the data is not available. Or we need to build a small (or not so small) data set quickly. Sure, we all have sample databases left and right for that purpose. But recently I was looking for a decent size data set for some tests (more than the traditional 30 sample records) and could not put my hands on what I needed. What the heck: why not have ODI build this for me?

The techniques that we will leveraged for this are the following:

  • Creation of a temporary interface to create the sample table (See this previous post for details on how to create a temporary interface)
  • Creation of a new knowledge module to generate enough records in the new table
  • Creation of ODI User Functions to simplify the generation of random values

 


All the objects mentioned in this article can be downloaded. Save

this XML file
if you want to import in your repository a project that already contains all the objects (IKM and User functions). Click
here
if you want to download a file that will let you import the different objects individually. You will have to unzip the file before importing the objects in the later case.

The samples provided here have all been designed for an Oracle database, but can be modified and adapted for other technologies.

Today we will discuss the different elements that allow us to generate the sample data set. In future posts, we will dissect the Knowledge Modules and User Functions to see what technological choices were made based on the different challenges that had to be solved.

1. THE INTERFACE

For more details on how to create a temporary interface, you can refer to this post. For our example, we will create a new table in an existing schema. When you create your temporary interface, remember to set the following elements:

  • Select of your staging area ( In the Definition tab of the interface)
  • Name your target table
  • Select the location of your target table (work schema / data schema)
  • Name the Columns, and set their individual data type and length

 


Interface Definition Tab

For our example, we will use a fairly simple table structure:
TABLE NAME:

SAMPLER

COLUMNS:
SAMPLER_ID number(3)
SAMPLER_NAME varchar2(30)
SAMPLER_PROMOTION varchar2(10)
SAMPLER_PRICE number(10,2)
SAMPLER_RELEASE_DATE date

 

Sampler Interface Table Creation

2. USER FUNCTIONS

The Oracle database comes with a package called DBMS_RANDOM. Other random generators can be used (DBMS_CRYPTO for instance has random generation functions as well). These functions take more or less parameters, and if we realize after creating dozens of mappings that using the "other" package would have been better... we would be in a lot of trouble. Creating user functions will allow us to:

  • Have a naming convention that is simplified
  • Limit the number of parameters
  • Limit the complexity of the code
  • Later maintain the code independently of our interfaces, in a centralized location: if we decide to change the code entirely, we will make modifications in one single place - no matter how often we use that function.

 


For our example, we will have 5 ODI user functions in ODI (again, these can be downloaded

here
):

  • RandomDecimal(Min, Max): generates a random value (with decimals) between the Min and Max values
  • RandomNumber(Min, Max): generates a random value (without decimals) between the Min and Max values
  • RandomBool(): generate a 0 or a 1
  • RandomDate(MinDate, MaxDate): returns a date between MinDate and MaxDate (make sure MinDate and MaxDate are valid dates for Oracle)
  • RandomString(Format, Min, Max): generates a random string with a minimum of Min characters and a maximum of Max characters. Valid formats are:
    • 'u', 'U' - returning string in uppercase alpha characters
    • 'l', 'L' - returning string in lowercase alpha characters
    • 'a', 'A' - returning string in mixed case alpha characters
    • 'x', 'X' - returning string in uppercase alpha-numeric characters
    • 'p', 'P' - returning string in any printable characters.

 


SamplerUserFunctions.PNG

We can either use these functions as is or as part of a more complex code logic, such as a case...when statement.

For our example, we will build the following mappings:

Column Mapping
SAMPLER_ID RandomNumber(1,999)
SAMPLER_NAME RandomString('A', 1, 30)
SAMPLER_PROMOTION case when RandomBool()=0 then 'FALSE'
else 'TRUE'
end
SAMPLER_PRICE RandomDecimal(1,10000)
SAMPLER_RELEASE_DATE RandomDate('01-JAN-2000', sysdate)


In ODI, the mappings will look like this:

Sampler Interface Mappings

3. THE KNOWLEDGE MODULE

Since we do not have any source table in this interface, we only need an IKM. The IKM provided will this example needs to be imported in your project.

Because the purpose of this KM is to generate sample data, it will have a few options where the default values will be different from the usual KMs:

  • TRUNCATE defaults to 'YES': we assume here that if you re-run the interface, you want to create a new sample. If you only want to add more records to an existing table, simply set this option to 'NO' in your interface.
  • CREATE_TABLE defaults to 'YES': we assume that the table to be loaded does not exist yet. You can turn that option to 'NO' if there is no need to create the table.
  • THOUSANDS_OF_RECORDS: set this to any value between 1 and 1,000 to generate between 1,000 and 1,000,000 records

 

Sampler IKM

Once you have set the values for your KM, you can run the interface and let it generate the random data set.

With the above configuration, and using a standard laptop (dual core 1.86GHz processor and 2 Gb of RAM) equipped with Oracle XE my statistics were as follows:

10,000 records generated in 5 seconds
100,000 records generated in 24 to 35 seconds (about 30 seconds on average)
1,000,000 records generated in 211 to 235 seconds (about 4 minutes on average)

Note that the machine was not dedicated to this process and was running other processes.

Statistics are available in the Operator interface.

Sampler Stats

To review the data loaded by ODI in your target table, simply reverse-engineer this table in a model, then right-click on the table and select View Data to see what was generated!

SamplerData.PNG

4. EXPANDING TO OTHER TECHNOLOGIES

One question: why did I stop here and did not try to make this work for other technologies? Well, it turns out that ODI is really meant to move and transform data. As long as I have at least ONE table with random data in any one of my databases, it is now faster to just create a regular ODI interface and move the data across... The design will take less than a minute. The data transfer should not take much time either. Who would try to spend more time coding when the solution is that simple?

But if you want to make this work for other databases, here are your entry points:

  • Duplicate the KM and modify it to use SQL that would work on these other databases
  • Update the user functions to make sure that they use the appropriate functions for the given databases
  • Use the same logic to create your interface

Enjoy!

 


All Screenshots were taken using version 10.1.3.5 of ODI. Actual icons and graphical representations may vary with other versions of ODI.

Data Integration Showcased at OpenWorld 2009

About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
3
5
6
7
8
9
10
12
13
14
17
18
19
20
21
23
24
25
26
27
28
29
30
   
       
Today