Monday Dec 31, 2012

ODI - Hive and NoSQL

The Hive external table let's us do lots of cool stuff including processing data from NoSQL. We have seen how custom SerDes are used, Hive storage handlers also provide some cool capabilities. Using the Hive storage handler defined here, an external table can be defined to project data from a NoSQL key-value store. The external table can then be used as a source in ODI, very simple.

The illustration on github has the following data stored in a Oracle NoSQL Database (the key is the lastname/firstname etc):

  • /Smith/Bob/-/birthdate: 05/02/1975
  • /Smith/Bob/-/phonenumber: 1111-1111
  • /Smith/Bob/-/userid: 1
  • /Smith/Patricia/-/birthdate: 10/25/1967
  • /Smith/Patricia/-/phonenumber: 2222-2222
  • /Smith/Patricia/-/userid: 2
  • /Wong/Bill/-/birthdate: 03/10/1982
  • /Wong/Bill/-/phonenumber: 3333-3333
  • /Wong/Bill/-/userid: 3

Using the Hive external table and the custom storage handler for a key value store, we define a mask to project the data through the external table. 

  1. ADD JAR /home/oracle/kv/HiveKVStorageHandler.jar;
  2. CREATE EXTERNAL TABLE MY_KV_TABLE (lastname string, firstname string, birthdate string, phonenumber string, userid string)
  3.       STORED BY 'org.vilcek.hive.kv.KVHiveStorageHandler'
  4.       WITH SERDEPROPERTIES ("kv.major.keys.mapping" = "lastname,firstname", "kv.minor.keys.mapping" = "birthdate,phonenumber,userID")
  5.       TBLPROPERTIES ("" = "localhost:5000", "" = "kvstore");

There are a few interesting properties here;

  • we specify the keyvalue store using TBLPROPERTIES, identify the host/port and the keystore name (kvstore).
  • the SerDe properties contains the mapping of the keys to column names, you will get a row for each value of birthdate, phonenumber, userID

Fairly straightforward. We can then reverse engineer this into ODI, using the same mechanism as I described in previous posts here setting the ODI_HIVE_SESSION_JARS and so forth. The data projected looks like this;

  2. OK
  3. Smith     Patricia     10/25/1967     NULL NULL
  4. Smith Patricia NULL 2222-2222     NULL
  5. Smith Patricia NULL NULL 2
  6. Smith Bob 05/02/1975 NULL NULL
  7. Smith Bob NULL 1111-1111 NULL
  8. Smith Bob NULL NULL 1
  9. Wong Bill 03/10/1982 NULL NULL
  10. Wong Bill NULL 3333-3333 NULL
  11. Wong Bill NULL NULL 3

In ODI by defining the Hive collect_set function as an aggregation function, we can then aggregate the data and pivot the data to get it as a row;

  1. SELECT lastname, firstname, collect_set(birthdate)[0], collect_set(phonenumber)[0], collect_set(userid)[0]
  2.       FROM MY_KV_TABLE
  3.       GROUP BY lastname, firstname;

So another interesting illustration of external tables in Hive and what they can provide.

Saturday Dec 29, 2012

ODI - Hive and Complex JSON

Time for a nested JSON example using Hive external tables. ODI treats nested complex types in Hive the same way it treats types in other technologies such as Oracle, the type name is captured, not the definition - you can see XMLType or SDO_GEOMETRY as an example within the ODI Oracle technology. The Hive technology in ODI also has the ARRAY, STRUCT and MAP types seeded. For the nested JSON example I will use the example define in this JSON SerDe page here.

The external table definition is below, I have defined this in Hive and reverse engineered into ODI just like the previous post. Note it is using a different SerDe from the post here, when I tried using that Google SerDe the data was not projected properly (all data was projected as beware of components used). Just like the previous post we need to add the jar whenever it is used (plus don't forget the ODI_HIVE_SESSION_JARS), either from Hive or ODI;

  1. ADD JAR /home/oracle/json/hive-serde-1.0.jar;
  3.   messageid string,
  4.   messagesize int,
  5.   sender string,
  6.   recipients array<string>,
  7.   messageparts array<struct<
  8.     extension: string,
  9.     size: int
  10.   >>,
  11.   headers map<string,string>
  12. )
  13. ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
  14. LOCATION '/user/oracle/json_complex';

This external table has ARRAY fields, STRUCT fields and MAP fields, so we are going above and beyond simple types. The data I will use is the same as the referenced web page;

  1. {
  2.   "messageId": "34dd0d3c-f53b-11e0-ac12-d3e782dff199",
  3.   "messageSize": 12345,
  4.   "sender": "",
  5.   "recipients": ["", ""],
  6.   "messageParts": [
  7.     {
  8.       "extension": "pdf",
  9.       "size": 4567
  10.     },
  11.     {
  12.       "extension": "jpg",
  13.       "size": 9451
  14.     }
  15.   ],
  16.   "headers": {
  17.     "Received-SPF": "pass",
  18.     "X-Broadcast-Id": "9876"
  19.   }
  20. }

Again, I will use the Hive RKM that I mentioned in the post here in order to reverse engineer the external table defined in Hive into ODI. Below you can see the table and how it is represented in ODI. You can see the recipients and messageparts columns are defined as ARRAY, and headers is a MAP.

We can view the data in ODI just like other tables - remember the external table is projected the JSON data from HDFS through Hive;

When the datastore is used in an interface these are the columns that are available in mapping, you can use any of the Hive functions or operators available on these complex types, plus constructors for building them. For example to retrieve the 1st element in the array the following Hive QL can be used;

  • select sender, recipients[0] from message;

 You can build such expressions in ODI;

You just need to be aware of the datatypes you are using and the functions available. Haven't gone into complex SQL/Hive QL here, but you see the basic mechanics are pretty straightforward. One of the points that comes home here is the functionality level of the non-core pieces of Hadoop, so the first 2 JSON SerDe libraries I have used support different capabilities of JSON - so be aware.

ODI - Hive External Tables, reverse engineering

Here we see Hive external tables on JSON data files in action and being reversed engineered from Hive into ODI. Carrying on from my earlier post on Hive reverse engineering here we will tackle external tables with primitive data types. Just like Oracle external tables for accessing data on the filesystem, Hive external tables can access big data files in HDFS. This example uses a JSON SerDe for Hive which was downloaded from here. The external tables in Hive are quite nice! You just specify the HDFS directory and all files in that directory are projected through the external table definition, also you can specify sophisticated SerDe classes that comply with the Hive SerDe interfaces - these classes can reverse engineer complex types also which make them very useful.

Before launching ODI or the agent for this I added the SerDe JAR file to this environment variable used by ODI, here is what I did in my system;

  • export ODI_HIVE_SESSION_JARS=$ODI_HIVE_SESSION_JARS:/home/oracle/json/hive-json-serde-0.2.jar

I have 2 JSON files with data in my HDFS under /user/oracle/json I will specify this directory in the external table definition, along with the JSON SerDe. Within my Hive shell I can define the external table, remembering to add the JSON SerDe file (using the ADD JAR command) just before I create the external table.

  1. ADD JAR /home/oracle/json/hive-json-serde-0.2.jar;
  3.           field1 string, field2 int, field3 string, field4 double
  4.        )
  5.        ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
  6.        LOCATION '/user/oracle/json';

 The data in the first JSON file looks like the following, this was just an example from the google code snippets;

  • {"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
  • {"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
  • {"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
  • {"field1":"data4","field2":400,"field3":"more data4","field4":123.004}

I have defined these fields in the external table. Back in ODI, in my model below I am using the customized reverse with the RKM Hive mentioned above, also for this reverse I just want one table imported, so I defined the mask just to be my_table. Clicking the 'Reverse Engineer' button on the top left of the panel starts the reverse engineering.

If there are errors, then check the ODI Operator, that's where you will see anything that goes wrong. The table in ODI has just some rudimentary information about the number of columns, all of the external table definition for this is defined in Hive, this is the same as when we reverse engineer an Oracle external table. Once the table is captured in ODI I can view the data;

The custom Hive RKM I used is here - this has some changes to the one shipped with ODI to include external table reverse (I included EXTERNAL_TABLE in the getTables API call), plus support for custom SerDe JARs loaded via the environment variable ODI_HIVE_SESSION_JARS (just like the ODI IKMs load the JARs, the RKM also needs to).

Anyway, that's a quick peek at external tables in Hive and reverse engineering them into ODI. Once the table is captured in ODI you can use it in mappings just like other tables!

Friday Dec 28, 2012

ODI - Basic Hive Queries

Here we will see a basic example joining the movie lens data and then loading a Hive table based on the tables from the Reverse Engineering Hive Tables post. The Hive table was defined and created via ODI, I duplicated the movies table and added a column for the rating, just for demo purposes...

When I build my interface, I add movies for my source and movies_info as my target, the auto mapping completes much of the mapping, the rating (which is not mapped below) comes from another table - this is where ODI's incremental design is nice, I can add in a new datastore as a source and map columns from it, then describe the join. 

 After I have added the movie ratings table, I will define the join just by dragging movie_id from movies to the ratings table movie_id column. That's the join...mostly defined.

The other thing you need to check is that the ordered join property is set. This will generate the ordered join (ANSI style, but using the Hive technology's template) syntax.

 We can also perform transformations using built in or user defined functions, below I am performing the Hive built-in UPPER function on the movie name for example.

In the physical, or flow view I am using the Hive Control Append IKM, I am using ODI to create the target table in Hive and also performing a truncate if it exists. Also have the control flow switched off. 

Executing this is just like any other interface apart from we leverage Hive to perform the heavy lifting. The resultant execution can be inspected in the ODI operator or console and the resultant table inspected when complete. 

ODI - Reverse Engineering Hive Tables

ODI can reverse engineer Hive tables via the standard reverse engineer and also an RKM to reverse engineer tables defined in Hive, this makes it very easy to capture table designs in ODI from Hive for integrating. To illustrate I will use the movie lens data set which is a common data set used in Hadoop training.

I have defined 2 tables in Hive for movies and their ratings as below, one file has fields delimited with '|' the other is tab delimited. 

  1. create table movies (movie_id int, movie_name string, release_date string, vid_release_date string,imdb_url string) row format delimited fields terminated by '|';
  2. create table movie_ratings (user_id string, movie_id string, rating float, tmstmp string) row format delimited fields terminated by '\t';

For this example I have loaded the Hive tables manually from my local filesystem (into Hive/HDFS) using the following LOAD DATA Hive commands and the movie lens data set mentioned earlier; 

  1. load data local inpath '/home/oracle/data/u.item' OVERWRITE INTO TABLE movies;
  2. load data local inpath '/home/oracle/data/' OVERWRITE INTO TABLE movie_ratings;

The data set in the file u.item data file looks like the following with '|' delimiter;

  • 1|Toy Story (1995)|01-Jan-1995|||0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
  • 2|GoldenEye (1995)|01-Jan-1995|||0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
  • 3|Four Rooms (1995)|01-Jan-1995|||0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

In ODI I can define my Hive data server and logical schema, here is the JDBC connection for my Hive database (I just used the default);

I can then define my model and perform a selective reverse using standard ODI functionality, below I am reversing just the movies table and the movie ratings table;


After the reverse is complete, the tables will appear in the model in the tree, the data can be inspected just like regular datastores;

From here we see the data in the regular data view;

The ODI RKM for Hive performs logging that is useful in debugging if you hit issues with the reverse engineer. This is a very basic example of how some of the capabilities hang together, ODI can also be used to design the load of the file into Hive, transformations within it and subsequent loads using Oracle Loader for Hadoop into Oracle and on and on.

Monday Dec 24, 2012

ODI 11g - Changing the Work Repository Connection

To change the work repository connection information, there is a button in the editor. I have heard a few folk have problems finding it. In 10g it was much more obvious, it had large text and was easily identifiable. Users search for this generally after cloning or upgrade when they need to connect the new master to the new cloned work repository.

In ODI 11g there is a small solitary button in the toolbar of the editor, doesn't exactly jump out saying 'press me, press me', the tooltip says 'Connection', pressing this button brings up the repository connection information editor. 

Clicking this button opens the work repository connection information where you can change the URL, connection credentials and so forth.

Thursday Dec 20, 2012

ODI 11g - Loading More Files Faster

This post shows you how to leverage more of your Oracle system, you can load more data and faster using these kind of best practices. The LKM here, lets you leverage the Oracle database's preprocessor to load MORE data faster. The KM also has the capability to leverage a regular expression to identify the file or files so that you can load much much more, it also sorts the files based on size to comply with the most efficient way of loading into Oracle via the external table. The preprocessor can basically do anything, but is commonly used to uncompress data on the fly - so you can eliminate transferring large uncompressed files around or needlessly uncompressing them! There are various creative blog posts around showing you what some customers have done - some even include the transfer in the preprocessing phase.

I posted the LKM on the site here

You can see the LKM configured below, I have used the preprocessor and a regular expression to load all gzipped files in the directory; 

When the code is generated below you can see the PREPROCESSOR directive used, plus you can see that the file names are going to be generated from a Java API call where the regular expression and directory are passed.

My gunzipdb.bat script has the following definition, wrappering the gzip utility (note the use of the -d -c directives);

  • @echo off
  • c:\ext_tab_test\gzip -d -c %1

Check this creative use of the PREPROCESSOR directive here (External table to load compressed data residing on remote locations) which does some creative work in the preprocessing script. This could do pretty much anything you want including transforming from some complex file to project through an external table.

Monday Dec 10, 2012

Maximum Availability with Oracle GoldenGate

Oracle Database offers a variety of built-in and optional products for maximum availability, and it is well known for its robust high availability and disaster recovery solutions. With its heterogeneous, real-time, transactional data movement capabilities, Oracle GoldenGate is a key part of the Maximum Availability Architecture for Oracle Database.

This week on Thursday Dec. 13th we will be presenting in a live webcast how Oracle GoldenGate fits into Oracle Database Maximum Availability Architecture (MAA). Joe Meeks from the Oracle Database High Availability team will discuss how Oracle GoldenGate complements other key products within MAA such as Active Data Guard. Nick Wagner from GoldenGate PM team will present how to upgrade to latest Oracle Database release without any downtime. Nick will also cover 2 new features of  Oracle GoldenGate 11gR2:  Integrated Capture for Oracle Database and Automated Conflict Detection and Resolution. Nick will provide in depth review of these new features with examples.

Oracle GoldenGate also offers maximum availability for non-Oracle databases, such as HP NonStop, SQL Server, DB2 (LUW, iSeries, or zSeries) and more. The same robust, reliable real-time, bidirectional data movement capabilities apply to all supported databases.

 I'd like to invite you to join us on Thursday Dec. 13th 10am PT/1pm ET to hear from the product experts on how to use GoldenGate for maximizing database availability and to ask your questions. You can find the registration link below.

Webcast: Maximum Availability with Oracle GoldenGate

Thursday Dec. 13th 10am PT/1pm ET

Look forward to another great webcast with lots of interaction with the audience.

ODI 11g - Dynamic and Flexible Code Generation

ODI supports conditional branching at execution time in its code generation framework. This is a little used, little known, but very powerful capability - this let's one piece of template code behave dynamically based on a runtime variable's value for example. Generally knowledge module's are free of any variable dependency. Using variable's within a knowledge module for this kind of dynamic capability is a valid use case - definitely in the highly specialized area.

The example I will illustrate is much simpler - how to define a filter (based on mapping here) that may or may not be included depending on whether at runtime a certain value is defined for a variable. I define a variable V_COND, if I set this variable's value to 1, then I will include the filter condition 'EMP.SAL > 1' otherwise I will just use '1=1' as the filter condition. I use ODIs substitution tags using a special tag '<$' which is processed just prior to execution in the runtime code - so this code is included in the ODI scenario code and it is processed after variables are substituted (unlike the '<?' tag).

 So the lines below are not equal ...

  • <$ if ( "#V_COND".equals("1")  ) { $> EMP.SAL > 1 <$ } else { $> 1 = 1 <$ } $>
  • <? if ( "#V_COND".equals("1")  ) { ?> EMP.SAL > 1 <? } else { ?> 1 = 1 <? } ?>

When the <? code is evaluated the code is executed without variable substitution - so we do not get the desired semantics, must use the <$ code. You can see the jython (java) code in red is the conditional if statement that drives whether the 'EMP.SAL > 1' or '1=1' is included in the generated code. For this illustration you need at least the ODI release - with the vanilla release it didn't work for me (may be patches?). As I mentioned, normally KMs don't have dependencies on variables - since any users must then have these variables defined etc. but it does afford a lot of runtime flexibility if such capabilities are required - something to keep in mind, definitely.

Thursday Dec 06, 2012

ODI 11g - Cleaning control characters and User Functions

In ODI user functions have a poor name really, they should be user expressions - a way of wrapping common expressions that you may wish to reuse many times - across many different technologies is an added bonus. To illustrate look at the problem of how to remove control characters from text. Users ask these types of questions over all technologies - Microsoft SQL Server, Oracle, DB2 and for many years - how do I clean a string, how do I tokenize a string and so on. After some searching around you will find a few ways of doing this, in Oracle there is a convenient way of using the TRANSLATE and REPLACE functions. So you can convert some text using the following SQL;

  • replace( translate('This is my string'||chr(9)||' which has a control character', chr(3)||chr(4)||chr(5)||chr(9), chr(3) ), chr(3), '' )

If you had many columns to perform this kind of transformation on, in the Oracle database the natural solution you'd go to would be to code this as a PLSQL function since you don't want the code splattered everywhere. Someone tells you that there is another control character that needs added equals a maintenance headache. Coding it as a PLSQL function will incur a context switch between SQL and PLSQL which could prove costly.

In ODI user functions let you capture this expression text and reference it many times across your mappings. This will protect the expression from being copy-pasted by developers and make maintenance much simpler - change the expression definition in one place.

Firstly define a name and a syntax for the user function, I am calling it UF_STRIP_BAD_CHARACTERS and it has one parameter an input string; 

We then can define an implementation for each technology we will use it, I will define Oracle's using the inputString parameter and the TRANSLATE and REPLACE functions with whatever control characters I want to replace;

I can then use this inside mapping expressions in ODI, below I am cleaning the ENAME column - a fabricated example but you get the gist.

 Note when I use the user function the function name remains in the text of the mapping, the actual expression is not substituted until I generate the scenario. If you generate the scenario and export the scenario you can have a peak at the code that is processed in the runtime - below you can see a snippet of my export scenario;

 That's all for now, hopefully a useful snippet of info.

Monday Dec 03, 2012

ODI 11g - Scripting a Reverse Engineer

A common question is related to how to script the reverse engineer using the ODI SDK. This follows on from some of my posts on scripting in general and accelerated model and topology setup.

Check out this viewlet here to see how to define a reverse engineering process using ODI's package.

Using the ODI SDK, you can script this up using the OdiPackage and StepOdiCommand classes as follows;

  1.  OdiPackage pkg = new OdiPackage(folder, "Pkg_Rev"+modName);
  2.   StepOdiCommand step1 = new StepOdiCommand(pkg,"step1_cmd_reset");
  3.   step1.setCommandExpression(new Expression("OdiReverseResetTable \"-MODEL="+mod.getModelId()+"\"",null, Expression.SqlGroupType.NONE));
  4.   StepOdiCommand step2 = new StepOdiCommand(pkg,"step2_cmd_reset");
  5.   step2.setCommandExpression(new Expression("OdiReverseGetMetaData \"-MODEL="+mod.getModelId()+"\"",null, Expression.SqlGroupType.NONE));
  6.   StepOdiCommand step3 = new StepOdiCommand(pkg,"step3_cmd_reset");
  7.   step3.setCommandExpression(new Expression("OdiReverseSetMetaData \"-MODEL="+mod.getModelId()+"\"",null, Expression.SqlGroupType.NONE));
  8.   pkg.setFirstStep(step1);
  9.   step1.setNextStepAfterSuccess(step2);
  10.   step2.setNextStepAfterSuccess(step3);

The biggest leap of faith for users is getting to know which SDK classes have to be used to build the objects in the design, using StepOdiCommand isn't necessarily obvious, once you see it in action though it is very simple to use. The above snippet uses an OdiModel variable named mod, its a snippet I added to the accelerated model creation script in the post linked above.


Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality


« December 2012 »