Tuesday May 07, 2013

ODI - Integrating more social data

Carrying on from the earlier post on integrating social data from Facebook Graph API, the functions you need to integrate JSON are here for download. I have wrapped into 2 functions, there is a readme in the zip describing what you need to do;

  1. JSON_to_XML
  2. Zipped_JSON_to_XML

After you have imported the ODI user functions you can then easily call them from your code; in ODI procedures for example. For example I have a procedure with a Groovy task that simply calls the function as;

  • JSON_to_XML("http://graph.facebook.com/search?q=exadata&type=post", "D:\\output\\fb_exadata.xml", "", "facebook");

The first parameter is a URI (not a regular filesystem path) representing the input data to be processed, and the second parameter is a filesystem path representing the generated file. The 3rd and 4th parameters are for configuring the XML generated, the unnamed element and root node names.

Here is an example of a zipped input taken from the filesystem;

  • Zipped_JSON_to_XML("file:///D:/input/fb_exadata.zip", "D:\\output\\fb_exadata.xml", "", "facebook");

The download is on the ODI user function samples download on the Java.net site. Its a community area, so try it out, make changes, and let me know how it goes. There are a few challenges in the process with names from JSON to XML, so the code has some support for that but could be better.

The 3rd and 4th parameters let us handle JSON with arrays of unnamed elements;

  • [{"id":"345", "name":"bob"},{"id":"123", "name":"jim"}]

so we can generate the following XML passing company as the 3rd parameter for the array name and emp as the unnamed element parameter, most commonly only the 4th parameter needs a value;

  1. <?xml version='1.0' encoding='UTF-8'?>
  2. <company>
  3. <emp><id>345</id><name>bob</name></emp>
  4. <emp><id>123</id><name>jim</name></emp>
  5. </company>

There's a few other posts I will share when I get a moment, including using named pipes and an expert for specific technologies (such as XML). I did this post for fast model creation which is very useful for relational oriented systems, but for XML we can have a much more specific one incorporating much of the sensible defaults and an option for using a system other than in-memory.

Tuesday Apr 16, 2013

ODI - Integrating social data from Facebook Graph API in 0 to 60

I'm always on the lookout for enabling technology to push the envelope. Here is a viewlet illustrating how to integrate Facebook  data into your flow. It uses a preprocessor that will help you integrate JSON in general into your data integration world.

There are a few pieces to the jigsaw. ODI provides an XML driver that can be used to process XML data from the filesystem, http or ftp for example. ODI has improved this greatly by leveraging Oracle standard XML schema and parsing components. Transforming JSON to XML is one approach which is convenient for when you already have XML based tools or utilities. With this utility you can load JSON data into your data integration flows and in the process understand more about the data (when we reverse engineer the JSON we get a schema). Using this we can integrate arbitrary JSON data such as social data feeds from Twitter, Facebook, Google, and aggregators like Datasift among many others.

Watch the viewlet to see it in action.

Saturday Dec 29, 2012

ODI - Hive and Complex JSON

Time for a nested JSON example using Hive external tables. ODI treats nested complex types in Hive the same way it treats types in other technologies such as Oracle, the type name is captured, not the definition - you can see XMLType or SDO_GEOMETRY as an example within the ODI Oracle technology. The Hive technology in ODI also has the ARRAY, STRUCT and MAP types seeded. For the nested JSON example I will use the example define in this JSON SerDe page here.

The external table definition is below, I have defined this in Hive and reverse engineered into ODI just like the previous post. Note it is using a different SerDe from the post here, when I tried using that Google SerDe the data was not projected properly (all data was projected as null...so beware of components used). Just like the previous post we need to add the jar whenever it is used (plus don't forget the ODI_HIVE_SESSION_JARS), either from Hive or ODI;

  1. ADD JAR /home/oracle/json/hive-serde-1.0.jar;
  3.   messageid string,
  4.   messagesize int,
  5.   sender string,
  6.   recipients array<string>,
  7.   messageparts array<struct<
  8.     extension: string,
  9.     size: int
  10.   >>,
  11.   headers map<string,string>
  12. )
  13. ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
  14. LOCATION '/user/oracle/json_complex';

This external table has ARRAY fields, STRUCT fields and MAP fields, so we are going above and beyond simple types. The data I will use is the same as the referenced web page;

  1. {
  2.   "messageId": "34dd0d3c-f53b-11e0-ac12-d3e782dff199",
  3.   "messageSize": 12345,
  4.   "sender": "alice@example.com",
  5.   "recipients": ["joe@example.com", "bob@example.com"],
  6.   "messageParts": [
  7.     {
  8.       "extension": "pdf",
  9.       "size": 4567
  10.     },
  11.     {
  12.       "extension": "jpg",
  13.       "size": 9451
  14.     }
  15.   ],
  16.   "headers": {
  17.     "Received-SPF": "pass",
  18.     "X-Broadcast-Id": "9876"
  19.   }
  20. }

Again, I will use the Hive RKM that I mentioned in the post here in order to reverse engineer the external table defined in Hive into ODI. Below you can see the table and how it is represented in ODI. You can see the recipients and messageparts columns are defined as ARRAY, and headers is a MAP.

We can view the data in ODI just like other tables - remember the external table is projected the JSON data from HDFS through Hive;

When the datastore is used in an interface these are the columns that are available in mapping, you can use any of the Hive functions or operators available on these complex types, plus constructors for building them. For example to retrieve the 1st element in the array the following Hive QL can be used;

  • select sender, recipients[0] from message;

 You can build such expressions in ODI;

You just need to be aware of the datatypes you are using and the functions available. Haven't gone into complex SQL/Hive QL here, but you see the basic mechanics are pretty straightforward. One of the points that comes home here is the functionality level of the non-core pieces of Hadoop, so the first 2 JSON SerDe libraries I have used support different capabilities of JSON - so be aware.

ODI - Hive External Tables, reverse engineering

Here we see Hive external tables on JSON data files in action and being reversed engineered from Hive into ODI. Carrying on from my earlier post on Hive reverse engineering here we will tackle external tables with primitive data types. Just like Oracle external tables for accessing data on the filesystem, Hive external tables can access big data files in HDFS. This example uses a JSON SerDe for Hive which was downloaded from here. The external tables in Hive are quite nice! You just specify the HDFS directory and all files in that directory are projected through the external table definition, also you can specify sophisticated SerDe classes that comply with the Hive SerDe interfaces - these classes can reverse engineer complex types also which make them very useful.

Before launching ODI or the agent for this I added the SerDe JAR file to this environment variable used by ODI, here is what I did in my system;

  • export ODI_HIVE_SESSION_JARS=$ODI_HIVE_SESSION_JARS:/home/oracle/json/hive-json-serde-0.2.jar

I have 2 JSON files with data in my HDFS under /user/oracle/json I will specify this directory in the external table definition, along with the JSON SerDe. Within my Hive shell I can define the external table, remembering to add the JSON SerDe file (using the ADD JAR command) just before I create the external table.

  1. ADD JAR /home/oracle/json/hive-json-serde-0.2.jar;
  3.           field1 string, field2 int, field3 string, field4 double
  4.        )
  5.        ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
  6.        LOCATION '/user/oracle/json';

 The data in the first JSON file looks like the following, this was just an example from the google code snippets;

  • {"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
  • {"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
  • {"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
  • {"field1":"data4","field2":400,"field3":"more data4","field4":123.004}

I have defined these fields in the external table. Back in ODI, in my model below I am using the customized reverse with the RKM Hive mentioned above, also for this reverse I just want one table imported, so I defined the mask just to be my_table. Clicking the 'Reverse Engineer' button on the top left of the panel starts the reverse engineering.

If there are errors, then check the ODI Operator, that's where you will see anything that goes wrong. The table in ODI has just some rudimentary information about the number of columns, all of the external table definition for this is defined in Hive, this is the same as when we reverse engineer an Oracle external table. Once the table is captured in ODI I can view the data;

The custom Hive RKM I used is here - this has some changes to the one shipped with ODI to include external table reverse (I included EXTERNAL_TABLE in the getTables API call), plus support for custom SerDe JARs loaded via the environment variable ODI_HIVE_SESSION_JARS (just like the ODI IKMs load the JARs, the RKM also needs to).

Anyway, that's a quick peek at external tables in Hive and reverse engineering them into ODI. Once the table is captured in ODI you can use it in mappings just like other tables!


Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality


« March 2015