Saturday Dec 29, 2012

ODI - Hive and Complex JSON

Time for a nested JSON example using Hive external tables. ODI treats nested complex types in Hive the same way it treats types in other technologies such as Oracle, the type name is captured, not the definition - you can see XMLType or SDO_GEOMETRY as an example within the ODI Oracle technology. The Hive technology in ODI also has the ARRAY, STRUCT and MAP types seeded. For the nested JSON example I will use the example define in this JSON SerDe page here.

The external table definition is below, I have defined this in Hive and reverse engineered into ODI just like the previous post. Note it is using a different SerDe from the post here, when I tried using that Google SerDe the data was not projected properly (all data was projected as null...so beware of components used). Just like the previous post we need to add the jar whenever it is used (plus don't forget the ODI_HIVE_SESSION_JARS), either from Hive or ODI;

  1. ADD JAR /home/oracle/json/hive-serde-1.0.jar;
  2. CREATE EXTERNAL TABLE message (
  3.   messageid string,
  4.   messagesize int,
  5.   sender string,
  6.   recipients array<string>,
  7.   messageparts array<struct<
  8.     extension: string,
  9.     size: int
  10.   >>,
  11.   headers map<string,string>
  12. )
  13. ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
  14. LOCATION '/user/oracle/json_complex';

This external table has ARRAY fields, STRUCT fields and MAP fields, so we are going above and beyond simple types. The data I will use is the same as the referenced web page;

  1. {
  2.   "messageId": "34dd0d3c-f53b-11e0-ac12-d3e782dff199",
  3.   "messageSize": 12345,
  4.   "sender": "alice@example.com",
  5.   "recipients": ["joe@example.com", "bob@example.com"],
  6.   "messageParts": [
  7.     {
  8.       "extension": "pdf",
  9.       "size": 4567
  10.     },
  11.     {
  12.       "extension": "jpg",
  13.       "size": 9451
  14.     }
  15.   ],
  16.   "headers": {
  17.     "Received-SPF": "pass",
  18.     "X-Broadcast-Id": "9876"
  19.   }
  20. }

Again, I will use the Hive RKM that I mentioned in the post here in order to reverse engineer the external table defined in Hive into ODI. Below you can see the table and how it is represented in ODI. You can see the recipients and messageparts columns are defined as ARRAY, and headers is a MAP.

We can view the data in ODI just like other tables - remember the external table is projected the JSON data from HDFS through Hive;

When the datastore is used in an interface these are the columns that are available in mapping, you can use any of the Hive functions or operators available on these complex types, plus constructors for building them. For example to retrieve the 1st element in the array the following Hive QL can be used;

  • select sender, recipients[0] from message;

 You can build such expressions in ODI;

You just need to be aware of the datatypes you are using and the functions available. Haven't gone into complex SQL/Hive QL here, but you see the basic mechanics are pretty straightforward. One of the points that comes home here is the functionality level of the non-core pieces of Hadoop, so the first 2 JSON SerDe libraries I have used support different capabilities of JSON - so be aware.

ODI - Hive External Tables, reverse engineering

Here we see Hive external tables on JSON data files in action and being reversed engineered from Hive into ODI. Carrying on from my earlier post on Hive reverse engineering here we will tackle external tables with primitive data types. Just like Oracle external tables for accessing data on the filesystem, Hive external tables can access big data files in HDFS. This example uses a JSON SerDe for Hive which was downloaded from here. The external tables in Hive are quite nice! You just specify the HDFS directory and all files in that directory are projected through the external table definition, also you can specify sophisticated SerDe classes that comply with the Hive SerDe interfaces - these classes can reverse engineer complex types also which make them very useful.

Before launching ODI or the agent for this I added the SerDe JAR file to this environment variable used by ODI, here is what I did in my system;

  • export ODI_HIVE_SESSION_JARS=$ODI_HIVE_SESSION_JARS:/home/oracle/json/hive-json-serde-0.2.jar

I have 2 JSON files with data in my HDFS under /user/oracle/json I will specify this directory in the external table definition, along with the JSON SerDe. Within my Hive shell I can define the external table, remembering to add the JSON SerDe file (using the ADD JAR command) just before I create the external table.

  1. ADD JAR /home/oracle/json/hive-json-serde-0.2.jar;
  2. CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
  3.           field1 string, field2 int, field3 string, field4 double
  4.        )
  5.        ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
  6.        LOCATION '/user/oracle/json';

 The data in the first JSON file looks like the following, this was just an example from the google code snippets;

  • {"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
  • {"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
  • {"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
  • {"field1":"data4","field2":400,"field3":"more data4","field4":123.004}

I have defined these fields in the external table. Back in ODI, in my model below I am using the customized reverse with the RKM Hive mentioned above, also for this reverse I just want one table imported, so I defined the mask just to be my_table. Clicking the 'Reverse Engineer' button on the top left of the panel starts the reverse engineering.

If there are errors, then check the ODI Operator, that's where you will see anything that goes wrong. The table in ODI has just some rudimentary information about the number of columns, all of the external table definition for this is defined in Hive, this is the same as when we reverse engineer an Oracle external table. Once the table is captured in ODI I can view the data;

The custom Hive RKM I used is here - this has some changes to the one shipped with ODI to include external table reverse (I included EXTERNAL_TABLE in the getTables API call), plus support for custom SerDe JARs loaded via the environment variable ODI_HIVE_SESSION_JARS (just like the ODI IKMs load the JARs, the RKM also needs to).

Anyway, that's a quick peek at external tables in Hive and reverse engineering them into ODI. Once the table is captured in ODI you can use it in mappings just like other tables!

About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« December 2012 »
SunMonTueWedThuFriSat
      
1
2
4
5
7
8
9
11
12
13
14
15
16
17
18
19
21
22
23
25
26
27
30
     
Today