ODI - Hive and Complex JSON

Time for a nested JSON example using Hive external tables. ODI treats nested complex types in Hive the same way it treats types in other technologies such as Oracle, the type name is captured, not the definition - you can see XMLType or SDO_GEOMETRY as an example within the ODI Oracle technology. The Hive technology in ODI also has the ARRAY, STRUCT and MAP types seeded. For the nested JSON example I will use the example define in this JSON SerDe page here.

The external table definition is below, I have defined this in Hive and reverse engineered into ODI just like the previous post. Note it is using a different SerDe from the post here, when I tried using that Google SerDe the data was not projected properly (all data was projected as null...so beware of components used). Just like the previous post we need to add the jar whenever it is used (plus don't forget the ODI_HIVE_SESSION_JARS), either from Hive or ODI;

  1. ADD JAR /home/oracle/json/hive-serde-1.0.jar;
  2. CREATE EXTERNAL TABLE message (
  3.   messageid string,
  4.   messagesize int,
  5.   sender string,
  6.   recipients array<string>,
  7.   messageparts array<struct<
  8.     extension: string,
  9.     size: int
  10.   >>,
  11.   headers map<string,string>
  12. )
  13. ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
  14. LOCATION '/user/oracle/json_complex';

This external table has ARRAY fields, STRUCT fields and MAP fields, so we are going above and beyond simple types. The data I will use is the same as the referenced web page;

  1. {
  2.   "messageId": "34dd0d3c-f53b-11e0-ac12-d3e782dff199",
  3.   "messageSize": 12345,
  4.   "sender": "alice@example.com",
  5.   "recipients": ["joe@example.com", "bob@example.com"],
  6.   "messageParts": [
  7.     {
  8.       "extension": "pdf",
  9.       "size": 4567
  10.     },
  11.     {
  12.       "extension": "jpg",
  13.       "size": 9451
  14.     }
  15.   ],
  16.   "headers": {
  17.     "Received-SPF": "pass",
  18.     "X-Broadcast-Id": "9876"
  19.   }
  20. }

Again, I will use the Hive RKM that I mentioned in the post here in order to reverse engineer the external table defined in Hive into ODI. Below you can see the table and how it is represented in ODI. You can see the recipients and messageparts columns are defined as ARRAY, and headers is a MAP.

We can view the data in ODI just like other tables - remember the external table is projected the JSON data from HDFS through Hive;

When the datastore is used in an interface these are the columns that are available in mapping, you can use any of the Hive functions or operators available on these complex types, plus constructors for building them. For example to retrieve the 1st element in the array the following Hive QL can be used;

  • select sender, recipients[0] from message;

 You can build such expressions in ODI;

You just need to be aware of the datatypes you are using and the functions available. Haven't gone into complex SQL/Hive QL here, but you see the basic mechanics are pretty straightforward. One of the points that comes home here is the functionality level of the non-core pieces of Hadoop, so the first 2 JSON SerDe libraries I have used support different capabilities of JSON - so be aware.

Comments:

Another JSON SerDe I came across is ;
http://www.congiu.com/a-json-readwrite-serde-for-hive/

The one I used in this blog post is here;
https://github.com/proofpoint/hive-serde

The first one I used was this;
http://code.google.com/p/hive-json-serde/wiki/ProjectOverview

Cheers
David

Posted by David on December 31, 2012 at 10:14 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
3
5
6
7
8
9
10
12
13
14
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today