ODI - Hive and NoSQL
By David Allan-Oracle on Dec 31, 2012
The Hive external table let's us do lots of cool stuff including processing data from NoSQL. We have seen how custom SerDes are used, Hive storage handlers also provide some cool capabilities. Using the Hive storage handler defined here, an external table can be defined to project data from a NoSQL key-value store. The external table can then be used as a source in ODI, very simple.
The illustration on github has the following data stored in a Oracle NoSQL Database (the key is the lastname/firstname etc):
- /Smith/Bob/-/birthdate: 05/02/1975
- /Smith/Bob/-/phonenumber: 1111-1111
- /Smith/Bob/-/userid: 1
- /Smith/Patricia/-/birthdate: 10/25/1967
- /Smith/Patricia/-/phonenumber: 2222-2222
- /Smith/Patricia/-/userid: 2
- /Wong/Bill/-/birthdate: 03/10/1982
- /Wong/Bill/-/phonenumber: 3333-3333
- /Wong/Bill/-/userid: 3
Using the Hive external table and the custom storage handler for a key value store, we define a mask to project the data through the external table.
- ADD JAR /home/oracle/kv/HiveKVStorageHandler.jar;
- CREATE EXTERNAL TABLE MY_KV_TABLE (lastname string, firstname string, birthdate string, phonenumber string, userid string)
- STORED BY 'org.vilcek.hive.kv.KVHiveStorageHandler'
- WITH SERDEPROPERTIES ("kv.major.keys.mapping" = "lastname,firstname", "kv.minor.keys.mapping" = "birthdate,phonenumber,userID")
- TBLPROPERTIES ("kv.host.port" = "localhost:5000", "kv.name" = "kvstore");
There are a few interesting properties here;
- we specify the keyvalue store using TBLPROPERTIES, identify the host/port and the keystore name (kvstore).
- the SerDe properties contains the mapping of the keys to column names, you will get a row for each value of birthdate, phonenumber, userID
Fairly straightforward. We can then reverse engineer this into ODI, using the same mechanism as I described in previous posts here setting the ODI_HIVE_SESSION_JARS and so forth. The data projected looks like this;
- hive> SELECT * FROM MY_KV_TABLE;
- Smith Patricia 10/25/1967 NULL NULL
- Smith Patricia NULL 2222-2222 NULL
- Smith Patricia NULL NULL 2
- Smith Bob 05/02/1975 NULL NULL
- Smith Bob NULL 1111-1111 NULL
- Smith Bob NULL NULL 1
- Wong Bill 03/10/1982 NULL NULL
- Wong Bill NULL 3333-3333 NULL
- Wong Bill NULL NULL 3
In ODI by defining the Hive collect_set function as an aggregation function, we can then aggregate the data and pivot the data to get it as a row;
- SELECT lastname, firstname, collect_set(birthdate), collect_set(phonenumber), collect_set(userid)
- FROM MY_KV_TABLE
- GROUP BY lastname, firstname;
So another interesting illustration of external tables in Hive and what they can provide.