By David Allan on Jan 02, 2013
This post includes the Java client demonstration code used in the Hive and NoSQL post illustrated here. The BasicBigData.java code is a NoSQL client which populates a key value store that is queryable using the Hive external table from that post. It didn't take long to code and a few peeks at the NoSQL javadoc to get it going. You can take this java code and compile and run it (instructions for compiling are similar to the verification demo here - it is very easy).
The java code uses the NoSQL major/minor path constructor to describe the Key, below is a snippet to define the birthdate for Bob Smith;
- ArrayList<String> mjc1 = new ArrayList<String>();
- ArrayList<String> mnrb = new ArrayList<String>();
In the referenced post, to actually aggregate the key values, we used the Hive collect_set aggregation function (see here for Hive aggregation functions). The collect_set aggregation function returns a set of objects with duplicates eliminated. To get the aggregation function behavior in ODI with the correct group by we must tell ODI about the Hive aggregation function. We can define a new language element for collect set in the Topology tree, define the element as a group function, and also define the expression for Hive under the Implementation tab;
We are then able to define expressions which reference this aggregation function and get the exact syntax defined in the earlier post. Below we see the Hive expressions using collect_set below;
From this design and the definition of the aggregation function in ODI, when its executed you can see the generated Hive QL with the correct columns in the grouping function;
The target Hive datastore in the interface I defined as been loaded with the key values from the NoSQL keystore, cool!
Those are a few of the missing pieces which would let you query NoSQL through Hive external tables, hopefully some useful pointers.