X

The Oracle NoSQL Database Blog covers all things Oracle NoSQL Database. On-Prem, Cloud and more.

  • October 15, 2014

Revisiting Nosql into Hive External Tables

About two years ago, in the Data Integration blog, David Allan wrote about the integration between Oracle NoSql and Hive, in his post David mentioned the work of Alexander Vilcek to implement a HiveStorageHandler to have access to NoSql data via Hive Sql select commands.

Unfortunately the Java code does not compile anymore with the NoSql Java Api (3.0+).I've modified the java source code to comply with the 3.0 java KV API. The new Hive storage handler has new definition possibilities.

The current  limitations are the same as the ones stated by the original Vilcek project code:

  • Supports only external non-native Hive tables.
  • Writing data to Oracle NoSQLDB is not supported .
  • Parsing of Hive SerDe properties is very rudimentary yet and spaces
    between NoSQL DB keys definitions in the key mappings properties in the
    Hive table create statement will cause key names to be misinterpreted.
  • Columns names and types specified in the Hive table definition are
    ignored; only NoSQL DB Major and Minor Keys mappings in the Hive table
    create statement define the column names.
  • A NoSQL DB Value for a given key is always interpred as string in the Hive table.


Hive CREATE TABLE Syntax:

CREATE EXTERNAL TABLE <hive_table_name> (column_name column_type,column_name column type, ...)
STORED BY 'nosql.example.oracle.com.NosqlStorageHandler'

WITH SERDEPROPERTIES ("kv.major.keys.mapping" =
"<majorKey1,majorKey2,...>", "kv.minor.keys.mapping" = 
"<minorKey1,minorKey2,...>"

"kv.minor.metadata" = "true|false", "kv.key.prefix" = "H", "kv.value.type" = "<avro|other>", "kv.key.range" = "<key restriction range>", "kv.key.ismajor" = "true|false","kv.host.port" = "<kvstore hostname>:<kvstore port number>", "kv.name" = "<kvstore name>"); 

Example 1:

Data stored in Oracle NoSQL Database:

/Smith/Bob/-/birthdate: 05/02/1975

/Smith/Bob/-/phonenumber: 1111-1111 

For this kind of data the minor key birthdate or phonenumber are metadata (field names) and the value of the key value pair is the real value for the metadata

Example 2:

/USAir1/gauge4/-/10-11-2012:12:12:56:77789:45678 

/USAir1/gauge4/-/10-11-2012:12:12:56:77790:45640

For this kind of data the minor key is a timestamp for a measure (45678, 45640) of a gauge of a plane. There in no metadata on the minor key.

The new table definitions  have the following parameters:

  • "kv.value.type" when the value type is avro the value is transformed into string by using the json transformation of the avro value
  • "kv.minor.metadata"  it's true or false value allows to make a difference between the two uses of key value pairs, true for Example 1, false for Example 2
  • "kv.key.prefix"  allows to define an starting path for a major key
  • "kv.key.ismajor" it's value is true if the key prefix is a complete major key path or false otherwise  
  • "kv.key.range"  is used in cases  where the key prefix does not correspond to a major key or when there is no key prefix

In real bigdata store you should not use null values for both "kv.key.prefix"   and "kv.key.range" .



Detail of metadata Example 1:

The following put operations have been done on our store using the kvstore client commands: 

  • put kv -key /H/Smith/Bob/-/birthdate -value 05/02/1975
  • put kv -key /H/Smith/Bob/-/phonenumber -value 1111-1111
  • put kv -key /H/Smith/Bob/-/userid -value 1
  • put kv -key /H/Smith/Patricia/-/birthdate -value 10/25/1967
  • put kv -key /H/Smith/Patricia/-/phonenumber -value 2222-2222
  • put kv -key /H/Smith/Patricia/-/userid -value 2
  • put kv -key /H/Wong/Bill/-/birthdate -value 03/10/1982
  • put kv -key /H/Wong/Bill/-/phonenumber -value 3333-3333
  • put kv -key /H/Wong/Bill/-/userid -value 3

A table creation to include all the above insertions is:

CREATE EXTERNAL TABLE MY_KV_TABLE (lastname string, firstname string, birthdate string, phonenumber string, userid string)

      STORED BY 'nosql.example.oracle.com.NosqlStorageHandler'

      WITH SERDEPROPERTIES ("kv.major.keys.mapping" = "lastname,firstname", "kv.minor.keys.mapping" = "birthdate,phonenumber,userid", 

"kv.minor.metadata" = "true", "k.key.pvrefix" = "H", "kv.value.type" = "String", "kv.key.range" = "", "kv.key.ismajor" = "false", "kv.host.port" = "bigdatalite:5000", "kv.name" = "kvstore");

 a select * from my_kv_table returns 9 values.

The creation and the select commands needs that the jar containing  nosql.example.oracle.com.NosqlStorageHandler be a part of the hive lib jars, lets say this jar is kvhive.jar.

any select applied to the table (different to the select *) invokes map/reduce procedures and needs hive add jar commands to target kvclient.jar and  kvhive.jar.

 For example: 

hive> SELECT lastname, firstname, collect_set(birthdate)[0], collect_set(phonenumber)[0], collect_set(userid)[0] FROM my_kv_table GROUP BY lastname, firstname;

 produces:

OK
Smith      Bob      05/02/1975      1111-1111      1
Smith Patricia      10/25/1967 2222-2222 2
Wong Bill 03/10/1982 3333-3333 3



Detail of a non-metadata example 2:

We have defined an Avro schema to record the list of min and max values of a given measure for a range of time stamps. 

The Hive table declaration in this case is:

CREATE EXTERNAL TABLE MY_KV_A_TABLE (flight string, sensor string, timeref string, stuff string)

      STORED BY 'nosql.example.oracle.com.NosqlStorageHandler'

      WITH SERDEPROPERTIES ("kv.major.keys.mapping" = "flight,sensor", "kv.minor.metadata" = "false", "kv.minor.keys.mapping" = "timeref", "kv.key.prefix" = "A/8/11", "kv.value.type" = "avro", "kv.key.range" = "", "kv.host.port" = "bigdatalite:5000", "kv.name" = "kvstore","kv.key.ismajor" = "true");

 In this case the value type specified is avro, the key prefix A/8/11 is A for analytic data prefix 8 for plane number 8 and 11 for gauge measure number 8.

A select * from my_kv_a_table returns records of the type:

 8 11 W2412 {"values": [{"min": -2.141749606E9, "max": 2.14331348E9}, {"min": -2.118390555E9, "max": 2.090954548E9}, {"min": -2.130274733E9, "max": 2.098153258E9}, {"min": -2.082685648E9, "max": 2.023497823E9}]}

 {"values": [{"min": -2.141749606E9, "max": 2.14331348E9}, {"min": -2.118390555E9, "max": 2.090954548E9}, {"min": -2.130274733E9, "max": 2.098153258E9}, {"min": -2.082685648E9, "max": 2.023497823E9}]} is the json format of the avro values stored in the nosql database.

The HiveSorageHandle Jdeveloper code can be get here.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services