ODI - Hive and MongoDB

I've been experimenting with another Hive storage handler, this time for MongoDB, there are a few out there including this one from MongoDB. The one I have been using supports basic primitive types and also supports read and write - using the standard approach of storage handler class and custom properties to describe the data mask. This then lets you access MongoDB via hive external table very easily and abstract away a lot of integration complexity - also makes it ideal for using in ODI. I have been using on my Linux VM where I have Hive running to access my MongoDB running on an another machine. The storage handler is found here, I used it to access the same example I blogged about here, below is the external table definition;

  1. ADD JAR /home/oracle/mongo/hive-mongo.jar;

  2. create external table mongo_emps(EMPNO string, ENAME string, SAL int)  
  3. stored by "org.yong3.hive.mongo.MongoStorageHandler"  
  4. with serdeproperties ( "mongo.column.mapping" = "EMPNO,ENAME,SAL" )  
  5. tblproperties ( "mongo.host" = "<my_mongo_ipaddress>", "mongo.port" = "27017",  
  6. "mongo.db" = "test", "mongo.collection" = "myColl" );

Very simple. The nice aspect of the Hive external table are the SerDeProperties that can be specified, very simple but provides a nice flexible approach. I can then reverse engineer this into ODI (see reverse engineering posting here) and use it in my Hive integration mappings to read and potentially write to MongoDB.

The primitive types supported can also project nested document types, so for example in the document below (taken from here), name, contribs and awards are strings but have JSON structures;

  1. {
  2. "_id" : 1,
  3. "name" : {
  4. "first" : "John",
  5. "last" :"Backus"
  6. },
  7. "birth" : ISODate("1924-12-03T05:00:00Z"),
  8. "death" : ISODate("2007-03-17T04:00:00Z"),
  9. "contribs" : [ "Fortran", "ALGOL", "Backus-Naur Form", "FP" ],
  10. "awards" : [
  11. {
  12. "award" : "W.W. McDowellAward",
  13. "year" : 1967,
  14. "by" : "IEEE Computer Society"
  15. },
  16. {
  17. "award" : "National Medal of Science",
  18. "year" : 1975,
  19. "by" : "National Science Foundation"
  20. },
  21. {
  22. "award" : "Turing Award",
  23. "year" : 1977,
  24. "by" : "ACM"
  25. },
  26. {
  27. "award" : "Draper Prize",
  28. "year" : 1993,
  29. "by" : "National Academy of Engineering"
  30. }
  31. }

can be processed with the following external table definition, which then can be used in ODI;

  1. create external table mongo_bios(name string, birth string, death string, contribs string, awards string)  
  2. stored by "org.yong3.hive.mongo.MongoStorageHandler"  
  3. with serdeproperties ( "mongo.column.mapping" = "name,birth,death,contribs,awards" )  
  4. tblproperties ( "mongo.host" = "<my_ip_address>", "mongo.port" = "27017",  
  5. "mongo.db" = "test", "mongo.collection" = "bios" );

All very simple and that's what makes it so appealing. Anyway, that's a quick following on using external tables with MongoDB and Hive to the SQL oriented approach I described here that used java table functions.

Comments:

I've been trying to overcome the nested document types... how did you manage to convert nested structures in Hive into a single string that you insert into "contribs" and "awards"? Please advise, thanks.

Posted by guest on March 04, 2013 at 09:07 AM PST #

Hi

The storage handler I used only supports primitive types including string, so I simply named the projected column 'awards' - although the data value is complex in structure, it is converted and projected as a (JSON) string through the external table. You can do whatever expression you want on that (JSON) string from the column.

If you look at the BasicDBObject for example from the MongoDB Java API, you will see there is a toString method that returns a JSON serialization of an object (as a string);
http://api.mongodb.org/java/current/com/mongodb/BasicDBObject.html

Cheers
David

Posted by David on March 04, 2013 at 09:43 AM PST #

Thanks for the prompt reply. I'm assuming the BasicDBObject.toString() is relevant if we are calling Hive select (expecting to get data from MongoDB). What about the other way round - inserting data to MongoDB from Hive queries as - shown in yc-huang's storagehandler example...

insert overwrite table mongo_users select id, name,age from hive_test;

In this example, what if one of the fields is a complex type? Perhaps name is further nested into first,last,middle, etc.

Posted by guest on March 04, 2013 at 05:21 PM PST #

Hi

The JSON string value representing the complex column value can be constructed in ODI as an expression and then simply assigned. There are Hive to Hive IKMs for inserting into Hive targets.

An alternative to using the storage handler is creating an IKM for Mongo DB, its very easy to have a groovy task in an IKM for example that inserts MongoDB documents in a collection from some source.

Cheers
David

Posted by David on March 04, 2013 at 08:27 PM PST #

Hmm...on second thought, not sure that will work tho (on the insert), it would probably store the value as a string rather than a complex value, I think...I'll have to have a try and see.

Cheers
David

Posted by David on March 04, 2013 at 08:58 PM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
3
5
6
7
8
9
10
12
13
14
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today