Friday Dec 28, 2012

ODI - Basic Hive Queries

Here we will see a basic example joining the movie lens data and then loading a Hive table based on the tables from the Reverse Engineering Hive Tables post. The Hive table was defined and created via ODI, I duplicated the movies table and added a column for the rating, just for demo purposes...

When I build my interface, I add movies for my source and movies_info as my target, the auto mapping completes much of the mapping, the rating (which is not mapped below) comes from another table - this is where ODI's incremental design is nice, I can add in a new datastore as a source and map columns from it, then describe the join. 

 After I have added the movie ratings table, I will define the join just by dragging movie_id from movies to the ratings table movie_id column. That's the join...mostly defined.

The other thing you need to check is that the ordered join property is set. This will generate the ordered join (ANSI style, but using the Hive technology's template) syntax.

 We can also perform transformations using built in or user defined functions, below I am performing the Hive built-in UPPER function on the movie name for example.

In the physical, or flow view I am using the Hive Control Append IKM, I am using ODI to create the target table in Hive and also performing a truncate if it exists. Also have the control flow switched off. 

Executing this is just like any other interface apart from we leverage Hive to perform the heavy lifting. The resultant execution can be inspected in the ODI operator or console and the resultant table inspected when complete. 

ODI - Reverse Engineering Hive Tables

ODI can reverse engineer Hive tables via the standard reverse engineer and also an RKM to reverse engineer tables defined in Hive, this makes it very easy to capture table designs in ODI from Hive for integrating. To illustrate I will use the movie lens data set which is a common data set used in Hadoop training.

I have defined 2 tables in Hive for movies and their ratings as below, one file has fields delimited with '|' the other is tab delimited. 

  1. create table movies (movie_id int, movie_name string, release_date string, vid_release_date string,imdb_url string) row format delimited fields terminated by '|';
  2. create table movie_ratings (user_id string, movie_id string, rating float, tmstmp string) row format delimited fields terminated by '\t';

For this example I have loaded the Hive tables manually from my local filesystem (into Hive/HDFS) using the following LOAD DATA Hive commands and the movie lens data set mentioned earlier; 

  1. load data local inpath '/home/oracle/data/u.item' OVERWRITE INTO TABLE movies;
  2. load data local inpath '/home/oracle/data/' OVERWRITE INTO TABLE movie_ratings;

The data set in the file u.item data file looks like the following with '|' delimiter;

  • 1|Toy Story (1995)|01-Jan-1995|||0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
  • 2|GoldenEye (1995)|01-Jan-1995|||0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
  • 3|Four Rooms (1995)|01-Jan-1995|||0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

In ODI I can define my Hive data server and logical schema, here is the JDBC connection for my Hive database (I just used the default);

I can then define my model and perform a selective reverse using standard ODI functionality, below I am reversing just the movies table and the movie ratings table;


After the reverse is complete, the tables will appear in the model in the tree, the data can be inspected just like regular datastores;

From here we see the data in the regular data view;

The ODI RKM for Hive performs logging that is useful in debugging if you hit issues with the reverse engineer. This is a very basic example of how some of the capabilities hang together, ODI can also be used to design the load of the file into Hive, transformations within it and subsequent loads using Oracle Loader for Hadoop into Oracle and on and on.


Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality


« December 2012 »