X

Move data between Oracle Database and Apache Hadoop using high speed connectors.

How to Load Oracle and Hive tables using OHSH (Part 5 - Using "loadermap" when loading Oracle tables)

In this tutorial we deal with columns and tables.   Specifically how to map Oracle columns to specific data fields in HDFS files or to Hive columns.

Previously in Part 3, we discussed the methods for loading Oracle tables for relatively simple cases.  The simplifying assumptions included the following:

  • When loading delimited text into Oracle tables, the physical order of the delimited text fields reflects the declared order of the columns in the Oracle table
  • When loading Hive tables into Oracle tables the column names are the same for both Hive and Oracle tables
  • Dates and timestamp fields have uniform formats. (i.e. in one load all date  fields loaded have the same format and all timestamp fields have the same format).  The data and timestamp formats can be specified using OHSH "set dateformat" for "jdbc" and "directpath" load methods, and "set datemask" and "set timestampmask" for the "exttab" load method

OHSH's "loadermap" construct deals with situations when the above assumptions are not true.  The "loadermap" is only implemented and needed for loading Oracle tables.   When creating and loading Hive tables from Oracle, data pump files are used.  They contain metadata that is used by the Oracle DPSerde to create a Hive table with identical column names, so no special mapping constructs need to be specified.

"loadermap" semantics

The construct is optional and follows the <LoadSource> construct in an OHSH "load oracle table" statement.

('explain')? 'load' 'oracle' 'table' <ResourceName>':'<TableName> 'from' <LoadSource> <LoaderMap>? <Using>? <LoadConf>?

A "loadermap" is a list of column mappings which are a series of triplets specifying the name of a column in the target table, and optional mapping to data living in the source, and an optional "format" field.

<LoaderMap> : 'loadermap' <ColumnMapping> (',' <ColumnMapping>)*

<ColumnMapping> : ('tcolname' ('=')?)? <STRORDQSTR>
                                 (('field' ('=')? )? <STRING>)?
                                 (('format' ('=')?)? <DQSTRING>)?

A "loadermap" specification is analogous to the select column clause in SQL and does four things:

  • Specifies the target columns that are to be loaded in a table
  • Maps each specified target column to a field position in a delimited text file (if loading from HDFS files)
  • Maps each specified target column to a Hive field (if loading from a Hive table)
  • Optionally specifies the format of the field being loaded (typically an Oracle DATE or TIMESTAMP)

The syntax looks complex but the defaulting allows it to be abbreviated, simple, and intuitive.

Using a "loadermap" for a reverse mapping

In earlier examples we showed how to use OHSH to load tables from delimited text files in HDFS, where we assumed the fields in the delimited text were in the same order as the columns declared in the Oracle table.  So when loading the MOVIE_RATINGS table below, "custid" would be the first field in each row of delimited text, while "movieid" would be the second field and so on.

ohsh>%sql describe movie_ratings;
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
CUSTID                                             NUMBER
MOVIEID                                            NUMBER
GENREID                                           NUMBER
TIME                                                   DATE
RECOMMENDED                               NUMBER
ACTIVITY                                           NUMBER
RATING                                              NUMBER
SALES                                               NUMBER

But what if the table was declared as this? Now the column order does not naturally match the field order in the text files used earlier. For example CUSTID is the 0 field in the text file, but it is the seventh column in the table below.

ohsh>%sql describe MOVIE_RATINGS_SHUFFLED;
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
MOVIEID                                            NUMBER
SALES                                                NUMBER
GENREID                                           NUMBER
TIME                                                   DATE
ACTIVITY                                           NUMBER
RECOMMENDED                               NUMBER
CUSTID                                              NUMBER
RATING                                              NUMBER

To remedy this using implicit "loadermaps" we simply need to declare the columns in the "loadermap" construct in the order they exist in the delimited text file.

Informally we call this an implicit order "loadermap" because the order is implied by the order of columns listed in the "loadermap".  So in the example below "custid" maps to the 0 field, "movieid" to the 1 field, and so on.  (Note that unless a column is delimited by double quotes it is treated as being a case-insensitive name.)


ohsh>set datemask "YYYY-MM-DD:HH24:MI:SS"
ohsh>load oracle table olhp:movie_ratings_shuffled \
  from path \
  hadoop0: "/user/${USER}/moviedemo/movie_ratings/delimited_text"  \
  loadermap \
  (custid,movieid,genreid,time,recommended,activity,rating,sales) \
  using exttab

The alternative is to us explicit order loader maps.   They require you to identify the field position of each target column being loaded.   Explicit ordering is zero based (i.e. fields are numbered starting with 0).

With explicit ordering, it doesn't matter about what order they are declared as long as they have a positive integer associated with it that is unique and maps to the position of one field in the delimited text file.

The restrictions are that you can only map one column to one field, and you can't mix explicit loader map syntax with implicit order syntax.  You either provide a number position for all columns specified or don't specify any positions explicitly.   Explicit ordering allows you to pick and choose delimited text fields of interest.   You don't have to enumerate all the fields, just the ones that map to columns in your table.


ohsh>set dateformat "yyyy-MM-dd:HH:mm:ss"
ohsh>load oracle table olhp:movie_ratings_shuffled \
   from path \
  hadoop0:"/user/${USER}/moviedemo/movie_ratings/delimited_text" \
   loadermap \
   (genreid 2, custid 0, sales 7, movieid 1, rating 6, time 3, \
   recommended 4,activity 5) \
   using jdbc 

"loadermaps" for loading Oracle tables from Hive tables are similar in structure, only rather than mapping target columns to numbers, they map to Hive column names.  Note that the order of declaration of column names in Hive and Oracle tables are not a problem.  If order of declarations don't match, OHSH won't get confused.

For example, suppose we are loading the following Oracle table whose column names are different from the column names in Hive.

ohsh>%sql describe movie_ratings_difcols_witness;

 Name                                      Null?    Type
-----------------------------------------
CUSTID_1                                           NUMBER
MOVIEID_1                                         NUMBER
GENREID_1                                        NUMBER
TIME_1                                                DATE
RECOMMENDED_1                            NUMBER
ACTIVITY_1                                         NUMBER
RATING_1                                            NUMBER
SALES_1                                              NUMBER

These column names don't match the column names in the Hive table we want to load from. 

ohsh>%hive0 describe movie_ratings;
OK
custid                   int                                        
movieid                 int                                        
genreid                 int                                        
time                     string                                     
recommended       int                                        
activity                 int                                        
rating                   int                                        
sales                   float        

Since the column names are different in the two tables, loading of the Oracle table requires the use of a "loadermap".


ohsh>set datemask "YYYY-MM-DD:HH24:MI:SS"
ohsh>load oracle table olhp:movie_ratings_difcols \
    from hive table moviedemo:movie_ratings \
    loadermap (custid_1 custid, movieid_1 movieid, genreid_1 genreid, \
    time_1 time,recommended_1 recommended, activity_1 activity     rating_1 rating, sales_1 sales)  \

Note that in the previous examples we have been setting either the "datemask" or the "dateformat" depending upon whether we are loading using Oracle external tables (i.e. the "exttab" method) or loading from Hadoop using "directpath" or "jdbc").

The "loadermap" verbiage allows for a format to be associated with a particular column.  Below the "datemask" format is embedded as a third format value in the the time_1 specification below.  This allows different date and time formats to be assigned to different columns being loaded.


ohsh>load oracle table olhp:movie_ratings_difcols \
    from hive table moviedemo:movie_ratings \
    loadermap (custid_1 custid, movieid_1 movieid, genreid_1 genreid,  \
    time_1 time "YYYY-MM-DD:HH24:MI:SS",\
    recommended_1 recommended, activity_1 activity,   \
    rating_1 rating, sales_1 sales)  \
    using exttab

Using a "loadermap" for projection

The following shows loadermaps serving to project columns.  (Again if you use loadermaps you need to specify all columns by name that you want to load).

Let's say our target table looks like this:  

ohsh>%sql describe movie_ratings_project;
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
CUSTID                                             NUMBER
MOVIEID                                           NUMBER
GENREID                                          NUMBER

We want to load three columns and project out the rest.  For delimited text we enumerate the columns and their mappings explicitly.

ohsh>set dateformat "yyyy-MM-dd:HH:mm:ss"
ohsh>load oracle table olhp:movie_ratings_project \
  from hive table moviedemo:movie_ratings \
  loadermap (custid 0, movieid 1, genreid 2) \
  using jdbc

Since we want the first three fields, Implicit mappings to (fields 0, 1, and 2) work here as well.

ohsh>set datemask "YYYY-MM-DD:HH24:MI:SS"
ohsh>load oracle table olhp:movie_ratings_project \
  from hive table moviedemo:movie_ratings \
  loadermap (custid, movieid, genreid) \
  using exttab

 

 

 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.