X

Move data between Oracle Database and Apache Hadoop using high speed connectors.

How to Load Oracle and Hive Tables using OHSH (Part 1 - Introduction)

This is a first of a series of blog posts that revisits an earlier set of tutorials about loading Oracle tables from Hadoop.

The intention of this tutorial is to introduce a tool formally known as OHSH (Oracle Shell for Hadoop Loaders), nick named "ohshell".  This is a CLI that supports commands to load Oracle tables from data living in Hadoop, and to create, replace, or incrementally load data from Oracle tables into Hive.  OHSH is a Hadoop application which layers on three Oracle underlying products: Oracle Loader for Hadoop (OLH),  Oracle SQL Connector for HDFS (OSCH), and the Copy To Hadoop feature of Big Data SQL (CP2HADOOP).  The last technology has been more recent and was not discussed in the original tutorials. It will be explained in depth here.

Why OHSH?

OHSH was built for a few reasons.  

One reason was to create a standardized declarative model that allows the user to say what he wants done rather than struggling with the complexities of setting the appropriate set of low level properties when submitting a load job.  OHSH gets simple input from the user and automatically generates the correct properties that are needed to do a particular type of load.   

Another reason was to simplify and unify the usage of the underlying technologies to move data between Oracle and Hadoop.  This includes automating the mechanical steps that were otherwise left for the users to do on their own.

For instance if you use OSCH directly, it will create an external table that maps to content living in HDFS, but you need to compose and execute a correct Oracle SQL insert statement that would load the Oracle target  table from the external table created by OSCH tool.

When using external tables to export Oracle table content to Hadoop, the user has to create the CTAS external tables export statement, manually copying over data pump files to HDFS and manually creating a Hive external table that maps to the data pump files.

OHSH does all of this work under the covers.  You don't have to create or manage Oracle external tables serving OSCH or CP2HADOOP, since they are created and deleted on-the-fly by OHSH as artifacts of a single load operation.

Additionally it was realized that the process of loading goes beyond the actual load operation.  It typically involves a lot of ad hoc interaction  to prep for a load, sanity check the results, and do cleanup afterward.  This involves interacting with various resources beyond the underlying products OHSH manages.  This include tables living in Oracle schemas or Hive databases, and files and directories living in HDFS or on the local file system.

One could resort to writing custom bash or perl scripts to do such activity, but they invariably are hard to read and maintain.

OHSH defines a minimalist domain specific language whose focus is on what is needed to move data between Oracle and Hadoop.  

So if you are loading an Oracle table from content living in Hadoop, or when loading a Hive table from content living in an Oracle table the user typically needs to identify the name of the Oracle table and the source of the content living in Hadoop (either a path to an HDFS directory or the name of a Hive table)  and the preferred method for loading.

If you are loading or creating a Hive table from an Oracle table, the user typically needs to specify the Hive and Oracle databases, the names of the tables, and the preferred method of loading.  

Actually, even specifying a method is optional, although at the time of this writing, choosing a method is highly recommended after one understands the pros and cons of the different methods offered.  The trade offs become clear once you start using the tool and can quickly test various methods of loading interactively.

OHSH Dependencies

OHSH runs on Linux and can run on the system running Oracle, or on a Hadoop cluster, or on an edge node that has client access to Oracle and to a Hadoop cluster.  Typical users of OHSH will be Oracle DBAs and developers who use Oracle SQL*Plus and Oracle SQL Loader on a regular basis.  OHSH is installed on Oracle BDA, and is available for download along with Oracle Big Data Connectors, on OTN and other download sites.

General dependencies include the following:

  • JDBC connectivity to one or more Oracle databases
  • Hive client software
  • Hadoop client software
  • SQL*Plus client
  • Linux OS and Bash

Underlying Oracle loader dependencies include:

  • Oracle Big Data Connectors for Hadoop (OLH and OSCH) for loading Oracle tables from data in Hadoop and/or
  • Copy To Hadoop (CP2HADOOP) for loading Hive tables from data in Oracle tables

OHSH, Oracle Big Data Connectors and Copy To Hadoop above are available on Big Data Cloud Service, installed and ready to use.   

If Oracle Big Data Connectors are licensed with Oracle Big Data Appliance, OHSH and Oracle Big Data Connectors are installed and configured.  On other Hadoop clusters OHSH needs to be configured to point to the homes of OLH and OSCH and/or the home of  CP2HADOOP.  OHSH is tightly coupled to these underlying technologies, so you need to download and install the appropriate versions of the aforementioned Oracle loader dependencies.

Note that SQL*Plus is installed on the BDA and on Oracle Database systems.  If you want to use OHSH on your own Hadoop cluster or on an edge node you should download and install Oracle Instant Client which is free and downloadable.

Typical OHSH Users

Moving large data sets (potentially terabytes) from Oracle to Hadoop or vice versa is a resource intensive operation which needs to be carefully managed especially for systems that are in production.

In general OHSH is designed for DBAs and developers who are defining and managing schemas and tables, managing Oracle directories, and who spend a lot of time using SQL*Plus, SQL*Loader, and/or Data Pump utilities.

As a convenience, OHSH allows one to integrate SQL*Plus commands within the context of an OHSH script.

OHSH Integration with Hadoop, Hive, Bash, SQL*Plus resources

OHSH is meant to integrate the usage of various tools that are needed to move and load data between Hadoop and Oracle.  The internals are split between an upper layer (ohsh.jar) which provides a shell like environment and CLI to harvest user input, and a loading engine (smartloader.jar) that does all the real work.

The CLI layer of OHSH is basically a REPL loop (read, evaluate, print) that performs two levels of parsing.  The top level captures a complete opaque command, determines what resource needs to execute the command (BeeLine/Hive, Hadoop, Bash,SQL*Plus, or OHSH itself), performs environmental variable substitution, and then routes the command to the appropriate resource for execution.  If the command is an OHSH command, the command will be parsed at a finer level, input harvested, and then visits an execution engine to either manage state, or to execute some load operation.

Non OHSH commands use a delegation operator "%" associated with the name of the command line resource to identify the resource that should execute a command.

For example:

ohsh>%hive0 show tables;

sends a "show tables" command to BeeLine/Hive CLI.

The implementation of a CLI that invokes other CLIs is somewhat unconventional, but here it makes sense.  When loading a table, one might need to create the table in Oracle before doing the load.  Rather than re-implementing the complexities of an Oracle CREATE TABLE command in OHSH, it makes sense to simply delegate the CREATE TABLE statement to SQL*Plus but keep the step in the workflow of the OHSH script orchestrating a load.

The OHSH CLI also supports shell amenities such as command history tracking and bang operation, spooling, setting environmental variables, up-arrow down-arrow command line recall.

As mentioned earlier, load commands require minimal user input.  The rest of the user input is provided to load commands using OHSH default settings.  Most of these settings need to be set once for your operational environment and should work across various load operations.  These include things like Oracle date masks, Oracle DOP, Hadoop reduce tasks and so on.  

A Brief Tour Starting with Loading an Oracle table

Let's look at an OHSH script that loads an Oracle table using a standard OHSH script pattern.  In this case we will be loading a table from HDFS delimited text files using OLH with OCI directpath load.

ohsh>set echo on
ohsh>@mdresources
ohsh>%sql truncate table movie_ratings;
ohsh>set reducetasks 18 
ohsh>load oracle table omovies:movie_ratings from path
 hadoop0:/user/${USER}/moviedemo/movie_ratings_delimitedtext
 using directpath
ohsh>%sql select count(*) from movie_ratings;

The "set echo on" tells OHSH to echo statements that are executed in subordinate scripts (e.g. "@mdresources.ohsh").

The @mdresources.ohsh executes a subordinate script which creates three resources named "sql", "omovies", and "hmovies".  "sql" is user defined name given to a SQL*Plus resource that executes SQL*Plus commands in an Oracle user/schema called "MOVIEDEMO".  The "omovies" resource is a JDBC connection that connects to the same schema using JDBC.  The "omovies" JDBC resource is needed for doing load commands, while the "sql" resource is for executing statements in SQL*Plus.   "hmovies" is a Hive resource that connects to a Hive database called "moviedemo".  

The "%sql truncated table movie_ratings;" command delegates a truncate table DDL statement to SQL*Plus for execution.

The "set reducetasks 18" says that the reduce phase in Hadoop map/reduce which does the loading is restricted to using 18 reducers.  This is a Hadoop related tuning option.

The "load oracle table omovies:movie_ratings ... using directpath" command is compiled by the SmartLoader layer of OHSH and is executed.  It will load all of the delimited text files living in the HDFS path "/user/${USER}/movie_ratings_delimitedtext" into a table called movie_ratings. The table lives in the Oracle schema "OLHP" using the name of the JDBC resource declared above.  

The last statement delegates a command back to SQL*Plus to count the number of rows loaded into the table after it was truncated.

Loading the Same Table with other OHSH Methods

If one wanted to execute a Hadoop map/reduce load using the jdbc load method rather than OCI directpath (for example, if the Oracle table is not partitioned, as OCI directpath requires the Oracle table to be partitioned), one would simply change the load method above (i.e. "directpath") to "jdbc".  The jdbc load method is typically slower than OCI directpath.

ohsh>load oracle table omovies:movie_ratings from path
     hadoop0:/user/${USER}/moviedemo/movie_ratings_delimitedtext
     using jdbc

Everything else is the same.

If one wants to execute a load using an Oracle external table, a script would want to set the preferred DOP and then execute the same command, only using the "exttab" method.

ohsh> set dop 18

ohsh>load oracle table omovies:movie_ratings from path
     hadoop0:/user/${USER}/moviedemo/movie_ratings_delimitedtext
     using exttab

It is worth pointing out that Oracle table loads using Hadoop map/reduce versus loads using Oracle external tables have very different execution strategies, but the commands to do both are declarative and are almost identical.  This makes it easy to test all methods, understand what is going on under the covers, and decide which method suits your use case and environment.

Loading Oracle tables from Hive Tables

Loading data from Hive tables is similar to the commands above, only the source of data is from a named Hive table (in this case living in a Hive "moviedemo" database). 

This is a load oracle table command from a Hive table using the "directpath" method.

ohsh>set dateformat "yyyy-MM-dd"
ohsh>load oracle table omovies:movie_ratings from hive table hmovies:movie_ratings using directpath 

This is a similar command using OSCH.

ohsh>set datemask "YYYY-MM-DD"
ohsh>load oracle table omovies:movie_ratings from hive table hmovies:movie_ratings using exttab

Note that there are different date and time defaults used when using "directpath" or "exttab".  (This will get explained in depth in later blog posts that do a deep dive into all these mechanisms, along with some recommendations as to best practices.)  

Creating or Loading Hive tables from Oracle tables

OHSH supports creating, create or replacing, replacing, or loading Hive tables from content in Oracle tables.   

ohsh>create hive table hive0:movie_ratings_oracle \
          from oracle table olhp:movie_ratings \
          hiverootdir /user/${USER}/moviedemo/movie_ratings_datapump \
          using directcopy

This is loading a hive table called oracle_movieratings from the Oracle table movie_ratings using a method called "directcopy".  The "hiverootdir" is a root directory under which data files of copied Oracle tables are stored and managed.

Next Steps

That completes our nickel tour.  The next tutorial will get into the nuts and bolts of configuring the tool and how to take advantage of its ease of use features.  This will be followed by more tutorials looking at the tool in depth and focusing on loading Oracle tables and Hive tables, and explaining special cases such as loading Hive partitions, using loader maps to arbitrarily map table columns to data, and how to ensure that loads into the Oracle database won't happen unless the data is guaranteed to be clean for loading.

 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.