X

Move data between Oracle Database and Apache Hadoop using high speed connectors.

How to Load Oracle and Hive Tables Using OHSH (Part 4 - Loading Hive Tables)

 

In Part 3 we described how to use OHSH to load Oracle tables with content living in either HDFS files or in Hive tables.  In this tutorial we focus on how to use OHSH to do the opposite: create, replace, and incrementally load Hive tables with content living in Oracle tables.  To use OHSH to do this you need to download and configure Copy To Hadoop feature of Big Data SQL (CP2HADOOP) as described in Part 1 and Part 2 of this series of tutorials.

When launching OHSH, the banner should tell you that Copy to Hadoop is enabled.  (If the banner does not include "Copy to Hadoop" OHSH is not configured correctly for these types of operations.)

$ohsh
Oracle Shell for Hadoop Loaders Release 1.2.0 - Production   (Build:20161214114823)
 Copyright (c) 2016, 2017, Oracle and/or its affiliates. All rights reserved.
Oracle Loader for Hadoop (3.8.0), Oracle SQL Connector for HDFS (3.7.0), The Copy to Hadoop feature of Oracle Big Data SQL (3.1.0) enabled.

Copying data from Oracle to Hadoop can be done using three methods: "stage", "fuse", or "directcopy". The "stage" and "fuse" methods push data from Oracle into Hadoop, while "directcopy" pulls data from Oracle using a Hadoop Map job.  The "stage" method is restricted and typically only works when you are running OHSH on the Oracle system.

All of the methods produce Oracle data pump files in an HDFS directory and create a Hive external table that points to the data pump files.  The HDFS directory where these files are placed must be readable from Hive.  The Hive external table uses an Oracle SERDE implementation called "oracle.hadoop.hive.datapump.DPSerde".  This serde can open and read Oracle data pump files.

Note that in this discussion the term "external table" can refer to either Oracle external tables used to export data from Oracle as data pump files or Hive external tables that are used to read data pump files in Hive.  Conceptually external tables in Oracle and Hive serve the same purpose, mapping a table definition to data controlled and managed by users, not by Hive or Oracle internal storage.

OHSH Copy To Hadoop Operations

OHSH supports four types of operations when loading content from Oracle to Hive:

  • create  - creates an external table in Hive with content from an Oracle table
  • replace - replaces an external table in Hive with content from an Oracle table
  • create or replace - creates or replaces an external table in Hive with content from an Oracle table
  • load - incrementally load an existing external Hive table with content from an Oracle table

When doing a "create" operation, OHSH assumes that the external table in Hive being created is from scratch.  Basically a "create" will fail if the table already exists or the HDFS directory where the table data would be stored exists.

When doing a "replace" operation, OHSH drops an existing external Hive table and the related storage directory in HDFS.  It then proceeds to do a standard "create" operation.  If it can't drop the Hive table and cannot find and delete the HDFS storage directory containing table data, the operation doesn't proceed and returns with an error.

The "create or replace" operation is more forgiving and drops up an existing table if it exists, and deletes the related storage directory if it exists.  It then proceeds with a normal "create" operation.  This is what you want to call if a "create" or a "replace" fails and leaves things in an inconsistent state and don't want to deal with the hassle of cleaning up Hive state.

The "load" operation assumes a Hive external table exists and the related directory containing table data exists.  It simply adds data from an existing Oracle table to the data files already living under a Hive table.

Why does OHSH support all these operations but only supports the load operation for loading Oracle tables?  Because normal tables in Oracle are not external and are an order of magnitude more sophisticated.  A "CREATE TABLE" command for an Oracle table has all sorts of powerful and fascinating options that are expressed in Oracle DDL.  Create and replacement of Oracle tables should be delegated to OHSH call outs to SQL*Plus.

How "directcopy" works

"directcopy" is the simple and preferred model both in operating complexity and security. It kicks off a Hadoop map-only job that connects to the Oracle data base, decomposes an Oracle table into table splits and reads splits from Oracle, writing the content into data pump files in HDFS into the directory that serves the Hive table.  The security is simpler because it eliminates the stage directory involved (as is the case with "stage" or "fuse" methods).  The OHSH user just needs to provide the credentials to connect to the Oracle schema that can read the table and needs to have HDFS privileges to write to the target HDFS directory.

How "stage" and "fuse" methods work

The "stage" and "fuse" methods work by producing data pump files using Oracle external tables in an Oracle CTAS (i.e. "CREATE TABLE AS SELECT") statement.   Typically you will use these methods only when the Hadoop cluster cannot connect to the Oracle database because of firewall and security issues, which prevents you from using "directcopy".  When the CTAS statement is executed, Oracle writes table data to files living in an Oracle directory object as data pump export files.  The name of the Oracle directory object is specified in OHSH default "locationdirectory", and this setting becomes external location directories used for exporting data out of Oracle.  From Oracle's perspective all this activity appears local to the system where Oracle is running.  Note that this directory serves an interim scratch area to land data pump files which will then be copied or moved to the user specified final destination someplace in the Hadoop cluster.

The number of data pump files created is typically equal to the OHSH "dop" setting.  This enables writing of data pump files in parallel in the CTAS statement.  If the payload is very small then it is possible that fewer data pump files will be created.

When using the "stage" method this scratch directory is readable and writable from the file system local to the Oracle database as a local OS directory.  When using "fuse" the scratch directory references an HDFS fuse mount to a directory living directly in HDFS.  This means that the Oracle external tables produce data pump files directly in HDFS.   Oracle is unaware of this.  It is naively writing to what appears to be a local directory but in-fact is a mount point into HDFS.

Once the data pump files are written by Oracle to this scratch directory, OHSH will create a sub-folder living directly under a user specified Hive directory which will serve as the final destination.  OHSH will then move the files to the final destination in HDFS. 

If the "stage" method is being used, the move operation involves copying the files into HDFS and deleting the interim copies in the scratch directory living on the Oracle system.  For the "fuse" method, the move is done by HDFS.  It is lightweight operation simply changing the HDFS path to the existing data pump files.

The final step involves telling Hive to create a Hive external table to point to the HDFS directory containing the data pump files.  (This step is omitted if the operation is "load".)

Comparing "stage" and "fuse"

The "fuse" method is more efficient than the "stage" method because the data pump files are written once directly into HDFS.  The "stage" method writes data pump files twice, once to an OS file path visible to Oracle, and then to a directory living in HDFS.

Oracle writes to the Oracle directory object specified in OHSH as the "locationdirectory".  (For those not familiar with Oracle, an Oracle directory object is a named object in Oracle that maps to some OS directory accessible to the Oracle system.)  This directory path needs to be readable and writable by both Oracle and the end user running OHSH.

For "stage", this directory is either a local disk on the Oracle host (e.g. "/tmp/ohshscratch") or an NFS mount (e.g. "/mnt/shared/tmp/ohshscratch").

If the directory is local to the Oracle host and you want to use the "stage" method you will need to run OHSH on the Oracle host.  If the directory is an NFS mount, you can run "stage" anywhere as long as the directory path (e.g. "/mnt/shared/tmp/ohshscratch") is visible and has read and write access for Oracle and the OHSH user.

For "fuse", the Fuse HDFS mount point (e.g. "/mnt/hdfs/tmp/ohshscratch") needs to be readable and writable by the Oracle host, and the underlying HDFS path (e.g. "/tmp/ohshscratch") needs to be HDFS readable and writable by the OHSH user.

Obviously, both "stage" and "fuse" methods impose administration and security complexity if you are not running OHSH on the Oracle system as user "oracle".  It involves configuring directories that are writable and readable by both Oracle and by the OHSH user.  This requires some workable combination of user and group file privileges on local disk or shared storage.

Making data pump files readable by Hive

Regardless of which method is used to create data pump files in an HDFS directory, Hive needs to be able to read them when queries are issued against the Hive external table.

The data pump files and the directory they live in are owned by the OHSH user in HDFS.  By default the data pump files landed in Hadoop are given user only (i.e. 700) file permissions.

If Hive impersonation is enabled, Hive can read the files because it is accessing the files under the identity of the OHSH user.

If Hive impersonation is not enabled, OHSH will change permissions at the end of the load to ensure the files are readable by the user identity of the HiveServer2 process (i.e. typically "hive").

OHSH detects whether or not Hive impersonation is enabled or not.  If it is not enabled, then OHSH will take administration steps to allow the HiveServer2 process to read the files either by adding an HDFS ACL to the directory, or by relaxing the directory permissions to 750.  For the latter case, the primary group of the OHSH user in HDFS needs to include the user identity of the HiveServer process.

Examples of loading using "directcopy" and "stage" methods

The Oracle table we are copying to Hive looks like this:

ohsh>%sql describe movie_ratings;

 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUSTID                                             NUMBER
 MOVIEID                                            NUMBER
 GENREID                                            NUMBER
 TIME                                                 DATE
 RECOMMENDED                                    NUMBER
 ACTIVITY                                           NUMBER
 RATING                                             NUMBER
 SALES                                              NUMBER

Composing a "create or replace" Hive Table command

The OHSH "create or replace hive table" command requires identifying the following:

  • The name of the Oracle table to be copied and the associate OHSH JDBC and SQL*Plus resources that access that table.
  • The name of the table to be created in Hive and the associated Hive resource that accesses the Hive database where the table is to be created.
  • Identifying a root directory in HDFS which is writable by the user running OHSH and serves as a repository for tables copied from Oracle to Hive.  This is the directory tree where the data pump files will live.  It can be specified by the OHSH default "hiverootdir", which can be overridden in the "create hive table" command.
  • For "stage" or "fuse" methods, setting the OHSH default "locationdirectory" to the name of the Oracle directory object that serves as a scratch directory where Oracle can write data pump files.  (If running the "stage" method this directory must be an OS directory visable to both Oracle and the OHSH user. If running the "fuse" method it must be mountable HDFS Fuse directory visible to Oracle.  The directories must be readable and writable by both Oracle and the OHSH user.)

OHSH will want to exclusively manage sub-directories underneath "hiverootdir".  When copying a table from Oracle OHSH will create a folder under the "hiverootdir" which will reflect the Hive database and Hive table name whose data has been copied from Oracle to Hadoop.

Before running any of these methods, one needs to define OHSH resources.  For "sql" and "omovies" (i.e. SQL*Plus and Oracle JDBC)  we are using a TNS alias "inst1" to identify the Oracle database and relying on Oracle Wallet to do authentication transparently.

ohsh>create sqlplus resource sql connectid="inst1"
ohsh>create oracle jdbc resource omovies connectid="inst1"


We then define "hmovies" to connect to Hive and use the "moviedemo" database.

ohsh>create hive resource hmovies connectionurl="jdbc:hive2:///moviedemo"

In all methods below we will create or replace a Hive table called "movie_ratings_oracle" living in a Hive database called "moviedemo".   The actual data pump files backing up this Hive table will live in this HDFS folder:  /user/rhanckel/moviedemo/movie_ratings/moviedemo.movie_ratings_oracle.   The "hiverootdir" will be set to /user/rhanckel/moviedemo/movie_ratings.  The methods will create and manage the subfolder underneath that identifies the Hive database and table (i.e. moviedemo.movie_ratings_oracle).

Creating a Hive table from an Oracle table using "directcopy"

For "directcopy", you need to define the number of data pump files you want to be created, using the "set dpfilecount" command.

ohsh>set outputlevel verbose
ohsh>set dpfilecount 4
ohsh>set hiverootdir /user/rhanckel/moviedemo/movie_ratings
ohsh>create or replace hive table hmovies:movie_ratings_oracle from oracle table omovies:movie_ratings using directcopy

The verbose output reflects a Hadoop map-only job which is reading Oracle data and producing the data pump files in HDFS.  When the map-only job finishes, the Hive external tables is created by pointing to the HDFS storage directory containing the data pump files and telling Hive to use oracle.hadoop.hive.datapump.DPSerDe.

Creating or replacing Hive table moviedemo.movie_ratings_oracle
  From Oracle table "OLHP"."MOVIE_RATINGS"
  Using direct copy to move content to Hive for transport to HDFS path /user/rhanckel/moviedemo/movie_ratings/moviedemo.movie_ratings_oracle

--------------------------------------------------------------------------------------------------------------------------------
Oracle Database Connectivity Settings
--------------------------------------------------------------------------------------------------------------------------------


TNS usage: CTOH (directcopy) job using client TNS path on hadoop cluster.
Wallet usage: CTOH (directcopy) job using client wallet path on hadoop cluster.

--------------------------------------------------------------------------------------------------------------------------------
Checking Hive metastore and preparing HDFS directory that will store exported data pump files
--------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------------------
Dropping existing Hive table
--------------------------------------------------------------------------------------------------------------------------------


drop table moviedemo.movie_ratings_oracle;


scan complete in 3ms
Connecting to jdbc:hive2:///moviedemo;

Connected to: Apache Hive (version 1.1.0-cdh5.8.0)
Driver: Hive JDBC (version 1.1.0-cdh5.8.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (2.758 seconds)
Beeline version 1.1.0-cdh5.8.0 by Apache Hive
Closing: 0: jdbc:hive2:///moviedemo;
Deleting existing storage directory /user/rhanckel/moviedemo/movie_ratings/moviedemo.movie_ratings_oracle
Creating storage directory /user/rhanckel/moviedemo/movie_ratings/moviedemo.movie_ratings_oracle with permission 700.


--------------------------------------------------------------------------------------------------------------------------------
CTOH map-only job will be submitted to the Hadoop cluster to load a Hive table using DIRECTCOPY
--------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------------------
Hadoop MapReduce Configuration Settings
--------------------------------------------------------------------------------------------------------------------------------
mapreduce.am.max-attempts=2
mapreduce.app-submission.cross-platform=false
mapreduce.client.completion.pollinterval=5000
mapreduce.client.genericoptionsparser.used=true
mapreduce.client.output.filter=FAILED
mapreduce.client.progressmonitor.pollinterval=1000
mapreduce.client.submit.file.replication=10

Lots of Hadoop configuration settings...


yarn.app.mapreduce.client.job.retry-interval=2000
yarn.app.mapreduce.client.max-retries=3
yarn.app.mapreduce.shuffle.log.backups=0
yarn.app.mapreduce.shuffle.log.limit.kb=0
yarn.app.mapreduce.shuffle.log.separate=true
yarn.app.mapreduce.task.container.log.backups=0

--------------------------------------------------------------------------------------------------------------------------------
CTOH Configuration Settings
--------------------------------------------------------------------------------------------------------------------------------


oracle.hadoop.ctoh.connection.tnsAdmin=/user/rhanckel/oracle/work
oracle.hadoop.ctoh.connection.username="OLHP"
oracle.hadoop.ctoh.connection.walletLoc=/user/rhanckel/oracle/work
oracle.hadoop.ctoh.datapump.basename=OSL_170707_215753
oracle.hadoop.ctoh.datapump.extension=.dmp
oracle.hadoop.ctoh.datapump.output=/user/rhanckel/moviedemo/movie_ratings/moviedemo.movie_ratings_oracle
oracle.hadoop.ctoh.home=/user/rhanckel/oracle/work/kits_scratch/orahivedp-3.1.3
oracle.hadoop.ctoh.jdbc.url=jdbc:oracle:thin:@inst1
oracle.hadoop.ctoh.maxSplits=4
oracle.hadoop.ctoh.splitterType=BLOCK_SPLITTER
oracle.hadoop.ctoh.table="OLHP"."MOVIE_RATINGS"

--------------------------------------------------------------------------------------------------------------------------------
Begin CTOH map-only execution on Hadoop
--------------------------------------------------------------------------------------------------------------------------------


[INFO] 2017-07-07 21:58:18,809 [oracle.hadoop.ctoh.CtohDriver]  ctoh-conf resource: jar:file:/net/slc03lfz/scratch/rhanckel/view_storage/rhanckel_hadoop15/work/kits_scratch/orahivedp-3.1.3/jlib/orahivedp.jar!/oracle/hadoop/ctoh/ctoh-conf.xml
[INFO] 2017-07-07 21:58:19,463 [oracle.hadoop.ctoh.CtohDriver]  Schema: OLHP Table: MOVIE_RATINGS
[INFO] 2017-07-07 21:58:19,947 [hadoop.conf.Configuration.deprecation]  session.id is deprecated. Instead, use dfs.metrics.session-id
[INFO] 2017-07-07 21:58:19,948 [hadoop.metrics.jvm.JvmMetrics]  Initializing JVM Metrics with processName=JobTracker, sessionId=
[INFO] 2017-07-07 21:58:20,797 [hadoop.ctoh.split.DBParallelSplitFactory]  Minimum number of chunks 3 < 4
[INFO] 2017-07-07 21:58:21,055 [hadoop.ctoh.split.DBParallelSplitFactory]  Number of Chunks: 6
[INFO] 2017-07-07 21:58:21,196 [apache.hadoop.mapreduce.JobSubmitter]  number of splits:4
[INFO] 2017-07-07 21:58:21,266 [apache.hadoop.mapreduce.JobSubmitter]  Submitting tokens for job: job_local1250038248_0001

...

 SELECT /*+ no_parallel */ "CUSTID", "MOVIEID", "GENREID", "TIME", "RECOMMENDED", "ACTIVITY", "RATING", "SALES" FROM "OLHP"."MOVIE_RATINGS" WHERE (ROWID BETWEEN ? and ?)
[INFO] 2017-07-07 21:58:24,230 [oracle.hadoop.ctoh.OjdbcRecordReader]  Query for Split: SELECT /*+ no_parallel */ "CUSTID", "MOVIEID", "GENREID", "TIME", "RECOMMENDED", "ACTIVITY", "RATING", "SALES" FROM "OLHP"."MOVIE_RATINGS" WHERE (ROWID BETWEEN ? and ?)

...

[INFO] 2017-07-07 21:58:24,230 [oracle.hadoop.ctoh.OjdbcRecordReader]  bindings Length=2
[INFO] 2017-07-07 21:58:29,108 [apache.hadoop.mapred.LocalJobRunner]  map task executor complete.
[INFO] 2017-07-07 21:58:29,618 [apache.hadoop.mapreduce.Job]   map 100% reduce 0%
[INFO] 2017-07-07 21:58:29,619 [apache.hadoop.mapreduce.Job]  Job job_local1250038248_0001 completed successfully
[INFO] 2017-07-07 21:58:29,654 [apache.hadoop.mapreduce.Job]  Counters: 16
    File System Counters
        FILE: Number of bytes read=159695873
        FILE: Number of bytes written=165760746
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=39716
        Map output records=39716
        Input split bytes=1196
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=11
        Total committed heap usage (bytes)=956301312
    oracle.hadoop.ctoh.
        rowCountCounter=39716
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=2598942

[INFO] 2017-07-07 21:58:29,690 [oracle.hadoop.ctoh.CtohDriver]  Pattern for transferring files OSL_170707_215753*.dmp
[INFO] 2017-07-07 21:58:29,701 [oracle.hadoop.ctoh.CtohDriver]  Number of files to transfer: 4
[INFO] 2017-07-07 21:58:29,703 [oracle.hadoop.ctoh.CtohDriver]  Moved File OSL_170707_215753-m-00000.dmp
[INFO] 2017-07-07 21:58:29,703 [oracle.hadoop.ctoh.CtohDriver]  Moved File OSL_170707_215753-m-00001.dmp
[INFO] 2017-07-07 21:58:29,704 [oracle.hadoop.ctoh.CtohDriver]  Moved File OSL_170707_215753-m-00002.dmp
[INFO] 2017-07-07 21:58:29,705 [oracle.hadoop.ctoh.CtohDriver]  Moved File OSL_170707_215753-m-00003.dmp

--------------------------------------------------------------------------------------------------------------------------------
End CTOH map-only execution on Hadoop
--------------------------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------------------------------------------------------
Creating the Hive table
--------------------------------------------------------------------------------------------------------------------------------

add jars /user/rhanckel/oracle/work/kits_scratch/orahivedp-3.1.3/jlib/orahivedp.jar /user/rhanckel/oracle/work/kits_scratch/orahivedp-3.1.3/jlib/oraloader.jar /user/rhanckel/oracle/work/kits_scratch/orahivedp-3.1.3/jlib/ojdbc7.jar /user/rhanckel/oracle/work/kits_scratch/orahivedp-3.1.3/jlib/orai18n.jar /user/rhanckel/oracle/work/kits_scratch/orahivedp-3.1.3/jlib/ora-hadoop-common.jar;CREATE EXTERNAL TABLE moviedemo.movie_ratings_oracle
  ROW FORMAT SERDE 'oracle.hadoop.hive.datapump.DPSerDe'
  STORED AS
  INPUTFORMAT 'oracle.hadoop.hive.datapump.DPInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  LOCATION '/user/rhanckel/moviedemo/movie_ratings/moviedemo.movie_ratings_oracle';

Shutting down embedded metastore database.
scan complete in 3ms
Connecting to jdbc:hive2:///moviedemo;
Connected to: Apache Hive (version 1.1.0-cdh5.8.0)
Driver: Hive JDBC (version 1.1.0-cdh5.8.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
No rows affected (0.123 seconds)
OK
No rows affected (1.731 seconds)
Beeline version 1.1.0-cdh5.8.0 by Apache Hive
Closing: 0: jdbc:hive2:///moviedemo;

--------------------------------------------------------------------------------------------------------------------------------
Hive table creation successful
--------------------------------------------------------------------------------------------------------------------------------

All load processing has completed.
Load operation OSL_170707_215753 successful. Load time = 0:00:49

--------------------------------------------------------------------------------------------------------------------------------
End of CREATE_OR_REPLACE Hive table execution at 2017-07-07:21:58:43
--------------------------------------------------------------------------------------------------------------------------------

Creating a Hive table from an Oracle table using "stage" or "fuse"

The processing steps for "stage" and "fuse" are almost identical.   The following describes the processing of the "stage" method.

ohsh>set outputlevel verbose
ohsh>set locationdirectory MOVIEDEMO_STAGE_DIR
ohsh>set hiverootdir /user/rhanckel/moviedemo/movie_ratings
ohsh>create or replace hive table hmovies:movie_ratings_oracle from oracle table omovies:movie_ratings using stage

The verbose output below describes the steps being processed.

  • The existing external table is dropped and the related HDFS storage directory is deleted
  • Oracle exports new data pump file from an Oracle table to an OS stage directory
  • The data pump files are then moved to a final HDFS directory underneath the "hiverootdir" directory which reflects the database qualified name of the Hive table  
  • A Hive external table is created to point to the data pump contents living in this directory
  • Artifacts of the operation that are transient (e.g. the Oracle external table used to write data pump files) are deleted or dropped at the end of the operation

Creating or replacing Hive table moviedemo.movie_ratings_oracle
  From Oracle table "OLHP"."MOVIE_RATINGS"
  Using a local stage directory to move content to Hive for transport to HDFS path /user/rhanckel/moviedemo/movie_ratings/moviedemo.movie_ratings_oracle

--------------------------------------------------------------------------------------------------------------------------------
Checking Hive metastore and preparing HDFS directory that will store exported data pump files
--------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------------------
Dropping existing Hive table
--------------------------------------------------------------------------------------------------------------------------------


drop table moviedemo.movie_ratings_oracle;
scan complete in 2ms
Connecting to jdbc:hive2:///moviedemo;
Connected to: Apache Hive (version 1.1.0-cdh5.8.0)
Driver: Hive JDBC (version 1.1.0-cdh5.8.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (2.33 seconds)
Beeline version 1.1.0-cdh5.8.0 by Apache Hive
Closing: 0: jdbc:hive2:///moviedemo;
Deleting existing storage directory /user/rhanckel/moviedemo/movie_ratings/moviedemo.movie_ratings_oracle
Creating storage directory /user/rhanckel/moviedemo/movie_ratings/moviedemo.movie_ratings_oracle with permission 700.

--------------------------------------------------------------------------------------------------------------------------------
Writing data pump files containing Oracle table content
--------------------------------------------------------------------------------------------------------------------------------

ALTER SESSION FORCE PARALLEL DML PARALLEL 4
ALTER SESSION FORCE PARALLEL QUERY PARALLEL 4

CREATE TABLE "OLHP"."OSL_170707_092253_EXT" ORGANIZATION EXTERNAL (TYPE ORACLE_DATAPUMP  DEFAULT DIRECTORY "MOVIEDEMO_DEFAULT_DIR" LOCATION (
"MOVIEDEMO_STAGE_DIR":'OSL_170707_092253-0.dmp',
"MOVIEDEMO_STAGE_DIR":'OSL_170707_092253-1.dmp',
"MOVIEDEMO_STAGE_DIR":'OSL_170707_092253-2.dmp', MOVIEDEMO_STAGE_DIR":'OSL_170707_092253-3.dmp'))

PARALLEL 4 AS SELECT  * FROM "OLHP"."MOVIE_RATINGS"

--------------------------------------------------------------------------------------------------------------------------------
Moving location file /user/rhanckel/oracle/work/cp2bda_stage/OSL_170707_092253-0.dmp from locally staged directory to HDFS directory /user/rhanckel/moviedemo/movie_ratings/moviedemo.movie_ratings_oracle
--------------------------------------------------------------------------------------------------------------------------------

...

--------------------------------------------------------------------------------------------------------------------------------
Moving location file /user/rhanckel/oracle/work/cp2bda_stage/OSL_170707_092253-3.dmp from locally staged directory to HDFS directory /user/rhanckel/moviedemo/movie_ratings/moviedemo.movie_ratings_oracle
--------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------------------
Creating the Hive table
--------------------------------------------------------------------------------------------------------------------------------

add jars /user/rhanckel/oracle/work/kits_scratch/orahivedp-3.1.3/jlib/orahivedp.jar /user/rhanckel/oracle/work/kits_scratch/orahivedp-3.1.3/jlib/oraloader.jar /user/rhanckel/oracle/work/kits_scratch/orahivedp-3.1.3/jlib/ojdbc7.jar /user/rhanckel/oracle/work/kits_scratch/orahivedp-3.1.3/jlib/orai18n.jar /user/rhanckel/oracle/work/kits_scratch/orahivedp-3.1.3/jlib/ora-hadoop-common.jar;CREATE EXTERNAL TABLE moviedemo.movie_ratings_oracle
  ROW FORMAT SERDE 'oracle.hadoop.hive.datapump.DPSerDe'
  STORED AS
  INPUTFORMAT 'oracle.hadoop.hive.datapump.DPInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  LOCATION '/user/rhanckel/moviedemo/movie_ratings/moviedemo.movie_ratings_oracle';
[WARN] 2017-07-07 09:23:16,064 [hadoop.hive.conf.HiveConf]  HiveConf of name hive.metastore.local does not exist
Shutting down embedded metastore database.
Connected to: Apache Hive (version 1.1.0-cdh5.8.0)
Driver: Hive JDBC (version 1.1.0-cdh5.8.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
No rows affected (0.114 seconds)
OK
No rows affected (1.646 seconds)
Beeline version 1.1.0-cdh5.8.0 by Apache Hive
Closing: 0: jdbc:hive2:///moviedemo;

--------------------------------------------------------------------------------------------------------------------------------
Hive table creation successful
--------------------------------------------------------------------------------------------------------------------------------


All load processing has completed.
Load operation OSL_170707_092253 successful. Load time = 0:00:35

--------------------------------------------------------------------------------------------------------------------------------
End of CREATE_OR_REPLACE Hive table execution at 2017-07-07:09:23:29
--------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------------------
Drop Oracle external table used for exporting content.
--------------------------------------------------------------------------------------------------------------------------------

DROP TABLE "OLHP"."OSL_170707_092253_EXT" PURGE

ohsh>set outputlevel minimal

ohsh>%moviedemo select count(*) from movie_ratings_oracle;
+--------+--+
|  _c0   |
+--------+--+
| 39716  |
+--------+--+

Incrementally loading a Hive table from an Oracle table

Once a Hive table has been created from an Oracle table, how do you add new rows that have been inserted into the source table?  First you need to figure out the appropriate where clause that identifies the new rows.

Let's say the Oracle source table has new rows not copied to the Hive table and these rows all have a "TIME" column value greater than or equal to October 12, 2012.  You can issue the OHSH load command, using any of the methods (i.e. "fuse", "stage", or "directcopy") and inject the new rows into the data living under the target Hive table.


ohsh>load hive table hmovies:movie_ratings_oracle \
     from oracle table omovies:movie_ratings using stage \
     where "(time >= '01-OCT-12')"

The load operation would be the usual steps of creating the data pump files and copying them to the HDFS storage directory supporting the Hive external table.

This works for appending new rows, not for updated existing rows that are already in the Hive external table.

Similar syntax syntax works for "directcopy" and "fuse"

ohsh>load hive table hmovies:movie_ratings_oracle \
     from oracle table omvoies:movie_ratings using directcopy \
     where "(time >= '01-OCT-12')"

Best practices for loading Hive tables using OHSH

  • When possible use directcopy. It is the simplest and most direct method for moving data and avoids the configuration complexities of "stage" and "fuse".  The case where "stage" and "fuse" make  sense is when the Hadoop cluster cannot connect to the Oracle database because of network firewall constraints.  In such cases the data needs to be pushed from Oracle to Hadoop.
  • When using "stage" or "fuse" set the OHSH default "dop" to the largest dop value you are allowed to use in Oracle, to maximum the parallelism of exporting data to data pump files
  • Security models work best when they are simple to implement and understand.   To simplify security and configuration in production environments one good option is to run OHSH on the Oracle host as the OS user running Oracle (e.g. "oracle") and propagate data to HDFS living under the same OS user identity. This keeps the data files under "oracle" control.  If HDFS ACLs are enabled OHSH will allow read access to the "oracle" owned data pump files to the owner of the Hive server process (e.g. "hive").  Access to the data can be controlled by Hive "GRANT SELECT" statements to a set of privileged users.  This side-steps problems having to deal with clumsy group file privileges in HDFS.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.