X

Move data between Oracle Database and Apache Hadoop using high speed connectors.

How to Load Oracle and Hive tables using OHSH (Part 6 - Using the "etl" method for loading Oracle tables)

In this post we will discuss the "etl" method for loading tables.  It was briefly mentioned in earlier posts, but because is a hybrid of both Hadoop map/reduce processing and Oracle external tables it made sense to defer its discussion until the other simpler methods of loading Oracle tables (i.e. "exttab", "directcopy", and "jdbc") were explained.

The "etl" method runs an OLH job on the Hadoop cluster to produce data pump files living in HDFS.  This is followed by an OSCH job which reads the data pump files using OSCH generated external tables and loads them into the target table.  The method name "etl" is short-hand for "extract, transform, and load".  In this case OLH is doing the extraction and tranforming, while OSCH is doing the loading.

Why is the method useful?

  • It can be used to back out of a job that encounters even one data conversion problem due to dirty data without wasting any resources on the system where Oracle is running.
  • It can load data pump files more quickly and efficiently via "exttab" than directly loading delimited text.  When loading HDFS delimited text files using "exttab", the heavy lifting is the translation of text fields to native Oracle data types, and the CPU cycles being impacted are on the Oracle system. With "etl", the translation to native Oracle data types is offloaded to the Hadoop cluster in a standard map/reduce job.  When the "exttab" method reads the data pump files it's basically doing IO, moving opaque bytes to storage without a lot of CPU processing.
  • The "exttab" phase can be deferred.  This means that the time that the Oracle database actually loads the data can be embedded in a cron job at a point of time when there is low usage.   More importantly, because it will be dealing with data pump files, the chance of running into a problem during the exttab phase is very low.

Both Oracle and Hadoop have settings to abort loads if some number of rows were rejected because of dirty data that cannot be converted to specific data types.  But because both Oracle and and Hadoop have parallel architectures, the reject limit is scoped to each PQ Slave or Hadoop task, and is not an absolute value across all slaves or tasks.  Basically it is a  fuzzy setting for both architectures used to control runaway processing disasters caused by an intrusion of lots of dirty data.

In general the sooner you detect such data the better.  

With the "etl" method you can assert that no Oracle load will occur if any dirty data is detected on the Hadoop side of the processing.  If the Hadoop map/reduce job completes without rejecting any records, you are guaranteed that the Oracle load won't fail because of Oracle data type conversion problems.  If it does, it is our (i.e. Oracle's) bug.

What this gate-keeping avoids is having to churn through Oracle external table bad files to determine what records were rejected and why.  It avoids identifying all bad the records from HDFS, extracting them, and cleaning them up and running a compensating load to complete the intended load.

Clearly there are use cases where a few bad apples don't spoil the rest of the barrel, but if they do than "etl" is a good way to go.

Loading using "etl" method

Loading syntax is identical to the syntax of other methods used for loading Oracle tables including OHSH loadermap functionality.  

ohsh>set dateformat "yyyy-MM-dd:HH:mm:ss"
ohsh>load oracle table olhp:movie_ratings from path \
   hadoop0:/user/$ {USER}/moviedemo/movie_ratings/delimited_text \
  using etl 

The output shows two discrete phases being executed.  The first is an map/reduce job (using OLH) that creates Oracle datapump files in HDFS.   The second phase is loading those HDFS files using Oracle External tables generated by OSCH.  The output is verbose, but it gives a clear idea of what is going on under the hood.

This is the first phase using Hadoop map/reduce to generate data pump files in HDFS.

-----------------------------------------------------------------------------------
Begin ETL execution at 2017-02-24:11:34:34
-----------------------------------------------------------------------------------

###################################################################################
Starting SmartLoader Job OSL_170224_113434
Local OS user: rhanckel
Hadoop user:   rhanckel
###################################################################################

Loading Oracle table "OLHP"."MOVIE_RATINGS"
  From HDFS files located at /user/rhanckel/moviedemo/movie_ratings/delimited_text
  Using Oracle Loader for Hadoop (OLH) Datapump file output
    into HDFS for subsequent OSCH load

-----------------------------------------------------------------------------------
Hadoop MapReduce Configuration Settings
-----------------------------------------------------------------------------------

mapreduce.am.max-attempts=2
mapreduce.app-submission.cross-platform=false
mapreduce.client.completion.pollinterval=5000
..

yarn.app.mapreduce.client.job.retry-interval=2000
yarn.app.mapreduce.client.max-retries=3
yarn.app.mapreduce.task.container.log.backups=0

-----------------------------------------------------------------------------------
OLH Configuration Settings
-----------------------------------------------------------------------------------

oracle.hadoop.loader.connection.tns_admin=/user/rhanckel/work
oracle.hadoop.loader.connection.url=jdbc:oracle:thin:@inst1
oracle.hadoop.loader.connection.wallet_location=/user/rhanckel/work
oracle.hadoop.loader.defaultDateFormat=yyyy-MM-dd:HH:mm:ss
oracle.hadoop.loader.input.fieldNames=F0,F1,F2,F3,F4,F5,F6,F7
oracle.hadoop.loader.input.fieldTerminator=\u002c
oracle.hadoop.loader.loaderMap."ACTIVITY".field=F5
oracle.hadoop.loader.loaderMap."CUSTID".field=F0
oracle.hadoop.loader.loaderMap."GENREID".field=F2
oracle.hadoop.loader.loaderMap."MOVIEID".field=F1
oracle.hadoop.loader.loaderMap."RATING".field=F6
oracle.hadoop.loader.loaderMap."RECOMMENDED".field=F4
oracle.hadoop.loader.loaderMap."SALES".field=F7
oracle.hadoop.loader.loaderMap."TIME".field=F3
oracle.hadoop.loader.loaderMap.columnNames="CUSTID","MOVIEID","GENREID","TIME","RECOMMENDED","ACTIVITY","RATING","SALES"
oracle.hadoop.loader.loaderMap.targetTable="OLHP"."MOVIE_RATINGS"
oracle.hadoop.loader.logBadRecords=true
oracle.hadoop.loader.rejectLimit=-1

-----------------------------------------------------------------------------------
Oracle Database Connectivity Settings
-----------------------------------------------------------------------------------

TNS usage: OLH job using client TNS path on hadoop cluster.
Wallet usage: OLH job using client wallet path on hadoop cluster.

-----------------------------------------------------------------------------------
Begin OLH execution at 2017-02-24:11:34:34
-----------------------------------------------------------------------------------

-----------------------------------------------------------------------------------
OLH MapReduce job is submitted to the Hadoop cluster to load an Oracle table using ETL_ONLINE
-----------------------------------------------------------------------------------

[INFO] 2017-02-24 11:34:34,153 [oracle.hadoop.loader.OraLoader]  Oracle Loader for Hadoop Release 3.8.1 - Production

 Copyright (c) 2011, 2017, Oracle and/or its affiliates. All rights reserved.

[INFO] 2017-02-24 11:34:34,153 [oracle.hadoop.loader.OraLoader]  Built-Against: hadoop-2.2.0 hive-0.13.0 avro-1.8.1 jackson-1.8.8
[INFO] 2017-02-24 11:34:34,453 [oracle.hadoop.loader.OraLoader]  oracle.hadoop.loader.enableSorting disabled, no sorting key provided
[INFO] 2017-02-24 11:34:34,770 [oracle.hadoop.balancer.Balancer]  Creating balancer
[INFO] 2017-02-24 11:34:34,836 [oracle.hadoop.balancer.Balancer]  Starting Balancer
[INFO] 2017-02-24 11:34:36,773 [oracle.hadoop.loader.OraLoader]  Submitting OraLoader job OSL_170224_113434
[INFO] 2017-02-24 11:34:36,795 [hadoop.conf.Configuration.deprecation]  session.id is deprecated. Instead, use dfs.metrics.session-id
[INFO] 2017-02-24 11:34:36,796 [hadoop.metrics.jvm.JvmMetrics]  Initializing JVM Metrics with processName=JobTracker, sessionId=
[INFO] 2017-02-24 11:34:37,137 [mapreduce.lib.input.FileInputFormat]  Total input paths to process : 1
[INFO] 2017-02-24 11:34:37,169 [apache.hadoop.mapreduce.JobSubmitter]  number of splits:1
[INFO] 2017-02-24 11:34:37,323 [apache.hadoop.mapreduce.JobSubmitter]  Submitting tokens for job: job_local978287496_0001

..

[INFO] 2017-02-24 11:34:42,223 [apache.hadoop.mapred.LocalJobRunner]  1 / 1 copied.
[INFO] 2017-02-24 11:34:42,223 [apache.hadoop.mapred.Task]  Task attempt_local978287496_0001_r_000004_0 is allowed to commit now
[INFO] 2017-02-24 11:34:42,225 [apache.hadoop.mapred.LocalJobRunner]  reduce > reduce
[INFO] 2017-02-24 11:34:42,226 [apache.hadoop.mapred.Task]  Task 'attempt_local978287496_0001_r_000004_0' done.
[INFO] 2017-02-24 11:34:42,226 [apache.hadoop.mapred.LocalJobRunner]  Finishing task: attempt_local978287496_0001_r_000004_0
[INFO] 2017-02-24 11:34:42,230 [apache.hadoop.mapred.LocalJobRunner]  reduce task executor complete.
[INFO] 2017-02-24 11:34:42,322 [oracle.hadoop.loader.OraLoader]  map 100% reduce 100%
[INFO] 2017-02-24 11:34:42,322 [oracle.hadoop.loader.OraLoader]  Job complete: OSL_170224_113434 (job_local978287496_0001)
[INFO] 2017-02-24 11:34:42,376 [oracle.hadoop.loader.OraLoader]  Counters: 30
        File System Counters
                FILE: Number of bytes read=303621893
                FILE: Number of bytes written=298705381
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=39716
                Map output records=39716
                Map output bytes=2540885
                Map output materialized bytes=2620347
                Input split bytes=142
                Combine input records=0
                Combine output records=0
                Reduce input groups=5
                Reduce shuffle bytes=2620347
                Reduce input records=39716
                Reduce output records=39716
                Spilled Records=79432
                Shuffled Maps =5
                Failed Shuffles=0
                Merged Map outputs=5
                GC time elapsed (ms)=215
                Total committed heap usage (bytes)=2119696384
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1688514
        File Output Format Counters
                Bytes Written=2619598

-----------------------------------------------------------------------------------
End OLH execution at 2017-02-24:11:34:42
-----------------------------------------------------------------------------------

This is the second phase, used to load the data pump files generated above.

Loading Oracle table "OLHP"."MOVIE_RATINGS"
  From HDFS data pump files located at  /user/rhanckel/smartloader/jobhistory/oracle/OLHP/MOVIE_RATINGS/OSL_170224_113434
  Using Oracle SQL Connector for Hadoop (OSCH) external tables

-----------------------------------------------------------------------------------
OSCH Configuration Settings
-----------------------------------------------------------------------------------

oracle.hadoop.connection.url=jdbc:oracle:thin:@inst1
oracle.hadoop.connection.user="OLHP"
oracle.hadoop.exttab.createBadFiles=true
oracle.hadoop.exttab.createLogFiles=false
oracle.hadoop.exttab.dataPaths=/user/rhanckel/smartloader/jobhistory/oracle/OLHP/MOVIE_RATINGS/OSL_170224_113434/oraloader-0*.dat
oracle.hadoop.exttab.defaultDirectory="OLHP_DEFAULT_DIR"
oracle.hadoop.exttab.locationFileCount=4
oracle.hadoop.exttab.sourceType=datapump
oracle.hadoop.exttab.tableName="OSL_170224_113434_EXT"

-----------------------------------------------------------------------------------
Create OSCH external table(s) used for accessing Hadoop content
-----------------------------------------------------------------------------------

Oracle SQL Connector for HDFS Release 3.7.1 - Production

 Copyright (c) 2011, 2017, Oracle and/or its affiliates. All rights reserved.

The create table command succeeded.

User: OLHP performed the following actions in schema: OLHP

CREATE TABLE "OLHP"."OSL_170224_113434_EXT"
(
  "CUSTID"                         NUMBER,
  "MOVIEID"                        NUMBER,
  "GENREID"                        NUMBER,
  "TIME"                           DATE,
  "RECOMMENDED"                    NUMBER,
  "ACTIVITY"                       NUMBER,
  "RATING"                         NUMBER,
  "SALES"                          NUMBER
)
ORGANIZATION EXTERNAL
(
   TYPE ORACLE_LOADER
   DEFAULT DIRECTORY "OLHP_DEFAULT_DIR"
   ACCESS PARAMETERS
   (
     external variable data
     NOLOGFILE
     PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
   )
   LOCATION
   (
     'osch-494CD2456A4A3007E0533A58F20AB90B-1-1',
     'osch-494CD2456A4A3007E0533A58F20AB90B-1-2',
     'osch-494CD2456A4A3007E0533A58F20AB90B-1-3',
     'osch-494CD2456A4A3007E0533A58F20AB90B-1-4',
     'osch-494CD2456A4A3007E0533A58F20AB90B-1-5'
   )
) PARALLEL REJECT LIMIT UNLIMITED;

The following location files were created.

osch-494CD2456A4A3007E0533A58F20AB90B-1-1 contains 1 URI, 266240 bytes

      266240 file:/user/rhanckel/smartloader/jobhistory/oracle/OLHP/MOVIE_RATINGS/OSL_170224_113434/oraloader-00000-dp-1.dat

...

osch-494CD2456A4A3007E0533A58F20AB90B-1-5 contains 1 URI, 270336 bytes

      270336 file:/user/rhanckel/smartloader/jobhistory/oracle/OLHP/MOVIE_RATINGS/OSL_170224_113434/oraloader-00004-dp-1.dat

-----------------------------------------------------------------------------------
Begin OSCH execution  at 2017-02-24:11:34:42
-----------------------------------------------------------------------------------

-----------------------------------------------------------------------------------
Begin single load (1/1) at 2017-02-24:11:34:43
-----------------------------------------------------------------------------------

ALTER SESSION FORCE PARALLEL DML PARALLEL 4
ALTER SESSION FORCE PARALLEL QUERY PARALLEL 4
INSERT /*+ append pq_distribute("OLHP"."MOVIE_RATINGS", none) */ INTO "OLHP"."MOVIE_RATINGS"("CUSTID","MOVIEID","GENREID","TIME","RECOMMENDED","ACTIVITY","RATING","SALES") SELECT "CUSTID","MOVIEID","GENREID","TIME","RECOMMENDED","ACTIVITY","RATING","SALES" FROM "OSL_170224_113434_EXT"

Transaction is committed.

-----------------------------------------------------------------------------------
End single load at 2017-02-24:11:35:05. Elapsed load time = 0:00:22.
-----------------------------------------------------------------------------------

-----------------------------------------------------------------------------------
End OSCH execution at 2017-02-24:11:35:05
-----------------------------------------------------------------------------------

-----------------------------------------------------------------------------------
Executing OSCH post-load cleanup
-----------------------------------------------------------------------------------

Oracle SQL Connector for HDFS Release 3.7.1 - Production

 Copyright (c) 2011, 2017, Oracle and/or its affiliates. All rights reserved.

The drop command was successful.

User: OLHP dropped OSCH external table OLHP.OSL_170224_113434_EXT and all associated location files.

Successfully dropped transient tables/views created for the load

Deleting temporary datapump files used for loading.

Deleting file:/user/rhanckel/smartloader/jobhistory/oracle/OLHP/MOVIE_RATINGS/OSL_170224_113434/oraloader-00001-dp-1.dat
...
Deleting file:/user/rhanckel/smartloader/jobhistory/oracle/OLHP/MOVIE_RATINGS/OSL_170224_113434/oraloader-00002-dp-1.dat

###################################################################################
Ending SmartLoader Job OSL_170224_113434
Local OS user: rhanckel
Hadoop user:   rhanckel
###################################################################################

Load operation successful. Load time = 0:00:35

--------------------------------------------------------------------------------------------------------------------------------
End of ETL load  operation at 2017-02-24:11:35:10
--------------------------------------------------------------------------------------------------------------------------------

ohsh>%sql select count(*) from movie_ratings;

  COUNT(*)
----------
     39716

Loading using "etl" method with rejectlimit set to 0

Now let's do the same job above but let's create one dirty record.

The first record in the file had a time stamp value that conforms to the declared dateformat:

ohsh>set dateformat "yyyy-MM-dd:HH:mm:ss"

The first record had a timestamp value compliant with the format above:

1363545,27205,9,2012-07-00:00:04:03,1,5,,

We will change it to this, which should cause the row to be rejected:

1363545,27205,9,2012-07-00 00:04:03,1,5,,

We copy it to another HDFS directory to load the file again, now specifying that we don't want to execute the "exttab" phase if one or more bad records were found. 

The Hadoop map/reduce phase of the "etl" job will see the bad record and abort the Oracle side of the load operation.  The "exttab" phase will not be run.

ohsh>set rejectlimit 0
ohsh>load oracle table olhp:movie_ratings from path  \
  hadoop0:/user/rhanckel/moviedemo/movie_ratings/delimited_text_bad \
  using etl

The first phase begins on the Hadoop side, and the "etl" job discovers one record is bad.

-----------------------------------------------------------------------------------
Begin ETL_ONLINE execution at 2017-02-24:12:04:07
-----------------------------------------------------------------------------------

###################################################################################
Starting SmartLoader Job OSL_170224_120407
Local OS user: rhanckel
Hadoop user:   rhanckel
###################################################################################

Loading Oracle table "OLHP"."MOVIE_RATINGS"
  From HDFS files located at    
  /user/rhanckel/moviedemo/movie_ratings/delimited_text_bad
  Using Oracle Loader for Hadoop (OLH) Datapump file output into HDFS
  for subsequent OSCH load

-----------------------------------------------------------------------------------
Hadoop MapReduce Configuration Settings
-----------------------------------------------------------------------------------

mapreduce.am.max-attempts=2
mapreduce.app-submission.cross-platform=false
mapreduce.client.completion.pollinterval=5000

Lots of Hadoop settings...

yarn.app.mapreduce.client.job.retry-interval=2000
yarn.app.mapreduce.client.max-retries=3
yarn.app.mapreduce.task.container.log.backups=0

-----------------------------------------------------------------------------------
OLH Configuration Settings
-----------------------------------------------------------------------------------

oracle.hadoop.loader.connection.tns_admin=/user/rhanckel/work
oracle.hadoop.loader.connection.url=jdbc:oracle:thin:@inst1
oracle.hadoop.loader.connection.wallet_location=/user/rhanckel/work
oracle.hadoop.loader.defaultDateFormat=yyyy-MM-dd:HH:mm:ss
oracle.hadoop.loader.input.fieldNames=F0,F1,F2,F3,F4,F5,F6,F7
oracle.hadoop.loader.input.fieldTerminator=\u002c
oracle.hadoop.loader.loaderMap."ACTIVITY".field=F5
oracle.hadoop.loader.loaderMap."CUSTID".field=F0
oracle.hadoop.loader.loaderMap."GENREID".field=F2
oracle.hadoop.loader.loaderMap."MOVIEID".field=F1
oracle.hadoop.loader.loaderMap."RATING".field=F6
oracle.hadoop.loader.loaderMap."RECOMMENDED".field=F4
oracle.hadoop.loader.loaderMap."SALES".field=F7
oracle.hadoop.loader.loaderMap."TIME".field=F3
oracle.hadoop.loader.loaderMap.columnNames="CUSTID","MOVIEID","GENREID","TIME","RECOMMENDED","ACTIVITY","RATING","SALES"
oracle.hadoop.loader.loaderMap.targetTable="OLHP"."MOVIE_RATINGS"
oracle.hadoop.loader.logBadRecords=true
oracle.hadoop.loader.rejectLimit=0

-----------------------------------------------------------------------------------
Oracle Database Connectivity Settings
-----------------------------------------------------------------------------------

TNS usage: OLH job using client TNS path on hadoop cluster.
Wallet usage: OLH job using client wallet path on hadoop cluster.

-----------------------------------------------------------------------------------
Begin OLH execution at 2017-02-24:12:04:07
-----------------------------------------------------------------------------------

-----------------------------------------------------------------------------------
OLH MapReduce job is submitted to the Hadoop cluster to load an Oracle table using ETL_ONLINE
-----------------------------------------------------------------------------------

[INFO] 2017-02-24 12:04:07,360 [oracle.hadoop.loader.OraLoader]  Oracle Loader for Hadoop Release 3.8.1 - Production

 Copyright (c) 2011, 2017, Oracle and/or its affiliates. All rights reserved.

[INFO] 2017-02-24 12:04:07,360 [oracle.hadoop.loader.OraLoader]  Built-Against: hadoop-2.2.0 hive-0.13.0 avro-1.8.1 jackson-1.8.8
[INFO] 2017-02-24 12:04:07,671 [oracle.hadoop.loader.OraLoader]  oracle.hadoop.loader.enableSorting disabled, no sorting key provided
[INFO] 2017-02-24 12:04:08,119 [hadoop.conf.Configuration.deprecation]  mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
[INFO] 2017-02-24 12:04:08,120 [hadoop.conf.Configuration.deprecation]  mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
[INFO] 2017-02-24 12:04:08,157 [mapreduce.lib.input.FileInputFormat]  Total input paths to process : 1
[INFO] 2017-02-24 12:04:08,243 [mapreduce.lib.output.FileOutputCommitter]  File Output Committer Algorithm version is 1
[INFO] 2017-02-24 12:04:08,249 [mapreduce.lib.output.FileOutputCommitter]  File Output Committer Algorithm version is 1
[WARN] 2017-02-24 12:04:08,574 [oracle.hadoop.loader.OraLoaderMapper]  oracle.hadoop.loader.OraLoaderException: error parsing input data: Unparseable date: "2012-07-01 00:04:03"
[WARN] 2017-02-24 12:04:08,579 [oracle.hadoop.loader.OraLoaderMapper]  skipping record [Offset 0 in
file:///user/rhanckel/moviedemo/movie_ratings/delimited_text_bad/tkhp_moviedata_onebadrecord.csv]
[INFO] 2017-02-24 12:04:12,632 [apache.hadoop.mapred.MapTask]  Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
[WARN] 2017-02-24 12:04:12,696 [oracle.hadoop.loader.OraLoaderMapper]  oracle.hadoop.loader.OraLoaderException: error parsing input data: Unparseable date: "2012-07-01 00:04:03"
[WARN] 2017-02-24 12:04:12,697 [oracle.hadoop.loader.OraLoaderMapper]  skipping record [Offset 0 in
file:/user/rhanckel/moviedemo/movie_ratings/delimited_text_bad/tkhp_moviedata_onebadrecord.csv]
[INFO] 2017-02-24 12:04:15,565 [oracle.hadoop.loader.OraLoader]  Counters: 32
        File System Counters
                FILE: Number of bytes read=47308629
                FILE: Number of bytes written=47170746
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=39716
                Map output records=39715
                Map output bytes=2540821
                Map output materialized bytes=2620281
                Input split bytes=159
                Combine input records=0
                Combine output records=0
                Reduce input groups=0
                Reduce shuffle bytes=524096
                Reduce input records=0
                Reduce output records=0
                Spilled Records=39715
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=410
                Total committed heap usage (bytes)=424148992
        Rows skipped by input error
                Parse Error=1
                Total rows skipped by input error=1
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1675410
        File Output Format Counters
                Bytes Written=0
[WARN] 2017-02-24 12:04:15,580 [oracle.hadoop.balancer.Balancer]  Cannot save balancer state: Job did not complete or was not successful

-----------------------------------------------------------------------------------
End OLH execution at 2017-02-24:12:04:15
-----------------------------------------------------------------------------------

The second phase of the "etl" method is aborted, avoiding the use of any Oracle resources trying to churn through a bad load.

Error: oracle.hadoop.smartloader.api.SmartLoaderException: ETL load was aborted during the OLH phase and no rows were loaded into Oracle.

Typically this is because the reject limit was exceeded. See the OLH job history for details.

The load fails and aborts the "exttab" load into Oracle.  No Oracle system resources are wasted on a bad load.

Loading using "etl deferred" method

The "etl" model can be used to split the time when the map/reduce phase produces data pump files, and the time when "exttab" is used to actually load the data into Oracle.  Presumably the actual load to Oracle would be fired off by a user defined "cron" job.  

This method is called "etl deferred".  The command itself has the same pattern as other Oracle table loading commands.

ohsh>load oracle table olhp:movie_ratings from path \
    hadoop0:/user/${USER}/moviedemo/movie_ratings/delimited_text \
    using etl deferred scriptdir=/tmp

The command above runs an OLH job which produces data pump files in Hadoop and then writes an OHSH script local to the system where OHSH is running which contains the script to finish the job of loading the table using "exttab".

Note that the "etl deferred" method is followed by "scriptdir=.".  This clause is specific to "etl deferred" method, where "scriptdir" is a directory where the generated script to load the table using "exttab" is written.  In this example, the script will be generated in "/tmp" on the system where OHSH is run.

The filename of the generated script will reflect the OHSH job name.

e.g. /tmp/OSL_170226_160417_offline.ohsh

The script contains the following, which when executed will complete the load job.

ohsh>load oracle table olhp:MOVIE_RATINGS from path \
  hadoop0:"/user/rhanckel/smartloader/jobhistory/oracle/OLHP/MOVIE_RATINGS/OSL_170226_160417" \
  fileformat datapump  \
  using exttab

The HDFS path was the directory location the OLH phase used to create and populate data pump files containing the contents to be loaded into the MOVIE_RATINGS table.

But how does OHSH or the caller know the name of the generated file (e.g. OSL_170226_160417_offline.ohsh)?

After calling the "etl deferred" command above, the command will finish by setting an OHSH environmental variable called OHSH_LATEST_DEFERRED.  It has the name of the generated OHSH script.  This variable can be used after executing the "etl deferred" command by user logic to schedule the rest of the load job.

Below is a full example where the %bash0 operations are used to to generate a wrapper OHSH file that can be picked up by a cron job

ohsh>@/tmp/mdresources

ohsh>%sql truncate table movie_ratings;
ohsh>set outputlevel verbose
ohsh>set dateformat "yyyy-MM-dd:HH:mm:ss"

ohsh> load oracle table olhp:movie_ratings from path \
    hadoop0:"/user/${USER}/moviedemo/movie_ratings/delimited_text"\
    using etl deferred scriptdir=/tmp

At this point OHSH has run an OLH job and produced data pump files living in HDFS.   It also has written the generated script OSL_170226_160417_offline.ohsh.  It contains the following one line OHSH command to load the generated data pump files using "exttab".

load oracle table omovies:MOVIE_RATINGS from path  hadoop0:"/user/rhanckel/smartloader/jobhistory/oracle/OLHP/MOVIE_RATINGS/OSL_170226_160417/etldeferred" fileformat datapump  using exttab

Getting back to the parent script above, we can add additional bash commands to create a wrapper OHSH script (i.e. "OSL_170226_160417_offline.ohsh.wrapper.ohsh") to set additional run-time context needed for the generated load command.

%bash0 echo "@/tmp/mdresources" > /tmp/${OHSH_LATEST_DEFERRED}.wrapper.ohsh
%bash0 echo "set dop 4" >> /tmp/${OHSH_LATEST_DEFERRED}.wrapper.ohsh
%bash0 echo "@/tmp/${OHSH_LATEST_DEFERRED}" >>  /tmp/${OHSH_LATEST_DEFERRED}.wrapper.ohsh
%bash0 echo "%sql select count(*) from movie_ratings;" >> /tmp/${OHSH_LATEST_DEFERRED}.wrapper.ohsh

The wrapper code is an OHSH script that looks like this:

@/tmp/mdresources
set dop 4
@/tmp/OSL_170724_141307_offline.ohsh
%sql select count(*) from movie_ratings;

Invocation of the wrapper script can be done appending it to some file executed by crontab job for doing nightly loads.

Assumptions when using the "etl" method

This model works assuming the following:

  • That the Oracle wallet is used for doing authentication
  • That the resources and defaults that are needed have been properly declared within a wrapper OHSH script
  • That the user will clean up the data pump files living in HDFS, after the load has been done

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.