How to Load Oracle Tables From Hadoop Tutorial (Part 3 - Direct Path)

Oracle Loader for Hadoop: OCI Direct Path

In the previous tutorial post we discussed the basic mechanics and structure of an OLH job using JDBC. In this post we move on to the more mainstream method used for OLH, specifically OCI Direct Path. The focus here is on loading Oracle tables with really big data, and we will discuss how to do this efficiently, and provide some basic rules for optimizing load performance. We will discuss the mechanics of submitting an OLH job, and then take a dive into why this OLH load method is what you want to use for most situations.

The Structure of an OLH Command using OCI Direct Path

The structure of an OLH command using OCI Direct Path is very similar to the structure we described for submitting a JDBC load:

$HADOOP_HOME/bin/hadoop jar
$OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader
  -D oracle.hadoop.loader.jobName=OLHP_fivdti_dtext_oci_10_723
  -Doracle.hadoop.loader.loaderMapFile=file:/tmp/loaderMap_fivdti.xml
  -D mapred.reduce.tasks=10
 -D mapred.input.dir=/user/olh_performance/fivdti/56000000_90
 -D mapred.output.dir=/user/oracle/olh_test/results/fivdti/723
  -conf /tmp/oracle_connection.xml
  -conf /tmp/dtextInput.xml
  -conf /tmp/dlOutput.xml

Aside from cosmetic changes (e.g. the job name) the key differences between this and the JDBC command discussed in lesson 2, is a non-zero value for “mapred.reduce.tasks” property and a different conf file for specifying the type of output (i.e. the bold italic lines above).

The new file we are using, “dlOutput.xml”, specifies the output format is OCI Direct Path (and not JDBC):

<configuration>
  <property>
    <name>mapreduce.outputformat.class</name>
    <value>oracle.hadoop.loaderlib.output.OCIOutputFormat</value>
  </property>
</configuration>

So switching from JDBC to OCI Direct Path is trivial. A little less trivial is why OCI Direct Path is preferred and what rules you should know to make this type of loading perform well and to maximize efficiency.

Rule 1: When using OCI Direct Path the target table must be partitioned.

This might sounds like a constraint, but practically speaking it isn’t.

Exploiting Oracle Table Partitioning

Partitions

A full understanding of Oracle table partitioning goes beyond the scope of this tutorial, and you would be advised to read related documentation that gets into this subject in depth, but for the sake of readers who live mostly in the world of Hadoop and have a limited understanding of Oracle let’s briefly outline the basics of what Oracle table partitioning is and why it is essential to understand.

Rule 2: If you are loading really big data into an Oracle table, your Oracle table will want to be partitioned.

The reason is pretty simple. Table partitions are Oracle’s method of breaking up a table into workloads that can be optimized transparently by SQL. In the same way MapReduce jobs scale out by breaking up a workload into data blocks and scheduling tasks to work in parallel against data blocks, Oracle SQL does the same with partitions. This is not only true for querying but it is also true for doing big loads.

Let’s look at the “fivdti” table we have been using. A flat table would be declared like this:

CREATE TABLE fivdti
  (f1 NUMBER,
  i2 INT,
  v3 VARCHAR2(50),
  d4 DATE,
  t5 TIMESTAMP,
  v6 VARCHAR2(200),
  i7 INT);

A partitioned table declaration, using a hash partitioning scheme would look like this:

CREATE TABLE fivdti
(f1 NUMBER,
i2 INT,
  v3 VARCHAR2(50),
  d4 DATE,
  t5 TIMESTAMP,
  v6 VARCHAR2(200),
  i7 INT)
PARTITION BY HASH(i7)

PARTITIONS 10

PARALLEL;

With the simple addition of the partition clause at the bottom of the CREATE TABLE clause, you’ve empowered Oracle to exploit big optimizations for processing.  The clause tells Oracle that the table should be divided into 10 partitions, and the partition for a row is determined by performing a hash operation on the value of the i7 column. If you were to compare load rates using OLH, SQL*Loader, or SQL for the flat table and the table that is partitioned, you would typically see a dramatic difference that favors partitioning. The same holds true for SQL. When querying partitioned tables, SQL can do all sorts of tricks under the covers to use parallel query technology that subdivides a job and maximizes parallel CPU and IO.

Oracle table partitioning comes in various flavors such as hash, list, and range. They also can be composites of the same.  OLH supports all partition methods except reference partitioning and virtual column-based partitioning.

Advantages of OCI Direct Path

OCI Direct Path is a well-established method of loading data into Oracle using OCI (Oracle’s C based client interface) or SQL*Loader. It is a code path dedicated to bulk loading and its key advantage is that it bypasses Oracle SQL, which makes it very efficient.

Virtually all relational database systems including Oracle are built on two layers of software: one for managing data at the row level (i.e. SQL), and another for managing data at the block level (i.e. storage). Loading through SQL (i.e. Oracle’s front door) is expensive. It’s okay when one is inserting a singleton row or a small array of rows, but it uses a lot of code path before the rows are passed onto storage and are copied into data blocks that ultimately get written to disk.

OCI Direct Path load is a short cut with an API whose code path both on the client and in the Oracle database is streamlined for loading. It does the work of preparing rows for storage in data blocks using client resources. (For our case the client is OLH running in Hadoop.)  It then sends blocks of rows to Oracle’s storage layer in a form close to what will be written to disk on a code path that minimizes contention: rows don’t need to pass through Oracle's buffer cache layer. It also maximizes parallelism for multi-block IO.  OCI Direct Path can also take advantage of presorted data which helps if it needs to build indexes for a table.

Running an OLH Job With OCI Direct Path

This pretty much looks the same as running a job with JDBC, except that the reduce phase always executes (since the target table is partitioned) , and it is much faster. For both JDBC and OCI Direct Path the actual loading of the Oracle table occurs when the Reduce phase is 67% complete. For large loads approximating or exceeding a terabyte you will see a big difference in the time spent in this phase. OCI Direct Path is much faster than JDBC.

Are You Balanced?

Balanced

Rule 3: After running an OLH load, check out the Oraloader report to see if it is balanced.

After the run of a successful OLH job, the output directory (specified by the “mapred.output.dir” property) generates an elegant report called “oraloader-report.txt” that details the work done in the reduce phase. It identifies reducer tasks that ran and associated statistics of their workload: bytes loaded, records loaded, and duration of the tasks (in seconds). If the load is not balanced, the values for bytes and duration will vary significantly between reduce tasks, and you will want to make adjustments.

Optimizing OLH and OCI Direct Path

Now we will discuss basic steps to optimize OLH using OCI Direct Path:

· Choosing a good number for Reducer Tasks

· Enabling the OLH Sampler

· Finding the sweet spot for Hadoop Map Reduce payloads

· If possible load using SDP transport protocol

Choosing a Number for Reducer Tasks

Rule 4: When using OCI Direct Path you want to choose the number of reducer tasks to be close to a multiple of the number of reducer slots allocated on your Hadoop cluster.

Reducer slots in Hadoop mean the number of processes that can run in a Hadoop cluster at once, performing the reduce phase for an OLH job. The Hadoop Map/Reduce Administration UI displays this as Reduce Task Capacity. Typically you choose some multiple of the number of reducer slots available. For example if the reduce task capacity in the Hadoop cluster is 50, then a mapred.reduce.tasks value of 50 or 100 should work well.

The purpose of this rule is to try to get reducers running and loading at the same time, and to make sure all available slots are being used. Not doing this can be costly. For example, suppose there are 50 reducer slots but you set the number of reducer tasks to 51. If reduce loads are balanced then the 50 reducer slots will start and finish at roughly the same time, but you will to wait for the singleton 51st task to run, which will double the time the reduce phase spends loading the data.

Rule 4 only works fully to your advantage when the data sets are balanced (i.e. you are using the Sampler) and your OLH job is not competing with other ongoing Map Reduce jobs that can steal reduce slots that you were expecting to use.  Note that Apache actually recommends a value close to a multiple of the number of reducer slots, for dealing with situations where reducers are not balanced.

This takes us to the next rule.

Rule 5: Always use the OLH Sampler.

The OLH Sampler

Sampler

The OLH Sampler is an optional feature of OLH that does a great job of balancing the workloads of reducer tasks when partitions are not balanced.  (Note that the Sampler works with all OLH load methods, not just OCI Direct Path).  You can control the Sampler manually by setting the following property to “true” or “false” (for recent versions of OLH the Sampler is turned on by default):

-D oracle.hadoop.loader.sampler.enableSampling=true

For example, suppose I had a customer table which was partitioned using list partitioning representing the fifty states in the United States. Most likely the partition representing California will be much larger than the state of New Hampshire. Without enabling the OLH Sampler, a single reducer task has the burden of publishing a whole partition. This means that one reducer will have to publish California records while another will be tasked to publish the records from New Hampshire. This will cause skew, where some tasks have bigger workloads than others. The OLH Sampler addresses this pathology, and breaks up large partitions into smaller equal sized units that can be dispatched evenly across various reducer tasks.

The overhead of the OLH Sampler is very small for big data payloads. A Hadoop Map Reduce job typically takes minutes or hours, and the sampler overhead typically takes a few seconds. (OLH console output tells you at the outset if the Sampler is running and how much time it cost.) It runs at the beginning of the Map Reduce job and samples the dataset to determine differences between partition sizes, it then creates an partition strategy which balances the reduce load evenly.

Another pathology that the Samper addresses is when you have more available reducer slots than partitions in your table. For instance suppose your table has 10 partitions but your Hadoop cluster has 50 reducer slots free. You would want to set the number of reduce tasks to take advantage of all these reducer slots to speed up the load.

-D mapred.reduce.tasks=50

But without the Sampler enabled this tuning knob would not have the desired effect. When the Sampler is not enabled, partitions are restricted to a single reducer task, which means that only 10 reducers will do real work, and the other 40 reduce slots will have nothing to do.

Based on our experience the Sampler should be used virtually all the time. The only situation to be wary of is when the Hadoop input splits are clustered by the reduce key. (e.g. the input data living in HDFS files is sorted by the value of the partition column). Under these circumstances loads might still be unbalanced. The work-around for clustered data is to force the Sampler to spend more time looking at the distribution of data by looking at more splits. (By default it looks at at least five). This is done by using the following property and setting <N > to a higher number.

-D oracle.hadoop.loader.sampler.minSplits=<N>

Again the higher number will impose more Sampler overhead at the beginning of the job but this should be rewarded with more efficient use of Hadoop resources

Finding the Sweet Spot for Hadoop Map Reduce Payloads

Rule 6: Experiment with different sized payloads.

Hadoop is great technology that does a good job of making sure that Map Reduce payloads scale. That being said, the resources of a Hadoop cluster are still finite, and there is a breaking point where load sizes are simply too big. Hadoop’s scaling typically breaks down in the reduce shuffle/sort stage where there is a tremendous amount of disk and network IO going on within a Hadoop cluster to move sorted data to designated systems where reducer tasks will do the actual loading. A telling sign is when you see your Hadoop job start to suffer from failed and restarted task attempts in the reduce phase. The other obvious sign is that when you double your payload, the time to process the load is greater than a factor of 2.

It’s a good idea to spend some time experimenting with different load sizes to see what your Hadoop configuration can handle. Obviously, if you break down a single big job into a series of smaller jobs, you will be paying a higher cost of overhead for starting up and tearing down multiple Map Reduce jobs.  That being said, breaking down a 90 minute OLH payload into three smaller 30 minute payloads is a perfectly reasonable strategy, since the startup/teardown overhead for running each OLH job is still very small compared to the total time running.

Use the SDP Protocol on Exadata and the BDA

RaceCar

Rule 7: If you are using Oracle Exadata and Oracle BDA with Infiniband, use SDP protocol.

SDP is a network transport protocol supported for loading tables living in an Oracle Exadata machine with HDFS data living in Oracle BDA (Big Data Appliance). Exadata and BDA move data using Infiniband, which has very high throughput and low latency. Because Infiniband has such high bandwidth, it can create bottlenecks in conventional TCP sockets.

SDP is an alternative networking protocol that uses RDMA technology which allows network interfaces to move data packets directly into RAM without involving CPU. In other words it doesn’t reproduce the network bottleneck that is seen when using TCP. In performance test runs we’ve found that using SDP improves the load stage of an OLH Direct Path by ten to twenty percent.

If you are running OLH Direct Path jobs using Infiniband, you will want to take advantage of SDP.  The way this is done is to configure Exadata listeners with an SDP port, and assert an additional Oracle connection descriptor dedicated to SDP when running OLH.

<property>
  <name>oracle.hadoop.loader.connection.oci_url</name>
  <value>
(DESCRIPTION=(ADDRESS=(PROTOCOL=SDP)(HOST=192.168.40.200)(PORT=1523))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=dbm)))
</value>
</property>

This esoteric property isolates SDP usage only when OLH reduce tasks create connections to Oracle to execute OCI Direct Path loading. All other network activity uses standard TCP connections.


Comments:

Bob,
Nice writeup. Always enjoy your writings. A few questions.
1. You mention Sockets Direct Protocol (SDP) is used between Exadata and BDA via Infiniband. I thought Oracle prefer to use RDS for the network traffic via Infiniband on Exadata. At least for the traffic inside the Exadata Rack, it is using RDS. So if use Exadata Expansion Rack, which network protocol will be used for the network traffic via Infiniband between the multiple Exadata systems, RDS or SDP? If it is using RDS, why traffic between Exadata and BDA is using SDP?

2. If I plan to use this SDP port on Exadata, I guess by default this port is not configured in the listener and I need to manually configure and enable this feature. Correct?

3. In your connection string, you are using 192.168.40.200 IP as the host. I guess it's definitely not a scan IP address as it is on the public/client network and Infiniband is on private network. But if I give a private IP of one DB node, although I may be able to see it from BDA, it's tied to a physical host. In the case if I need to reboot the db node, the connection will not failover to other db nodes like regular VIPs. So what's kind of IP you used in your example?

Posted by guest on July 28, 2013 at 07:58 PM PDT #

Hi Bob,
I've been following your articles and really enjoy the easy to understand writing style. I am working with my own hadoop cluster and trying to incorporate Oracle, but whenever I attempt to load tables using OLH my job is unsuccessful. The output states that the job is complete, but no data has been copied, and from the output file I gather that the map task failed. I was wondering if you have ever encountered this problem, or could offer any advice?

Thanks

Posted by guest on August 05, 2013 at 11:10 AM PDT #

Sorry you are having problems. Could you post the console output of the OLH job where it reports a failure? Also did the map tasks launch? If they did try to track down a mapper log using standard Hadoop Map/Reduce Admin GUI and please post any errors that you see.

Regards,

Bob

Posted by guest on August 05, 2013 at 04:42 PM PDT #

| I thought Oracle prefer to use RDS for the network traffic via
| Infiniband on Exadata. At least for the traffic inside the Exadata
| Rack, it is using RDS.

I will have to follow up on this with some folks who engineer the BDA to give a definitive answer on where RDS fits in.

The SDP protocol is supported by Java and by extension Hadoop and by standard Oracle client/server network infrastructure, so it is easy to use. (By the way our 192.168.40.200 IP address is a private IB address.)

You do have to manually configure it as a scan VIP. We use the VIP for load balancing over three IB addresses that are private to the Exadata and BDA box.

Posted by Bob on August 05, 2013 at 05:03 PM PDT #

Hi, Bob,
Thank you for the reply. Figured out SDP vs RDS Between Exadata and BDA, it is indeed using SDP similar to the traffic between Exadata and Exalytics. We configured the IB Listener on Exadata to listen to both TCP and SDP. They are running fine on Exadata.

I have two more questions:
1) In your example for using OCI Direct Path, your connection file oracle_connection.xml is still using the one from last example. However, that example is using jdbc connection string for the oracle.hadoop.loader.connection.url property. I thought OCI Direct Path is trying to avoid using jdbc. I am thinking it might use the property like oracle.hadoop.loader.connection.oci_url as shown in your sdp example.
2) I tested the SDP between Exadata and BDA. The JDBC type connection works. However, I tried OCI Direct Path and it kept giving me error like " Oracle database connection error: either connection url or TNS entry must be specified." I have tried many different combinations in the value field for oracle.hadoop.loader.connection.oci_url property and none of them works. Here is the connection file I used
<configuration>
<property>
<name>oracle.hadoop.loader.connection.oci_url</name>
<value>
(HOST=192.168.12.31)(PORT=1522))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=dbm)))
</value>
</property>

<property>
<name>oracle.hadoop.loader.connection.user</name>
<value>wzhou</value>
</property>

<property>
<name>oracle.hadoop.loader.connection.password</name>
<value>wzhou</value>
</property>
</configuration>

192.168.12.31 is enkx3db01-ibvip on Exadata. The example in your example seems missing (. Even I add it and it also did not work.
I also tried the following entry and it also did not work
(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=SDP)(HOST=192.168.12.31)(PORT=1522)))(CONNECT_DATA=(SERVICE_NAME=dbm)))

Would you please post an example of your connection file for OCI Direct Path. Thank you very much for your help.

Weidong

Posted by guest on August 06, 2013 at 02:54 AM PDT #

Hi Weidong,

Two connections are used when using SDP. One is the standard one we use for TCP access, the other is SDP which is isolated for the actual loading.

The standard one is a jdbc type connection setting, but it is a form that works for both JDBC and OCI. OLH code if it is using OCI Direct Load parses the jdbc form to extract the credentials to build a connection string that OCI uses.

My example had a typo and left out the ADDRESS and PROTOCOL verbiage.

The difference I see between what you are doing and what we are doing is the port (i.e. 1523), which is different from the TCP port (which for us is 1522).

Here is a cut and paste example of what works for us:

<property>
<name>oracle.hadoop.loader.connection.oci_url</name>
<value>(DESCRIPTION=(ADDRESS=(PROTOCOL=SDP)(HOST=192.168.40.200)(PORT=1523))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=dbm)))</value>

<description>
The database connection string used by the OCI Direct Path Output Format.
This property allows the OCI client to connect to the database using
different connection parameters than the JDBC connection url. If this value
is not defined, then the oracle.hadoop.loader.connection.url value is used.

The following example specifies the Socket Direct Protocol (SDP)
for OCI connections

(DESCRIPTION=(ADDRESS_LIST=
(ADDRESS=(PROTOCOL=SDP)(HOST=myhost)(PORT=1521)))
(CONNECT_DATA=(SERVICE_NAME=my_db_service_name)))

This connection string does not require the prefix "jdbc:oracle:thin:@".
All characters up to and including the first '@' character are removed.
</description>
</property>

Posted by guest on August 06, 2013 at 09:05 AM PDT #

Thanks a lot for the response. It clear a lot of my questions. So the only thing left on my side is the SDP connection issue using url_oci. I checked our db side and it seems ok and use the port 1522.

[enkx3db01:oracle:dbm1] /home/oracle
> srvctl config listener -l listener_ib
Name: LISTENER_IB
Network: 2, Owner: oracle
Home: <CRS home>
End points: TCP:1522/SDP:1522

[enkx3db01:oracle:dbm1] /home/oracle
> srvctl status listener
Listener LISTENER is enabled
Listener LISTENER is running on node(s): enkx3db02,enkx3db01
Listener LISTENER_IB is enabled
Listener LISTENER_IB is running on node(s): enkx3db01

[enkx3db01:oracle:dbm1] /home/oracle
> srvctl config network -k 2
Network exists: 2/192.168.12.0/255.255.255.0/bondib0, type static

It could be configuration issues on BDA side as I did fix one configuration issue on BDA to make OLH working. I will post the result if I can find anything.

Thank you very much for the great help and post.

Weidong

Posted by Weidong Zhou on August 06, 2013 at 09:35 AM PDT #

Hi Bob, I'm from the post at "August 05, 2013 at 11:10 AM"

The OLH job does not actually report the job as a failure, though it does say that a map task failed.

The error in my mapper log seems to be here, but I am not sure what this means:
MapAttempt TASK_TYPE="MAP" TASKID="task_201307291839_0011_m_000000" TASK_ATTEMPT_ID="attempt_201307291839_0011_m_000000_0" START_TIME="1375305443979" TRACKER_NAME="tracker_rk-bst79:localhost\.localdomain/127\.0\.0\.1:12581" HTTP_PORT="50060" .
MapAttempt TASK_TYPE="MAP" TASKID="task_201307291839_0011_m_000000" TASK_ATTEMPT_ID="attempt_201307291839_0011_m_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1375305449961" HOSTNAME="rk-bst79" ERROR="java\.lang\.ClassCastException: org\.apache\.hadoop\.io\.Text cannot be cast to org\.apache\.avro\.generic\.IndexedRecord
at oracle\.hadoop\.loader\.OraLoaderMapper\.run(OraLoaderMapper\.java:339)
at org\.apache\.hadoop\.mapred\.MapTask\.runNewMapper(MapTask\.java:764)
at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:370)
at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:255)
at java\.security\.AccessController\.doPrivileged(Native Method)
at javax\.security\.auth\.Subject\.doAs(Subject\.java:416)
at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1149)
at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:249)
" .

Thank you,
Myko

Posted by guest on August 09, 2013 at 12:46 PM PDT #

Hi Myko,

Feedback from OLH engineers on your problem.

##################################################################

First guess: no input format was specified, so TextInputFormat got used by default.

TextInputFormat outputs and Text values (and LongWritable keys), while OLH expects IndexedRecord values (and takes anything for key).

Hence the "ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.avro.generic.IndexedRecord" error message.

Posted by Bob on August 09, 2013 at 02:58 PM PDT #

Hi Bob,

Thanks for the writeup.
We are using Hadoop(CDH4.4) and Oracle and are trying to replace the existing ETL process with Oracle Hadoop Loader. I was able to get Oracle Hadoop Loader to work using JDBCOutputFormat, but I am not able to load the same records using OCIOutputFormat.
- Our table is partitioned using Range Partition (PARTITION BY RANGE ("DAY") INTERVAL (NUMTODSINTERVAL(1,'DAY')) )
- We have created our own CustomInputFormat that reads the sequenceFile and converts it into Avro Record.
- Only thing that I changed between the successful import using JDBCOutputFormat and unsuccessful attempt with OCIOutputFormat - was to change the 'mapreduce.outputformat.class' property to 'oracle.hadoop.loader.lib.output.OCIOutputFormat'.

There is no error while executing the job but it seems all the 5 records are rejected during import. Here is the error that I see in <outputdir>/_olh/oraloader-00000-oci.log

Insertion Errors
================

Partition: 745
Failure with error code 14401 at row 0.
Failure with error code 14401 at row 1.
Failure with error code 14401 at row 2.

Let me know if I am missing something and would really appreciate any pointers on how to debug this further.

Thanks,
Ankita

Posted by guest on February 03, 2014 at 05:35 PM PST #

Sorry about your problem. I'm am checking with our developers and see if I can come up with some answers.

Posted by guest on February 04, 2014 at 08:47 AM PST #

Hi Ankita,

The ORA-14401 is an error where the partition key is outside
the scope of the partitions defined.

Could you post the full DDL of the table you are trying to populate
and if possible post a few rows of data that is giving you the problem?

Thanks.

Posted by bob on February 04, 2014 at 09:33 AM PST #

Thanks Bob. Pls. find the table schema and sample rows below.

I looked at the error but was not sure how the partition key would be treated differently in JDBC vs OCI format.
We are storing dates in the hdfs file as long timestamp. Since Avro doesn't seem to have Date type, in the custom input format we are formatting the dates using SimpleDateFormat("yyyyMMdd") and timezone is UTC. This is the same format that I have specified in the conf file. This seems to work for JDBCOutputFormat.
Let me know if you need more details and any pointers on how to debug this on our side.

Thanks again for your help,
Ankita

---- TABLE SCHEMA -----
CREATE TABLE "MY_SCHEMA"."TARGET_TABLE"
( "DAY" DATE,
"COLUMN_1" DATE,
"COLUMN_2" DATE,
"COLUMN_3" VARCHAR2(1024 CHAR),
"COLUMN_4" VARCHAR2(1024 CHAR),
"COLUMN_5" VARCHAR2(1024 CHAR),
"COLUMN_6" VARCHAR2(1024 CHAR),
"COLUMN_7" NUMBER,
"COLUMN_8" NUMBER,
"COLUMN_9" NUMBER,
"COLUMN_10" NUMBER(19,0),
"COLUMN_11" TIMESTAMP (3) DEFAULT SYSTIMESTAMP
) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255
STORAGE(
BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
TABLESPACE "DATA_TS"
PARTITION BY RANGE ("DAY") INTERVAL (NUMTODSINTERVAL(1,'DAY'))
(PARTITION "P0" VALUES LESS THAN (TO_DATE(' 2012-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')) SEGMENT CREATION DEFERRED
PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255
COMPRESS FOR OLTP LOGGING
STORAGE(
BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
TABLESPACE "DATA_TS" ) ;
---------------------------------

--There are BITMAP INDEX on almost all columns. Example
CREATE BITMAP INDEX "MY_SCHEMA"."TARGET_TABLE_COLUMN_3" ON "MY_SCHEMA"."TARGET_TABLE" ("COLUMN_3")
PCTFREE 10 INITRANS 2 MAXTRANS 255
STORAGE(
BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT) LOCAL
(PARTITION "P0"
PCTFREE 10 INITRANS 2 MAXTRANS 255 LOGGING
STORAGE(
BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
TABLESPACE "DATA_TS" ) ;
---------------------------------

-- 2 ROWS THAT WERE INSERTED THROUGH JDBCOutputFormat but failed with OCIOutputFormat
2014-01-13 00:00:00,2014-01-13 00:00:00,2014-01-01 00:00:00,"SomeString1","SomeString2","SomeString3","SomeString4",55.808,49.152,0.128,4,,
2014-01-13 00:00:00,2014-01-13 00:00:00,2014-01-01 00:00:00,"SomeString5","SomeString6","SomeString7","SomeString8",39.808,116.352,0.128,1,,

Posted by Ankita Bakshi on February 04, 2014 at 10:31 AM PST #

Thanks. We are taking a look at this to see if we can reproduce.

Posted by bob on February 04, 2014 at 12:42 PM PST #

Thanks Bob for the help. Here are some relevant debug level logs from the reduce task. Let me know if you need more details.

---------
2014-02-04 20:55:07,231 INFO oracle.hadoop.loader.lib.output.OCIOutputFormat: OCIOutputFormat() logon successful
2014-02-04 20:55:07,589 INFO oracle.hadoop.loader.lib.output.DBOutputFormat: conf prop: defaultExecuteBatch: 10000
2014-02-04 20:55:07,589 INFO oracle.hadoop.loader.lib.output.DBOutputFormat: conf prop: loadByPartition: true
2014-02-04 20:55:07,607 INFO oracle.hadoop.loader.lib.output.OCIOutputFormat: OCIDPRecordWriter: partFile: true intPartFlag: 1
2014-02-04 20:55:07,625 DEBUG oracle.hadoop.loader.OraLoaderReducer: beginning reduce() for key: 745
2014-02-04 20:55:07,626 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadTable [owner="MY_SCHEMA"] [tableName="TARGET_TABLE"] [batchSize=10000]
2014-02-04 20:55:07,626 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadColumn [colName="DAY"] [colType=12] [maxSize=7] [csid=0] [csform=0] [scale=-2147483648]
2014-02-04 20:55:07,626 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadColumn [colName="COLUMN_1"] [colType=12] [maxSize=7] [csid=0] [csform=0] [scale=-2147483648]
2014-02-04 20:55:07,626 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadColumn [colName="COLUMN_2"] [colType=12] [maxSize=7] [csid=0] [csform=0] [scale=-2147483648]
2014-02-04 20:55:07,626 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadColumn [colName="COLUMN_3"] [colType=1] [maxSize=3072] [csid=871] [csform=1] [scale=0]
2014-02-04 20:55:07,637 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadColumn [colName="COLUMN_4"] [colType=1] [maxSize=3072] [csid=871] [csform=1] [scale=0]
2014-02-04 20:55:07,649 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadColumn [colName="COLUMN_5"] [colType=1] [maxSize=3072] [csid=871] [csform=1] [scale=0]
2014-02-04 20:55:07,660 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadColumn [colName="COLUMN_6"] [colType=1] [maxSize=3072] [csid=871] [csform=1] [scale=0]
2014-02-04 20:55:07,671 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadColumn [colName="COLUMN_7"] [colType=2] [maxSize=22] [csid=0] [csform=0] [scale=0]
2014-02-04 20:55:07,672 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadColumn [colName="COLUMN_8"] [colType=2] [maxSize=22] [csid=0] [csform=0] [scale=0]
2014-02-04 20:55:07,672 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadColumn [colName="COLUMN_9"] [colType=2] [maxSize=22] [csid=0] [csform=0] [scale=0]
2014-02-04 20:55:07,672 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadColumn [colName="COLUMN_10"] [colType=2] [maxSize=22] [csid=0] [csform=0] [scale=0]
2014-02-04 20:55:07,672 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCISetLoadColumn [colName="COLUMN_11"] [colType=187] [maxSize=11] [csid=0] [csform=0] [scale=3]
2014-02-04 20:55:07,734 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: openBindContext: partName[TO_DATE('2014-01-13 00:00:00','SYYYY-MM-DD HH24:MI:SS')] fragnum[745] intPartFlag[1]
2014-02-04 20:55:07,734 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareDirectPath [pname="TO_DATE('2014-01-13 00:00:00','SYYYY-MM-DD HH24:MI:SS')"] [intPartFlag=1] [parallel0] [initExtent=1048576] [nextExtent=1048576]
2014-02-04 20:55:07,763 DEBUG oracle.hadoop.loader.OraLoaderReducer: finished reduce() for key: 745
2014-02-04 20:55:07,763 DEBUG oracle.hadoop.loader.OraLoaderReducer: entering cleanup()
2014-02-04 20:55:08,072 DEBUG oracle.hadoop.loader.OraLoaderReducer: exiting cleanup()
2014-02-04 20:55:08,072 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareColumnData [idx=0] [rowCnt=5]
2014-02-04 20:55:08,072 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareColumnData [idx=1] [rowCnt=5]
2014-02-04 20:55:08,072 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareColumnData [idx=2] [rowCnt=5]
2014-02-04 20:55:08,072 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareColumnData [idx=3] [rowCnt=5]
2014-02-04 20:55:08,073 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareColumnData [idx=4] [rowCnt=5]
2014-02-04 20:55:08,073 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareColumnData [idx=5] [rowCnt=5]
2014-02-04 20:55:08,073 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareColumnData [idx=6] [rowCnt=5]
2014-02-04 20:55:08,073 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareColumnData [idx=7] [rowCnt=5]
2014-02-04 20:55:08,073 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareColumnData [idx=8] [rowCnt=5]
2014-02-04 20:55:08,073 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareColumnData [idx=9] [rowCnt=5]
2014-02-04 20:55:08,073 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareColumnData [idx=10] [rowCnt=5]
2014-02-04 20:55:08,073 DEBUG oracle.hadoop.loader.lib.output.OCIOutputFormat: Call OCIPrepareColumnData [idx=11] [rowCnt=5]
2014-02-04 20:55:08,281 INFO org.apache.hadoop.mapred.Task: Task:attempt_201312131808_3852_r_000000_0 is done. And is in the process of commiting
2014-02-04 20:55:09,324 INFO org.apache.hadoop.mapred.Task: Task attempt_201312131808_3852_r_000000_0 is allowed to commit now
---------

Posted by Ankita Bakshi on February 04, 2014 at 01:11 PM PST #

One last input. Did you use the loader map? If so could you post it. Also post the OLH command you used.

Thanks.

Posted by guest on February 04, 2014 at 01:23 PM PST #

Thanks. I did not use loader map instead we are using targetTable property.

--- command used to run OLH
export OLH_HOME="/app/oraloader-2.3.1-h2"
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$OLH_HOME/jlib/*"
export JAVA_LIBRARY_PATH=$OLH_HOME/lib
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${JAVA_LIBRARY_PATH}
echo "Starting oracle loader job..."

$HADOOP_HOME/bin/hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader -libjars=/app/ankita/inputFormat.jar -conf /app/ankita/OracleLoaderConf.xml
------
--- non-default properties (OracleLoaderConf.xml is a copy of oraloader-conf.xml (which is shipped with the package) with the following modified properties)
<property>
<name>oracle.hadoop.loader.loaderMap.targetTable</name>
<value>MY_SCHEMA.TARGET_TABLE</value>
</property>
<property>
<name>oracle.hadoop.loader.defaultDateFormat</name>
<value>yyyyMMdd</value>
</property>
<property>
<name>mapreduce.inputformat.class</name>
<value>com.company.etl.OracleLoaderInputFormat</value>
</property>
<property>
<name>mapreduce.outputformat.class</name>
<value>oracle.hadoop.loader.lib.output.OCIOutputFormat</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>ankita/ETL_Out/OracleLoader</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>cubes/20140113/part-00492.seq</value>
</property>
<property>
<name>oracle.hadoop.loader.connection.url</name>
<value>jdbc:oracle:thin:@//<CONNECTION_URL></value>
</property>
<property>
<name>oracle.hadoop.loader.connection.user</name>
<value>OWNER</value>
</property>
<property>
<name>oracle.hadoop.loader.connection.password</name>
<value>PASSWORD</value>
<description>Password for the connecting user.</description>
</property>
<property>
<name>oracle.hadoop.loader.connection.defaultExecuteBatch</name>
<value>10000</value>
</property>
<property>
<name>oracle.hadoop.loader.connection.sessionTimezone</name>
<value>UTC</value>
</property>
<property>
<name>oracle.hadoop.loader.sampler.numThreads</name>
<value>3</value>
</property>

Posted by Ankita Bakshi on February 04, 2014 at 02:02 PM PST #

Before you replied we used a loader map and did not reproduce your
error. The default date format you have above doesn't have hyphens as separators but your data did. You might want to check that. If you still have problems, please give us the version of OLH and of the Oracle database.

Thanks.

Posted by bob on February 04, 2014 at 02:45 PM PST #

Thanks Bob. We are storing dates in the hdfs file as timestamp. Since Avro doesn't seem to have Date type, in the custom input format, while creating Avro record, we were formatting the dates using SimpleDateFormat("yyyyMMdd") and timezone is UTC. This is the same format that I have specified in the conf file.
The data that I have posted earlier were the rows that finally got inserted into the table using JDBCOutputFormat. I was under the impression that somewhere in the OracleLoader it parses the date using the format specified in the conf file and converts it into oracle date format which is different.

-ROWS THAT WERE INSERTED THROUGH JDBCOutputFormat but failed with OCIOutputFormat
2014-01-13 00:00:00,2014-01-13 00:00:00,2014-01-01 00:00:00,"SomeString1","SomeString2","SomeString3","SomeString4",55.808,49.152,0.128,4,,

-This is the toString() for avro indexed record returns (Note: in this run I have changed the oracle.hadoop.loader.defaultDateFormat to yyyy-MM-dd HH:mm:ss to avoid confusion. )
{"DAY": "2014-01-13 00:00:00", "COLUMN-1": "2014-01-13 00:00:00", "COLUMN-2": "2014-01-01 00:00:00", "COLUMN-3": "SomeString1", "COLUMN-4": "SomeString2", "COLUMN-5": "SomeString3", "COLUMN-6": "SomeString4", "COLUMN-7": 55.808, "COLUMN-8": 49.152, "COLUMN-9": 0.128, "COLUMN-10": 4, "COLUMN-11": null}

It might be hard for me to post the raw data from hdfs files because they are in the form of key value pair and we transform it into fields in the CustomInputFormat. My OLH version is 2.3.1-h2 and Oracle is 11.2.0.3.0 - 64bit. I will try the load with loader Map.

Let me know if I am missing something. One more thing, I have set the oracle.hadoop.loader.logBadRecords property to true but still I am not able to find the corresponding logs. I am wondering if anything else needs to be turned on.

Thanks,
Ankita

Posted by Ankita Bakshi on February 04, 2014 at 05:27 PM PST #

Hi Ankita,

Developers are suggesting that you turn on debug output for the
OCI, using the following property.

-D log4j.logger.oracle.hadoop.loader.lib.output.OCIOutputFormat=DEBUG

and send us your logs. To be safe just tar up everything in your results
directory for the run.

Send it to robert<dot>hanckel<at>oracle<dot>com.

I'll check on the logBadRecords setting and see if it needs some associated properties.

Posted by bob on February 05, 2014 at 08:59 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Oracle Loader for Hadoop and Oracle SQL Connector for HDFS

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today