After following the tips below your Oracle Loader for Hadoop jobs should fly!
The first thing to check is the network bandwidth you have
between your Hadoop cluster and Oracle Database. You are more likely to be bound by network
bandwidth than by CPU. So if your
connection is 1 GigE see whether you can make it 10 GigE or more. InfiniBand would be great, if possible.
OutputFormat determines how the data is written to the database. You have the choice of using the JDBCOutputFormat or OCIOutputFormat for your job.
JDBCOutputFormat uses INSERT statements to write to the database. It can work with partitioned and non-partitioned target tables. Inserted rows are committed in batches (see section 'Load Batch Size' below for more details). In the event of failure in the middle of a load job, only that batch will be rolled back. When a job fails identifying the rows that did not load and re-starting the load has to be done manually.
OCIOutputFormat uses direct path load to write to the database. It requires the target table to be partitioned. Inserted rows are committed after the entire job completes. So in the event of failure the entire job is rolled back.
OCIOutputFormat is faster than JDBCOutputFormat. If the target table is partitioned, we always recommend OCIOutputFormat. If the target table is not partitioned, try to partition the table. Use JDBCOutputFormat only when you cannot use OCIOutputFormat.
The most important feature in
performance tuning of Oracle Loader for Hadoop is parallelism. The most significant performance boost comes when the load is parallel.
Below are some tips to ensure and verify your Oracle Loader for Hadoop job is running in parallel.
of parallelism is the same as the number of map tasks (when the target table is not partitioned) and the same as the number of reduce tasks (when the target table is partitioned).
When the target table is not partitioned:
Determining the number of map tasks happens automatically as part of the job. No user action is needed. The maximum number of map tasks is the minimum of (number
of splits, maximum capacity of the cluster).
appears that there is no parallelism, it might because only one
map task is created. One possible reason for that is the data is non-splittable. For
example, gzip compressed data is non-splittable. Then the entire file will be loaded by one
map task. Try to split the data into
multiple files that are individually gzip compressed to increase parallelism.
Manager (if using CDH) or Ambari (if using HDP) to check the number of map
tasks. The number of map tasks will
match the number of database connections. This is the degree of parallelism during load.
When the target table is partitioned:
Refer to the section below When using OCIOutputFormat.
of parallelism is the same as the number of reduce tasks. Use
the property mapreduce.job.reduces to select the number of reduce tasks. Again, use Cloudera Manager or Ambari to
confirm that the number of reduce tasks of the job. The number of reduce tasks will match the number
of database connections.
target database is Exadata, you can start with the number of reduce tasks at
64. Then increase the number of reduce
tasks and measure load performance to determine an optimal number of reduce
tasks to use.
requires the target table to be partitioned.
The number of reduce tasks can write to a single partition. Even if the table has only one partition, multiple reduce tasks will
parallelize the load.
When using JDBCOutputFormat, data is loaded in batches. The default size is 100. You can increase this value by setting the
property oracle.hadoop.loader.defaultExecuteBatch. However, we do not recommend increasing this value by
a lot, as it will use more memory during load and you might not see real load
speed gains. In practice the default
value of 100 works very well.
This property does not apply to OCIOutputFormat.
Using the SDP transport protocol can improve performance by
about 5-10%. In particular, if your
network connection between Hadoop and the database is InfiniBand you will want
to use SDP.
SDP is a network transport protocol that uses RDMA
technology that allows network interfaces to move data packets directly into
RAM without involving CPU.
Use of SDP protocol only applies to OCIOutputFormat.
If you are using InfiniBand, you are probably using Oracle
Big Data Appliance and Oracle Exadata. Configure the Exadata listener with an SDP port (refer to the Oracle Big Data Appliance Software User's Guide for details), and use an additional
(in addition to the standard database connection string) Oracle connection
descriptor for SDP:
This only applies to OCIOutputFormat.
Data is often skewed. For example, when loading into a partitioned target table, some
partitions might have more data than others. Without balancing, reduce tasks loading into the larger partitions will
do more work than others. Other reduce tasks will finish early.
Oracle Loader for Hadoop has a nice feature to balance the
load so that all reduce tasks share the work equally. This makes the load faster, as the load is
only as fast as the longest running reducer. The balancer feature is enabled by default, whenever the target table is
partitioned and when the load method is OCIOutputFormat. You can enable or disable balancer by using the property oracle.hadoop.loader.sampler.enableSampling.
You can verify that the data is being sampled and used for
balancing the load by checking console output when you run a load job. The first half page of output will contain
information on whether balancing is enabled, such as:
…… INFO balancer.Balancer: Creating balancer
…… INFO balancer.Balancer: Starting Balancer
----- a few lines later ----
…… INFO balancer.Balancer: Balancer completed
Another indication that balancer was enabled is the creation
of the _balancer directory.
$ hadoop fs -ls <value
of property mapreduce.output.fileoutputformat.outputdir>
Found 3 items
-rw-r--r-- 1 oracle
oracle 0 2017-02-17 16:55 <HDFS
drwxr-xr-x - oracle
oracle 0 2017-02-17 16:55
drwxr-xr-x - oracle
oracle 0 2017-02-17 16:55