How to Load Oracle Tables From Hadoop Tutorial (Part 3 - Direct Path)
By Bob Hanckel-Oracle on Jul 26, 2013
Oracle Loader for Hadoop: OCI Direct
the previous tutorial post we discussed the basic mechanics and structure of an
OLH job using JDBC. In this post we move
on to the more mainstream method used for OLH, specifically OCI Direct
Path. The focus here is on loading
Oracle tables with really big data, and we will discuss how to do this
efficiently, and provide some basic rules for optimizing load performance. We will discuss the mechanics of
submitting an OLH job, and then take a dive into why this OLH load method is
what you want to use for most situations.
The Structure of an OLH Command using OCI Direct Path
The structure of an OLH command using OCI Direct Path is very similar to the structure we described for submitting a JDBC load:
The new file we are using, “dlOutput.xml”, specifies the output format is OCI Direct Path (and not JDBC):
So switching from JDBC to OCI Direct Path is trivial. A little less trivial is why OCI Direct Path is preferred and what rules you should know to make this type of loading perform well and to maximize efficiency.
Rule 1: When using OCI Direct Path the target table must be partitioned.
This might sounds like a constraint, but practically speaking it isn’t.
Exploiting Oracle Table Partitioning
A full understanding of Oracle table partitioning goes beyond the scope of this tutorial, and you would be advised to read related documentation that gets into this subject in depth, but for the sake of readers who live mostly in the world of Hadoop and have a limited understanding of Oracle let’s briefly outline the basics of what Oracle table partitioning is and why it is essential to understand.
Rule 2: If you are loading really big data into an Oracle table, your Oracle table will want to be partitioned.
The reason is pretty simple. Table partitions are Oracle’s method of breaking up a table into workloads that can be optimized transparently by SQL. In the same way MapReduce jobs scale out by breaking up a workload into data blocks and scheduling tasks to work in parallel against data blocks, Oracle SQL does the same with partitions. This is not only true for querying but it is also true for doing big loads.
Let’s look at the “fivdti” table we have been using. A flat table would be declared like this:
A partitioned table declaration, using a hash partitioning scheme would look like this:
PARTITION BY HASH(i7)
With the simple addition of the partition clause at the bottom of the CREATE TABLE clause, you’ve empowered Oracle to exploit big optimizations for processing. The clause tells Oracle that the table should be divided into 10 partitions, and the partition for a row is determined by performing a hash operation on the value of the i7 column. If you were to compare load rates using OLH, SQL*Loader, or SQL for the flat table and the table that is partitioned, you would typically see a dramatic difference that favors partitioning. The same holds true for SQL. When querying partitioned tables, SQL can do all sorts of tricks under the covers to use parallel query technology that subdivides a job and maximizes parallel CPU and IO.
Oracle table partitioning comes in various flavors such as hash, list, and range. They also can be composites of the same. OLH supports all partition methods except reference partitioning and virtual column-based partitioning.
Advantages of OCI Direct Path
OCI Direct Path is a well-established method of loading data into Oracle using OCI (Oracle’s C based client interface) or SQL*Loader. It is a code path dedicated to bulk loading and its key advantage is that it bypasses Oracle SQL, which makes it very efficient.
Virtually all relational database systems including Oracle are built on two layers of software: one for managing data at the row level (i.e. SQL), and another for managing data at the block level (i.e. storage). Loading through SQL (i.e. Oracle’s front door) is expensive. It’s okay when one is inserting a singleton row or a small array of rows, but it uses a lot of code path before the rows are passed onto storage and are copied into data blocks that ultimately get written to disk.
OCI Direct Path load is a short cut with an API whose code path both on the client and in the Oracle database is streamlined for loading. It does the work of preparing rows for storage in data blocks using client resources. (For our case the client is OLH running in Hadoop.) It then sends blocks of rows to Oracle’s storage layer in a form close to what will be written to disk on a code path that minimizes contention: rows don’t need to pass through Oracle's buffer cache layer. It also maximizes parallelism for multi-block IO. OCI Direct Path can also take advantage of presorted data which helps if it needs to build indexes for a table.
Running an OLH Job With OCI Direct Path
This pretty much looks the same as running a job with JDBC, except that the reduce phase always executes (since the target table is partitioned) , and it is much faster. For both JDBC and OCI Direct Path the actual loading of the Oracle table occurs when the Reduce phase is 67% complete. For large loads approximating or exceeding a terabyte you will see a big difference in the time spent in this phase. OCI Direct Path is much faster than JDBC.
Are You Balanced?
Rule 3: After running an OLH load, check out the Oraloader report to see if it is balanced.
After the run of a successful OLH job, the output directory (specified by the “mapred.output.dir” property) generates an elegant report called “oraloader-report.txt” that details the work done in the reduce phase. It identifies reducer tasks that ran and associated statistics of their workload: bytes loaded, records loaded, and duration of the tasks (in seconds). If the load is not balanced, the values for bytes and duration will vary significantly between reduce tasks, and you will want to make adjustments.
Optimizing OLH and OCI Direct Path
Now we will discuss basic steps to optimize OLH using OCI Direct Path:
· Choosing a good number for Reducer Tasks
· Enabling the OLH Sampler
· Finding the sweet spot for Hadoop Map Reduce payloads
· If possible load using SDP transport protocol
Choosing a Number for Reducer Tasks
Rule 4: When using OCI Direct Path you want to choose the number of reducer tasks to be close to a multiple of the number of reducer slots allocated on your Hadoop cluster.
Reducer slots in Hadoop mean the number of processes that can run in a Hadoop cluster at once, performing the reduce phase for an OLH job. The Hadoop Map/Reduce Administration UI displays this as Reduce Task Capacity. Typically you choose some multiple of the number of reducer slots available. For example if the reduce task capacity in the Hadoop cluster is 50, then a mapred.reduce.tasks value of 50 or 100 should work well.
The purpose of this rule is to try to get reducers running and loading at the same time, and to make sure all available slots are being used. Not doing this can be costly. For example, suppose there are 50 reducer slots but you set the number of reducer tasks to 51. If reduce loads are balanced then the 50 reducer slots will start and finish at roughly the same time, but you will to wait for the singleton 51st task to run, which will double the time the reduce phase spends loading the data.
Rule 4 only works fully to your advantage when the data
sets are balanced (i.e. you are using the Sampler) and your OLH job is not
competing with other ongoing Map Reduce jobs that can steal reduce slots that
you were expecting to use. Note that Apache actually recommends a value close to a multiple of the number of reducer slots, for dealing with situations where reducers are not balanced.
This takes us to the next rule.
Rule 5: Always use the OLH Sampler.
The OLH Sampler
The OLH Sampler is an optional feature of OLH that does a great job of balancing the workloads of reducer tasks when partitions are not balanced. (Note that the Sampler works with all OLH load methods, not just OCI Direct Path). You can control the Sampler manually by setting the following property to “true” or “false” (for recent versions of OLH the Sampler is turned on by default):
For example, suppose I had a customer table which was partitioned using list partitioning representing the fifty states in the United States. Most likely the partition representing California will be much larger than the state of New Hampshire. Without enabling the OLH Sampler, a single reducer task has the burden of publishing a whole partition. This means that one reducer will have to publish California records while another will be tasked to publish the records from New Hampshire. This will cause skew, where some tasks have bigger workloads than others. The OLH Sampler addresses this pathology, and breaks up large partitions into smaller equal sized units that can be dispatched evenly across various reducer tasks.
The overhead of the OLH Sampler is very small for big data payloads. A Hadoop Map Reduce job typically takes minutes or hours, and the sampler overhead typically takes a few seconds. (OLH console output tells you at the outset if the Sampler is running and how much time it cost.) It runs at the beginning of the Map Reduce job and samples the dataset to determine differences between partition sizes, it then creates an partition strategy which balances the reduce load evenly.
Another pathology that the Samper addresses is when you have more available reducer slots than partitions in your table. For instance suppose your table has 10 partitions but your Hadoop cluster has 50 reducer slots free. You would want to set the number of reduce tasks to take advantage of all these reducer slots to speed up the load.
But without the Sampler enabled this tuning knob would not have the desired effect. When the Sampler is not enabled, partitions are restricted to a single reducer task, which means that only 10 reducers will do real work, and the other 40 reduce slots will have nothing to do.
Based on our experience the Sampler should be used virtually all the time. The only situation to be wary of is when the Hadoop input splits are clustered by the reduce key. (e.g. the input data living in HDFS files is sorted by the value of the partition column). Under these circumstances loads might still be unbalanced. The work-around for clustered data is to force the Sampler to spend more time looking at the distribution of data by looking at more splits. (By default it looks at at least five). This is done by using the following property and setting <N > to a higher number.
Again the higher number will impose more Sampler overhead
at the beginning of the job but this should be rewarded with more efficient use
of Hadoop resources
Finding the Sweet Spot for Hadoop Map Reduce Payloads
Rule 6: Experiment with different sized payloads.
Hadoop is great technology that does a good job of making sure that Map Reduce payloads scale. That being said, the resources of a Hadoop cluster are still finite, and there is a breaking point where load sizes are simply too big. Hadoop’s scaling typically breaks down in the reduce shuffle/sort stage where there is a tremendous amount of disk and network IO going on within a Hadoop cluster to move sorted data to designated systems where reducer tasks will do the actual loading. A telling sign is when you see your Hadoop job start to suffer from failed and restarted task attempts in the reduce phase. The other obvious sign is that when you double your payload, the time to process the load is greater than a factor of 2.
It’s a good idea to spend some time experimenting with different load sizes to see what your Hadoop configuration can handle. Obviously, if you break down a single big job into a series of smaller jobs, you will be paying a higher cost of overhead for starting up and tearing down multiple Map Reduce jobs. That being said, breaking down a 90 minute OLH payload into three smaller 30 minute payloads is a perfectly reasonable strategy, since the startup/teardown overhead for running each OLH job is still very small compared to the total time running.
Use the SDP Protocol on Exadata and the BDA
Rule 7: If you are using Oracle Exadata and Oracle BDA with Infiniband,
use SDP protocol.
SDP is an alternative networking protocol that uses RDMA technology which allows network interfaces to move data packets directly into RAM without involving CPU. In other words it doesn’t reproduce the network bottleneck that is seen when using TCP. In performance test runs we’ve found that using SDP improves the load stage of an OLH Direct Path by ten to twenty percent.
If you are running OLH Direct Path jobs using Infiniband, you will want to take advantage of SDP. The way this is done is to configure Exadata listeners with an SDP port, and assert an additional Oracle connection descriptor dedicated to SDP when running OLH.
This esoteric property isolates SDP usage only when OLH reduce tasks create connections to Oracle to execute OCI Direct Path loading. All other network activity uses standard TCP connections.