So far we've discussed OLH and OSCH as alternative strategies for loading Oracle tables with data living in Hadoop. OLH uses the map-reduce framework to do the heavy lifting of converting text data into Oracle native data types, and splitting loads so they can get dedicated to Oracle table partitions. It then uses OCI Direct Path to push data from Hadoop into Oracle. OSCH uses Oracle parallel query processing to start pulling data from Hadoop into Oracle using Oracle external tables where the native data type conversion and partitioning of data occurs all on the Oracle side. The key difference between OLH and OSCH is efficiency versus speed. OLH is a map reduce batch job that dramatically reduces the CPU resources needed by Oracle to complete a load, but the load only starts at the end of a map reduce job. OSCH starts loading immediately, but uses more Oracle resources.
In this lesson we will talk about using them together for a hybrid solution using Data Pump files. Speaking for myself, if I was administering the loading of data from Hadoop clusters into Oracle databases on a regular basis I would go with this solution.
About ten years ago Oracle revamped its model for importing and exporting data, taking advantage of parallel technology and calling it Oracle Data Pump. It has since become a well established utility for moving data from one Oracle database to another. It offers two standalone utilities “expdp” and “impdp” to export and import Oracle tables and other schema objects to and from files, and is integrated into Oracle's external table infrastructure. The format of files exported and imported are “Data Pump” formats, which means they are already converted into native Oracle data types and have metadata embedded that describes the format of an Oracle table. Pulling data into Oracle via external tables reading Data Pump files is more efficient than reading text files because the type conversion of table columns is, by definition, already done.
The Data Pump model for loading data from Hadoop into Oracle is to use OLH to run a map reduce job and produce Data Pump files in HDFS, and then use OSCH to later access these files. In other
words OLH stages data in HDFS for OSCH to pick up and load at a later time. In the first lesson of this tutorial I argued that staging data adds an additional write and read of content and is
costly. That's true and it would be crippling if you were to stage data to a standard file system, but when you stage it using HDFS (which is massively distributed and parallel) such costs are minimized.
There are several reasons why this is an attractive solution.
When OLH runs either using JDBC or OCI Direct Path, the load will be using both the Oracle database and a Hadoop cluster at roughly the same time. Very often these systems in production environments have their own schedules, and finding a time slot to dedicate two systems to do loading can be difficult. With the Data Pump model, the OLH job can run when it suits Hadoop administrators. The Oracle DBA simply needs to know where the Data Pump load files are living in HDFS, and can load them at a later time that suits his own schedule.
When an external table references text files you need to concern yourself with converting the text into native Oracle types, either using external table access parameter verbiage in an external table definition, or by using CAST operators while loading the data into an Oracle table. It's not rocket science, but it does require some work to get it right. Data Pump external tables bypasses these issues so the external table DDL is
streamlined and simple. To create a Data Pump external table you just point OSCH’s External Table tool to a set of Data Pump files generated by OLH and living in HDFS. The tool will extract table schema
information from the files to create an external table. It will also sanity check that all Data Pump files referenced have the same table schema.
We haven't talked much about what happens with OLH or OSCH when it runs into bad data formats that can't be converted properly to Oracle native data types. OSCH, when used against HDFS text files, relies on standard Oracle external table functionality to state how the external table should react if it gets bad data. Options can include simply skipping the bad row, recording the bad row and associated bad data in a dedicated log, or aborting the load immediately or after some threshold of bad records is exceeded.
The point is that with the Data Pump model you will catch your bad record errors when running OLH before you load the database. Any Data Pump files that OLH produces will be bullet proof, so you can be confident that when you load it with OSCH, the load will not fail because of dirty data. You can think of OLH generating Data Pump files as the final transform of an ETL process.
When OSCH is loading Data Pump files it uses substantially less database resources then when it is loading HDFS text files. That's because the work and cost of converting text to native Oracle data types was offloaded to OLH. Typically for big loads, an OSCH load of HDFS Data Pump files runs anywhere from ten to thirty percent faster than the same payload of HDFS text files. OSCH against
text files is fast. OSCH against Data Pump files is faster and uses less database resources when you are dealing with partitioned tables.
Generating Data Pump files in HDFS is accomplished by running a vanilla OLH job.
The structure of an OLH command is very similar to the structure we described for OCI Direct Path. You specify the number of reduce tasks (assuming the table is partitioned), the input directory in HDFS,
the job name, and the HDFS output directory and so on. The only difference between this and OCI Direct Path is the output format buried in the dpOutput.xml file below:
$HADOOP_HOME/bin/hadoop jar $OLH_HOME/jlib/oraloader.jar
The key difference between this and the OLH Direct Path command discussed in Lesson 3, is the new file we are using, “dpOutput.xml” which specifies that the output format is Data Pump.<configuration>
Rather than loading an Oracle database, the output will be stored as Data Pump files in the “mapred.output.dir” specified in the command (e.g. /user/oracle/olh_test/results/749).
They will have this type of pattern in the filename: “oraloader—00059-dp-85.dat”.
One obvious question is since this OLH command is not loading an Oracle table why do you need the “oracle_connection.xml” file which allows OLH to connect to the database? The reason is that OLH will still need to make a connection to get metadata about the target table (e.g. "helloworld") that ultimately will be loaded via OSCH, so it needs to extract this information from Oracle. This is not resource intensive. OLH simply connects to Oracle, gets the metadata and a quick cup of coffee and then does the rest of the work off-line. While connectivity is not required for this type of OLH job (there is a workaround where you specify the precise metadata information OLH needs), it is easier and more convenient.
Note: since this series of tutorials was started, the loader map mechanism has been deprecated (but still supported) in favor of using new OLH properties.
This is where the ease of use kicks in. You use OSCH’s External Table tool to create the external table for you. Since metadata is embedded in the Data Pump files you want access, you don’t have to fiddle with table column definitions as you do if you were loading HDFS text files. The tool will look at your Data Pump files in HDFS, extract schema information, sanity check that all the Data Pump files share the share the same single definition of the schema, and create an external table for you, mapping it to access the Data Pump files.
"Metadata? We don't need no stinkin metadata!"
The structure of the command is similar to the publish command in OSCH. You need to provide the standard connection information, the name of the Oracle user, and provide the name of the external table. The key differences from publish are that you specify the source type as “datapump”, specify the Oracle directory name you created where location files live, point to the same results directory that was
specified in the OLH job you ran above, and ask the tool to create the external table.
/home/oracle/osch/hadoop-2.0.0/bin/hadoop jar /home/oracle/osch/orahdfs-2.3.0/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable
At this point, your Data Pump files are bound to your external table, which can be used for access or for loading the Oracle target table.
INSERT INTO helloworld SELECT * FROM helloworld_exttab;
When loading text files, we enumerated a lot of rules for figuring out the optimal number of location files and data files that we wanted, driven off of the DOP we expected to use when running OSCH. These rules don’t apply for loading Data Pump files, simply because the number of Data Pump files created by OLH is not strictly controllable.
For this model there are three rules you want to follow:
Rule 1: For optimal performance use this model against partitioned tables.
The model will work for non-partitioned tables but not as efficiently and as fast. In the non-partitioned case, the OLH job becomes a map only job and creates a Data Pump file for every map task (i.e. input split). Since input splits are relatively small sizes (e.g. 250MB), this tends to create a lot of Data Pump files and correspondingly a lot of location files when you publish using OSCH. With non-partitioned table you still get ease-of-use and load integrity benefits mentioned above, but not the load efficiency. In this case OSCH against vanilla text files will probably be a better choice.
Rule 2: If the table is partitioned, use the sampler when generating Datapump files with the OLH job.
This will help balance out the size of the generated Data Pump files.
Rule 3: Adjust your session to the desired DOP (or use the “parallel” hint), and use the standard “append” and “pq_distribute” hints.
ALTER SESSION FORCE PARALLEL DML
ALTER SESSION FORCE PARALLEL QUERY PARALLEL 8;
INSERT /*+ append pq_distribute(helloworld, none) */ INTO helloworld SELECT * FROM helloworld_exttab;