Move data between Oracle Database and Apache Hadoop using high speed connectors.

  • March 15, 2017

Using Copy to Hadoop to copy Oracle Database data into Parquet format in HDFS

It is very easy to copy Oracle Database tables to Parquet format in HDFS.

When Using Copy to Hadoop with OHSH 

If using Copy to Hadoop with OHSH, with one additional step you can convert the Oracle Data Pump files into Parquet.

ohsh> %hive_moviedemo create movie_sessions_tab_parquet stored as parquet as select * from movie_sessions_tab;

hive_moviedemo is a Hive resource (we created that in the blog post on using Copy to Hadoop with OHSH).   Within OHSH you are using Hive to convert the data pump files to Parquet.


When Using Copy to Hadoop with SQL Developer

You can select Parquet as the destination format when using SQL Developer.









When Running Copy to Hadoop as a Hadoop job (for power users)

The Hadoop job for the directcopy option syntax is the following.   Refer to Appendix B in the Oracle Big Data SQL User's Guide for more details.

# hadoop jar ${CP2HADOOP_HOME}/jlib/orahivedp.jar oracle.hadoop.ctoh.CtohDriver \
-D oracle.hadoop.ctoh.connection.tnsEntry=<my-oracle-tns-entry> \
-D oracle.hadoop.ctoh.connection.walletLoc=<local-oracle-wallet-dir> \
-D oracle.hadoop.ctoh.connection.tnsAdmin=<local-oracle-wallet-dir> \
-D oracle.hadoop.ctoh.connection.clusterWalletLoc=<oracle-wallet-dir-on-hadoop-cluster> \
-D oracle.hadoop.ctoh.connection.clusterTnsAdmin=<oracle-wallet-dir-on-hadoop-cluster> \
-D mapreduce.output.fileoutputformat.outputdir=<mytab-hdfs-output-dir> \
-D oracle.hadoop.ctoh.splitterType="BLOCK_SPLITTER" \
-D oracle.hadoop.ctoh.table=<dbSchema.dbTable> \ 
-D oracle.hadoop.ctoh.maxSplits=10 


 You would then create a Hive table on the Oracle Data Pump files in <mytab-hdfs-output-dir>.   This makes the data accessible in Hive.

hive> create external table <mytab-hive> 
row format
  serde 'oracle.hadoop.hive.datapump.DPSerDe'
stored as
  inputformat 'oracle.hadoop.hive.datapump.DPInputFormat'
  outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location <mytab-hdfs-output-dir>;

Now you can convert the data to Parquet in Hive.

hive> create <mytab-parquet> stored as parquet as select * from <mytab-hive>;




Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha