X

Move data between Oracle Database and Apache Hadoop using high speed connectors.

  • December 23, 2016

Loading Parquet Files Using Oracle Loader for Hadoop

Oracle Loader for Hadoop (OLH) can load Parquet data.  It does so by reading Parquet data via a Hive table.

So if you have Parquet data that you want to load into
Oracle Database using Oracle Loader for Hadoop, the first step is to create a
Hive table over the Parquet data.  For
example:

create external table oracletest.my_hivetab_on_parquet(f1
decimal(38,18), i2 int, v3 varchar(50), d4 timestamp, t5 timestamp, v6
varchar(200), i7 int)

      stored as
parquet

      location
'/user/hive/warehouse/oracletest.db/parquet_data';

It would be nice if
the syntax was simpler than listing out all the Hive column names and data
types, wouldn't it!   But unfortunately there is no way around that for now.

Once  you have the parquet table you can assemble the confirmation properties just as you
would when using Oracle Loader for Hadoop to load from a Hive table.  Note that the Hive table name in the configuration properties below (value for the
property oracle.hadoop.loader.input.hive.tableName) is the Hive table we just
created, my_hivetab_on_parquet.

olh_parquet.xml:

<?xml version="1.0"
encoding="UTF-8"?>

<configuration>

<!-- Input settings -->

<property>

<name>mapreduce.inputformat.class</name>

<value>oracle.hadoop.loader.lib.input.HiveToAvroInputFormat</value>

 </property>

<property>

<name>oracle.hadoop.loader.input.hive.databaseName</name>

<value>oracletest</value>

</property>

<property>

<name>oracle.hadoop.loader.input.hive.tableName</name>

<value>my_hivetab_on_parquet</value>

</property>

<!-- Output settings -->

<property>

<name>mapreduce.job.outputformat.class</name>

<value>oracle.hadoop.loader.lib.output.OCIOutputFormat</value>

</property>

<property>

<name>mapreduce.output.fileoutputformat.outputdir</name>

<value>temp_out_session_p</value>

</property>

<!-- Table information -->

<property>

<name>oracle.hadoop.loader.loaderMap.targetTable</name>

<value>MY_ORA_TAB</value>

</property>

<property>

<name>oracle.hadoop.loader.input.fieldNames</name>

<value>F1,I2,V3,D4,T5,V6,I7</value>

</property>

<property>

<name>oracle.hadoop.loader.defaultDateFormat</name>

<value>yyyy-MM-dd HH:mm:ss</value>

</property>

<!-- Connection information -->

<..... the usual connection information
.....>

 Now, you can submit the Oracle Loader for Hadoop job to load
the data in Parquet format:

#!/bin/bash

export
OLH_HOME=/opt/oracle/olh

export
HIVE_CONF_DIR=/etc/hive/conf

export
HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive

export
HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$OLH_HOME/jlib/*:$HIVE_HOME/lib/*:$HIVE_CONF_DIR

hadoop jar
${OLH_HOME}/jlib/oraloader.jar \

oracle.hadoop.loader.OraLoader \

-conf /home/oracle/olh_parquet.xml \

       -libjars
/opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar,/opt/cloudera/parcels/CDH/lib/hive/lib/hive-metastore.jar,
/opt/cloudera/parcels/CDH/lib/hive/lib/libfb303-0.9.2.jar
\

-D mapred.reduce.tasks=4 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.