Guest post by Javier De La Torre Medina and Iwona Rajca
In this post we describe how to setup and use Oracle SQL
Connector for Hadoop (OSCH) for data query between Cloudera’s Distribution
Including Apache Hadoop hosted on AWS and Oracle Exadata. OSCH leverages Oracle’s
External tables technology and enables queries on Hadoop and Hive from the database.AWS
is a good starting point to build your own test Hadoop environment.
Many enterprises who are experimenting with Hadoop in the
cloud setup will likely have engineered systems like Exadata on premise – and
look to combine the two to create a powerful blended architecture.
Part 1: Configure Exadata as a Hadoop client
The first thing to do is networking setup between Exadata
and Cloudera. Make sure that Exadata can access Hadoop on AWS. There are a few
things to be aware of: check the firewall setup and confirm that the ports for
HiveMetastore, DataNodes, Zookeeper, and Namenode are open.
To set up the networking you might need to make changes on
the AWS cluster. The rest of the configuration is done solely on the Exadata
Oracle provides comprehensive documentation on how to setup OSCH
for a remote Hadoop cluster – check the latest Connnectors User’s Guide here.
The main tasks you need to perform are:
- Download & install OSCH on Exadata.
- Download and install selected Cloudera packages
(Hadoop and Hive) on Exadata.
- Configure Hadoop client on Exadata.
The documentation contains detailed steps.
Part 1: Setup OSCH for a remote Hadoop cluster
Download & install OSCH.
Download the OSCH connector form OTN
and unpack it. The OSCH directory should be placed in a shared directory so
that all nodes in Exadata can access it. Download and install selected Cloudera Hadoop
To be able to run queries from Exadata on AWS-hosted CDH, you
need to install and configure Hadoop and Hive client packages on the client
system. You can do an RPM install with Cloudera-provided packages. Alternatively you can download a
that matches the version of your CDH on AWS. For the tarball install unzip the following packages:
are installed set up your HADOOP_CLASSPATH environmental variable to point to:
- the path to the JAR files for OSCH (your OSCH installation
folder), e.g. path/orahdfs-3.3.0/jlib/*
- Hive JAR files and conf directory (in a default
RPM install these are: /usr/lib/hive/lib/*, /etc/hive/conf)
Configure Hadoop client.
installing CDH packages, you must configure the Hadoop client for use with CDH
on AWS. Cloudera Manager automatically generates the configuration files for
the Hadoop client – they will include your cluster network configuration
details. To get them, log in to Cloudera Manager as admin and right next to you
cluster name click on a drop down and select “View Client Configuration URLs”
option. From there download YARN and Hive packages. Once you get the zip
contents unpacked, move them to configuration directories of Hadoop and Hive – in
a default RPM install these are /etc/hadoop/conf and /etc/hive/conf respectively.
connection by accessing HDFS files and running Hive queries.
Part 2: Configure Oracle SQL Connector for Hadoop
Once you can run your Hive jobs, configuring OSCH is a
matter of minutes.
Start by creating a folder in a shared location that will
host external table location files when an external table is created – e.g. path/ext_table_dir. You can use the ASM
Cluster File System in Exadata to have a shared folder.
To access Hive, an Oracle user requires the basic create
session, create table and create view privileges. For OSCH to make a Hive to
Oracle mapping, a Sysdba should specify the following:
- create a database directory to point to the
orahdfs-version/bin directory where hdfs_stream file resides. E.g.: create or
replace directory osch_bin_path as 'path/orahdfs-3.1.0/bin'
- grant read, execute on directory osch_bin_path
- create a database directory where Hive files
will be published to, e.g.: create or replace directory ext_table_dir as 'path/ext_table_dir'
- grant read, write on ext_table_dir to <username>
Part 3: Create the external table
Creating an External table is simple – as long as you
specify the correct Hive table and the destination table, Oracle will use
Hive’s metadata to describe the new table. To create a table, simply use the
Linux CLI or put the configuration under a separate file and point to it. A sample CREATE
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.tableName=<username>.oracle_table \ <- pick a
name for the external table
-D oracle.hadoop.exttab.sourceType=hive \ <- specifies
external table type
-D oracle.hadoop.exttab.hive.tableName=my_hive_table \ <-
point to your Hive table
-D oracle.hadoop.exttab.hive.databaseName=default \ <-
point to you Hive database
-D oracle.hadoop.exttab.defaultDirectory= ext_table_dir \
<- point to your external directory
\ <- specify the connection to Oracle database
-D oracle.hadoop.connection.user=<username> \ <-
specify the database user (unless using Oracle wallet)
This will create the table.
Worth noting - The Oracle external table is not a
"live" Hive table. After changes are made to a Hive table, you must
use the ExternalTable tool to drop the existing external table and create a new