By Mannamal-Oracle on Dec 04, 2015
Guest post by Javier De La Torre Medina and Iwona Rajca
In this post we describe how to setup and use Oracle SQL Connector for Hadoop (OSCH) for data query between Cloudera’s Distribution Including Apache Hadoop hosted on AWS and Oracle Exadata. OSCH leverages Oracle’s External tables technology and enables queries on Hadoop and Hive from the database.AWS is a good starting point to build your own test Hadoop environment.
Many enterprises who are experimenting with Hadoop in the cloud setup will likely have engineered systems like Exadata on premise – and look to combine the two to create a powerful blended architecture.
Part 1: Configure Exadata as a Hadoop client
The first thing to do is networking setup between Exadata and Cloudera. Make sure that Exadata can access Hadoop on AWS. There are a few things to be aware of: check the firewall setup and confirm that the ports for HiveMetastore, DataNodes, Zookeeper, and Namenode are open.
To set up the networking you might need to make changes on the AWS cluster. The rest of the configuration is done solely on the Exadata side.
Oracle provides comprehensive documentation on how to setup OSCH for a remote Hadoop cluster – check the latest Connnectors User’s Guide here. The main tasks you need to perform are:
- Download & install OSCH on Exadata.
- Download and install selected Cloudera packages (Hadoop and Hive) on Exadata.
- Configure Hadoop client on Exadata.
The documentation contains detailed steps.
Part 1: Setup OSCH for a remote Hadoop cluster
Download & install OSCH.
Download the OSCH connector form OTN and unpack it. The OSCH directory should be placed in a shared directory so that all nodes in Exadata can access it.Download and install selected Cloudera Hadoop client.
To be able to run queries from Exadata on AWS-hosted CDH, you need to install and configure Hadoop and Hive client packages on the client system. You can do an RPM install with Cloudera-provided packages. Alternatively you can download a Cloudera-provided tarball that matches the version of your CDH on AWS. For the tarball install unzip the following packages:
When these are installed set up your HADOOP_CLASSPATH environmental variable to point to:
- the path to the JAR files for OSCH (your OSCH installation folder), e.g. path/orahdfs-3.3.0/jlib/*
- Hive JAR files and conf directory (in a default RPM install these are: /usr/lib/hive/lib/*, /etc/hive/conf)
Configure Hadoop client.
After installing CDH packages, you must configure the Hadoop client for use with CDH on AWS. Cloudera Manager automatically generates the configuration files for the Hadoop client – they will include your cluster network configuration details. To get them, log in to Cloudera Manager as admin and right next to you cluster name click on a drop down and select “View Client Configuration URLs” option. From there download YARN and Hive packages. Once you get the zip contents unpacked, move them to configuration directories of Hadoop and Hive – in a default RPM install these are /etc/hadoop/conf and /etc/hive/conf respectively.
Test the connection by accessing HDFS files and running Hive queries.
Part 2: Configure Oracle SQL Connector for Hadoop
Once you can run your Hive jobs, configuring OSCH is a matter of minutes.
Start by creating a folder in a shared location that will host external table location files when an external table is created – e.g. path/ext_table_dir. You can use the ASM Cluster File System in Exadata to have a shared folder.
To access Hive, an Oracle user requires the basic create session, create table and create view privileges. For OSCH to make a Hive to Oracle mapping, a Sysdba should specify the following:
- create a database directory to point to the orahdfs-version/bin directory where hdfs_stream file resides. E.g.: create or replace directory osch_bin_path as 'path/orahdfs-3.1.0/bin'
- grant read, execute on directory osch_bin_path to <username>
- create a database directory where Hive files will be published to, e.g.: create or replace directory ext_table_dir as 'path/ext_table_dir'
- grant read, write on ext_table_dir to <username>
Part 3: Create the external table
Creating an External table is simple – as long as you specify the correct Hive table and the destination table, Oracle will use Hive’s metadata to describe the new table. To create a table, simply use the Linux CLI or put the configuration under a separate file and point to it. A sample CREATE TABLE statement:
$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \
-D oracle.hadoop.exttab.tableName=<username>.oracle_table \ <- pick a name for the external table
-D oracle.hadoop.exttab.sourceType=hive \ <- specifies external table type
-D oracle.hadoop.exttab.hive.tableName=my_hive_table \ <- point to your Hive table
-D oracle.hadoop.exttab.hive.databaseName=default \ <- point to you Hive database
-D oracle.hadoop.exttab.defaultDirectory= ext_table_dir \ <- point to your external directory
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \ <- specify the connection to Oracle database
-D oracle.hadoop.connection.user=<username> \ <- specify the database user (unless using Oracle wallet)
This will create the table.
Worth noting - The Oracle external table is not a "live" Hive table. After changes are made to a Hive table, you must use the ExternalTable tool to drop the existing external table and create a new one.