X

Move data between Oracle Database and Apache Hadoop using high speed connectors.

  • January 23, 2018

Oracle Big Data Cloud and Oracle Database

Connecting to Oracle Database from Big Data Cloud

Big Data Cloud release 17.4.6 comes with pre-installed and configured high speed connectors to load data to Oracle Database and to copy data from Oracle Database.    These are the Oracle Loader for Hadoop and Copy to Hadoop products you might be familiar with from other deployments.

The UI to use these connectors is the OHSH (Oracle Shell for Hadoop Loaders) UI.   See here for a comprehensive set of blog posts on OHSH.

A brief summary of OHSH

OHSH (Oracle Shell for Hadoop Loaders), nick named “ohshell” is a CLI to run commands to load Oracle Database tables with data in Hadoop, and to copy Oracle Database tables to Hive.  OHSH layers on the underlying connector products such as Oracle Loader for Hadoop and Copy to Hadoop.

OHSH integrate various tools needed to move and load data between Hadoop and Oracle Database: Beeline/Hive, Hadoop, bash, SQL*Plus.  A delegation operation “%” associated with a resource name identifies the resource that should execute a command.   For example,

ohsh> %hive0 show tables;

sends a “show tables” to Beeline/Hive CLI.

hive0, hadoop0 and bash0 are resources available when you start OHSH.    A user can create additional resources, such as resources to send a query to the database, as we will see in this blog post.

In 17.4.6 you can use OHSH by making an ssh connection to a Big Data Cloud node and starting up OHSH.

Getting Started

On the Database

  1. First, create a database user and tablespace in your Oracle Database Cloud Service.
    $ sqlplus / as sysdba

    sql> alter session set container=<container name> # Example container
                                                                                      # name is PDB1

    sql> create tablespace tbs_bdc datafile ‘tbs_bdc.dat’ size 1M autoextend on next 100K segment space management auto;

    sql> create user bdcusr identified as <password of your choice> default tablespace tbs_bdc;

    sql> grant create session, alter session, create table, create view to bdcusr;

On Big Data Cloud

  1. ssh to a node in Big Data Cloud and login

$ sudo su oracle
 

  1. Add /opt/oracle/dbconnector/ohsh/bin to your PATH.   The ohsh executable is at this location.
     
  2. There are a comprehensive set of example scripts in /opt/oracle/dbconnector/ohsh/examples (along with a README.txt) to use the connectors.   In this blog post we select a few commands from these scripts.
     
  3. Load data from this directory into HDFS, for use in this blog post.
    $ cd /opt/oracle/dbconnector/ohsh/examples
    $ ohsh
    ohsh> %hadoop0 fs -rm -r ohshdata   # Use the hadoop0 resource to
                                                                  # delete existing directories

    ohsh> @create_hdfs_data_tables.ohsh # Example script that loads data into HDFS

       5. Create a Hive table over this data for use in the examples.

           ohsh> %hive0 >>
          drop table fivdti;
              create table fivdti
                    (f1 decimal(38,18),
                     i2 bigint,
                    v3 string,
                    d4 date,
                     t5 timestamp,
                    v6 string,
                    i7 bigint)
row format delimited fields terminated by ','
stored as textfile;

  1. Create a file setresources_blog.ohsh in your local directory.  This is similar to /opt/oracle/dbconnector/ohsh/examples/setresources.ohsh, but it creates fewer resources – it only create resources to connect to the database.  You will need the database connection string.   Add the following commands to setresources_blog.ohsh.

    # Create a sqlplus command line resource that enables you to run SQL queries in the database
    # from the OHSH prompt.    You will need the database username/password you created in the
    # ‘Getting Started’ section, you will be prompted for them.

 

create sqlplus resource sql0 connectid="bdctest.compute-uspm032.oraclecloud.internal:1521/pdb1.uspm032.oraclecloud.internal"

 

# Create a jdbc resource that enables jobs submitted via OHSH to
# make JDBC connections to the
# database.   You will need the database username/password you created in the
# ‘Getting Started’ section, you will be prompted for them.

 

create oracle jdbc resource jdbc0 connectid="bdctest.compute-uspm032.oraclecloud.internal:1521/pdb1.uspm032.oraclecloud.internal"

 

Now you can execute in an OHSH session.

 

ohsh> @setresources_blog.ohsh

 

Resources are valid during the entire OHSH session.  If you start a new OHSH session, create the resources again.

 

  1. Create tables in the database

    We create tables with a schema that matches the data we are going to load.   You can either connect to the database and create the tables, or create them from ohsh as shown below.  If using ohsh, note the use of the sql0 resource we created earlier.

    ohsh> %sql0 >>

CREATE TABLE ohsh_fivdti

(f1 NUMBER

 ,i2 INT

 ,v3 VARCHAR2(50)

 ,d4 DATE

 ,t5 TIMESTAMP

 ,v6 VARCHAR2(200)

 ,i7 INT

)

PARTITION BY HASH(i7)

  PARTITIONS 4

NOLOGGING PARALLEL;

CREATE TABLE ohsh_fivdti_reverse

(

  i7 NUMBER(38,0)

 ,v6 VARCHAR2(200)

 ,t5 TIMESTAMP

 ,d4 DATE

 ,v3 VARCHAR2(50)

 ,i2 NUMBER(38,0)

 ,f1 NUMBER

)

PARTITION BY HASH(i7)

  PARTITIONS 4

NOLOGGING PARALLEL;

 

CREATE TABLE ohsh_fivdti_part

(

  state VARCHAR2(2)

 ,f1 NUMBER

 ,i2 INT

 ,v3 VARCHAR2(50)

 ,d4 DATE

 ,t5 TIMESTAMP

 ,v6 VARCHAR2(200)

 ,i7 INT

)

PARTITION BY HASH(state)

  PARTITIONS 6

NOLOGGING PARALLEL;

;

 

Now you are ready to start using OHSH and the connectors.

Load data from Oracle Big Data Cloud to a Database Cloud Service

 

The following ohsh commands are similar to the commands in the example load_directpath.ohsh


$ ohsh

# Truncate the destination table before the new load if you need to

ohsh> %sql0 truncate table ohsh_fivdti;    

# Set number of threads to execute the load on the Hadoop side.  Higher this number faster the load,
# as long as the cluster has the compute resources to support it.

ohsh> set reducetasks 4;    

# Load data with the directpath load option

 ohsh>  load oracle table jdbc0:OHSH_FIVDTI from \

   path hadoop0:/user/${HADOOP_USER}/ohshdata/fivdti using directpath;

OHSH will resolve the environmental variable HADOOP_USER to be the OS user or  user which has the kerberos ticket (if the cluster is a secure cluster).

# Load data with the directpath load option, from a Hive table

  ohsh> load oracle table jdbc0:OHSH_FIVDTI from \

    hive table hive0:fivdti using directpath;

 

Copy Data from a Database Cloud Service to Big Data Cloud

The following ohsh commands are similar to the commands in the example createreplace_directcopy.ohsh

$ ohsh

# Count the number of rows in the database table

ohsh> %sql0 select count(*) from ohsh_fivdti;    

# Set the number of files that the database data is copied to. Higher this number greater the parallelism,
# as long as the database and the cluster have compute resources to support this.

ohsh>  set dpfilecount 2;    

# Copy data with the directcopy option and create a Hive table on these files

 ohsh>  create or replace hive table hive0:cp2hadoop_fivdti from \

   oracle table jdbc0:ohsh_fivdti using directcopy;

# Count the number of rows in the hive table

ohsh> %hive0 select count(*) from cp2hadoop_fivdti;    

 

  

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.