Big Data Cloud release 17.4.6 comes with pre-installed and configured high speed connectors to load data to Oracle Database and to copy data from Oracle Database. These are the Oracle Loader for Hadoop and Copy to Hadoop products you might be familiar with from other deployments.
The UI to use these connectors is the OHSH (Oracle Shell for Hadoop Loaders) UI. See here for a comprehensive set of blog posts on OHSH.
OHSH (Oracle Shell for Hadoop Loaders), nick named “ohshell” is a CLI to run commands to load Oracle Database tables with data in Hadoop, and to copy Oracle Database tables to Hive. OHSH layers on the underlying connector products such as Oracle Loader for Hadoop and Copy to Hadoop.
OHSH integrate various tools needed to move and load data between Hadoop and Oracle Database: Beeline/Hive, Hadoop, bash, SQL*Plus. A delegation operation “%” associated with a resource name identifies the resource that should execute a command. For example,
ohsh> %hive0 show tables;
sends a “show tables” to Beeline/Hive CLI.
hive0, hadoop0 and bash0 are resources available when you start OHSH. A user can create additional resources, such as resources to send a query to the database, as we will see in this blog post.
In 17.4.6 you can use OHSH by making an ssh connection to a Big Data Cloud node and starting up OHSH.
$ sudo su oracle
5. Create a Hive table over this data for use in the examples.
ohsh> %hive0 >>
drop table fivdti;
create table fivdti
(f1 decimal(38,18),
i2 bigint,
v3 string,
d4 date,
t5 timestamp,
v6 string,
i7 bigint)
row format delimited fields terminated by ','
stored as textfile;
create sqlplus resource sql0 connectid="bdctest.compute-uspm032.oraclecloud.internal:1521/pdb1.uspm032.oraclecloud.internal"
# Create a jdbc resource that enables jobs submitted via OHSH to
# make JDBC connections to the
# database. You will need the database username/password you created in the
# ‘Getting Started’ section, you will be prompted for them.
create oracle jdbc resource jdbc0 connectid="bdctest.compute-uspm032.oraclecloud.internal:1521/pdb1.uspm032.oraclecloud.internal"
Now you can execute in an OHSH session.
ohsh> @setresources_blog.ohsh
Resources are valid during the entire OHSH session. If you start a new OHSH session, create the resources again.
CREATE TABLE ohsh_fivdti
(f1 NUMBER
,i2 INT
,v3 VARCHAR2(50)
,d4 DATE
,t5 TIMESTAMP
,v6 VARCHAR2(200)
,i7 INT
)
PARTITION BY HASH(i7)
PARTITIONS 4
NOLOGGING PARALLEL;
CREATE TABLE ohsh_fivdti_reverse
(
i7 NUMBER(38,0)
,v6 VARCHAR2(200)
,t5 TIMESTAMP
,d4 DATE
,v3 VARCHAR2(50)
,i2 NUMBER(38,0)
,f1 NUMBER
)
PARTITION BY HASH(i7)
PARTITIONS 4
NOLOGGING PARALLEL;
CREATE TABLE ohsh_fivdti_part
(
state VARCHAR2(2)
,f1 NUMBER
,i2 INT
,v3 VARCHAR2(50)
,d4 DATE
,t5 TIMESTAMP
,v6 VARCHAR2(200)
,i7 INT
)
PARTITION BY HASH(state)
PARTITIONS 6
NOLOGGING PARALLEL;
;
Now you are ready to start using OHSH and the connectors.
The following ohsh commands are similar to the commands in the example load_directpath.ohsh
$ ohsh
# Truncate the destination table before the new load if you need to
ohsh> %sql0 truncate table ohsh_fivdti;
# Set number of threads to execute the load on the Hadoop side. Higher this number faster the load,
# as long as the cluster has the compute resources to support it.
ohsh> set reducetasks 4;
# Load data with the directpath load option
ohsh> load oracle table jdbc0:OHSH_FIVDTI from \
path hadoop0:/user/${HADOOP_USER}/ohshdata/fivdti using directpath;
OHSH will resolve the environmental variable HADOOP_USER to be the OS user or user which has the kerberos ticket (if the cluster is a secure cluster).
# Load data with the directpath load option, from a Hive table
ohsh> load oracle table jdbc0:OHSH_FIVDTI from \
hive table hive0:fivdti using directpath;
The following ohsh commands are similar to the commands in the example createreplace_directcopy.ohsh
$ ohsh
# Count the number of rows in the database table
ohsh> %sql0 select count(*) from ohsh_fivdti;
# Set the number of files that the database data is copied to. Higher this number greater the parallelism,
# as long as the database and the cluster have compute resources to support this.
ohsh> set dpfilecount 2;
# Copy data with the directcopy option and create a Hive table on these files
ohsh> create or replace hive table hive0:cp2hadoop_fivdti from \
oracle table jdbc0:ohsh_fivdti using directcopy;
# Count the number of rows in the hive table
ohsh> %hive0 select count(*) from cp2hadoop_fivdti;