Move data between Oracle Database and Apache Hadoop using high speed connectors.

How to Load Oracle and Hive Tables using OHSH (Part 2 - OHSH Configuration and CLI Usage)

In this "eat your vegetables before having dessert" tutorial we will talk about how to configure OHSH.  We will also discuss groups of OHSH commands that you will need to know about before you start using the tool to perform loading tasks.

Configuration and SQL*Plus and Hadoop/Hive Dependencies

Configuration of OHSH will differ depending upon where you want to run it.  The short story is that OHSH expects to run in a Linux environment with Bash shell available, and have SQL*Plus, Hadoop and Hive client software installed.   The latter two need to be configured to access the Hadoop cluster of interest.

An Oracle database system serves as a natural home for this tool since loading is a DBA centric activity.  (It's where SQL*Plus is installed.  It's also where OSCH kit needs to be installed, if you are using Oracle Connectors for Hadoop).  If OSCH was previously installed, then the dependent Hadoop and Hive clients should already be on the system.  For production systems, it is most secure to run OHSH logged in as "oracle" or some other OS user on the Oracle system that can do DBA type activities.

Alternatively, your can run OHSH on a Hadoop cluster.  Oracle's BDA and most Hadoop clusters comes automatically installed and configured for Hadoop and Hive.  If SQL*Plus is not installed you should install Oracle Instant Client.  You then just need to twiddle with Oracle TNS names or connection URLs to ensure you can connect to the Oracle database with JDBC and SQL*Plus.

If you want to run it on an edge node you will need to install Hadoop and Hive clients and Oracle Instant client.  If your SQL*Plus client connects to your Oracle database and Hadoop and Hive can find their Hadoop cluster, you are good to go.

The OHSH kit has three configuration READMEs for Oracle, Hadoop cluster, and edge node environments, that walk through the configuration details step by step.

Once you have all these dependencies installed, check to see that Bash, Hadoop, Beeline or HiveCLI, and SQL*Plus are working and can access the appropriate resources (i.e. Hadoop and Hive command line work and can access the Hadoop cluster, and SQL*Plus can connect to your Oracle database system.)

Oracle Connectors and/or Copy To Hadoop Dependencies

You need to download and install the Oracle Connectors (OLH and OSCH) and/or the Copy To Hadoop (CP2HADOOP) feature from Big Data SQL.

NOTE: OHSH is tightly coupled to particular versions of dependencies so check the OHSH README to know the precise version of each product it wants to use.

These kits need to live on some directories on your system that you can point to.  OLH and CP2HADOOP are just a bunch of jars that OHSH needs to find. OSCH is a little more complicated because it has a tool component that OHSH calls to generate Oracle external tables to access Hadoop HDFS content, and it has run-time components that needs to live on the Oracle database system and serve as a gateway to HDFS content.

If your system is not the database system, the OSCH kit also needs to be copied and configured on the database system.  Configuration of OSCH on an Oracle system requires installing a Hadoop client that can access the Hadoop cluster, creating some Oracle directories, and setting some Oracle privileges.  OSCH's installation guide explains all of this.  Once a Hadoop client is installed on the Oracle system, configuring OSCH on the system takes a few minutes.

Configuring ohsh_config.sh

Once all the dependencies are in place, you need to edit ohsh_config.sh in the bin directory of OHSH.

1.) (REQUIRED) You need to define OLH_HOME, OSCH_HOME, and/or CP2HADOOP_HOME to the directories on your system where these kits were unzipped and installed.  To use Oracle Big Data Connectors, OLH_HOME and OSCH_HOME must be both set.  

2.) (REQUIRED) You need to define how to connect to HiveServer2

Specifically you need to define HS2_HOST_PORT=<host>:<port> where <host> is the host name on which HiveServer2 is running and <port> is the port number on which HiveServer2 is listening.

You also need to define HIVE_SESS_VAR_LIST that contains session variables required to connect to HiveServer2.

To connect to HiveServer2 with Kerberos authentication, "principal=<Server_Principal_of_HiveServer2>"

To connect to HiveServer2 running in HTTP mode, "transportMode=http;httpPath=<http_endpoint>"

3.) (OPTIONAL) If Hadoop and Beeline/Hive CLI do not default to the appropriate configuration to find the right Hadoop cluster, you need to define HADOOP_CONF_DIR and HIVE_CONF_DIR.

4.) (OPTIONAL) If you needed to install Oracle Instant Client you will need to define SQLCLIENT_LIB.

Post Configuration

4.) (OPTIONAL) If you want to use TNS aliases for connecting to Oracle you need to define TNS_ADMIN to point to the directory where "tnsnames.ora" lives.  Typically this lives under $ORACLE_HOME/network/admin, but you can create your own TNS file with your own aliases for connecting to particular Oracle databases.  They just need to be readable from the OS user running OHSH.

5.) (OPTIONAL) If you have set up wallet security for connecting to Oracle set the WALLET_LOCATION to the directory containing Oracle Wallet artifacts.  The wallet should live in a directory owned by the OS user running OHSH whose privileges are 700 (readable or writable only by your OS user name).

Please note that the Oracle Wallet is what you always should use when running OHSH in a production environment. 

If you are experimenting with the tool and kicking the tires in the OHSH examples, you can run without it.  When the Oracle Wallet is not configured OHSH will prompt you for Oracle user name and password when creating a JDBC or SQL*Plus resource.  However if you have a few minutes to configure the Oracle Wallet it will be well worth your while, since you won't be consistently prodded for user credentials every time you invoke OHSH.

OHSH Invocation

You should put $OHSH_HOME/bin on your PATH.

To launch OHSH interactively, type "ohsh".  The banner will be accompanied by the underling Oracle Connector and Copy To Hadoop technologies that have be installed and configured.

bash3.2> ohsh
Oracle Shell for Hadoop Loaders Release 1.2.0 - Production   (Build:20161119092118)
Copyright (c) 2016, 2017, Oracle and/or its affiliates. All rights reserved.
Oracle Loader for Hadoop (3.8.0), Oracle SQL Connector for HDFS (3.7.0),
The Copy to Hadoop feature of Oracle Big Data SQL (3.1.0) enabled.

OHSH can be invoked non-interactively with the -f switch.

bash3.2> ohsh -f mytableload.ohsh

It also supports an initialization script that executes before the interactive prompt appears.

bash3.2> ohsh -i initresources.ohsh
Oracle Shell for Hadoop Loaders Release 1.2.0 - Production   (Build:20161119092118)
Copyright (c) 2016, 2017, Oracle and/or its affiliates. All rights reserved.
Oracle Loader for Hadoop (3.8.0), Oracle SQL Connector for HDFS (3.7.0),
The Copy to Hadoop feature of Oracle Big Data SQL (3.1.0) enabled.

The -i and -f switches can be used together for non-interactive sessions.

bash3.2> ohsh -i initresources.ohsh -f mytableload.ohsh

This is useful for executing commands to initialize resources used for all loading activities (e.g. JDBC connection to a specific Oracle database), before executing a specific load operation.

OHSH runs as a Hadoop application.  It is launched as a jar by the "hadoop" command and inherits Hadoop's default configuration.  When OHSH performs load operations, it clones the configuration settings and adds new properties and/or alters default properties to serve a particular job.  

bash3.2> ohsh -c 

This dumps the properties OHSH inherits from hadoop and from settings in the "ohsh" launch script.  This is useful for inspecting the configuration that exists before OHSH modifies it before running a particular job.

OHSH help outlines the groups of commands that you will be working with.


Help <sub-commands>

help load     - load Oracle tables or load/create/replace Hive tables commands
help set      - set load defaults
help show     - show load defaults
help shell    - shell commands
help resource - create and drop resources

Resource Commands

Resources describe named entities that you will be interacting with to perform load operations.

When you launch OHSH you get predefined named resources for bash, hadoop, and beeline/hive.

ohsh>show resources
hadoop0 : Hadoop Command Line Resource
  HADOOP_CONF_DIR = /home/rhanckel/hadoop_scratch/conf
hive0 : Hive Command Line Resource
  HIVE_CONF_DIR = /ade/rhanckel/hive_scratch/conf
  Connect to jdbc:hive2:///default;
  hadoop resource id = hadoop0
bash0 : Bash Command Line Resource
  path = /usr/local/bin/bash

These are command line resources and are invoked using the "%" operator.

Check to see if these resources above (i.e. bash0, hadoop0, hive0) work. 

ohsh>%bash0 uname

ohsh>%hadoop0 fs -ls /user
Found 2 items
drwxrwxr-x   - rhanckel g578       4096 2016-06-21 15:43 /user/oracle
drwxrwxrwx   - root     root       4096 2016-10-30 18:24 /user/rhanckel

ohsh>%hive0 show tables;
|           tab_name            |
| movie_ratings                 |

Create SQL*Plus and Oracle JDBC Resources

Now lets define SQL*Plus and JDBC resources that are related to an Oracle database.  The Oracle user schema of interest is called MOVIEDEMO.  The connect id to the database is "movieserver:15210/movieservice".  (If you have TNS aliases enabled, your connect id can be an alias.)  If an Oracle Wallet has not been configured, these commands will prompt for an Oracle user name and password.

ohsh>create sqlplus resource sql connectid="movieserver:15210/movieservice"

ohsh>create oracle jdbc resource omovies connectid="movieserver:15210/movieservice"

Now we should be able to delegate commands to SQL*Plus connecting as "OLHP".

NOTE: Don't forget the semicolon at the end of your SQL*Plus input. Otherwise SQL*Plus will quietly treat the input as a NOOP.

ohsh>%sql select table_name from tabs;


1 rows selected.

User defined resources need to be declared with a resource name that doesn't clash with other currently defined resources. 

Additional Hive resources can be declared to map to other Hive databases managed by Hive (e.g. "hmovie" rather than "default").  Hive "databases" are roughly equivalent to Oracle schemas.  They define a namespace for user defined tables but live in the same Hadoop/Hive system.

ohsh>create hive resource hmovies connectionurl="jdbc:hive2:///moviedemo"

The "show resources" command should reflect these three additional resources.

ohsh>show resources
omovies : Oracle JDBC Resource
sql : Oracle SQL*Plus Command Line Resource
hadoop0 : Hadoop Command Line Resource
hmovies : Hive Command Line Resource
  HIVE_CONF_DIR = /user/rhanckel/hive_scratch/conf
  Connect to jdbc:hive2:///moviedemo;
  hadoop resource id = hadoop0
hive0 : Hive Command Line Resource
  HIVE_CONF_DIR = /user/rhanckel/hive_scratch/conf
  Connect to jdbc:hive2:///default;
  hadoop resource id = hadoop0
bash0 : Bash Command Line Resource
  path = /usr/local/bin/bash

"hive0" and any user created resources can be dropped.  

ohsh>drop resource hmovie;

Only "bash0" and "hadoop0" resources are not drop-able.

Command Line Resources and Multi-line Statements

In OHSH the delegation operator (i.e. "%") simply takes one opaque line of input and feeds it to bash, Beeline, Hadoop, or SQL*Plus for execution in a sub-process.  

SQL*Plus, Beeline, and bash support multiple lines of input.  This situation is indicated by using the ">>" operator with the delegation operator.  This allows one push multiple commands to a resource.  The
termination of the input is a line with a single ";" 

ohsh>%sql >>
select table_name from tabs where table_name like 'M%';
select directory_name from all_directories where directory_name like 'OHSH%';


1 rows selected.


ohsh>%bash0 >>
echo $HOME


Shell Commands

OHSH supports ease of use shell features:

  • OHSH inherits the environmental variables of the parent process
  • OHSH supports setenv and printenv
  • Environmental variables are expanded using formal bracketed ${<envvar_name>} syntax that appears in a command.  Note that ${USER} will be expanded. $USER will not be expanded.
  • Up arrow and down arrow causes command line recall
  • history recall is supported
  • the bang operator recalls interactive commands
  • @<filename> input redirection (with ".ohsh" as default file extension)
  • spooling operations (by default to ohshspool.txt in the local directory)

       e.g. ohsh>spool on

       Turns spooling on to ohshspool.txt

       e.g. ohsh>spool off

       Turns spooling off

       e.g. set spool "mytest.txt"

       Directs spooling to a user defined file.

History is only recorded for interactive sessions. It is persisted in a shadow directory under ${HOME}/.ohsh.

User input is tokenized by OHSH as STRING and DQSTRING.  A STRING is a token that is not double quoted and has the typical characters one expects to see in Oracle and Hive identifiers, or in normal looking file paths, or URLs.  The DQSTRING is a double quoted string and accepts virtually anything as input.  DQSTRING always preserves case.  When STRING is used as an Oracle identifier will be normalized to upper case.

Sometimes user input clashes with defined tokens.

For example:

line 1:1 mismatched input 'resources' expecting {STRING, DQSTRING}
Error: ohsh syntax error encountered
Use "help" to check correct syntax. 

The problem is that "resources" is a keyword in OHSH grammar, so  it is getting confused when it is being used as a filename.  (The full grammar specification is specified at the bottom of this chapter.)

You can always work around the problem by double quoting the input.


For this case adding the file extension also works since the string is no longer a keyword.


Show and Set Defaults

OHSH relies heavily on default settings that augment user input when executing log commands.  "show defaults" lists all defaults and their settings.

ohsh>show defaults
createlogfiles = false
dateformat = "yyyy-MM-dd HH:mm:ss"
datemask = "YYYY-MM-DD HH24:MI:SS"
defaultdirectory = "OLHP_DEFAULT_DIR"
dop = 4
dpfilecount = 4
fieldterminator = "\u002c"
fusepath = ""
hadooptnsadmin = ""
hadoopwalletlocation = ""
hiverootdir = "/user/rhanckel/oracle_warehouse"
locationdirectory = "OLHP_STAGE_DIR"
logbadrecords = true
logdirectory = ""
multibyte = false
outputlevel = verbose
reducetasks = 5
rejectlimit = unlimited
rowdelimiter = "\u000a"
sessionconf = ""
skipcolprefix = "OSL_SKIPCOL_"
timestampmask = "YYYY-MM-DD HH24:MI:SS.FF"

Many of these defaults are specific to one or more load methods.

The semantics of these defaults are explained by:

ohsh>help set

This elaborates all the defaults and explains what load methods they impact.

For example, the default "dateformat" has this explanation.

'set' 'dateformat' <DQSTRING>

 Sets the default value of a Java dateformat.

 Load methods: directpath, etl (hadoop-phase), jdbc

 Sets oracle.hadoop.loader.defaultDateFormat.

Practically speaking only a few of these properties might need to be set on a per load basis.  Usually these are the data and timestamp default formats, assuming the formats differ from load to load.

Load Statements

Load statements will be the subject of the Parts 3, 4, 5, and 6 of the tutorial.  In Part 1 we walked through some simple cases.  The formal load syntax looks like this:

<LoadOracleTable> : 
('explain')? 'load' 'oracle' 'table' <ResourceName>':'<TableName> 'from' <LoadSource> <LoaderMap>? <Using>? <LoadConf>?

<LoadSource> : (<HdfsSource> | <HiveSource>)

<HdfsSource> : 'path' <ResourceName>':'<HdfsDirectoryPath> (<FileFormat>)?

<HiveSource> : 'hive' 'table' <ResourceName>':'<TableName> 
               ('partition' ('filter')? ('=')? <DQSTRING>)?

<FileFormat> : 'fileformat' ('datapump' | 'text' <TextDelimiter>)
<LoaderMap> : 'loadermap' <ColumnMapping> (',' <ColumnMapping>)*
<ColumnMapping> : ('tcolname' ('=')?)? <STRORDQSTR> 
			(('field' ('=')? )? <STRING>)? 
			(('format' ('=')?)? <DQSTRING>)?
<TextDelimiter> : ('fieldterminator' ('=')? <DQSTRING>)?
        		('rowdelimiter' ('=')? <DQSTRING>)? 
        		('initialencloser' ('=')? <DQSTRING> 
			('trailingencloser' ('=')? <DQSTRING>)?)?

<Using> :  'using' (
                   'jdbc' |
                   'directpath' |
                   'exttab' <OracleWhereClause>? |
                   'etl' <OracleWhereClause>? |
                   'etl' 'deferred' <OracleWhereClause>? 
			     ('scriptdir' <STRORDQSTR>)?

<LoadHiveTable> :  
    ('explain')? ('create' | 'create' 'or' 'replace' | 'load' | 'replace' )
    'hive' 'table' <ResourceName>':'<TableName> 
    'from' 'oracle' 'table'  <ResourceName>':'<TableName> <HiveStorage>?

<HiveUsing>? <OracleWhereClause>?
<HiveStorage> : 'hiverootdir' <HdfsDirectoryPath>  
<HiveUsing> :  'using' ('directcopy' | 'fuse' | 'stage') 

--------------- Common SubClauses --------------

<HdfsDirectoryPath> : <STRORDQSTR>
<OracleWhereClause> : <DQSTRING>
<ResourceName>      : <STRING>
<TableName>         : <STRORDQSTR>

--------------- Lexical Primitives ---------------

<BOOLEAN> : 'true' | 'false'
<DQSTRING> : "[.]*"
  e.g. "anything in double quotes!"
<STRING> : ('a'..'z'|'A'..'Z'|'0'..'9'|'.'|'/'|'\'|'_'|'+'|'-'|'~')+
  e.g. Scott, 123A, /tmp/foo.txt
<INTEGER> : ['0'-'9']+
  e.g. 1245

Next Steps

The next tutorial will focus on using OHSH to load Oracle tables from content living in HDFS files and Hive tables.


Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.