X

Move data between Oracle Database and Apache Hadoop using high speed connectors.

  • July 28, 2017

How to Load Oracle Tables From Hadoop Tutorial (Part 1 - Overview)

Introduction

This is the first of a series of blog posts that will discuss how to load data living in the Hadoop Ecosphere into Oracle tables. The goal is to give insights, discuss pros and cons, and best practices for achieving optimal load times and flexibility from an experienced developer’s point of view.

Oracle and Hadoop are complementary technologies where the whole is greater than the sum of the parts. They both have parallel architectures, which, if used intelligently can move data at an impressive rate. Last year, we achieved a load rate of 12TB (terabytes) per hour between Oracle Exadata and Hadoop running on Oracle’s Big Data Appliance (BDA). The ability to distill big data in Hadoop and then to seamlessly move large result sets into the Oracle stack creates enormous added value in solving Big Data problems.

In supporting customers who need this functionality we’ve noticed that more frequently than not, we are talking to people who are either Hadoop experts or Oracle heavyweights but not both. In our attempt to explain these two technologies we will offer breakout sections that offer some rudimentary background notes about Hadoop and Oracle that we think are important to understand, so you can use these tools effectively. Additional specialized topics will also go into loading issues specific to RAC and Exadata environments.

Why Use Oracle Big Data Connectors?

Hadoop developers might be asking themselves the following question: Oracle has been around for a long time managing huge sets of data in tables. These tables had to be loaded somehow?  What’s the added value of the Big Data Connectors? Can’t we use the standard utilities Oracle has provided to load tables?

The quick answer is yes. But if you are dealing with Big Data, you really don’t want to.

Some Background about Conventional Loading Tools and Oracle

Oracle's off-the-shelf utility used for loading data from external source is called SQL*Loader. It does a great job loading files of various formats into an Oracle table. The following SQL*Loader control file illustrates what this utility does:

LOAD DATA
INFILE file1.dat
INFILE file2.dat
INFILE file3.dat
APPEND
INTO TABLE emp
( empno POSITION(1:4) INTEGER EXTERNAL,
ename POSITION(6:15) CHAR,
deptno POSITION(17:18) CHAR,
mgr POSITION(20:23) INTEGER EXTERNAL
)

SQL*Loader is being told to open three files and append an existing table “emp” with data from the files whose column mapping, physical position, and representation are articulated between the parenthesis. SQL*Loader is really powerful for processing files of various formats.

But to use this tool with Hadoop you need to work around several problems. The first of which is that Hadoop content lives in Hadoop Distributed File System (HDFS) files, not standard OS file systems. SQL*Loader does not know how to access HDFS directly, so the “INFILE” verbiage is a non-starter.

You could work around this problem two ways. One way is to copy the file from Hadoop onto a local disk on a system where SQL*Loader is installed. The problem with this solution is that Hadoop files are big, very often bigger than any storage you have on a single system. Remember that a single Hadoop file can potentially be huge (say 18TB, larger than the digital content of the Library of Congress).  That’s a big storage requirement for a single system, especially for a transient requirement such as staging data. Also you can assume that whatever storage requirements you have today for Big Data, they will certainly grow fast.

Secondly, in order to get the data from HDFS into an Oracle table you are doubling the amount of IO resources consumed. (“Read from HDFS, write into an Oracle table” becomes “Read from HDFS, write to staged file, read from staged file, write into an Oracle table”). When operating against Big Data, doubling the IO overhead is worth avoiding.

An alternative approach is to use FUSE technology (Mountable HDFS) that creates a mount point for HDFS. It is an elegant solution but it is substantially slower than Oracle Big Data Connectors (by a factor of 5) and consumes about three times the CPU.

And in both cases you would be forced to run SQL*Loader on the machine where Oracle lives, not because of some functional limitation of SQL*Loader (you can run it anywhere) but because of the practicalities of working with HDFS which is inherently distributed. Running SQL*Loader on a non-Oracle system means you are moving huge data blocks of distributed data living on any number of Hadoop DataNodes through the network to a single system which will be tasked to pass the entire payload over the network again to Oracle. This model doesn’t scale.

 

WrongToolForTheJob

Exploiting Parallelism in Oracle and Hadoop to Load Data

The best solution for loading data from Hadoop to Oracle is to use and align the mechanisms for doing parallel work in both environments.

Parallelism in Oracle Loader for Hadoop (OLH)

For OLH this means running MapReduce programs in Hadoop to break up a load operation into tasks running on all available MapReduce nodes in a Hadoop cluster. These MapReduce tasks run concurrently, naturally dividing the workload into discrete payloads that use Oracle MapReduce code to connect to Oracle Database remotely and load data into a target table. It’s a natural parallel model for Hadoop since the loading logic is encapsulated and run as vanilla MapReduce jobs. And it’s a natural model for Oracle, since the Oracle database system is being tasked to serve multiple clients (i.e MapReduce tasks) loading data at once, using standard client-server architecture that’s been around for decades.

Parallelism in Oracle SQL Connector for Hadoop Distributed File System (OSCH)

OSCH is the alternative approach that marries two other parallel mechanisms: Oracle Parallel Query for Oracle External Tables and Hadoop HDFS Client. To explain how these mechanisms align, let’s first talk about External tables and Parallel Query.

External Tables

External tables are tables defined in Oracle which manage data not living in Oracle. For example, suppose you had an application that managed and frequently updated some structured text files in a system, but you needed to access that data to join it to some Oracle table. You would define an Oracle External table which pointed it to the same structured text files updated by the application, accompanied by verbiage that looks striking similar to the SQL*Loader verbiage discussed above. That’s not a coincidence. The Oracle External tables use the SQL*Loader driver which executes SQL*Loader code under the covers.

Parallel Query

Parallel Query (PQ) is a “divide and conquer” strategy that decomposes a SQL statement into partitioned tasks that can execute in parallel and merge the results. PQ exploits the fact that SQL tables are symmetric and can be logically subdivided into horizontal partitions (i.e. sets of rows). With PQ if you want to execute:

SELECT last_name FROM emp WHERE salary > 30000

Oracle can decompose this query into smaller units of work which perform the identical query in parallel against mutually exclusive sets of rows in the “emp” table. For PQ to give you this advantage it needs to be enabled and properly configured (a detail we will talk about in a future post.)  For now you simply need to understand that PQ works to break down SQL statements into worker bees (i.e. PQ Slaves) that divide the load and execute in parallel. In particular, PQ can be enabled for External tables which allow SQL to access data outside of Oracle in parallel. The amount of parallelism an External table has is configurable and is dictated by configuring the DOP (degree of parallelism).  The DOP can be asserted various ways: as an attribute of a table, or within a SQL statement using a table, or at the session level after the user connects to Oracle.

HDFS Client

Now let’s talk about Hadoop HDFS Client. This is a Java API living in Hadoop that acts as a client to HDFS file systems. It looks like your standard file system programmatic interface: with open, read, write, and close methods. But because it works against HDFS which distributes individual blocks of a file across a Hadoop cluster, there is a lot of parallelism going on in the back end. Blocks are served up to HDFS by Hadoop DataNodes that are daemons running on Hadoop nodes, serving up data blocks that are stored locally to the node. If you run a lot of HDFS Clients concurrently against different HDFS files, you are doing lots of concurrent IO and concurrent streaming of data, from every Hadoop node that has a running DataNode. In other words you are maximizing retrieval of data from the Hadoop Cluster.

Putting It All Together

OSCH works by using all these mechanisms together. It defines a specialized External table which can invoke HDFS Client software to access data in HDFS. And when PQ is enabled for this type of External table, a SQL select statement gets decomposed into N PQ slaves (where N is the DOP). In other words a SQL select statement can kick off N PQ slaves that are each accessing K Hadoop DataNodes in parallel. Access of HDFS blocks by PQ slaves maximizes disk IO, network bandwidth, and processing both in Oracle and in Hadoop.

DataStampede

With this model, you load an Oracle table (e.g. “MY_TABLE”) by executing a single SQL Insert statement. One that gets its data from a subordinate Select statement that references the external table retrieving data from HDFS (e.g. “MY_EXTERNAL_TABLE”).

INSERT INTO MY_TABLE as SELECT * FROM MY_EXTERNAL_TABLE;

Actually I lied. It takes two statements.

COMMIT;

Just sayin.

Next Topic

In following post we will look at OLH in depth starting with JDBC. We will look at the execution model, and discuss the basics for configuring and tuning a MapReduce job used to load a table living in Oracle.

Author’s Background

My background in this space involves both product development and performance. I was pulled into this project about 20 months ago from doing work in Oracle middleware (specifically Oracle Coherence). I am currently working with a talented team that developed OLH and OSCH from scratch. My contribution was to design and prototype OSCH to the point that it scaled, and then spin up on Oracle BDA/Exadata/Hadoop internals to do performance benchmarks and testing. Because I’m almost the newest member of the team, the experience of spinning up in this space is still fresh in my mind, so I have a healthy respect for what it’s like to wrap ones brain around both technologies. Many readers will have much deeper knowledge in either the Oracle space or about Hadoop, so questions or clarifications are welcome.

 

 

Join the discussion

Comments ( 11 )
  • Michael Segel Thursday, May 2, 2013

    Hi,

    I'm interested in the OSCH approach. Just a quick question... how do you create the external table in Oracle?

    Can it be done programmatically ?

    (It would be nice to see an example of the ddl syntax specifying a file sitting on an HDFS cluster...)

    Also are you limited to Cloudera's release (CDH4) or can you hit Hortonworks, MapR, and Apache which may or may not be on the same release?

    Thx


  • guest Thursday, May 16, 2013

    Hi Mike,

    We are certified to work with Cloudera and Hadoop 1.0 and 1.1. We expect that these mechanisms would work fine with other distributions.

    In future posts we will get into the mechanics of creating external tables for Hadoop but I can post an example here:

    CREATE TABLE FOO

    (

    F1 NUMBER,

    I2 NUMBER(38,0),

    V3 VARCHAR2(50),

    D4 DATE,

    T5 TIMESTAMP (6),

    V6 VARCHAR2(200),

    I7 NUMBER(38,0)

    )

    ORGANIZATION EXTERNAL

    ( TYPE ORACLE_LOADER

    DEFAULT DIRECTORY "OLHP_DEFAULT_DIR"

    ACCESS PARAMETERS

    (

    RECORDS DELIMITED BY NEWLINE

    disable_directory_link_check

    PREPROCESSOR OSCH_BIN_PATH:hdfs_stream

    FIELDS TERMINATED BY ','

    MISSING FIELD VALUES ARE NULL

    ( f1 DECIMAL EXTERNAL,

    i2 DECIMAL EXTERNAL,

    v3 CHAR(200),

    d4 CHAR DATE_FORMAT DATE MASK "YYYY-MM-DD HH24:MI:SS",

    t5 CHAR DATE_FORMAT TIMESTAMP MASK "YYYY-MM-DD HH24:MI:SS.FF",

    v6 CHAR(200),

    i7 DECIMAL EXTERNAL

    )

    )

    LOCATION

    ( 'lf_0.hdfsm',

    'lf_1.hdfsm'

    )

    The PREPROCESSOR directive is what makes this external table definition get data from Hadoop. Take it out, and the external table will look for data in the local file system.

    External tables are like any other SQL DDL. They can be created programatically and then processed by Oracle using JDBC, ODBC, OCI, or PL/SQL.

    The latest revision of OSCH V2.1 will generate external tables for simple cases.

    But I'd recommend spending sometime at formal Oracle documentation to understand how to create these things manually.

    http://docs.oracle.com/cd/B19306_01/server.102/b14215/et_concepts.htm

    Regards,

    Bob


  • Melli Annamalai Thursday, May 16, 2013

    Hi Mike,

    To add to Bob's comments:

    Oracle SQL Connector for HDFS includes a command-line utility to automatically generate the external table. The utility takes as input a set of configuration parameters: Hive database and table name (if accessing a Hive table), input file directory and delimiter (if accessing files on HDFS), Oracle Database connection information, and so on. A table such as the Bob describes above is automatically created in the database for your input data.

    Oracle SQL Connector for HDFS and Oracle Loader for Hadoop are certified for:

    - Cloudera's Distribution Including Apache Hadoop (CDH3 and CDH4)

    - Apache Hadoop 1.0 and 1.1

    We expect the connectors to work for Hadoop distributions that are based on these. We have a validation test kit that vendors and customers can use to test distributions not on the certification list.


  • Melli Annamalai Thursday, May 16, 2013

    See the Oracle Big Data Connectors User's Guide for more information: http://docs.oracle.com/cd/E40622_01/doc.21/e36961/toc.htm


  • guest Saturday, December 28, 2013

    Hi Bob, do you provide Hadoop training?


  • guest Wednesday, January 8, 2014

    Sorry, I'm an database developer. You might try Cloudera. They have on-line tutorials.


  • guest Monday, July 7, 2014

    hai dude,this info very excellent and then very use to my project.

    <a href="http://www.hadooptrainingchennai.co.in">Hadoop Training in Chennai</a>


  • vignes waran Tuesday, July 8, 2014

    Thanks for this reviews.

    <a href="http://www.hadooptrainingchennai.co.in">Hadoop Training in Chennai</a>


  • guest Friday, February 27, 2015

    Hi, nice tutorial on how to load orcale tables. This has cleared some of my doubts in Hadoop.


  • James Thursday, January 12, 2017

    In here you mentioned Oracle Exadata but what type of performance is expected with a standard Oracle 11g database using RAC? Also, which method between OLH and OSCH works best?


  • guest Friday, January 13, 2017

    The network bandwidth between the system running Oracle and the cluster running Hadoop is typically the biggest restriction on performance.

    Exadata is typically tethered to the BDA using Infiniband. A 10 Gigabit network between RAC and a Hadoop cluster should be very good.

    OSCH and OLH load times are similar when measuring the actual loading of data from Hadoop to Oracle. The difference is that the OLH loading is preceded by a Map/Reduce job before the loading starts and typically the total elapsed time is about three or four times slower. But OLH uses a fraction of the resources on the Oracle system as compared to OSCH.

    That is because OLH uses OCI Direct Path to load (when the target table is partitioned) and because the data type conversion is done using Hadoop cycles during the Map/Reduce phase.


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.