Monday Jun 17, 2013

How to Load Oracle Tables From Hadoop Tutorial (Part 2 - Hello World)

HelloWorld


A “Hello World” Example using OLH

This post will discuss the basic mechanics for loading an Oracle table using Oracle Loader for Hadoop (OLH). For this “Hello World” discussion, we will use JDBC to drive the example, loading delimited text living in HDFS files into a simple un-partitioned table called “FIVDTI” living in an Oracle schema. It will illustrate the bare bones structure of an OLH job, and touch upon the minimal configuration properties you need to set to get something working end to end. This tutorial assumes that you know how to run basic MapReduce jobs and know how to connect to Oracle with SQL*Plus to create and drop tables using Oracle SQL.

Restating what was explained in the introduction to this tutorial, OLH uses a MapReduce job to read data living in HDFS and to load it into a target table living in Oracle. If the data you are loading is in a typical form (e.g. delimited text or CSV files), you should be able to load a table interactively with a single command.

Requirements for Running an OLH Job

Since OLH runs MapReduce jobs, the OLH command will need to run on either some system on a Hadoop cluster, or on a system that has client access to the cluster. That system will also need JDBC access to an Oracle database that has a schema with the table that you want to load. Finally the system will need access to an OLH installation and its jar files (i.e. OLH_HOME in our running example below)..

If you want to use a Hadoop client, setting it up on a development node is not a lot of hard work. You simply need to download and install the Hadoop software that is running on the cluster using the configuration files specific to the cluster.

You want to sanity test everything to make sure the plumbing works. If you can kick off the Hadoop Wordcount MapReduce job and can read/write/delete HDFS files interactively using Hadoop, your Hadoop plumbing should be ready to accept an OLH job. (See http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html for details on sanity checking MapReduce.)

At this point it would be good to carve out a subdirectory in HDFS that you own, that will hold result directories of OLH jobs that you want to run. (In our working example in this post the result log directory will live in “/user/olh_test/results/fivdti”.) You use hadoop to do this:

hadoop fs –mkdir /user/oracle/olh_test/results/fivdti

You want to make sure that you are either the owner of this HDFS directory or at least have read and write access to it.

Connecting to an Oracle database requires you to connect using an Oracle connection URL with the name of the Oracle user and the Oracle password. In this example, we are assuming the Oracle schema and the Oracle user are the same. (See http://radiofreetooting.blogspot.com/2007/02/user-schema.html for an explanation of the difference between the two concepts.)

Before kicking off an OLH job it is worth taking the time to ensure the Oracle connection credentials are correct. Remember OLH will load an Oracle table across many Hadoop map or reduce tasks each of which will try to make a JDBC connection to the database, so you want to eliminate trial-and-error type authentication hiccups before passing them on to OLH. (See http://www.dba-oracle.com/t_oracle_jdbc_connection_testing.htm for details.)

While configuring and testing the OLH framework you will also want some minimal administration of Hadoop and Oracle. Specifically you will want to browse the Hadoop URL that contains JobTracker information where you can monitor a load job. You will also need to connect to Oracle via SQL*Plus to do administration with your schema and to manually inspect the target table using SQL. For purposes of OLH, installing a formal Oracle client on your development system, though convenient, is overkill. You simply need to log onto a system where SQL*Plus is available (e.g. the system where Oracle is running) and where you can connect to the Oracle database.

The Structure of an OLH Command

Plumbing parts

Let’s start by looking at an OLH command that you would invoke to kick off an OLH MapReduce job. Again this will use JDBC to connect to the Oracle database and load a table with delimited text living in files in HDFS.

$HADOOP_HOME/bin/hadoop jar
  $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader
  -D oracle.hadoop.loader.jobName=OLHP_fivdti_dtext_jdbc_0_722

  -D oracle.hadoop.loader.loaderMapFile=file:/tmp/loaderMap_fivdti.xml
  -D mapred.reduce.tasks=0

  -D mapred.input.dir=/user/olh_performance/fivdti/56000000_90
  -D mapred.output.dir=/user/oracle/olh_test/results/fivdti/722

  -conf /tmp/oracle_connection.xml
  -conf /tmp/dtextInput.xml

  -conf /tmp/jdbcOutput.xml

The command starts with an invocation of hadoop (living in the Hadoop client) passing the “jar” command.

$HADOOP_HOME/bin/hadoop jar

This is followed by a reference to OLH’s jar file (“oraloader.jar”) and the OLH loader class (fully qualified by its classpath) that defines an OLH load job.

$OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader

The following two properties are consumed by OLH. (Note that the space between a –D and the property name is not a typo. It is Hadoop’s convention for setting properties that are directed for Hadoop rather than for the JVM running underneath it.)

The first is to give the job a meaningful name when it executes in Hadoop. When creating a name, I find it useful to capture the name of the table being loaded, the type of input, the load method, the number of reducers used, and a unique job number for the OLH run (i.e. 722).

-D oracle.hadoop.loader.jobName=OLHP_fivdti_dtext_jdbc_0_722

MapReduce jobs are long running batch jobs. While debugging and tuning OLH load operations you will want to spend a lot of time looking at Hadoop’s JobTracker URL to find your job by name and see how it behaves.

The second property points to the loader map. The loader map is an XML file that indicates what Oracle table will be loaded, and indicates the fields in a delimited text file that map to columns in the target table.

-D oracle.hadoop.loader.loaderMapFile=file:/tmp/loaderMap_fivdti.xml

The next three properties are consumed by Hadoop.

The first designates the number of reduce tasks to run. The reduce stage in OLH is used for improving load performance by sorting records by table partitions and optionally sorting by table columns. If the table being loaded is not partitioned and if there is no value of sorting by columns, then you will want to set this value to zero. This means that the MapReduce job only runs the map stage, which simply loads the Oracle table.  Note that for JDBC, a zero value can be used even if the table is partitioned.  But if a table is partitioned you probably will want to use the OCI Direct Load method which does require a non-zero value.  (Using OCI Direct Load to load partitioned tables will be discussed in depth in the next post.)  

-D mapred.reduce.tasks=0

The second property designates the HDFS directory that holds input files of delimited text that are to serve as the payload for the target table.

-D mapred.input.dir=/user/olh_performance/fivdti/56000000_90

Finally you specify an HDFS directory where OLH can write logging output specific to the job.

In this case we carved out a directory under /user/oracle/olh_test/results, and assigned a unique integer for the job (i.e. 722). An important note: the subdirectory “722” is created by Hadoop once the job is submitted. If it exists before the job is submitted Hadoop will fail.

-D mapred.output.dir=/user/oracle/olh_test/results/fivdti/722/

The final arguments are files that complete configuration. For purposes of modularity they are split across three configuration files that can be reused for different types of jobs.

The first file has the credentials needed to connect to the Oracle database.

-conf /tmp/oracle_connection.xml

The second file designates the form of input (“delimited text”).

-conf /tmp/dtextInput.xml

The last file designates the output (i.e. online loading of an Oracle table via JDBC).

-conf /tmp/jdbcOutput.xml

Again all of these properties embedded in these various configuration files could be bundled into one large configuration file or called out explicitly using in-line “-D” properties.

However, assuming you don’t have a penchant for obscurity or verbosity, it’s probably prudent to breakdown the various configuration settings into configurations files that are organized to be reusable. Tuning Hadoop to run big MapReduce jobs can take some time to get optimal, and once you get there it’s a good idea to isolate the settings so they can be easily reused for different jobs. It’s also easier to test different combinations on the fly (e.g. trying JDBC and then trying OCI Direct Path) against the same table. You just need to swap the jdbcOutput.xml for the analogous specification for OCI Direct Path.

Loader Maps

Loader maps are used to map fields read as input to columns living in the Oracle table that will be loaded. They also identify the table being loaded (e.g. “FIVDTI”) and the schema (“OLHP”) where the table resides.

First let’s look at the Oracle table we are loading:

SQL> describe fivdti

Name Null Type
-------------------------------------------------------------
F1 NUMBER

I2 NUMBER(38)
V3 VARCHAR2(50)

D4 DATE
T5 TIMESTAMP(6)

V6 VARCHAR2(200)
I7 NUMBER(38)

Again, the loader map file is identified by the following OLH property:

-D oracle.hadoop.loader.loaderMapFile=file:/tmp/loaderMap_fivdti.xml

The file contains this specification:

<LOADER_MAP>
   <SCHEMA>OLHP</SCHEMA>
   <TABLE>FIVDTI</TABLE>

   <COLUMN field="F0">F1</COLUMN>
   <COLUMN field="F1">I2</COLUMN>

   <COLUMN field="F2">V3</COLUMN>
   <COLUMN field="F3" format="yyyy-MM-dd HH:mm:ss">D4</COLUMN>

   <COLUMN field="F4">T5</COLUMN>
   <COLUMN field="F5">V6</COLUMN>

   <COLUMN field="F6">I7</COLUMN>
</LOADER_MAP>

We need to map fields in the delimited text to the column names specified in the table. Field names in delimited text files can default to a simple naming convention, where fields are named by default to “F0, F1, F2…” reflecting the physical order that they appear in, in a line of CSV text. These field names are then paired with column names. At the top of the specification is the schema name and table name.

What is critical about loader maps is ensuring that the text fields being loaded can be legally converted to the Oracle data type of the mapped columns, considering issues such as precision and scale that are asserted on the Oracle columns. Typically for standard scalar types this is straight forward. However DATE columns are fussier and require the user to describe an explicit format. (Strictly speaking, loader maps are not required in cases when column names reflect field names and the DATE data type is not used, but in this case we are using default field names that are different from the formal column names of the Oracle table.)

Configuration Specified in Files

Let’s look at the settings living in the configuration files specified above.

OLH Connection Properties

The oracle_connection.xml file in the example above contains the credentials needed to find the Oracle database and to connect to a schema. This information gets passed to the OLH MapReduce jobs.

The connection URL follows the “jdbc:oracle:thin” pattern. You need to know the host where the Oracle database is running (e.g. “myoraclehost”), the listening port (e.g. 1511), and the database service name (e.g. “dbm”). You also need to identify the Oracle user (e.g. “olhp”) and password (e.g. “welcome1”). (In a production environment you will want to use Oracle Wallet to avoid storing passwords in clear text, but we will save that issue for a later post.)

<configuration>

<property>
    <name>oracle.hadoop.loader.connection.url</name>

    <value>jdbc:oracle:thin:@myoraclehost:1511/dbm</value>
  </property>

  <property>
    <name>oracle.hadoop.loader.connection.user</name>

    <value>olhp</value>
  </property>

  <property>
    <name>oracle.hadoop.loader.connection.password</name>

    <value>welcome1</value> 
  </property>

</configuration>

OLH Input Format

The dtextInput.xml configuration file in the example above describes the physical characteristics of the rows of data that OLH will read. OLH provides off-the-shelf built-in input format implementations that cover many common formats that you would expect to see living in Hadoop data files. These include delimited text and CSV, text with regular expressions, Hive formats, and Oracle NoSQL database. In general, it’s much easier to generate output in a form that is compliant with the existing built-ins, although rolling your own input format class is a supported option.

In our running example we are using default settings for delimited text, which is CSV. Additional properties for delimited text allow you to specify alternative field terminators, and require explicit field enclosers (useful when you want to use CSV as an input format but when there are commas embedded in field values.)

<configuration>

 <property>
    <name>mapreduce.inputformat.class</name>

    <value>oracle.hadoop.loader.lib.input.DelimitedTextInputFormat</value>
 </property>

</configuration>

OLH Output Format

The jdbcOutput.xml file is used to specify how we want to load Oracle. Output formats can be divided into two types: those that load the data directly into an Oracle database table, or those that store the data in a new set of files in HDFS that can be pulled into Oracle later.

For our running example, we are going to use JDBC.

<configuration>

   <property>
    <name>mapreduce.outputformat.class</name>
    <value>oracle.hadoop.loaderlib.output.JDBCOutputFormat</value>
  </property>

</configuration>

Running the OLH Job

When you kick off the OLH Job it will give you console output that looks something like this:

Copyright (c) 2011, 2013, Oracle and/or its affiliates. All rights reserved.
13/05/30 09:18:00 INFO loader.OraLoader: Oracle Loader for Hadoop Release 2.1.0 - Production

Copyright (c) 2011, 2013, Oracle and/or its affiliates. All rights reserved.

13/05/30 09:18:00 INFO loader.OraLoader: Built-Against: hadoop-2.0.0-mr1-cdh4.1.2 hive-0.9.0-cdh4.1.2 avro-1.6.3 jackson-1.8.8
13/05/30 09:18:02 INFO loader.OraLoader: oracle.hadoop.loader.loadByPartition is disabled because mapred.reduce.tasks=0
13/05/30 09:18:02 INFO loader.OraLoader: oracle.hadoop.loader.enableSorting disabled: cannot sort by key when number of reducers is zero
13/05/30 09:18:02 INFO output.DBOutputFormat: Setting map tasks speculative execution to false for : oracle.hadoop.loader.lib.output.JDBCOutputFormat
13/05/30 09:18:02 WARN loader.OraLoader: Sampler error: the number of reduce tasks must be greater than one; the configured value is 0 . Job will continue without sampled information.
13/05/30 09:18:02 INFO loader.OraLoader: Sampling time=0D:0h:0m:0s:14ms (14 ms)
13/05/30 09:18:02 INFO loader.OraLoader: Submitting OraLoader job OraLoader .
13/05/30 09:18:02 INFO input.FileInputFormat: Total input paths to process : 90
13/05/30 09:18:04 INFO loader.OraLoader: map 0% reduce 0%
13/05/30 09:18:19 INFO loader.OraLoader: map 1% reduce 0%
13/05/30 09:18:20 INFO loader.OraLoader: map 2% reduce 0%
13/05/30 09:18:22 INFO loader.OraLoader: map 3% reduce 0%

….
13/05/30 09:21:13 INFO loader.OraLoader: map 95% reduce 0%
13/05/30 09:21:16 INFO loader.OraLoader: map 96% reduce 0%
13/05/30 09:21:18 INFO loader.OraLoader: map 97% reduce 0%
13/05/30 09:21:20 INFO loader.OraLoader: map 98% reduce 0%
13/05/30 09:21:23 INFO loader.OraLoader: map 99% reduce 0%
13/05/30 09:21:31 INFO loader.OraLoader: map 100% reduce 0%
13/05/30 09:21:33 INFO loader.OraLoader: Job complete: OraLoader (job_201305201106_0524)

Note that because we are loading a simple non-partitioned table, this is a map-only job where loading is done by mappers and there is no reduce phase. The warning message at the outset of the job is about an OLH feature called the sampler. It is used when tables are partitioned to balance the reducers doing a load. Since the target table is not partitioned the warning about the sampler being disabled is not interesting.

Where to look if something went wrong

I’ve run OLH jobs interactively daily for more than a year. When something goes wrong, Hadoop console output will make it obvious, and typically gives a pretty good idea of what problem you are having. I also rely on Hadoop’s Job and Task tracker UIs which allow you to drill down to a failed task and look at the output it produces: typically Java stack dumps and log messages that detail the problems it was having.

The results directory in HDFS that was specified in the “mapred.output.dir" setting in the OLH command contains lots of information about a job run. In the directory there will be a top level report called “oraloader-report.txt”. It offers a clean breakdown of time spent by tasks running in Hadoop that were used to load the target table. It probably is the quickest way of looking at the workloads and determining if they are unbalanced.

Hadoop and Connection Resources

Overload

About the only additional issue that you need to be concerned about for this kind of job is the number of concurrent connections the Oracle database accepts. This becomes a problem when the number of Hadoop tasks that are loading a table concurrently exceeds the number of connections that Oracle will accept. If that happens the loading tasks will fail with an ORA-00020 error.

You want to check the number of map and reduce slots that are configured for Hadoop. For map-only jobs, if the number of map slots is less than the number of connections Oracle accepts, there won’t be a problem. The same holds true for full MapReduce jobs if the number of reduce slots are less than the max number of Oracle connections accepted.

If this is not true you need to artificially restrict the concurrency of load tasks.

For map-only jobs (like the one illustrated above) this means you will need to restrict the number of map slots in the cluster available to the OLH job to something less than the number of connections Oracle allows.

For full MapReduce OLH jobs (which are more typical) loading occurs in the reduce phase, and this can be easily controlled in the OLH command by tweaking the “mapred.reduce.tasks” property mentioned above, and setting it to an appropriate number.

-D mapred.reduce.tasks=20

Summary

To summarize, a bare bones OLH configuration typically needs the following information:

  • How to connect to Oracle
  • How many reduce tasks to run
  • The form of input (e.g. CSV) and output (e.g. JDBC)
  • An HDFS directory containing input files
  • An HDFS directory where OLH can write information about job results
  • A loader map that tells what fields correspond to what columns in a table

Done


That's it.  Easy-peasy lemon squeezy.  All OLH commands pretty much look like the one used above with minor substitutions.  Creating configurations for other tables typically requires only creating new a loader map specification. All the other configuration files can be reused.

The next post will discuss using OLH with OCI Direct Load which is what you will want to use when loading big tables that are sorted by key or are partitioned. We will spend a lot of time discussing performance issues such as load balancing, using the OLH sampler, and using the SDP transport protocol.

Thursday May 16, 2013

Oracle Big Data Connectors 2.1

Oracle Big Data Connectors 2.1 is now available.  

Oracle Loader for Hadoop and Oracle SQL Connector for HDFS add certification with CDH 4.2 and Apache Hadoop 1.1.1 in this release.

Enhancements to Oracle Loader for Hadoop: 

 - Ability to load from Hive partitioned tables
 - Improved usability and error handling
 - Sort by user-specified key before load

Tuesday Apr 30, 2013

How to Load Oracle Tables From Hadoop Tutorial (Part 1 - Overview)


Introduction

This is the first of a series of blog posts that will discuss how to load data living in the Hadoop Ecosphere into Oracle tables. The goal is to give insights, discuss pros and cons, and best practices for achieving optimal load times and flexibility from an experienced developer’s point of view.

Oracle and Hadoop are complementary technologies where the whole is greater than the sum of the parts. They both have parallel architectures, which, if used intelligently can move data at an impressive rate. Last year, we achieved a load rate of 12TB (terabytes) per hour between Oracle Exadata and Hadoop running on Oracle’s Big Data Appliance (BDA). The ability to distill big data in Hadoop and then to seamlessly move large result sets into the Oracle stack creates enormous added value in solving Big Data problems.

In supporting customers who need this functionality we’ve noticed that more frequently than not, we are talking to people who are either Hadoop experts or Oracle heavyweights but not both. In our attempt to explain these two technologies we will offer breakout sections that offer some rudimentary background notes about Hadoop and Oracle that we think are important to understand, so you can use these tools effectively. Additional specialized topics will also go into loading issues specific to RAC and Exadata environments.

Why Use Oracle Big Data Connectors?

Hadoop developers might be asking themselves the following question: Oracle has been around for a long time managing huge sets of data in tables. These tables had to be loaded somehow? What’s the added value of the Big Data Connectors? Can’t we use the standard utilities Oracle has provided to load tables?

The quick answer is yes. But if you are dealing with Big Data, you really don’t want to.

Some Background about Conventional Loading Tools and Oracle

Oracle's off-the-shelf utility used for loading data from external source is called SQL*Loader. It does a great job loading files of various formats into an Oracle table.

The following SQL*Loader control file illustrates what this utility does:

LOAD DATA

INFILE file1.dat

INFILE file2.dat

INFILE file3.dat

APPEND

INTO TABLE emp

( empno POSITION(1:4) INTEGER EXTERNAL,

ename POSITION(6:15) CHAR,

deptno POSITION(17:18) CHAR,

mgr POSITION(20:23) INTEGER EXTERNAL

)

SQL*Loader is being told to open three files and append an existing table “emp” with data from the files whose column mapping, physical position, and representation are articulated between the parenthesis. SQL*Loader is really powerful for processing files of various formats.

But to use this tool with Hadoop you need to work around several problems. The first of which is that Hadoop content lives in Hadoop Distributed File System (HDFS) files, not standard OS file systems. SQL*Loader does not know how to access HDFS directly, so the “INFILE” verbiage is a non-starter.

You could work around this problem two ways. One way is to copy the file from Hadoop onto a local disk on a system where SQL*Loader is installed. The problem with this solution is that Hadoop files are big, very often bigger than any storage you have on a single system. Remember that a single Hadoop file can potentially be huge (say 18TB, larger than the digital content of the Library of Congress). That’s a big storage requirement for a single system, especially for a transient requirement such as staging data. Also you can assume that whatever storage requirements you have today for Big Data, they will certainly grow fast.

Secondly, in order to get the data from HDFS into an Oracle table you are doubling the amount of IO resources consumed. (“Read from HDFS, write into an Oracle table” becomes “Read from HDFS, write to staged file, read from staged file, write into an Oracle table”). When operating against Big Data, doubling the IO overhead is worth avoiding.

An alternative approach is to use FUSE technology (Mountable HDFS) that creates a mount point for HDFS. It is an elegant solution but it is substantially slower than Oracle Big Data Connectors (by a factor of 5) and consumes about three times the CPU.

And in both cases you would be forced to run SQL*Loader on the machine where Oracle lives, not because of some functional limitation of SQL*Loader (you can run it anywhere) but because of the practicalities of working with HDFS which is inherently distributed. Running SQL*Loader on a non-Oracle system means you are moving huge data blocks of distributed data living on any number of Hadoop DataNodes through the network to a single system which will be tasked to pass the entire payload over the network again to Oracle. This model doesn’t scale.

WrongToolForTheJob

Exploiting Parallelism in Oracle and Hadoop to Load Data

The best solution for loading data from Hadoop to Oracle is to use and align the mechanisms for doing parallel work in both environments.

Parallelism in Oracle Loader for Hadoop (OLH)

For OLH this means running MapReduce programs in Hadoop to break up a load operation into tasks running on all available MapReduce nodes in a Hadoop cluster. These MapReduce tasks run concurrently, naturally dividing the workload into discrete payloads that use Oracle MapReduce code to connect to Oracle Database remotely and load data into a target table. It’s a natural parallel model for Hadoop since the loading logic is encapsulated and run as vanilla MapReduce jobs. And it’s a natural model for Oracle, since the Oracle database system is being tasked to serve multiple clients (i.e MapReduce tasks) loading data at once, using standard client-server architecture that’s been around for decades.

Parallelism in Oracle SQL Connector for Hadoop Distributed File System (OSCH)

OSCH is the alternative approach that marries two other parallel mechanisms: Oracle Parallel Query for Oracle External Tables and Hadoop HDFS Client. To explain how these mechanisms align, let’s first talk about External tables and Parallel Query.

External Tables

External tables are tables defined in Oracle which manage data not living in Oracle. For example, suppose you had an application that managed and frequently updated some structured text files in a system, but you needed to access that data to join it to some Oracle table. You would define an Oracle External table which pointed it to the same structured text files updated by the application, accompanied by verbiage that looks striking similar to the SQL*Loader verbiage discussed above. That’s not a coincidence. The Oracle External tables use the SQL*Loader driver which executes SQL*Loader code under the covers.

Parallel Query

Parallel Query (PQ) is a “divide and conquer” strategy that decomposes a SQL statement into partitioned tasks that can execute in parallel and merge the results. PQ exploits the fact that SQL tables are symmetric and can be logically subdivided into horizontal partitions (i.e. sets of rows). With PQ if you want to execute:

SELECT last_name FROM emp WHERE salary > 30000

Oracle can decompose this query into smaller units of work which perform the identical query in parallel against mutually exclusive sets of rows in the “emp” table. For PQ to give you this advantage it needs to be enabled and properly configured (a detail we will talk about in a future post.) For now you simply need to understand that PQ works to break down SQL statements into worker bees (i.e. PQ Slaves) that divide the load and execute in parallel. In particular, PQ can be enabled for External tables which allow SQL to access data outside of Oracle in parallel. The amount of parallelism an External table has is configurable and is dictated by configuring the DOP (degree of parallelism). The DOP can be asserted various ways: as an attribute of a table, or within a SQL statement using a table, or at the session level after the user connects to Oracle.

HDFS Client

Now let’s talk about Hadoop HDFS Client. This is a Java API living in Hadoop that acts as a client to HDFS file systems. It looks like your standard file system programmatic interface: with open, read, write, and close methods. But because it works against HDFS which distributes individual blocks of a file across a Hadoop cluster, there is a lot of parallelism going on in the back end. Blocks are served up to HDFS by Hadoop DataNodes that are daemons running on Hadoop nodes, serving up data blocks that are stored locally to the node. If you run a lot of HDFS Clients concurrently against different HDFS files, you are doing lots of concurrent IO and concurrent streaming of data, from every Hadoop node that has a running DataNode. In other words you are maximizing retrieval of data from the Hadoop Cluster.

Putting It All Together

OSCH works by using all these mechanisms together. It defines a specialized External table which can invoke HDFS Client software to access data in HDFS. And when PQ is enabled for this type of External table, a SQL select statement gets decomposed into N PQ slaves (where N is the DOP). In other words a SQL select statement can kick off N PQ slaves that are each accessing K Hadoop DataNodes in parallel. Access of HDFS blocks by PQ slaves maximizes disk IO, network bandwidth, and processing both in Oracle and in Hadoop.

DataStampede

With this model, you load an Oracle table (e.g. “MY_TABLE”) by executing a single SQL Insert statement. One that gets its data from a subordinate Select statement that references the external table retrieving data from HDFS (e.g. “MY_EXTERNAL_TABLE”).

INSERT INTO MY_TABLE as SELECT * FROM MY_EXTERNAL_TABLE;

Actually I lied. It takes two statements.

COMMIT;

Just sayin.

Next Topic

In following post we will look at OLH in depth starting with JDBC. We will look at the execution model, and discuss the basics for configuring and tuning a MapReduce job used to load a table living in Oracle.

Author’s Background

My background in this space involves both product development and performance. I was pulled into this project about 20 months ago from doing work in Oracle middleware (specifically Oracle Coherence). I am currently working with a talented team that developed OLH and OSCH from scratch. My contribution was to design and prototype OSCH to the point that it scaled, and then spin up on Oracle BDA/Exadata/Hadoop internals to do performance benchmarks and testing. Because I’m almost the newest member of the team, the experience of spinning up in this space is still fresh in my mind, so I have a healthy respect for what it’s like to wrap ones brain around both technologies. Many readers will have much deeper knowledge in either the Oracle space or in Hadoop, so questions or clarifications are welcome.

About

Oracle Loader for Hadoop and Oracle SQL Connector for HDFS

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today