This is the first of a series of blog posts that will discuss how to load data living in the Hadoop Ecosphere into Oracle tables. The goal is to give insights, discuss pros and cons, and best practices for achieving optimal load times and flexibility from an experienced developer’s point of view.
Oracle and Hadoop are complementary technologies where the whole is greater than the sum of the parts. They both have parallel architectures, which, if used intelligently can move data at an impressive rate. Last year, we achieved a load rate of 12TB (terabytes) per hour between Oracle Exadata and Hadoop running on Oracle’s Big Data Appliance (BDA). The ability to distill big data in Hadoop and then to seamlessly move large result sets into the Oracle stack creates enormous added value in solving Big Data problems.
In supporting customers who need this functionality we’ve noticed that more frequently than not, we are talking to people who are either Hadoop experts or Oracle heavyweights but not both. In our attempt to explain these two technologies we will offer breakout sections that offer some rudimentary background notes about Hadoop and Oracle that we think are important to understand, so you can use these tools effectively. Additional specialized topics will also go into loading issues specific to RAC and Exadata environments.
Hadoop developers might be asking themselves the following question: Oracle has been around for a long time managing huge sets of data in tables. These tables had to be loaded somehow? What’s the added value of the Big Data Connectors? Can’t we use the standard utilities Oracle has provided to load tables?
The quick answer is yes. But if you are dealing with Big Data, you really don’t want to.
Oracle's off-the-shelf utility used for loading data from external source is called SQL*Loader. It does a great job loading files of various formats into an Oracle table. The following SQL*Loader control file illustrates what this utility does:
INTO TABLE emp
( empno POSITION(1:4) INTEGER EXTERNAL,
ename POSITION(6:15) CHAR,
deptno POSITION(17:18) CHAR,
mgr POSITION(20:23) INTEGER EXTERNAL
SQL*Loader is being told to open three files and append an existing table “emp” with data from the files whose column mapping, physical position, and representation are articulated between the parenthesis. SQL*Loader is really powerful for processing files of various formats.
But to use this tool with Hadoop you need to work around several problems. The first of which is that Hadoop content lives in Hadoop Distributed File System (HDFS) files, not standard OS file systems. SQL*Loader does not know how to access HDFS directly, so the “INFILE” verbiage is a non-starter.
You could work around this problem two ways. One way is to copy the file from Hadoop onto a local disk on a system where SQL*Loader is installed. The problem with this solution is that Hadoop files are big, very often bigger than any storage you have on a single system. Remember that a single Hadoop file can potentially be huge (say 18TB, larger than the digital content of the Library of Congress). That’s a big storage requirement for a single system, especially for a transient requirement such as staging data. Also you can assume that whatever storage requirements you have today for Big Data, they will certainly grow fast.
Secondly, in order to get the data from HDFS into an Oracle table you are doubling the amount of IO resources consumed. (“Read from HDFS, write into an Oracle table” becomes “Read from HDFS, write to staged file, read from staged file, write into an Oracle table”). When operating against Big Data, doubling the IO overhead is worth avoiding.
An alternative approach is to use FUSE technology (Mountable HDFS) that creates a mount point for HDFS. It is an elegant solution but it is substantially slower than Oracle Big Data Connectors (by a factor of 5) and consumes about three times the CPU.
And in both cases you would be forced to run SQL*Loader on the machine where Oracle lives, not because of some functional limitation of SQL*Loader (you can run it anywhere) but because of the practicalities of working with HDFS which is inherently distributed. Running SQL*Loader on a non-Oracle system means you are moving huge data blocks of distributed data living on any number of Hadoop DataNodes through the network to a single system which will be tasked to pass the entire payload over the network again to Oracle. This model doesn’t scale.
The best solution for loading data from Hadoop to Oracle is to use and align the mechanisms for doing parallel work in both environments.
For OLH this means running MapReduce programs in Hadoop to break up a load operation into tasks running on all available MapReduce nodes in a Hadoop cluster. These MapReduce tasks run concurrently, naturally dividing the workload into discrete payloads that use Oracle MapReduce code to connect to Oracle Database remotely and load data into a target table. It’s a natural parallel model for Hadoop since the loading logic is encapsulated and run as vanilla MapReduce jobs. And it’s a natural model for Oracle, since the Oracle database system is being tasked to serve multiple clients (i.e MapReduce tasks) loading data at once, using standard client-server architecture that’s been around for decades.
OSCH is the alternative approach that marries two other parallel mechanisms: Oracle Parallel Query for Oracle External Tables and Hadoop HDFS Client. To explain how these mechanisms align, let’s first talk about External tables and Parallel Query.
External tables are tables defined in Oracle which manage data not living in Oracle. For example, suppose you had an application that managed and frequently updated some structured text files in a system, but you needed to access that data to join it to some Oracle table. You would define an Oracle External table which pointed it to the same structured text files updated by the application, accompanied by verbiage that looks striking similar to the SQL*Loader verbiage discussed above. That’s not a coincidence. The Oracle External tables use the SQL*Loader driver which executes SQL*Loader code under the covers.
Parallel Query (PQ) is a “divide and conquer” strategy that decomposes a SQL statement into partitioned tasks that can execute in parallel and merge the results. PQ exploits the fact that SQL tables are symmetric and can be logically subdivided into horizontal partitions (i.e. sets of rows). With PQ if you want to execute:
SELECT last_name FROM emp WHERE salary > 30000
Oracle can decompose this query into smaller units of work which perform the identical query in parallel against mutually exclusive sets of rows in the “emp” table. For PQ to give you this advantage it needs to be enabled and properly configured (a detail we will talk about in a future post.) For now you simply need to understand that PQ works to break down SQL statements into worker bees (i.e. PQ Slaves) that divide the load and execute in parallel. In particular, PQ can be enabled for External tables which allow SQL to access data outside of Oracle in parallel. The amount of parallelism an External table has is configurable and is dictated by configuring the DOP (degree of parallelism). The DOP can be asserted various ways: as an attribute of a table, or within a SQL statement using a table, or at the session level after the user connects to Oracle.
Now let’s talk about Hadoop HDFS Client. This is a Java API living in Hadoop that acts as a client to HDFS file systems. It looks like your standard file system programmatic interface: with open, read, write, and close methods. But because it works against HDFS which distributes individual blocks of a file across a Hadoop cluster, there is a lot of parallelism going on in the back end. Blocks are served up to HDFS by Hadoop DataNodes that are daemons running on Hadoop nodes, serving up data blocks that are stored locally to the node. If you run a lot of HDFS Clients concurrently against different HDFS files, you are doing lots of concurrent IO and concurrent streaming of data, from every Hadoop node that has a running DataNode. In other words you are maximizing retrieval of data from the Hadoop Cluster.
OSCH works by using all these mechanisms together. It defines a specialized External table which can invoke HDFS Client software to access data in HDFS. And when PQ is enabled for this type of External table, a SQL select statement gets decomposed into N PQ slaves (where N is the DOP). In other words a SQL select statement can kick off N PQ slaves that are each accessing K Hadoop DataNodes in parallel. Access of HDFS blocks by PQ slaves maximizes disk IO, network bandwidth, and processing both in Oracle and in Hadoop.
With this model, you load an Oracle table (e.g. “MY_TABLE”) by executing a single SQL Insert statement. One that gets its data from a subordinate Select statement that references the external table retrieving data from HDFS (e.g. “MY_EXTERNAL_TABLE”).
INSERT INTO MY_TABLE as SELECT * FROM MY_EXTERNAL_TABLE;
Actually I lied. It takes two statements.
In following post we will look at OLH in depth starting with JDBC. We will look at the execution model, and discuss the basics for configuring and tuning a MapReduce job used to load a table living in Oracle.
My background in this space involves both product development and performance. I was pulled into this project about 20 months ago from doing work in Oracle middleware (specifically Oracle Coherence). I am currently working with a talented team that developed OLH and OSCH from scratch. My contribution was to design and prototype OSCH to the point that it scaled, and then spin up on Oracle BDA/Exadata/Hadoop internals to do performance benchmarks and testing. Because I’m almost the newest member of the team, the experience of spinning up in this space is still fresh in my mind, so I have a healthy respect for what it’s like to wrap ones brain around both technologies. Many readers will have much deeper knowledge in either the Oracle space or about Hadoop, so questions or clarifications are welcome.