Thursday May 22, 2014

Loading Apache Web Logs with Oracle Loader For Hadoop

We will discuss how to load Apache access logs in the Combined Log Format using Oracle Loader for Hadoop (OLH). Let's start with a brief introduction to Apache web logs as described in http://httpd.apache.org/docs/2.4/logs.html.

Apache Web Logs

Apache HTTP Servers provide a variety of different mechanisms for logging everything that happens on the HTTP server.  They generate server access logs and server error logs. The server access log contains records of all requests processed by the server. The server error log contains error and diagnostic information for errors encountered while processing requests.

A typical configuration for the Apache Combined Log Format might look like this

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""

Example:

157.166.226.21157.166.226.21 - - [25/Jan/2014:18:41:40 -0400] "GET / HTTP/1.1" 200 - "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.124 Safari/534.30"

Each part of the log format string is described below

 %h is the IP address of the client (remote host) which made the request to the server

%l is the identity of the client, almost never used.

The "hyphen" in the output indicates that the requested piece of information is not available.

 %u is the userid as determined by HTTP authentication.

 %t is the time that the request was received.

The format is: [day/month/year:hour:minute:second zone].

   [25/Jan/2014:18:41:40 -0400] 

\"%r\" is the request line from the client, given in double quotes

        "GET / HTTP/1.1" 

%>s is the status code that the server sends back to the client (e.g. 200, 404).

%b is the size of the object returned to the client.

"%{Referer}i\" is the site that the client reports having been referred from.  

 \"%{User-agent}i\" is the identifying information that the client browser reports about itself.

 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML,
 like Gecko) Chrome/12.0.742.124 Safari/534.30" 


Input Format:

We will use the regular expression input format class oracle.hadoop.loader.lib.input.RegexInputFormat to read Apache log files. This input format uses a regular expression to match a line of text.  In this example we assume the lines in the log file are separated by newline characters.

The regular expression is specified by the oracle.hadoop.loader.input.regexPattern property. 

The job configuration properties are shown below

<!-- input format -->
<property>
  <name>mapreduce.inputformat.class</name>
  <value>oracle.hadoop.loader.lib.input.RegexInputFormat</value>
 </property>

<!-- regular expression -->
<property>
   <name>oracle.hadoop.loader.input.regexPattern</name>
   <value>([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) ((?:[^ \"]*)|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?</value>
</property>


See the Complex Text Input Formats section for more information about this input format.

Output Format:

We will use the output format class oracle.hadoop.loader.lib.output.OCIOutputFormat to load the Apache web logs into a partitioned database table. Note that OCIOutputFormat requires the database table to be partitioned.

Sample Log Files 

  You can download sample data  from https://blogs.oracle.com/bigdataconnectors/resource/apache_access_log.1 to/tmp/apache_access_log.1.

After copying you can copy the log file to HDFS: 

$ hadoop fs -mkdir apache_log_dir

$ hadoop fs -put /tmp/apache_access_log.1 apache_log_dir

Steps:

  1. Create the database table APACHE_COMBINED_LOG.
  2. Next create the job configuration file apachelog_jobconf.xml.
  3. Then run an OLH job to load the apache logs from HDFS to into the database table.
  4. Lastly query the the database table APACHE_COMBINED_LOG to see the data.
  5. Also see the job reports generated by OLH.

Details

Please note: All the code listed in this document is for reference purposes and will not work “AS-IS” in your environment. You must customize the environment variables such as HADOOP_HOME, OLH_HOME and other job configuration properties to get this example to work in your environment. Sample data and configuration files referenced in this blog can be used as a starting point to run this example.

Step 1: Create the database table APACHE_COMBINED_LOG

>sqlplus SCOTT

SQL*Plus: Release 12.1.0.1.0 Production on Tue May 13 16:22:53 2014

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Enter password: enter password for SCOTT

SQL> create table APACHE_COMBINED_LOG

(

CLIENT_IP_ADDRESS VARCHAR2(15),

CLIENT_ID VARCHAR2(40),

REMOTE_USER_ID VARCHAR2(40),

REQUEST_DATE DATE,

REQUEST_LINE VARCHAR(3000),

RETURN_STATUS_CODE NUMBER,

RETURN_SIZE VARCHAR2(20),

REFERER VARCHAR2(3000),

USER_AGENT VARCHAR2(3000)

) partition by hash(REQUEST_DATE) partitions 2;

Table created.

SQL>

Step 2: Create the job configuration file

Download the job configuration file https://blogs.oracle.com/bigdataconnectors/resource/apache_log_jobconf.xml into your current directory.  Edit the configuration file to set values for HOST, PORT, SERVICE_NAME, database user, database password, job input and output directories. Review all the properties listed in this file.

Step 3: Run OLH job

NOTE 1: Change values for HADOOP_HOME and OLH_HOME

#!/bin/bash

export HADOOP_HOME=/usr/lib/hadoop
export PATH=${HADOOP_HOME}/bin:${PATH}
export OLH_HOME=/myhome/olh/oraloader-3.0.0-h2
export HADOOP_CLASSPATH=${OLH_HOME}/jlib/*

#remove job output directory
hadoop fs -rmr myoutputdir

#Run OLH

hadoop oracle.hadoop.loader.OraLoader -conf ./apache_log_jobconf.xml


Sample job output:

Note: You may see a lot deprecated property warnings if you are using the newer release of Hadoop such as CDH5.0.0. This is expected, and is because the examples use some older properties for compatibility across different Hadoop versions.  

14/05/15 10:18:51 INFO loader.OraLoader: Oracle Loader for Hadoop Release 3.0.0 - Production

Copyright (c) 2011, 2014, Oracle and/or its affiliates. All rights reserved.

14/05/15 10:18:51 INFO loader.OraLoader: Built-Against: hadoop-2.3.0-cdh5.0.0 hive-0.12.0-cdh5.0.0 avro-1.7.3 jackson-1.8.8

 ...

14/05/15 10:18:54 INFO output.OCIOutputFormat: Adding OCI libraries to distributed cache

...

14/05/15 10:19:07 INFO loader.OraLoader: map 0% reduce 0%

14/05/15 10:19:41 INFO loader.OraLoader: map 100% reduce 0%

14/05/15 10:19:55 INFO loader.OraLoader: map 100% reduce 100%

14/05/15 10:19:56 INFO loader.OraLoader: Job complete: olh_regex_apache_combined_log (job_1399674267702_0150)

...

File Input Format Counters

Bytes Read=1216

File Output Format Counters

Bytes Written=2223

...

Step 4: Query the APACHE_COMBINED_LOG table

 Verify that the log data has been loaded into the table.

sql> connect scott

SQL> select count(*) from apache_combined_log;

COUNT(*)

----------

8

SQL> select client_ip_address, request_date, return_status_code from apache_combined_log;

CLIENT_IP_ADDRE REQUEST_D RETURN_STATUS_CODE

--------------- --------- ------------------

157.166.226.25 26-JUL-11 401

23.7.70.140 26-JUL-11 200

157.166.226.25 25-JUL-11 404

157.166.226.25 25-JUL-11 200

157.166.226.25 26-JUL-11 401

23.7.70.140 26-JUL-11 401

157.166.226.25 25-JUL-11 304

157.166.226.25 25-JUL-11 404


8 rows selected.

Step 5: Job Report and debugging tips

Oracle Loader for Hadoop consolidates reporting information from individual tasks into a file named ${mapred.output.dir}/_olh/oraloader-report.txt. In this example ${mapred.output.dir} points to myoutputdir. Among other statistics, the report shows the number of errors, broken out by type and task, for each mapper and reducer.

In this example we set oracle.hadoop.loader.logBadRecords configuration property to true, which directs OLH to log bad records into one or more ".bad" files in the myoutputdir/_olh/ directory. You can inspect “.bad” files to find out which rows were rejected and why they were rejected.

A listing of the job output directory for this example looks like this.

$ hadoop fs -ls myoutputdir

Found 2 items

-rw-r--r-- 3 yala hadoop 0 2014-05-15 100 2014-05-15 10:19 myoutputdir/_SUCCESS

drwxr-xr-x - yala hadoop 0 2014-05-15 100 2014-05-15 10:19 myoutputdir/_olh

$ hadoop fs -ls myoutputdir/_olh

Found 5 items

-rw-r--r-- 3 yala hadoop 1209 2014-05-15 10:19 myoutputdir/_olh/oraloader-00000-oci.log

-rw-r--r-- 3 yala hadoop 1014 2014-05-15 10:19 myoutputdir/_olh/oraloader-00000-oci.xml

-rw-r--r-- 3 yala hadoop 377 2014-05-15 10:19 myoutputdir/_olh/oraloader-00000-r.xml

-rw-r--r-- 3 yala hadoop 6285 2014-05-15 10:19 myoutputdir/_olh/oraloader-report.txt

-rw-r--r-- 1 yala hadoop 19607 2014-05-15 10:18 myoutputdir/_olh/tableMetadata.xml

Software used in this example:

Oracle Loader For Hadoop 3.0.0: oraloader-3.0.0-h2 (from http://www.oracle.com/technetwork/database/database-technologies/bdc/big-data-connectors/downloads/big-data-downloads-1451048.html)

Hadoop: hadoop 2.3.0-cdh5.0.0 from Cloudera

Oracle Database: Oracle Release 12.1.0.1.0 Production with partitioning option

Sample files:

apache_log_jobconf.xml:  https://blogs.oracle.com/bigdataconnectors/resource/apache_log_jobconf.xml

sample data:  https://blogs.oracle.com/bigdataconnectors/resource/apache_access_log.1


Tuesday Apr 30, 2013

How to Load Oracle Tables From Hadoop Tutorial (Part 1 - Overview)


Introduction

This is the first of a series of blog posts that will discuss how to load data living in the Hadoop Ecosphere into Oracle tables. The goal is to give insights, discuss pros and cons, and best practices for achieving optimal load times and flexibility from an experienced developer’s point of view.

Oracle and Hadoop are complementary technologies where the whole is greater than the sum of the parts. They both have parallel architectures, which, if used intelligently can move data at an impressive rate. Last year, we achieved a load rate of 12TB (terabytes) per hour between Oracle Exadata and Hadoop running on Oracle’s Big Data Appliance (BDA). The ability to distill big data in Hadoop and then to seamlessly move large result sets into the Oracle stack creates enormous added value in solving Big Data problems.

In supporting customers who need this functionality we’ve noticed that more frequently than not, we are talking to people who are either Hadoop experts or Oracle heavyweights but not both. In our attempt to explain these two technologies we will offer breakout sections that offer some rudimentary background notes about Hadoop and Oracle that we think are important to understand, so you can use these tools effectively. Additional specialized topics will also go into loading issues specific to RAC and Exadata environments.

Why Use Oracle Big Data Connectors?

Hadoop developers might be asking themselves the following question: Oracle has been around for a long time managing huge sets of data in tables. These tables had to be loaded somehow? What’s the added value of the Big Data Connectors? Can’t we use the standard utilities Oracle has provided to load tables?

The quick answer is yes. But if you are dealing with Big Data, you really don’t want to.

Some Background about Conventional Loading Tools and Oracle

Oracle's off-the-shelf utility used for loading data from external source is called SQL*Loader. It does a great job loading files of various formats into an Oracle table.

The following SQL*Loader control file illustrates what this utility does:

LOAD DATA

INFILE file1.dat

INFILE file2.dat

INFILE file3.dat

APPEND

INTO TABLE emp

( empno POSITION(1:4) INTEGER EXTERNAL,

ename POSITION(6:15) CHAR,

deptno POSITION(17:18) CHAR,

mgr POSITION(20:23) INTEGER EXTERNAL

)

SQL*Loader is being told to open three files and append an existing table “emp” with data from the files whose column mapping, physical position, and representation are articulated between the parenthesis. SQL*Loader is really powerful for processing files of various formats.

But to use this tool with Hadoop you need to work around several problems. The first of which is that Hadoop content lives in Hadoop Distributed File System (HDFS) files, not standard OS file systems. SQL*Loader does not know how to access HDFS directly, so the “INFILE” verbiage is a non-starter.

You could work around this problem two ways. One way is to copy the file from Hadoop onto a local disk on a system where SQL*Loader is installed. The problem with this solution is that Hadoop files are big, very often bigger than any storage you have on a single system. Remember that a single Hadoop file can potentially be huge (say 18TB, larger than the digital content of the Library of Congress). That’s a big storage requirement for a single system, especially for a transient requirement such as staging data. Also you can assume that whatever storage requirements you have today for Big Data, they will certainly grow fast.

Secondly, in order to get the data from HDFS into an Oracle table you are doubling the amount of IO resources consumed. (“Read from HDFS, write into an Oracle table” becomes “Read from HDFS, write to staged file, read from staged file, write into an Oracle table”). When operating against Big Data, doubling the IO overhead is worth avoiding.

An alternative approach is to use FUSE technology (Mountable HDFS) that creates a mount point for HDFS. It is an elegant solution but it is substantially slower than Oracle Big Data Connectors (by a factor of 5) and consumes about three times the CPU.

And in both cases you would be forced to run SQL*Loader on the machine where Oracle lives, not because of some functional limitation of SQL*Loader (you can run it anywhere) but because of the practicalities of working with HDFS which is inherently distributed. Running SQL*Loader on a non-Oracle system means you are moving huge data blocks of distributed data living on any number of Hadoop DataNodes through the network to a single system which will be tasked to pass the entire payload over the network again to Oracle. This model doesn’t scale.

WrongToolForTheJob

Exploiting Parallelism in Oracle and Hadoop to Load Data

The best solution for loading data from Hadoop to Oracle is to use and align the mechanisms for doing parallel work in both environments.

Parallelism in Oracle Loader for Hadoop (OLH)

For OLH this means running MapReduce programs in Hadoop to break up a load operation into tasks running on all available MapReduce nodes in a Hadoop cluster. These MapReduce tasks run concurrently, naturally dividing the workload into discrete payloads that use Oracle MapReduce code to connect to Oracle Database remotely and load data into a target table. It’s a natural parallel model for Hadoop since the loading logic is encapsulated and run as vanilla MapReduce jobs. And it’s a natural model for Oracle, since the Oracle database system is being tasked to serve multiple clients (i.e MapReduce tasks) loading data at once, using standard client-server architecture that’s been around for decades.

Parallelism in Oracle SQL Connector for Hadoop Distributed File System (OSCH)

OSCH is the alternative approach that marries two other parallel mechanisms: Oracle Parallel Query for Oracle External Tables and Hadoop HDFS Client. To explain how these mechanisms align, let’s first talk about External tables and Parallel Query.

External Tables

External tables are tables defined in Oracle which manage data not living in Oracle. For example, suppose you had an application that managed and frequently updated some structured text files in a system, but you needed to access that data to join it to some Oracle table. You would define an Oracle External table which pointed it to the same structured text files updated by the application, accompanied by verbiage that looks striking similar to the SQL*Loader verbiage discussed above. That’s not a coincidence. The Oracle External tables use the SQL*Loader driver which executes SQL*Loader code under the covers.

Parallel Query

Parallel Query (PQ) is a “divide and conquer” strategy that decomposes a SQL statement into partitioned tasks that can execute in parallel and merge the results. PQ exploits the fact that SQL tables are symmetric and can be logically subdivided into horizontal partitions (i.e. sets of rows). With PQ if you want to execute:

SELECT last_name FROM emp WHERE salary > 30000

Oracle can decompose this query into smaller units of work which perform the identical query in parallel against mutually exclusive sets of rows in the “emp” table. For PQ to give you this advantage it needs to be enabled and properly configured (a detail we will talk about in a future post.) For now you simply need to understand that PQ works to break down SQL statements into worker bees (i.e. PQ Slaves) that divide the load and execute in parallel. In particular, PQ can be enabled for External tables which allow SQL to access data outside of Oracle in parallel. The amount of parallelism an External table has is configurable and is dictated by configuring the DOP (degree of parallelism). The DOP can be asserted various ways: as an attribute of a table, or within a SQL statement using a table, or at the session level after the user connects to Oracle.

HDFS Client

Now let’s talk about Hadoop HDFS Client. This is a Java API living in Hadoop that acts as a client to HDFS file systems. It looks like your standard file system programmatic interface: with open, read, write, and close methods. But because it works against HDFS which distributes individual blocks of a file across a Hadoop cluster, there is a lot of parallelism going on in the back end. Blocks are served up to HDFS by Hadoop DataNodes that are daemons running on Hadoop nodes, serving up data blocks that are stored locally to the node. If you run a lot of HDFS Clients concurrently against different HDFS files, you are doing lots of concurrent IO and concurrent streaming of data, from every Hadoop node that has a running DataNode. In other words you are maximizing retrieval of data from the Hadoop Cluster.

Putting It All Together

OSCH works by using all these mechanisms together. It defines a specialized External table which can invoke HDFS Client software to access data in HDFS. And when PQ is enabled for this type of External table, a SQL select statement gets decomposed into N PQ slaves (where N is the DOP). In other words a SQL select statement can kick off N PQ slaves that are each accessing K Hadoop DataNodes in parallel. Access of HDFS blocks by PQ slaves maximizes disk IO, network bandwidth, and processing both in Oracle and in Hadoop.

DataStampede

With this model, you load an Oracle table (e.g. “MY_TABLE”) by executing a single SQL Insert statement. One that gets its data from a subordinate Select statement that references the external table retrieving data from HDFS (e.g. “MY_EXTERNAL_TABLE”).

INSERT INTO MY_TABLE as SELECT * FROM MY_EXTERNAL_TABLE;

Actually I lied. It takes two statements.

COMMIT;

Just sayin.

Next Topic

In following post we will look at OLH in depth starting with JDBC. We will look at the execution model, and discuss the basics for configuring and tuning a MapReduce job used to load a table living in Oracle.

Author’s Background

My background in this space involves both product development and performance. I was pulled into this project about 20 months ago from doing work in Oracle middleware (specifically Oracle Coherence). I am currently working with a talented team that developed OLH and OSCH from scratch. My contribution was to design and prototype OSCH to the point that it scaled, and then spin up on Oracle BDA/Exadata/Hadoop internals to do performance benchmarks and testing. Because I’m almost the newest member of the team, the experience of spinning up in this space is still fresh in my mind, so I have a healthy respect for what it’s like to wrap ones brain around both technologies. Many readers will have much deeper knowledge in either the Oracle space or about Hadoop, so questions or clarifications are welcome

About

Oracle Loader for Hadoop and Oracle SQL Connector for HDFS

Search

Categories
Archives
« September 2015
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today