X

Access External Data Sources from Oracle Database

  • May 22, 2014

Loading Apache Web Logs with Oracle Loader For Hadoop

Manjari Yalavarthy
Principal Member

We will discuss how
to load Apache access logs in the Combined Log Format using Oracle
Loader for Hadoop (OLH). Let's start with a brief introduction to Apache web logs
as described in http://httpd.apache.org/docs/2.4/logs.html.

Apache Web Logs

Apache HTTP Servers provide a variety of different
mechanisms for logging everything that happens on the HTTP server.  They generate
server access logs and server error logs. The server access log contains records of all requests
processed by the server. The server error log contains error and diagnostic
information for errors encountered while processing requests.

A
typical configuration for the Apache Combined Log Format might look like this

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""

Example:

157.166.226.21157.166.226.21 - - [25/Jan/2014:18:41:40 -0400] "GET / HTTP/1.1" 200 - "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML,
like Gecko) Chrome/12.0.742.124 Safari/534.30"

Each part of the log format string is described below

 %h is the IP address of the client (remote host) which made the request to the server

%l is the identity of the client, almost never used.

The "hyphen" in the output indicates that the requested piece of information is not available.

 %u is the userid as determined by HTTP authentication.

 %t is the time that the request was received.

The format is: [day/month/year:hour:minute:second zone].

   [25/Jan/2014:18:41:40 -0400] 

\"%r\" is the request line from the client, given in double quotes

        "GET / HTTP/1.1" 

%>s is the status code that the server sends back to the client (e.g. 200, 404).

%b is the size of the object returned to the client.

"%{Referer}i\" is the site that the client reports having been referred from.  

 \"%{User-agent}i\" is the identifying information that the client browser reports about itself.

 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML,
like Gecko) Chrome/12.0.742.124 Safari/534.30"


Input Format:

We will use the regular
expression input format class oracle.hadoop.loader.lib.input.RegexInputFormat
to read Apache log files. This input format uses a regular
expression to match a line of text.  In this example we assume the lines in the log file are separated by newline characters.

The regular expression is
specified by the oracle.hadoop.loader.input.regexPattern
property. 

The job configuration properties are shown below

<!-- input format -->
<property>
  <name>mapreduce.inputformat.class</name>
  <value>oracle.hadoop.loader.lib.input.RegexInputFormat</value>
 </property>

<!-- regular expression -->
<property>
   <name>oracle.hadoop.loader.input.regexPattern</name>
   <value>([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) ((?:[^ \"]*)|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?</value>
</property>


See the Complex
Text Input Formats
section for more information about this input format.

Output Format:

We will use the output
format class oracle.hadoop.loader.lib.output.OCIOutputFormat
to load the Apache web logs into a partitioned database table. Note that OCIOutputFormat requires the database
table to be partitioned.

Sample Log Files 

  You can download sample data  from //cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/f4a5b21d-66fa-4885-92bf-c4e81c06d916/File/1c81c152db7abac017f7a2d6cdee5a5b/apache_access_log.1 to/tmp/apache_access_log.1.

After copying you can copy the log file to HDFS: 

$ hadoop fs -mkdir apache_log_dir

$ hadoop fs -put /tmp/apache_access_log.1 apache_log_dir

Steps:

  1. Create the database table APACHE_COMBINED_LOG.
  2. Next create the job configuration file apachelog_jobconf.xml.
  3. Then run an OLH job to load the apache logs from HDFS to into the database table.
  4. Lastly query the the database table APACHE_COMBINED_LOG to see the data.
  5. Also see the job reports generated by OLH.

Details

Please note: All the code
listed in this document is for reference purposes and will not work “AS-IS” in
your environment. You must customize the environment variables such as
HADOOP_HOME, OLH_HOME and other job
configuration properties to get this example to work in your
environment. Sample data and configuration files referenced in this blog can be used as a starting point to run this example.

Step 1: Create the database table APACHE_COMBINED_LOG

>sqlplus SCOTT

SQL*Plus: Release 12.1.0.1.0 Production on Tue May 13 16:22:53 2014

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Enter password: enter password for
SCOTT

SQL>
create table APACHE_COMBINED_LOG

(

CLIENT_IP_ADDRESS VARCHAR2(15),

CLIENT_ID VARCHAR2(40),

REMOTE_USER_ID VARCHAR2(40),

REQUEST_DATE DATE,

REQUEST_LINE VARCHAR(3000),

RETURN_STATUS_CODE NUMBER,

RETURN_SIZE VARCHAR2(20),

REFERER VARCHAR2(3000),

USER_AGENT VARCHAR2(3000)

)
partition by hash(REQUEST_DATE) partitions 2;

Table created.

SQL>

Step 2: Create the job configuration file

Download the job configuration file //cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/f4a5b21d-66fa-4885-92bf-c4e81c06d916/File/8816bd297c381d35b700f39772e08cca/apache_log_jobconf.xml into your current directory.  Edit the configuration file to set values for HOST, PORT, SERVICE_NAME, database user, database password, job input and output directories. Review all the properties listed in this file.

Step 3: Run OLH job

NOTE 1: Change values for HADOOP_HOME and OLH_HOME

#!/bin/bash

export HADOOP_HOME=/usr/lib/hadoop

export PATH=${HADOOP_HOME}/bin:${PATH}
export OLH_HOME=/myhome/olh/oraloader-3.0.0-h2

export HADOOP_CLASSPATH=${OLH_HOME}/jlib/*

#remove job output directory

hadoop fs -rmr myoutputdir

#Run OLH

hadoop
oracle.hadoop.loader.OraLoader -conf ./apache_log_jobconf.xml


Sample job output:

Note: You may see a lot deprecated property warnings if you
are using the newer release of Hadoop such as CDH5.0.0. This is
expected, and is because the examples use some older properties
for compatibility across different Hadoop versions.  

14/05/15
10:18:51 INFO loader.OraLoader: Oracle Loader for Hadoop Release 3.0.0 -
Production

Copyright
(c) 2011, 2014, Oracle and/or its affiliates. All rights reserved.

14/05/15
10:18:51 INFO loader.OraLoader: Built-Against: hadoop-2.3.0-cdh5.0.0
hive-0.12.0-cdh5.0.0 avro-1.7.3 jackson-1.8.8

 ...

14/05/15
10:18:54 INFO output.OCIOutputFormat: Adding OCI libraries to distributed cache

...

14/05/15
10:19:07 INFO loader.OraLoader: map 0% reduce 0%

14/05/15
10:19:41 INFO loader.OraLoader: map 100% reduce 0%

14/05/15
10:19:55 INFO loader.OraLoader: map 100% reduce 100%

14/05/15
10:19:56 INFO loader.OraLoader: Job complete: olh_regex_apache_combined_log
(job_1399674267702_0150)

...

File Input Format Counters

Bytes Read=1216

File Output Format Counters

Bytes Written=2223

...

Step 4: Query the APACHE_COMBINED_LOG table

 Verify that the log data has been loaded into the table.

sql> connect scott

SQL>
select count(*) from apache_combined_log;

COUNT(*)

----------

8

SQL> select client_ip_address, request_date, return_status_code from
apache_combined_log;

CLIENT_IP_ADDRE REQUEST_D RETURN_STATUS_CODE

--------------- --------- ------------------

157.166.226.25 26-JUL-11 401

23.7.70.140 26-JUL-11 200

157.166.226.25 25-JUL-11 404

157.166.226.25 25-JUL-11 200

157.166.226.25 26-JUL-11 401

23.7.70.140 26-JUL-11 401

157.166.226.25 25-JUL-11 304

157.166.226.25 25-JUL-11 404


8 rows selected.

Step 5: Job Report and debugging tips

Oracle Loader for Hadoop consolidates reporting information
from individual tasks into a file named ${mapred.output.dir}/_olh/oraloader-report.txt. In this example ${mapred.output.dir}
points to myoutputdir. Among other statistics, the report shows the number of errors, broken out by
type and task, for each mapper and reducer.























In this example we set oracle.hadoop.loader.logBadRecords configuration property to true,
which directs OLH to log bad records into one or more ".bad" files in
the myoutputdir/_olh/ directory.
You can inspect “.bad” files to find out which rows were rejected and why they
were rejected.

A listing of the job output directory for
this example looks like this.

$ hadoop fs -ls myoutputdir

Found 2 items

-rw-r--r-- 3 yala hadoop 0 2014-05-15 100 2014-05-15 10:19
myoutputdir/_SUCCESS

drwxr-xr-x - yala hadoop 0 2014-05-15 100 2014-05-15 10:19 myoutputdir/_olh

$ hadoop fs -ls myoutputdir/_olh

Found 5 items

-rw-r--r-- 3 yala hadoop 1209 2014-05-15 10:19
myoutputdir/_olh/oraloader-00000-oci.log

-rw-r--r-- 3 yala hadoop 1014 2014-05-15 10:19
myoutputdir/_olh/oraloader-00000-oci.xml

-rw-r--r-- 3 yala hadoop 377 2014-05-15 10:19 myoutputdir/_olh/oraloader-00000-r.xml

-rw-r--r-- 3 yala hadoop 6285 2014-05-15 10:19
myoutputdir/_olh/oraloader-report.txt

-rw-r--r-- 1 yala hadoop 19607 2014-05-15 10:18
myoutputdir/_olh/tableMetadata.xml

Software used in
this example:

Oracle Loader For
Hadoop 3.0.0:
oraloader-3.0.0-h2 (from http://www.oracle.com/technetwork/database/database-technologies/bdc/big-data-connectors/downloads/big-data-downloads-1451048.html)

Hadoop: hadoop
2.3.0-cdh5.0.0 from Cloudera

Oracle Database: Oracle Release 12.1.0.1.0 Production with
partitioning option

Sample files:

apache_log_jobconf.xml:  //cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/f4a5b21d-66fa-4885-92bf-c4e81c06d916/File/8816bd297c381d35b700f39772e08cca/apache_log_jobconf.xml

sample data:  //cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/f4a5b21d-66fa-4885-92bf-c4e81c06d916/File/1c81c152db7abac017f7a2d6cdee5a5b/apache_access_log.1


You'll need Skype CreditFree via Skype

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.