Thursday May 22, 2014

Loading Apache Web Logs with Oracle Loader For Hadoop

We will discuss how to load Apache access logs in the Combined Log Format using Oracle Loader for Hadoop (OLH). Let's start with a brief introduction to Apache web logs as described in http://httpd.apache.org/docs/2.4/logs.html.

Apache Web Logs

Apache HTTP Servers provide a variety of different mechanisms for logging everything that happens on the HTTP server.  They generate server access logs and server error logs. The server access log contains records of all requests processed by the server. The server error log contains error and diagnostic information for errors encountered while processing requests.

A typical configuration for the Apache Combined Log Format might look like this

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""

Example:

157.166.226.21157.166.226.21 - - [25/Jan/2014:18:41:40 -0400] "GET / HTTP/1.1" 200 - "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.124 Safari/534.30"

Each part of the log format string is described below

 %h is the IP address of the client (remote host) which made the request to the server

%l is the identity of the client, almost never used.

The "hyphen" in the output indicates that the requested piece of information is not available.

 %u is the userid as determined by HTTP authentication.

 %t is the time that the request was received.

The format is: [day/month/year:hour:minute:second zone].

   [25/Jan/2014:18:41:40 -0400] 

\"%r\" is the request line from the client, given in double quotes

        "GET / HTTP/1.1" 

%>s is the status code that the server sends back to the client (e.g. 200, 404).

%b is the size of the object returned to the client.

"%{Referer}i\" is the site that the client reports having been referred from.  

 \"%{User-agent}i\" is the identifying information that the client browser reports about itself.

 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML,
 like Gecko) Chrome/12.0.742.124 Safari/534.30" 


Input Format:

We will use the regular expression input format class oracle.hadoop.loader.lib.input.RegexInputFormat to read Apache log files. This input format uses a regular expression to match a line of text.  In this example we assume the lines in the log file are separated by newline characters.

The regular expression is specified by the oracle.hadoop.loader.input.regexPattern property. 

The job configuration properties are shown below

<!-- input format -->
<property>
  <name>mapreduce.inputformat.class</name>
  <value>oracle.hadoop.loader.lib.input.RegexInputFormat</value>
 </property>

<!-- regular expression -->
<property>
   <name>oracle.hadoop.loader.input.regexPattern</name>
   <value>([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) ((?:[^ \"]*)|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?</value>
</property>


See the Complex Text Input Formats section for more information about this input format.

Output Format:

We will use the output format class oracle.hadoop.loader.lib.output.OCIOutputFormat to load the Apache web logs into a partitioned database table. Note that OCIOutputFormat requires the database table to be partitioned.

Sample Log Files 

  You can download sample data  from https://blogs.oracle.com/bigdataconnectors/resource/apache_access_log.1 to/tmp/apache_access_log.1.

After copying you can copy the log file to HDFS: 

$ hadoop fs -mkdir apache_log_dir

$ hadoop fs -put /tmp/apache_access_log.1 apache_log_dir

Steps:

  1. Create the database table APACHE_COMBINED_LOG.
  2. Next create the job configuration file apachelog_jobconf.xml.
  3. Then run an OLH job to load the apache logs from HDFS to into the database table.
  4. Lastly query the the database table APACHE_COMBINED_LOG to see the data.
  5. Also see the job reports generated by OLH.

Details

Please note: All the code listed in this document is for reference purposes and will not work “AS-IS” in your environment. You must customize the environment variables such as HADOOP_HOME, OLH_HOME and other job configuration properties to get this example to work in your environment. Sample data and configuration files referenced in this blog can be used as a starting point to run this example.

Step 1: Create the database table APACHE_COMBINED_LOG

>sqlplus SCOTT

SQL*Plus: Release 12.1.0.1.0 Production on Tue May 13 16:22:53 2014

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Enter password: enter password for SCOTT

SQL> create table APACHE_COMBINED_LOG

(

CLIENT_IP_ADDRESS VARCHAR2(15),

CLIENT_ID VARCHAR2(40),

REMOTE_USER_ID VARCHAR2(40),

REQUEST_DATE DATE,

REQUEST_LINE VARCHAR(3000),

RETURN_STATUS_CODE NUMBER,

RETURN_SIZE VARCHAR2(20),

REFERER VARCHAR2(3000),

USER_AGENT VARCHAR2(3000)

) partition by hash(REQUEST_DATE) partitions 2;

Table created.

SQL>

Step 2: Create the job configuration file

Download the job configuration file https://blogs.oracle.com/bigdataconnectors/resource/apache_log_jobconf.xml into your current directory.  Edit the configuration file to set values for HOST, PORT, SERVICE_NAME, database user, database password, job input and output directories. Review all the properties listed in this file.

Step 3: Run OLH job

NOTE 1: Change values for HADOOP_HOME and OLH_HOME

#!/bin/bash

export HADOOP_HOME=/usr/lib/hadoop
export PATH=${HADOOP_HOME}/bin:${PATH}
export OLH_HOME=/myhome/olh/oraloader-3.0.0-h2
export HADOOP_CLASSPATH=${OLH_HOME}/jlib/*

#remove job output directory
hadoop fs -rmr myoutputdir

#Run OLH

hadoop oracle.hadoop.loader.OraLoader -conf ./apache_log_jobconf.xml


Sample job output:

Note: You may see a lot deprecated property warnings if you are using the newer release of Hadoop such as CDH5.0.0. This is expected, and is because the examples use some older properties for compatibility across different Hadoop versions.  

14/05/15 10:18:51 INFO loader.OraLoader: Oracle Loader for Hadoop Release 3.0.0 - Production

Copyright (c) 2011, 2014, Oracle and/or its affiliates. All rights reserved.

14/05/15 10:18:51 INFO loader.OraLoader: Built-Against: hadoop-2.3.0-cdh5.0.0 hive-0.12.0-cdh5.0.0 avro-1.7.3 jackson-1.8.8

 ...

14/05/15 10:18:54 INFO output.OCIOutputFormat: Adding OCI libraries to distributed cache

...

14/05/15 10:19:07 INFO loader.OraLoader: map 0% reduce 0%

14/05/15 10:19:41 INFO loader.OraLoader: map 100% reduce 0%

14/05/15 10:19:55 INFO loader.OraLoader: map 100% reduce 100%

14/05/15 10:19:56 INFO loader.OraLoader: Job complete: olh_regex_apache_combined_log (job_1399674267702_0150)

...

File Input Format Counters

Bytes Read=1216

File Output Format Counters

Bytes Written=2223

...

Step 4: Query the APACHE_COMBINED_LOG table

 Verify that the log data has been loaded into the table.

sql> connect scott

SQL> select count(*) from apache_combined_log;

COUNT(*)

----------

8

SQL> select client_ip_address, request_date, return_status_code from apache_combined_log;

CLIENT_IP_ADDRE REQUEST_D RETURN_STATUS_CODE

--------------- --------- ------------------

157.166.226.25 26-JUL-11 401

23.7.70.140 26-JUL-11 200

157.166.226.25 25-JUL-11 404

157.166.226.25 25-JUL-11 200

157.166.226.25 26-JUL-11 401

23.7.70.140 26-JUL-11 401

157.166.226.25 25-JUL-11 304

157.166.226.25 25-JUL-11 404


8 rows selected.

Step 5: Job Report and debugging tips

Oracle Loader for Hadoop consolidates reporting information from individual tasks into a file named ${mapred.output.dir}/_olh/oraloader-report.txt. In this example ${mapred.output.dir} points to myoutputdir. Among other statistics, the report shows the number of errors, broken out by type and task, for each mapper and reducer.

In this example we set oracle.hadoop.loader.logBadRecords configuration property to true, which directs OLH to log bad records into one or more ".bad" files in the myoutputdir/_olh/ directory. You can inspect “.bad” files to find out which rows were rejected and why they were rejected.

A listing of the job output directory for this example looks like this.

$ hadoop fs -ls myoutputdir

Found 2 items

-rw-r--r-- 3 yala hadoop 0 2014-05-15 100 2014-05-15 10:19 myoutputdir/_SUCCESS

drwxr-xr-x - yala hadoop 0 2014-05-15 100 2014-05-15 10:19 myoutputdir/_olh

$ hadoop fs -ls myoutputdir/_olh

Found 5 items

-rw-r--r-- 3 yala hadoop 1209 2014-05-15 10:19 myoutputdir/_olh/oraloader-00000-oci.log

-rw-r--r-- 3 yala hadoop 1014 2014-05-15 10:19 myoutputdir/_olh/oraloader-00000-oci.xml

-rw-r--r-- 3 yala hadoop 377 2014-05-15 10:19 myoutputdir/_olh/oraloader-00000-r.xml

-rw-r--r-- 3 yala hadoop 6285 2014-05-15 10:19 myoutputdir/_olh/oraloader-report.txt

-rw-r--r-- 1 yala hadoop 19607 2014-05-15 10:18 myoutputdir/_olh/tableMetadata.xml

Software used in this example:

Oracle Loader For Hadoop 3.0.0: oraloader-3.0.0-h2 (from http://www.oracle.com/technetwork/database/database-technologies/bdc/big-data-connectors/downloads/big-data-downloads-1451048.html)

Hadoop: hadoop 2.3.0-cdh5.0.0 from Cloudera

Oracle Database: Oracle Release 12.1.0.1.0 Production with partitioning option

Sample files:

apache_log_jobconf.xml:  https://blogs.oracle.com/bigdataconnectors/resource/apache_log_jobconf.xml

sample data:  https://blogs.oracle.com/bigdataconnectors/resource/apache_access_log.1


Tuesday May 20, 2014

Hadoop Certification Matrix

The table below gives an overview of the Hadoop distributions that work with Oracle Big Data Connectors.


Wednesday Apr 02, 2014

Oracle Big Data Connectors at Collaborate14

Attending Collaborate14 and interested in integrating Hadoop with Oracle Database?    Add the following session to your schedule: 

519: Integrating Hadoop Data with your Data Warehouse
Date: 04/08/2014
Time: 5:30 PM
Location: Level 3, San Polo 3405

Abstract:

Learn how to use the high speed Oracle Big Data Connectors to integrate data on Hadoop with data in Oracle Database. Oracle Loader for Hadoop and Oracle SQL Connector for HDFS provide very fast and efficient load from Hadoop into the database and Oracle SQL access to data on Hadoop. These connectors are essential components of any big data solution, integrating data from JSON, Hive, HBase, Oracle NoSQL Database and other data sources with data in the database. See how easy it is use data from unstructured data sources in the database after transformation on Hadoop.

About

Oracle Loader for Hadoop and Oracle SQL Connector for HDFS

Search

Categories
Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today