Friday Dec 04, 2015

A Hybrid Marriage: Connecting Exadata and Hadoop hosted on AWS

Guest post by Javier De La Torre Medina and Iwona Rajca

In this post we describe how to setup and use Oracle SQL Connector for Hadoop (OSCH) for data query between Cloudera’s Distribution Including Apache Hadoop hosted on AWS and Oracle Exadata. OSCH leverages Oracle’s External tables technology and enables queries on Hadoop and Hive from the database. AWS is a good starting point to build your own test Hadoop environment.

Many enterprises who are experimenting with Hadoop in the cloud setup will likely have engineered systems like Exadata on premise – and look to combine the two to create a powerful blended architecture.

Part 1: Configure Exadata as a Hadoop client

The first thing to do is networking setup between Exadata and Cloudera. Make sure that Exadata can access Hadoop on AWS. There are a few things to be aware of: check the firewall setup and confirm that the ports for HiveMetastore, DataNodes, Zookeeper, and Namenode are open.

To set up the networking you might need to make changes on the AWS cluster. The rest of the configuration is done solely on the Exadata side.

Oracle provides comprehensive documentation on how to setup OSCH for a remote Hadoop cluster – check the latest Connnectors User’s Guide here. The main tasks you need to perform are:

  • Download & install OSCH on Exadata.
  • Download and install selected Cloudera packages (Hadoop and Hive) on Exadata.
  • Configure Hadoop client on Exadata.

The documentation contains detailed steps.

Part 1: Setup OSCH for a remote Hadoop cluster

Download & install OSCH.

Download the OSCH connector form OTN and unpack it. The OSCH directory should be placed in a shared directory so that all nodes in Exadata can access it.

Download and install selected Cloudera Hadoop client.

To be able to run queries from Exadata on AWS-hosted CDH, you need to install and configure Hadoop and Hive client packages on the client system. You can do an RPM install with Cloudera-provided packages. Alternatively you can download a Cloudera-provided tarball that matches the version of your CDH on AWS. For the tarball install unzip the following packages:

  • hadoop-<version>.tar.gz
  • hive-<version>.tar.gz

When these are installed set up your HADOOP_CLASSPATH environmental variable to point to:

  • the path to the JAR files for OSCH (your OSCH installation folder), e.g. path/orahdfs-3.3.0/jlib/*
  • Hive JAR files and conf directory (in a default RPM install these are: /usr/lib/hive/lib/*, /etc/hive/conf)


Configure Hadoop client.

After installing CDH packages, you must configure the Hadoop client for use with CDH on AWS. Cloudera Manager automatically generates the configuration files for the Hadoop client – they will include your cluster network configuration details. To get them, log in to Cloudera Manager as admin and right next to you cluster name click on a drop down and select “View Client Configuration URLs” option. From there download YARN and Hive packages. Once you get the zip contents unpacked, move them to configuration directories of Hadoop and Hive – in a default RPM install these are /etc/hadoop/conf and /etc/hive/conf respectively.

Test the connection by accessing HDFS files and running Hive queries.

Part 2: Configure Oracle SQL Connector for Hadoop

Once you can run your Hive jobs, configuring OSCH is a matter of minutes.

Start by creating a folder in a shared location that will host external table location files when an external table is created – e.g. path/ext_table_dir. You can use the ASM Cluster File System in Exadata to have a shared folder.

To access Hive, an Oracle user requires the basic create session, create table and create view privileges. For OSCH to make a Hive to Oracle mapping, a Sysdba should specify the following:

 - create a database directory to point to the orahdfs-version/bin directory where hdfs_stream file resides. E.g.: create or replace directory osch_bin_path as 'path/orahdfs-3.1.0/bin'

- grant read, execute on directory osch_bin_path to <username>

- create a database directory where Hive files will be published to, e.g.: create or replace directory ext_table_dir as 'path/ext_table_dir'

- grant read, write on ext_table_dir to <username>

Part 3: Create the external table

Creating an External table is simple – as long as you specify the correct Hive table and the destination table, Oracle will use Hive’s metadata to describe the new table. To create a table, simply use the Linux CLI or put the configuration under a separate file and point to it. A sample CREATE TABLE statement:

$ hadoop jar OSCH_HOME/jlib/orahdfs.jar \

oracle.hadoop.exttab.ExternalTable \

-D oracle.hadoop.exttab.tableName=<username>.oracle_table \ <- pick a name for the external table

-D oracle.hadoop.exttab.sourceType=hive \ <- specifies external table type

-D oracle.hadoop.exttab.hive.tableName=my_hive_table \ <- point to your Hive table

-D oracle.hadoop.exttab.hive.databaseName=default \ <- point to you Hive database

-D oracle.hadoop.exttab.defaultDirectory= ext_table_dir \ <- point to your external directory

-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//myhost:1521/myservicename \ <- specify the connection to Oracle database

-D oracle.hadoop.connection.user=<username> \ <- specify the database user (unless using Oracle wallet)

-createTable

This will create the table.

Worth noting - The Oracle external table is not a "live" Hive table. After changes are made to a Hive table, you must use the ExternalTable tool to drop the existing external table and create a new one.

Thursday May 22, 2014

Loading Apache Web Logs with Oracle Loader For Hadoop

We will discuss how to load Apache access logs in the Combined Log Format using Oracle Loader for Hadoop (OLH). Let's start with a brief introduction to Apache web logs as described in http://httpd.apache.org/docs/2.4/logs.html.

Apache Web Logs

Apache HTTP Servers provide a variety of different mechanisms for logging everything that happens on the HTTP server.  They generate server access logs and server error logs. The server access log contains records of all requests processed by the server. The server error log contains error and diagnostic information for errors encountered while processing requests.

A typical configuration for the Apache Combined Log Format might look like this

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""

Example:

157.166.226.21157.166.226.21 - - [25/Jan/2014:18:41:40 -0400] "GET / HTTP/1.1" 200 - "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.124 Safari/534.30"

Each part of the log format string is described below

 %h is the IP address of the client (remote host) which made the request to the server

%l is the identity of the client, almost never used.

The "hyphen" in the output indicates that the requested piece of information is not available.

 %u is the userid as determined by HTTP authentication.

 %t is the time that the request was received.

The format is: [day/month/year:hour:minute:second zone].

   [25/Jan/2014:18:41:40 -0400] 

\"%r\" is the request line from the client, given in double quotes

        "GET / HTTP/1.1" 

%>s is the status code that the server sends back to the client (e.g. 200, 404).

%b is the size of the object returned to the client.

"%{Referer}i\" is the site that the client reports having been referred from.  

 \"%{User-agent}i\" is the identifying information that the client browser reports about itself.

 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML,
 like Gecko) Chrome/12.0.742.124 Safari/534.30" 


Input Format:

We will use the regular expression input format class oracle.hadoop.loader.lib.input.RegexInputFormat to read Apache log files. This input format uses a regular expression to match a line of text.  In this example we assume the lines in the log file are separated by newline characters.

The regular expression is specified by the oracle.hadoop.loader.input.regexPattern property. 

The job configuration properties are shown below

<!-- input format -->
<property>
  <name>mapreduce.inputformat.class</name>
  <value>oracle.hadoop.loader.lib.input.RegexInputFormat</value>
 </property>

<!-- regular expression -->
<property>
   <name>oracle.hadoop.loader.input.regexPattern</name>
   <value>([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) ((?:[^ \"]*)|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?</value>
</property>


See the Complex Text Input Formats section for more information about this input format.

Output Format:

We will use the output format class oracle.hadoop.loader.lib.output.OCIOutputFormat to load the Apache web logs into a partitioned database table. Note that OCIOutputFormat requires the database table to be partitioned.

Sample Log Files 

  You can download sample data  from https://blogs.oracle.com/bigdataconnectors/resource/apache_access_log.1 to/tmp/apache_access_log.1.

After copying you can copy the log file to HDFS: 

$ hadoop fs -mkdir apache_log_dir

$ hadoop fs -put /tmp/apache_access_log.1 apache_log_dir

Steps:

  1. Create the database table APACHE_COMBINED_LOG.
  2. Next create the job configuration file apachelog_jobconf.xml.
  3. Then run an OLH job to load the apache logs from HDFS to into the database table.
  4. Lastly query the the database table APACHE_COMBINED_LOG to see the data.
  5. Also see the job reports generated by OLH.

Details

Please note: All the code listed in this document is for reference purposes and will not work “AS-IS” in your environment. You must customize the environment variables such as HADOOP_HOME, OLH_HOME and other job configuration properties to get this example to work in your environment. Sample data and configuration files referenced in this blog can be used as a starting point to run this example.

Step 1: Create the database table APACHE_COMBINED_LOG

>sqlplus SCOTT

SQL*Plus: Release 12.1.0.1.0 Production on Tue May 13 16:22:53 2014

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Enter password: enter password for SCOTT

SQL> create table APACHE_COMBINED_LOG

(

CLIENT_IP_ADDRESS VARCHAR2(15),

CLIENT_ID VARCHAR2(40),

REMOTE_USER_ID VARCHAR2(40),

REQUEST_DATE DATE,

REQUEST_LINE VARCHAR(3000),

RETURN_STATUS_CODE NUMBER,

RETURN_SIZE VARCHAR2(20),

REFERER VARCHAR2(3000),

USER_AGENT VARCHAR2(3000)

) partition by hash(REQUEST_DATE) partitions 2;

Table created.

SQL>

Step 2: Create the job configuration file

Download the job configuration file https://blogs.oracle.com/bigdataconnectors/resource/apache_log_jobconf.xml into your current directory.  Edit the configuration file to set values for HOST, PORT, SERVICE_NAME, database user, database password, job input and output directories. Review all the properties listed in this file.

Step 3: Run OLH job

NOTE 1: Change values for HADOOP_HOME and OLH_HOME

#!/bin/bash

export HADOOP_HOME=/usr/lib/hadoop
export PATH=${HADOOP_HOME}/bin:${PATH}
export OLH_HOME=/myhome/olh/oraloader-3.0.0-h2
export HADOOP_CLASSPATH=${OLH_HOME}/jlib/*

#remove job output directory
hadoop fs -rmr myoutputdir

#Run OLH

hadoop oracle.hadoop.loader.OraLoader -conf ./apache_log_jobconf.xml


Sample job output:

Note: You may see a lot deprecated property warnings if you are using the newer release of Hadoop such as CDH5.0.0. This is expected, and is because the examples use some older properties for compatibility across different Hadoop versions.  

14/05/15 10:18:51 INFO loader.OraLoader: Oracle Loader for Hadoop Release 3.0.0 - Production

Copyright (c) 2011, 2014, Oracle and/or its affiliates. All rights reserved.

14/05/15 10:18:51 INFO loader.OraLoader: Built-Against: hadoop-2.3.0-cdh5.0.0 hive-0.12.0-cdh5.0.0 avro-1.7.3 jackson-1.8.8

 ...

14/05/15 10:18:54 INFO output.OCIOutputFormat: Adding OCI libraries to distributed cache

...

14/05/15 10:19:07 INFO loader.OraLoader: map 0% reduce 0%

14/05/15 10:19:41 INFO loader.OraLoader: map 100% reduce 0%

14/05/15 10:19:55 INFO loader.OraLoader: map 100% reduce 100%

14/05/15 10:19:56 INFO loader.OraLoader: Job complete: olh_regex_apache_combined_log (job_1399674267702_0150)

...

File Input Format Counters

Bytes Read=1216

File Output Format Counters

Bytes Written=2223

...

Step 4: Query the APACHE_COMBINED_LOG table

 Verify that the log data has been loaded into the table.

sql> connect scott

SQL> select count(*) from apache_combined_log;

COUNT(*)

----------

8

SQL> select client_ip_address, request_date, return_status_code from apache_combined_log;

CLIENT_IP_ADDRE REQUEST_D RETURN_STATUS_CODE

--------------- --------- ------------------

157.166.226.25 26-JUL-11 401

23.7.70.140 26-JUL-11 200

157.166.226.25 25-JUL-11 404

157.166.226.25 25-JUL-11 200

157.166.226.25 26-JUL-11 401

23.7.70.140 26-JUL-11 401

157.166.226.25 25-JUL-11 304

157.166.226.25 25-JUL-11 404


8 rows selected.

Step 5: Job Report and debugging tips

Oracle Loader for Hadoop consolidates reporting information from individual tasks into a file named ${mapred.output.dir}/_olh/oraloader-report.txt. In this example ${mapred.output.dir} points to myoutputdir. Among other statistics, the report shows the number of errors, broken out by type and task, for each mapper and reducer.

In this example we set oracle.hadoop.loader.logBadRecords configuration property to true, which directs OLH to log bad records into one or more ".bad" files in the myoutputdir/_olh/ directory. You can inspect “.bad” files to find out which rows were rejected and why they were rejected.

A listing of the job output directory for this example looks like this.

$ hadoop fs -ls myoutputdir

Found 2 items

-rw-r--r-- 3 yala hadoop 0 2014-05-15 100 2014-05-15 10:19 myoutputdir/_SUCCESS

drwxr-xr-x - yala hadoop 0 2014-05-15 100 2014-05-15 10:19 myoutputdir/_olh

$ hadoop fs -ls myoutputdir/_olh

Found 5 items

-rw-r--r-- 3 yala hadoop 1209 2014-05-15 10:19 myoutputdir/_olh/oraloader-00000-oci.log

-rw-r--r-- 3 yala hadoop 1014 2014-05-15 10:19 myoutputdir/_olh/oraloader-00000-oci.xml

-rw-r--r-- 3 yala hadoop 377 2014-05-15 10:19 myoutputdir/_olh/oraloader-00000-r.xml

-rw-r--r-- 3 yala hadoop 6285 2014-05-15 10:19 myoutputdir/_olh/oraloader-report.txt

-rw-r--r-- 1 yala hadoop 19607 2014-05-15 10:18 myoutputdir/_olh/tableMetadata.xml

Software used in this example:

Oracle Loader For Hadoop 3.0.0: oraloader-3.0.0-h2 (from http://www.oracle.com/technetwork/database/database-technologies/bdc/big-data-connectors/downloads/big-data-downloads-1451048.html)

Hadoop: hadoop 2.3.0-cdh5.0.0 from Cloudera

Oracle Database: Oracle Release 12.1.0.1.0 Production with partitioning option

Sample files:

apache_log_jobconf.xml:  https://blogs.oracle.com/bigdataconnectors/resource/apache_log_jobconf.xml

sample data:  https://blogs.oracle.com/bigdataconnectors/resource/apache_access_log.1


Tuesday May 20, 2014

Hadoop Certification Matrix

The table below gives an overview of the Hadoop distributions that work with Oracle Big Data Connectors.


About

Oracle Loader for Hadoop and Oracle SQL Connector for HDFS

Search

Categories
Archives
« May 2016
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today