Friday Aug 23, 2013

Big Data Connectors and Oracle Database 12c

 

Oracle Loader for Hadoop and Oracle SQL Connector for HDFS work with Oracle Database 12c.  For convenience they ship with database jar files, and some of them are version specific. The missing jar files for Oracle Database 12c have to be copied over from the database.    

The only required 12c database jars are the Oracle Wallet Manager jars.   So if you are using Oracle Wallet Manager, copy the following jars:

   ORACLE_HOME/jlib/oraclepki.jar
   ORACLE_HOME/jlib/osdt_cert.jar
   ORACLE_HOME/jlib/osdt_core.jar 


to OLH_HOME/jlib and OSCH_HOME/jlib.

The next release of Oracle Big Data connectors will include these 12c jar files so there will be no need for the copy.

All other functionality for these two connectors will work with Oracle Database 12c without change.  

Thursday Aug 15, 2013

Parallelism while Querying and Loading Data using Oracle SQL Connector for HDFS

For fast performance Oracle SQL Connector for HDFS (OSCH) can use parallel threads in the database to read data files on HDFS.   This post discusses the settings in OSCH that are relevant for using parallel threads, and some heuristics to set them.

First, a quick background on ‘location files.’ OSCH generates an external table that is used to query or load data from HDFS. This external table has a LOCATION clause with a list of location files (see example of a generated external table at the end of this post). OSCH populates the location files with URIs of the data files on HDFS. When there are multiple location files and multiple data files, OSCH uses load balancing to distribute the files (by size) among the location files. Data files are not split up, but the URIs are distributed by data file size as evenly as possible. See the documentation on location files in the Oracle® Big Data Connectors User's Guide Release 2 (2.2) for more details.

Number of Location Files

Degree of Parallelism (DOP) is the number of parallel threads used by the database to read the data. When reading data from external tables, the DOP specified in the database has to be supported by the relevant number of location files in the external table. If there is only one location file, there will be no parallelism while reading the data. The default value for the number of location files is 4. Higher numbers to support a higher DOP can be specified using the locationFileCount property. Keep in mind the resources available in the database while determining the number of parallel threads to use.

Specifying locationFileCount while Using OSCH

As a -D parameter: -D oracle.hadoop.exttab.locationFileCount=256

In a configuration file:

<property>
<name>oracle.hadoop.exttab.locationFileCount</name>
<value>256</value>
</property>

A value of 256 will result in an external table with 256 location files, which in turn means that 256 parallel threads can be used to access the data.

Number of Data Files

If the number of data files accessed is less than the number of location files, the number of parallel threads will be constrained by the number of populated location files. If possible, split up the data files to support higher numbers of parallel threads. If there are more data files than location files OSCH will load balance to evenly distribute data files by size across location files. A good heuristic: the number of data files should be a multiple of the number of location files.

Reading from a Hive table

If OSCH is accessing data from a Hive table with a single underlying data file, a bucketed Hive table can be created to split the input files. If there are more buckets than location files OSCH will load balance. A good heuristic: the number of buckets should be a multiple of the number of location files.

Example on the Oracle Big Data Appliance

Data file size: 1 TB.

If the data is in a Hive table, a bucketed Hive table can be created with 2048 buckets (each file will have 2 blocks on the BDA which has a block size of 256MB). A locationFileCount value of 256 will result in each thread reading about 8 files.

Note on enabling parallel query in the database

Parallel query has to be enabled on the database for data files to be accessed in parallel from HDFS. From the Oracle® Big Data Connectors User's Guide Release 2 (2.2):

Enable parallel query with this SQL command:

ALTER SESSION ENABLE PARALLEL QUERY;

Before loading the data into an Oracle database from the external files created by Oracle SQL Connector for HDFS, enable parallel DDL:

ALTER SESSION ENABLE PARALLEL DDL;

Before inserting data into an existing database table, enable parallel DML with this SQL command:

ALTER SESSION ENABLE PARALLEL DML;

Hints such as APPEND and PQ_DISTRIBUTE also improve performance while inserting data.

Example external table generated by OSCH (LOCATION clause and location files in red)

CREATE TABLE "HDFSUSER"."TEXT_ORA_EXT_TAB"

( "CUST_ID" VARCHAR2(4000),

"MOVIE_ID" VARCHAR2(4000),

"GENRE_ID" VARCHAR2(4000),

"TIME_ID" VARCHAR2(4000),

"RECOMMENDED" VARCHAR2(4000),

"ACTIVITY_ID" VARCHAR2(4000),

"RATING" VARCHAR2(4000),

"SALES" VARCHAR2(4000)

)

ORGANIZATION EXTERNAL

( TYPE ORACLE_LOADER

DEFAULT DIRECTORY "ORACLETEST_DIR"

ACCESS PARAMETERS

( RECORDS DELIMITED BY 0X'0A'

CHARACTERSET AL32UTF8

STRING SIZES ARE IN CHARACTERS

PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'

FIELDS TERMINATED BY 0X'09'

MISSING FIELD VALUES ARE NULL

(

"CUST_ID" CHAR(4000),

"MOVIE_ID" CHAR(4000),

"GENRE_ID" CHAR(4000),

"TIME_ID" CHAR(4000),

"RECOMMENDED" CHAR(4000),

"ACTIVITY_ID" CHAR(4000),

"RATING" CHAR(4000),

"SALES" CHAR(4000)

)

)

LOCATION

( 'osch-20130725105856-1332-1’,

  'osch-20130725105856-1332-2',

  'osch-20130725105856-1332-3',

  'osch-20130725105856-1332-4'

 ........

     ........ 

)

)

REJECT LIMIT UNLIMITED

PARALLEL

About

Oracle Loader for Hadoop and Oracle SQL Connector for HDFS

Search

Categories
Archives
« August 2013 »
SunMonTueWedThuFriSat
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16
17
18
19
20
21
22
24
25
26
27
28
29
30
31
       
Today