X

Access External Data Sources from Oracle Database

  • August 15, 2013

Parallelism while Querying and Loading Data using Oracle SQL Connector for HDFS

For fast performance Oracle SQL Connector for HDFS (OSCH) can
use parallel threads in the database to read data files on HDFS.   This post discusses the settings in OSCH
that are relevant for using parallel threads, and some heuristics to set them.

First, a quick background on
‘location files.’ OSCH generates an
external table that is used to query or load data from HDFS. This external table has a LOCATION clause
with a list of location files (see example of a generated external table at the
end of this post). OSCH populates the
location files with URIs of the data files on HDFS. When there are multiple location files and
multiple data files, OSCH uses load balancing to distribute the files (by size)
among the location files. Data files are
not split up, but the URIs are distributed by data file size as evenly as
possible. See the documentation
on location files
in the Oracle® Big Data Connectors User's Guide Release 2 (2.2) for
more details.

Number of Location Files

Degree of Parallelism (DOP) is the number of parallel
threads used by the database to read the data. When reading data from external tables, the DOP specified in the
database has to be supported by the relevant number of location files in the
external table. If there is only one location
file, there will be no parallelism while reading the data. The default value for the number of location
files is 4. Higher numbers to support a higher DOP can
be specified using the locationFileCount
property. Keep in mind the resources available in the
database while determining the number of parallel threads to use.

Specifying locationFileCount while Using OSCH

As a -D parameter: -D
oracle.hadoop.exttab.locationFileCount=256

In a configuration file:

<property>
<name>oracle.hadoop.exttab.locationFileCount</name>
<value>256</value>

</property>

A value of 256 will result in an external table with 256
location files, which in turn means that 256 parallel threads can be used to
access the data.

Number of Data Files

If the number of data files accessed is less than the number of
location files, the number of parallel threads will be constrained by the
number of populated location files. If
possible, split up the data files to support higher numbers of parallel
threads. If there are more data files than
location files OSCH will load balance to evenly distribute data files by size
across location files. A good
heuristic: the number of data files should be a multiple of the number of
location files.

Reading from a Hive table

If OSCH is accessing data from a Hive table with a single
underlying data file, a bucketed Hive table can be created to split the input
files. If there are more buckets than
location files OSCH will load balance. A
good heuristic: the number of buckets
should be a multiple of the number of location files.

Example on the Oracle Big Data Appliance

Data file size: 1 TB.

If the data is in a Hive table, a bucketed Hive table can be
created with 2048 buckets (each file will have 2 blocks on the BDA which has a
block size of 256MB). A locationFileCount value of 256 will
result in each thread reading about 8 files.

Note on enabling parallel query in the database

Parallel query has to be enabled on the database for data
files to be accessed in parallel from HDFS. From the Oracle® Big Data Connectors User's Guide Release 2 (2.2):

Enable parallel query with this SQL
command:

ALTER SESSION ENABLE PARALLEL QUERY;

Before loading the data into an Oracle
database from the external files created by Oracle SQL Connector for HDFS,
enable parallel DDL:

ALTER
SESSION ENABLE PARALLEL DDL;

Before inserting data into an existing
database table, enable parallel DML with this SQL command:

ALTER
SESSION ENABLE PARALLEL DML;

Hints such as APPEND and PQ_DISTRIBUTE also
improve performance while inserting data.

Example external table generated by OSCH (LOCATION
clause and location files in red)

CREATE TABLE
"HDFSUSER"."TEXT_ORA_EXT_TAB"

( "CUST_ID" VARCHAR2(4000),

"MOVIE_ID" VARCHAR2(4000),

"GENRE_ID" VARCHAR2(4000),

"TIME_ID" VARCHAR2(4000),

"RECOMMENDED" VARCHAR2(4000),

"ACTIVITY_ID" VARCHAR2(4000),

"RATING" VARCHAR2(4000),

"SALES" VARCHAR2(4000)

)

ORGANIZATION EXTERNAL

( TYPE ORACLE_LOADER

DEFAULT DIRECTORY
"ORACLETEST_DIR"

ACCESS PARAMETERS

( RECORDS DELIMITED BY 0X'0A'

CHARACTERSET AL32UTF8

STRING SIZES ARE IN CHARACTERS

PREPROCESSOR
"OSCH_BIN_PATH":'hdfs_stream'

FIELDS TERMINATED BY 0X'09'

MISSING FIELD VALUES ARE NULL

(

"CUST_ID" CHAR(4000),

"MOVIE_ID" CHAR(4000),

"GENRE_ID" CHAR(4000),

"TIME_ID" CHAR(4000),

"RECOMMENDED" CHAR(4000),

"ACTIVITY_ID" CHAR(4000),

"RATING" CHAR(4000),

"SALES" CHAR(4000)

)

)

LOCATION

( 'osch-20130725105856-1332-1’,

  'osch-20130725105856-1332-2',

  'osch-20130725105856-1332-3',

  'osch-20130725105856-1332-4'

 ........

     ........ 

)

)

REJECT LIMIT UNLIMITED

PARALLEL

Join the discussion

Comments ( 1 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.