Parallelism while Querying and Loading Data using Oracle SQL Connector for HDFS
By Mannamal-Oracle on Aug 15, 2013
For fast performance Oracle SQL Connector for HDFS (OSCH) can use parallel threads in the database to read data files on HDFS. This post discusses the settings in OSCH that are relevant for using parallel threads, and some heuristics to set them.
First, a quick background on ‘location files.’ OSCH generates an external table that is used to query or load data from HDFS. This external table has a LOCATION clause with a list of location files (see example of a generated external table at the end of this post). OSCH populates the location files with URIs of the data files on HDFS. When there are multiple location files and multiple data files, OSCH uses load balancing to distribute the files (by size) among the location files. Data files are not split up, but the URIs are distributed by data file size as evenly as possible. See the documentation on location files in the Oracle® Big Data Connectors User's Guide Release 2 (2.2) for more details.
Number of Location Files
Degree of Parallelism (DOP) is the number of parallel threads used by the database to read the data. When reading data from external tables, the DOP specified in the database has to be supported by the relevant number of location files in the external table. If there is only one location file, there will be no parallelism while reading the data. The default value for the number of location files is 4. Higher numbers to support a higher DOP can be specified using the locationFileCount property. Keep in mind the resources available in the database while determining the number of parallel threads to use.
Specifying locationFileCount while Using OSCH
As a -D parameter: -D oracle.hadoop.exttab.locationFileCount=256
In a configuration file:
A value of 256 will result in an external table with 256 location files, which in turn means that 256 parallel threads can be used to access the data.
Number of Data Files
If the number of data files accessed is less than the number of location files, the number of parallel threads will be constrained by the number of populated location files. If possible, split up the data files to support higher numbers of parallel threads. If there are more data files than location files OSCH will load balance to evenly distribute data files by size across location files. A good heuristic: the number of data files should be a multiple of the number of location files.
Reading from a Hive table
If OSCH is accessing data from a Hive table with a single underlying data file, a bucketed Hive table can be created to split the input files. If there are more buckets than location files OSCH will load balance. A good heuristic: the number of buckets should be a multiple of the number of location files.
Example on the Oracle Big Data Appliance
Data file size: 1 TB.
If the data is in a Hive table, a bucketed Hive table can be created with 2048 buckets (each file will have 2 blocks on the BDA which has a block size of 256MB). A locationFileCount value of 256 will result in each thread reading about 8 files.
Note on enabling parallel query in the database
Parallel query has to be enabled on the database for data files to be accessed in parallel from HDFS. From the Oracle® Big Data Connectors User's Guide Release 2 (2.2):
Enable parallel query with this SQL command:
Before loading the data into an Oracle database from the external files created by Oracle SQL Connector for HDFS, enable parallel DDL:
ALTER SESSION ENABLE PARALLEL DDL;
Before inserting data into an existing database table, enable parallel DML with this SQL command:
ALTER SESSION ENABLE PARALLEL DML;
Example external table generated by OSCH (LOCATION clause and location files in red)
CREATE TABLE "HDFSUSER"."TEXT_ORA_EXT_TAB"
( "CUST_ID" VARCHAR2(4000),
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "ORACLETEST_DIR"
( RECORDS DELIMITED BY 0X'0A'
STRING SIZES ARE IN CHARACTERS
FIELDS TERMINATED BY 0X'09'
MISSING FIELD VALUES ARE NULL
REJECT LIMIT UNLIMITED