Thursday Feb 06, 2014

Try the connectors in Oracle BigDataLite VM

You can now try all the examples technical expert Bob has been blogging about in a VM.  Download the VM from here, and the instructions (hands on lab) for using the connectors in the VM from here. The VM contains complete examples for using the connectors and other big data products. Check it out!

The VM contains a single-node Hadoop cluster, an Oracle Database, the Oracle Big Data Connectors and a lot more.  With everything in the same VM it can be used to test performance, but it can be used to play around with the technology and test functionality.

Wednesday Oct 16, 2013

Loading from Multiple Data Sources with Oracle Loader for Hadoop

Oracle Loader for Hadoop can be used to load data from multiple data sources (for example Hive, HBase), and data in multiple formats (for example Apache weblogs, JSON files).   There are two ways to do this:

(1) Use an input format implementation.  Oracle Loader for Hadoop includes several input format implementations.  In addition, a user can develop their own input format implementation for proprietary data sources and formats.

(2) Leverage the capabilities of Hive, and use Oracle Loader for Hadoop to load from Hive.

These approaches are discussed in our Oracle Open World 2013 presentation

Thursday Aug 15, 2013

Parallelism while Querying and Loading Data using Oracle SQL Connector for HDFS

For fast performance Oracle SQL Connector for HDFS (OSCH) can use parallel threads in the database to read data files on HDFS.   This post discusses the settings in OSCH that are relevant for using parallel threads, and some heuristics to set them.

First, a quick background on ‘location files.’ OSCH generates an external table that is used to query or load data from HDFS. This external table has a LOCATION clause with a list of location files (see example of a generated external table at the end of this post). OSCH populates the location files with URIs of the data files on HDFS. When there are multiple location files and multiple data files, OSCH uses load balancing to distribute the files (by size) among the location files. Data files are not split up, but the URIs are distributed by data file size as evenly as possible. See the documentation on location files in the Oracle® Big Data Connectors User's Guide Release 2 (2.2) for more details.

Number of Location Files

Degree of Parallelism (DOP) is the number of parallel threads used by the database to read the data. When reading data from external tables, the DOP specified in the database has to be supported by the relevant number of location files in the external table. If there is only one location file, there will be no parallelism while reading the data. The default value for the number of location files is 4. Higher numbers to support a higher DOP can be specified using the locationFileCount property. Keep in mind the resources available in the database while determining the number of parallel threads to use.

Specifying locationFileCount while Using OSCH

As a -D parameter: -D oracle.hadoop.exttab.locationFileCount=256

In a configuration file:

<property>
<name>oracle.hadoop.exttab.locationFileCount</name>
<value>256</value>
</property>

A value of 256 will result in an external table with 256 location files, which in turn means that 256 parallel threads can be used to access the data.

Number of Data Files

If the number of data files accessed is less than the number of location files, the number of parallel threads will be constrained by the number of populated location files. If possible, split up the data files to support higher numbers of parallel threads. If there are more data files than location files OSCH will load balance to evenly distribute data files by size across location files. A good heuristic: the number of data files should be a multiple of the number of location files.

Reading from a Hive table

If OSCH is accessing data from a Hive table with a single underlying data file, a bucketed Hive table can be created to split the input files. If there are more buckets than location files OSCH will load balance. A good heuristic: the number of buckets should be a multiple of the number of location files.

Example on the Oracle Big Data Appliance

Data file size: 1 TB.

If the data is in a Hive table, a bucketed Hive table can be created with 2048 buckets (each file will have 2 blocks on the BDA which has a block size of 256MB). A locationFileCount value of 256 will result in each thread reading about 8 files.

Note on enabling parallel query in the database

Parallel query has to be enabled on the database for data files to be accessed in parallel from HDFS. From the Oracle® Big Data Connectors User's Guide Release 2 (2.2):

Enable parallel query with this SQL command:

ALTER SESSION ENABLE PARALLEL QUERY;

Before loading the data into an Oracle database from the external files created by Oracle SQL Connector for HDFS, enable parallel DDL:

ALTER SESSION ENABLE PARALLEL DDL;

Before inserting data into an existing database table, enable parallel DML with this SQL command:

ALTER SESSION ENABLE PARALLEL DML;

Hints such as APPEND and PQ_DISTRIBUTE also improve performance while inserting data.

Example external table generated by OSCH (LOCATION clause and location files in red)

CREATE TABLE "HDFSUSER"."TEXT_ORA_EXT_TAB"

( "CUST_ID" VARCHAR2(4000),

"MOVIE_ID" VARCHAR2(4000),

"GENRE_ID" VARCHAR2(4000),

"TIME_ID" VARCHAR2(4000),

"RECOMMENDED" VARCHAR2(4000),

"ACTIVITY_ID" VARCHAR2(4000),

"RATING" VARCHAR2(4000),

"SALES" VARCHAR2(4000)

)

ORGANIZATION EXTERNAL

( TYPE ORACLE_LOADER

DEFAULT DIRECTORY "ORACLETEST_DIR"

ACCESS PARAMETERS

( RECORDS DELIMITED BY 0X'0A'

CHARACTERSET AL32UTF8

STRING SIZES ARE IN CHARACTERS

PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'

FIELDS TERMINATED BY 0X'09'

MISSING FIELD VALUES ARE NULL

(

"CUST_ID" CHAR(4000),

"MOVIE_ID" CHAR(4000),

"GENRE_ID" CHAR(4000),

"TIME_ID" CHAR(4000),

"RECOMMENDED" CHAR(4000),

"ACTIVITY_ID" CHAR(4000),

"RATING" CHAR(4000),

"SALES" CHAR(4000)

)

)

LOCATION

( 'osch-20130725105856-1332-1’,

  'osch-20130725105856-1332-2',

  'osch-20130725105856-1332-3',

  'osch-20130725105856-1332-4'

 ........

     ........ 

)

)

REJECT LIMIT UNLIMITED

PARALLEL

About

Oracle Loader for Hadoop and Oracle SQL Connector for HDFS

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today