Friday Jul 19, 2013

Oracle R Connector for Hadoop 2.2.0 released

Oracle R Connector for Hadoop 2.2.0 is now available for download. The Oracle R Connector for Hadoop 2.x series has introduced numerous enhancements, which are highlighted in this article and summarized as follows:

 ORCH 2.0.0
 ORCH 2.1.0
 ORCH 2.2.0

 Analytic Functions

  • orch.lm
  • orch.lmf
  • orch.neural
  • orch.nmf

Oracle Loader for Hadoop (OLH) support

CDH 4.2.0

ORCHhive transparency layer







Analytic Functions
  • orch.cor
  • orch.cov
  • orch.kmeans
  • orch.princomp
  • orch.sample - by percent

Configurable delimiters in text input data files

Map-only and reduce-only jobs

Keyless map/reduce output

"Pristine" data mode for high performance data access

HDFS cache of metadata

Hadoop Abstraction Layer (HAL)


Analytic Functions
  • orch.sample - by number of rows

CDH 4.3.0

Full online documentation

Support integer and matrix data types in hdfs.attach with detection of "pristine" data

Out-of-the-box support for "pristine" mode for high I/O performance

HDFS cache to improve interactive performance when navigating HDFS directories and file lists

HDFS multi-file upload and download performance enhancements

HAL for Hortonworks Data Platform 1.2 and Apache Hadoop 1.0

ORCH 2.0.0

In ORCH 2.0.0, we introduced four Hadoop-enabled analytic functions supporting linear  regression, low rank matrix factorization, neural network, and non-negative matrix factorization. These enable R users to immediately begin using advanced analytics functions on HDFS data using the MapReduce paradigm on a Hadoop cluster without having to design and implement such algorithms themselves.

While ORCH 1.x supported moving data between the database and HDFS using sqoop, ORCH 2.0.0 supports the use of Oracle Loader for Hadoop (OLH) to move very large data volumes from HDFS to Oracle Database in a efficient and high performance manner.

ORCH 2.0.0 supported Cloudera Distribution for Hadoop (CDH) version 4.2.0 and introduced the ORCHhive transparency layer, which leverages the Oracle R Enterprise transparency layer for SQL, but instead maps to HiveQL, a SQL-like language for manipulating HDFS data via Hive tables.

ORCH 2.1.0

In ORCH 2.1.0, we added several more analytic functions, including correlation and covariance, clustering via K-Means, principle component analysis (PCA), and sampling by specifying the percent of records to return.

ORCH 2.1.0 also brought a variety of features, including: configurable delimiters (beyond comma delimited text files, using any ASCII delimiter), the ability to specify mapper-only and reduce-only jobs, and the output of NULL keys in mapper and reducer functions.

To speed the loading of data into Hadoop jobs, ORCH introduced “pristine” mode where the user guarantees that the data meets certain requirements so that ORCH skips a time-consuming data validation step. “Pristine” data requires that numeric columns contain only numeric data, that missing values are either R’s NA or the null string, and that all rows have the same number of columns. This improves performance of hdfs.get on a 1GB file by a factor of 10.

ORCH 2.1.0 introduced the caching of ORCH metadata to improve response time of ORCH functions, such as, hdfs.describe, and hdfs.mget between 5x and 70x faster.

The Hadoop Abstraction Layer, or HAL, enables ORCH to work on top of various Hadoop versions or variants, including Apache/Hortonworks, Cloudera Hadoop distributions: CDH3, and CDH 4.x with MR1 and MR2.

ORCH 2.2.0

In the latest release, ORCH 2.2.0, we’ve augmented orch.sample to allow specifying the number of rows in addition to percentage of rows. CDH 4.3 is now supported, and ORCH functions provide full online documentation via R's help function or ?. The function hdfs.attach now support integer and matrix data types and the ability to detect pristine data automatically. HDFS bulk directory upload and download performance speeds were also improved. Through the caching and automatic synchronization of ORCH metadata and file lists, the responsiveness of metadata HDFS-related functions has improved by 3x over ORCH 2.1.0, which also improves performance of and hadoop.exec functions. These improvements in turn bring a more interactive user experience for the R user when working with HDFS.

Starting in ORCH 2.2.0, we introduced out-of-the-box tuning optimizations for high performance and expanded HDFS caching to include the caching of file lists, which further improves performance of HDFS-related functions.

The function hdfs.upload now supports the option to upload multi-file directories in a single invocation, which optimizes the process. When downloading an HDFS directory, is optimized to issue a single HDFS command to download files into one local temporary directory before combining the separate parts into a single file.

The Hadoop Abstraction Layer (HAL) was extended to support Hortonworks Data Platform 1.2 and Apache Hadoop 1.0. In addition, ORCH now allows the user to override the Hadoop Abstraction Layer version for use with unofficially supported distributions of Hadoop using system environment variables. This enables testing and certification of ORCH by other Hadoop distribution vendors.

Certification of ORCH on non-officially supported platforms can be done using a separate test kit (available for download upon request: that includes an extensive set of tests for core ORCH functionality and that can be run using the ORCH built-in testing framework. Running the tests pinpoints the failures and ensures that ORCH is compatible with the target platform.

See the ORCH 2.2.0 Change List and Release Notes for additional details. ORCH 2.2.0 can be downloaded here.


The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.


« July 2013 »