By Mark Hornick on Jul 19, 2013
Oracle R Connector for Hadoop 2.2.0 is now available for download. The Oracle R Connector for Hadoop 2.x series has introduced numerous enhancements, which are highlighted in this article and summarized as follows:
| ORCH 2.0.0
|| ORCH 2.1.0
|| ORCH 2.2.0
Oracle Loader for Hadoop (OLH) support
ORCHhive transparency layer
| Analytic Functions
Configurable delimiters in text input data files
Map-only and reduce-only jobs
Keyless map/reduce output
"Pristine" data mode for high performance data access
HDFS cache of metadata
Hadoop Abstraction Layer (HAL)
| Analytic Functions
Full online documentation
Support integer and matrix data types in hdfs.attach with detection of "pristine" data
Out-of-the-box support for "pristine" mode for high I/O performance
HDFS cache to improve interactive performance when navigating HDFS directories and file lists
HDFS multi-file upload and download performance enhancements
HAL for Hortonworks Data Platform 1.2 and Apache Hadoop 1.0
In ORCH 2.0.0, we introduced four Hadoop-enabled analytic functions supporting linear regression, low rank matrix factorization, neural network, and
non-negative matrix factorization. These enable R users to immediately begin using advanced analytics functions on HDFS data using the MapReduce paradigm on a Hadoop cluster without having to design and implement such algorithms themselves.
While ORCH 1.x supported moving data between the database and HDFS using sqoop, ORCH 2.0.0 supports the use of Oracle Loader for Hadoop (OLH) to move very large data volumes from HDFS to Oracle Database in a efficient and high performance manner.
ORCH 2.0.0 supported Cloudera Distribution for Hadoop (CDH) version 4.2.0 and introduced the ORCHhive transparency layer, which leverages the Oracle R Enterprise transparency layer for SQL, but instead maps to HiveQL, a SQL-like language for manipulating HDFS data via Hive tables.
In ORCH 2.1.0, we added several more analytic functions, including correlation and covariance, clustering via K-Means, principle component analysis (PCA), and sampling by specifying the percent of records to return.
ORCH 2.1.0 also brought a variety of features, including: configurable delimiters (beyond comma delimited text files, using any ASCII delimiter), the ability to specify mapper-only and reduce-only jobs, and the output of NULL keys in mapper and reducer functions.
To speed the loading of data into Hadoop jobs, ORCH introduced “pristine” mode where the user guarantees that the data meets certain requirements so that ORCH skips a time-consuming data validation step. “Pristine” data requires that numeric columns contain only numeric data, that missing values are either R’s NA or the null string, and that all rows have the same number of columns. This improves performance of hdfs.get on a 1GB file by a factor of 10.
ORCH 2.1.0 introduced the caching of ORCH metadata to improve response time of ORCH functions, such as hdfs.ls, hdfs.describe, and hdfs.mget between 5x and 70x faster.
The Hadoop Abstraction Layer, or HAL, enables ORCH to work on top of various Hadoop versions or variants, including Apache/Hortonworks, Cloudera Hadoop distributions: CDH3, and CDH 4.x with MR1 and MR2.
In the latest release, ORCH 2.2.0, we’ve augmented orch.sample to allow specifying the
number of rows in addition to percentage of rows. CDH 4.3 is now supported, and ORCH functions provide
full online documentation via R's help function or ?. The function hdfs.attach now
support integer and matrix data types and the ability to detect pristine
data automatically. HDFS bulk directory upload and download performance speeds were
also improved. Through the caching and automatic synchronization of ORCH metadata and file lists, the
responsiveness of metadata HDFS-related functions has improved by 3x over ORCH 2.1.0, which also improves performance of hadoop.run and hadoop.exec functions. These improvements in turn bring a more
interactive user experience for the R user when working with HDFS.
Starting in ORCH 2.2.0, we introduced out-of-the-box tuning optimizations for high performance and expanded HDFS caching to include the caching of file lists, which further improves performance of HDFS-related functions.
The function hdfs.upload now supports the option to upload multi-file directories in a single invocation, which optimizes the process. When downloading an HDFS directory, hdfs.download is optimized to issue a single HDFS command to download files into one local temporary directory before combining the separate parts into a single file.
The Hadoop Abstraction Layer (HAL) was extended to support Hortonworks Data Platform 1.2 and Apache Hadoop 1.0. In addition, ORCH now allows the user to override the Hadoop Abstraction Layer version for use with unofficially supported distributions of Hadoop using system environment variables. This enables testing and certification of ORCH by other Hadoop distribution vendors.
Certification of ORCH on non-officially supported platforms can be done using a separate test kit (available for download upon request: firstname.lastname@example.org) that includes an extensive set of tests for core ORCH functionality and that can be run using the ORCH built-in testing framework. Running the tests pinpoints the failures and ensures that ORCH is compatible with the target platform.