Tuesday Feb 18, 2014

Low-Rank Matrix Factorization in Oracle R Advanced Analytics for Hadoop

This guest post from Arun Kumar, a graduate student in the Department of Computer Sciences at the University of Wisconsin-Madison, describes work done during his internship in the Oracle Advanced Analytics group.

Oracle R Advanced Analytics For Hadoop (ORAAH), a component of Oracle’s Big Data Connectors software suite is a collection of statistical and predictive techniques implemented on Hadoop infrastructure. In this post, we introduce and explain techniques for a popular machine learning task that has diverse applications ranging from predicting ratings in recommendation systems to feature extraction in text mining namely matrix completion and factorization. Training, scoring, and prediction phases for matrix completion and factorization are available in ORAAH. The models generated can also be transparently loaded into R for ad-hoc inspection. In this blog, post we describe implementation specifics of these two techniques available in ORAAH.


Consider an e-commerce company that displays products to potential customers on its webpage and collects data about views, purchases, ratings (e.g., 1 to 5 stars), etc. Increasingly, such online retailers are using machine learning techniques to predict in advance which products a customer is likely to rate highly and recommend such products to the customers in the hope that they might purchase them. Users build a statistical model based on the past history of ratings by all customers on all products. One popular model to generate predictions from such a hyper-sparse matrix is the latent factor model, also known as the low-rank matrix factorization model (LMF).

The setup is the following – we are given a large dataset of past ratings (potentially in the billions), say, with the schema (Customer ID, Product ID, Rating). Here, Customer ID refers to a distinct customer, Product ID refers to a distinct product, and Rating refers to a rating value, e.g., 1 to 5. Conceptually, this dataset represents a large matrix D with m rows (number of customers) and n columns (number of products), where the entries are the available ratings. Notice that this matrix is likely to be extremely sparse, i.e., many ratings could be missing since most customers typically rate only a few products. Thus, the task here is matrix completion – we need to predict the missing ratings so that it can be used for downstream processing such as displaying the top recommendations for each customer.

The LMF model assumes that the ratings matrix can be approximately generated as a product of two factor matrices, L and R, which are much smaller than D (lower rank). The idea is that the product L * R will approximately reconstruct the existing ratings and also automatically predict the missing ratings in D. More precisely, for each available rating (i,j,v) in D, we have (L x R) [i,j] ≈ v, while for each missing rating (i',j') in D, the predicted rating is (L x R) [i',j']. The model has a parameter r, which dictates the rank of the factor matrices, i.e., L is m x r, while R is r x n.

Matrix Completion in ORAAH

LMF can be invoked out-of-the-box using the routine orch.lmf. An execution based on the above example is shown below. The dataset of ratings is in a CSV file on HDFS with the schema above (named “retail_ratings” here).

input <- hdfs.attach("retail_ratings")
fit <- orch.lmf(input)

# Export the model into R memory
lr <- orch.export.fit(fit)

# Compute the prediction for the point (100, 50)

# First column of lr$L contains the userid
userid <- lr$L[,1] == 100 # find row corresponding to user id 100
L <- lr$L[, 2:(rank+1)]

#First column contains the itemid
itemid <- lr$R[,1] == 50 # find row corresponding to item id 50
R <- lr$R[, 2:(rank+1)]

# dot product as sum of terms obtained through component wise multiplication
pred <- sum(L[userid,] * R[itemid,])

The factor matrices can be transparently loaded into R for further inspection and for ad-hoc predictions of specific customer ratings using R. The algorithm we use for training the LMF model is called Incremental Gradient Descent (IGD), which has been shown to be one of the fastest algorithms for this task [1, 2].

The entire set of arguments for the function orch.lmf along with a brief description of each and their default values is given in the table below. The latin parameter configures the degree of parallelism for executing IGD for LMF on Hadoop [2]. ORAAH sets this automatically based on the dimensions of the problem and the memory available to each Mapper. Each Mapper fits its partition of the model in memory, and the multiple partitions run in parallel to learn different parts of the model. The last five parameters configure IGD and need to be tuned by the user to a given dataset since they can impact the quality of the model obtained.

ORAAH also provides routines for predicting ratings as well as for evaluating the model (computing the error of the model on a given labeled dataset) on a large scale over HDFS-resident datasets. The routine for prediction of ratings is predict, and for evaluating is orch.evaluate. Use help(orch.lmf) for online documentation, and demo(orch_lmf_jellyfish) for a fully working example including model fit, evaluation, and prediction.

Other Matrix Factorization Tasks

While LMF is primarily used for matrix completion tasks, it can also be used for other matrix factorization tasks that arise in text mining, computer vision, and bio-informatics, e.g., dimension reduction and feature extraction. In these applications, the input data matrix need not necessarily be sparse. Although many zeros might be present, they are not treated as missing values. The goal here is simply to obtain a low-rank factorization D ≈ L x R as accurately as possible, i.e., the product L x R should recover all entries in D, including the zeros. Typically, such applications use a Non-Negative Matrix Factorization (NMF) approach due to non-negativity constraints on the factor matrix entries. However, many of these applications often do not need non-negativity in the factor matrices. Using NMF algorithms for such applications leads to poorer-quality solutions. Our implementation of matrix factorization for such NMF-style tasks can be invoked out-of-the-box in ORAAH using the routine orch.nmf, which has the same set of arguments as LMF.

Experimental Results & Comparison with Apache Mahout

We now present an empirical evaluation of the performance, quality, and scalability of the ORAAH LMF tool based on IGD and compare it to the most widely used off-the-shelf tool for LMF on Hadoop – an implementation of the ALS algorithm from Apache Mahout [3].

All our experiments are run on an Oracle Big Data Appliance Hadoop cluster with nine nodes, each with Intel Xeon X5675 12-core 3.07GHz processors, 48 GB RAM, and 20 TB disk. We use 256MB HDFS blocks and 10 reducers for MapReduce jobs.

We use two standard public datasets for recommendation tasks – MovieLens10M (referred to as MLens) and Netflix – for the performance and quality comparisons (insert URL). To study scalability aspects, we use several synthetic datasets of different sizes by changing the number of rows, number of columns, and/or number of ratings. The table below presents the data set statistics.

Results: Performance and Quality

We first present end-to-end overview of the performance and quality achieved by our implementation and Mahout on MLens and Netflix. The rank parameter was set at 50 (a typical choice for such tasks) and the other parameters for both tools were chosen using a grid search. The quality of the factor matrices was determined using the standard measure of root mean square error (RMSE) [2]. We use a 70%-15%-15% Wold holdout of the datasets, i.e., 70% for training, 15% for testing, and 15% for validation of generalization error. The training was performed until 0.1% convergence, i.e., until the fractional decrease in the training RMSE after every iteration reached 0.1%. The table below presents the results.

1. ORAAH LMF has a faster performance than Mahout LMF on the overall training runtime on both datasets – 1.8x faster on MLens and 2.3x faster on Netflix.
2. The per-iteration runtime of ORAAH LMF is much lower than that of Mahout LMF – between 4.4x and 5.4x.
3. Although ORAAH LMF runs more iterations than Mahout LMF, the huge difference in the per-iteration runtimes make the overall runtime smaller for ORAAH LMF.
4. The training quality (training RMSE) achieved is comparable across both tools on both datasets. Similarly, the generalization quality is also comparable. Thus, ORAAH LMF can offer state-of-the-art quality along with faster performance.

Results: Scalability

The ability to scale along all possible dimensions of the data is key to big data analytics. Both ORAAH LMF and Mahout LMF are able to scale to billions of ratings by parallelizing and distributing computations on Hadoop. But we now show that unlike Mahout LMF, ORAAH LMF is also able to scale to hundreds of millions of customers (m) and products (n), and also scales well with the rank results along these three dimensions – m, n, and r. parameter (r, which affects the size of the factor matrices). The figure below presents the scalability.

1. Figures (A) and (B) plot the results for the Syn-row and Syn-col datasets, respectively (r = 2). ORAAH LMF scales linearly with both number of rows (m) and number of columns (n), while Mahout LMF does not show up on either plot because it crashes at all these values of m. In fact, we verified that Mahout LMF does not scale beyond even m = 20 M! The situation is similar with n. This is because Mahout LMF assumes that the factor matrices L and R fit entirely in the memory of each Mapper. In contrast, ORAAH LMF uses a clever partitioning scheme on all matrices ([2]) and can thus scale seamlessly on all dataset dimensions.
2. Figure (C) shows the impact of the rank parameter r. ORAAH LMF scales linearly with r and the per-iteration runtime roughly doubles between r = 20 and r = 100. However, the per-iteration runtime of Mahout LMF varies quadratically with r, and in fact, increases by a factor of 40x between r = 20 and r = 100! Thus, ORAAH LMF is also able to scale better with r.
3. Finally, on the tera-scale dataset Syn-tera with 1 billion rows, 10 million columns, and 20 billion ratings, ORAAH LMF (for r = 2) finishes an iteration in just under 2 hours!


The matrix factorization features in ORAAH were implemented and benchmarked by Arun Kumar during his summer internship at Oracle under the guidance of Vaishnavi Sashikanth. He is pursuing his PhD in computer science from the University of Wisconsin-Madison. This work is the result of a collaboration between Oracle and the research group of Dr. Christopher Ré, who is now at Stanford University. Anand Srinivasan helped integrate these features into ORAAH.


[1] Towards a Unified Architecture for in-RDBMS Analytics. Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. ACM SIGMOD 2012.

[2] Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion. Benjamin Recht and Christopher Ré. Mathematical Programming Computation 2013.

[3] Apache Mahout. http://mahout.apache.org/.

Friday Jul 19, 2013

Oracle R Connector for Hadoop 2.2.0 released

Oracle R Connector for Hadoop 2.2.0 is now available for download. The Oracle R Connector for Hadoop 2.x series has introduced numerous enhancements, which are highlighted in this article and summarized as follows:

 ORCH 2.0.0
 ORCH 2.1.0
 ORCH 2.2.0

 Analytic Functions

  • orch.lm
  • orch.lmf
  • orch.neural
  • orch.nmf

Oracle Loader for Hadoop (OLH) support

CDH 4.2.0

ORCHhive transparency layer







Analytic Functions
  • orch.cor
  • orch.cov
  • orch.kmeans
  • orch.princomp
  • orch.sample - by percent

Configurable delimiters in text input data files

Map-only and reduce-only jobs

Keyless map/reduce output

"Pristine" data mode for high performance data access

HDFS cache of metadata

Hadoop Abstraction Layer (HAL)


Analytic Functions
  • orch.sample - by number of rows

CDH 4.3.0

Full online documentation

Support integer and matrix data types in hdfs.attach with detection of "pristine" data

Out-of-the-box support for "pristine" mode for high I/O performance

HDFS cache to improve interactive performance when navigating HDFS directories and file lists

HDFS multi-file upload and download performance enhancements

HAL for Hortonworks Data Platform 1.2 and Apache Hadoop 1.0

ORCH 2.0.0

In ORCH 2.0.0, we introduced four Hadoop-enabled analytic functions supporting linear  regression, low rank matrix factorization, neural network, and non-negative matrix factorization. These enable R users to immediately begin using advanced analytics functions on HDFS data using the MapReduce paradigm on a Hadoop cluster without having to design and implement such algorithms themselves.

While ORCH 1.x supported moving data between the database and HDFS using sqoop, ORCH 2.0.0 supports the use of Oracle Loader for Hadoop (OLH) to move very large data volumes from HDFS to Oracle Database in a efficient and high performance manner.

ORCH 2.0.0 supported Cloudera Distribution for Hadoop (CDH) version 4.2.0 and introduced the ORCHhive transparency layer, which leverages the Oracle R Enterprise transparency layer for SQL, but instead maps to HiveQL, a SQL-like language for manipulating HDFS data via Hive tables.

ORCH 2.1.0

In ORCH 2.1.0, we added several more analytic functions, including correlation and covariance, clustering via K-Means, principle component analysis (PCA), and sampling by specifying the percent of records to return.

ORCH 2.1.0 also brought a variety of features, including: configurable delimiters (beyond comma delimited text files, using any ASCII delimiter), the ability to specify mapper-only and reduce-only jobs, and the output of NULL keys in mapper and reducer functions.

To speed the loading of data into Hadoop jobs, ORCH introduced “pristine” mode where the user guarantees that the data meets certain requirements so that ORCH skips a time-consuming data validation step. “Pristine” data requires that numeric columns contain only numeric data, that missing values are either R’s NA or the null string, and that all rows have the same number of columns. This improves performance of hdfs.get on a 1GB file by a factor of 10.

ORCH 2.1.0 introduced the caching of ORCH metadata to improve response time of ORCH functions, such as hdfs.ls, hdfs.describe, and hdfs.mget between 5x and 70x faster.

The Hadoop Abstraction Layer, or HAL, enables ORCH to work on top of various Hadoop versions or variants, including Apache/Hortonworks, Cloudera Hadoop distributions: CDH3, and CDH 4.x with MR1 and MR2.

ORCH 2.2.0

In the latest release, ORCH 2.2.0, we’ve augmented orch.sample to allow specifying the number of rows in addition to percentage of rows. CDH 4.3 is now supported, and ORCH functions provide full online documentation via R's help function or ?. The function hdfs.attach now support integer and matrix data types and the ability to detect pristine data automatically. HDFS bulk directory upload and download performance speeds were also improved. Through the caching and automatic synchronization of ORCH metadata and file lists, the responsiveness of metadata HDFS-related functions has improved by 3x over ORCH 2.1.0, which also improves performance of hadoop.run and hadoop.exec functions. These improvements in turn bring a more interactive user experience for the R user when working with HDFS.

Starting in ORCH 2.2.0, we introduced out-of-the-box tuning optimizations for high performance and expanded HDFS caching to include the caching of file lists, which further improves performance of HDFS-related functions.

The function hdfs.upload now supports the option to upload multi-file directories in a single invocation, which optimizes the process. When downloading an HDFS directory, hdfs.download is optimized to issue a single HDFS command to download files into one local temporary directory before combining the separate parts into a single file.

The Hadoop Abstraction Layer (HAL) was extended to support Hortonworks Data Platform 1.2 and Apache Hadoop 1.0. In addition, ORCH now allows the user to override the Hadoop Abstraction Layer version for use with unofficially supported distributions of Hadoop using system environment variables. This enables testing and certification of ORCH by other Hadoop distribution vendors.

Certification of ORCH on non-officially supported platforms can be done using a separate test kit (available for download upon request: mark.hornick@oracle.com) that includes an extensive set of tests for core ORCH functionality and that can be run using the ORCH built-in testing framework. Running the tests pinpoints the failures and ensures that ORCH is compatible with the target platform.

See the ORCH 2.2.0 Change List and Release Notes for additional details. ORCH 2.2.0 can be downloaded here.

Monday Jun 10, 2013

Bringing R to the Enterprise - new white paper available

Check out this new white paper entitled "Bringing R to the Enterprise -  A Familiar R Environment with Enterprise-Caliber Performance, Scalability, and Security."

In this white paper, we begin with "Beyond the Laptop" exploring the ability to run R code in the database, working with CRAN packages at the database server, operationalizing R analytics, and leveraging Hadoop from the comfort of the R language and environment.

Excerpt: "Oracle Advanced Analytics and Oracle R Connector for Hadoop combine the advantages of R with the power and scalability of Oracle Database and Hadoop. R programs and libraries can be used in conjunction with these database assets to process large amounts of data in a secure environment. Customers can build statistical models and execute them against local data stores as well as run R commands and scripts against data stored in a secure corporate database."

The white paper continues with three use cases involving Oracle Database and Hadoop: analyzing credit risk, detecting fraud, and preventing customer churn.  The conclusion: providing analytics for the enterprise based on the R environment is here!

Wednesday May 22, 2013

Big Data Analytics in R – the tORCH has been lit!

This guest post from Anand Srinivasan compares performance of the Oracle R Connector for Hadoop with the R {parallel} package for covariance matrix computation, sampling, and parallel linear model fitting. 

Oracle R Connector for Hadoop (ORCH) is a collection of R packages that enables Big Data analytics from the R environment. It enables a Data Scientist /Analyst to work on data straddling multiple data platforms (HDFS, Hive, Oracle Database, local files) from the comfort of the R environment and benefit from the R ecosystem.

ORCH provides:

1) Out of the box predictive analytic techniques for linear regression, neural networks for prediction, matrix completion using low rank matrix factorization, non-negative matrix factorization, kmeans clustering, principal components analysis and multivariate analysis. While all these techniques have R interfaces, they are implemented either in Java or in R as distributed parallel implementations leveraging all nodes of your Hadoop cluster

2) A general framework, where a user can use the R language to write custom logic executable in a distributed parallel manner using available compute and storage resources.

The main idea behind the ORCH architecture and its approach to Big Data analytics is to leverage the Hadoop infrastructure and thereby inherit all its advantages.

The crux of ORCH is read parallelization and robust methods over parallelized data. Efficient parallelization of reads is the single most important step necessary for Big Data Analytics because it is either expensive or impractical to load all available data in a single thread.

ORCH is often compared/contrasted with the other options available in R, in particular the popular open source R package called parallel. The parallel package provides a low-level infrastructure for “coarse-grained” distributed and parallel computation. While it is fairly general, it tends to encourage an approach that is based on using the aggregate RAM in the cluster as opposed to using the file system. Specifically, it lacks a data management component, a task management component and an administrative interface for monitoring. Programming, however, follows the broad Map Reduce paradigm.

 In the rest of this article, we assume that the reader has basic familiarity with the parallel package and proceed to compare ORCH and its approach with the parallel package. The goal of this comparison is to explain what it takes for a user to build a solution for their requirement using each of these technologies and also to understand the performance characteristics of these solutions.

We do this comparison using three concrete use cases – covariance matrix computation, sampling and partitioned linear model fitting. The exercise is designed to be repeatable, so you, the reader, can try this “at home”. We will demonstrate that ORCH is functionally and performance-wise superior to the available alternative of using R’s parallel package.

A six node Oracle Big Data Appliance v2.1.1 cluster is used in the experiments. Each node in this test environment has 48GB RAM and 24 CPU cores.

Covariance Matrix Computation

Computing covariance matrices is one of the most fundamental of statistical techniques.

In this use case, we have a single input file, “allnumeric_200col_10GB” (see appendix on how to generate this data set), that is about 10GB in size and has a data matrix with about 3 million rows and 200 columns. The requirement is to compute the covariance matrix of this input matrix.

Since a single node in the test environment has 48GB RAM and the input file is only 10GB, we start with the approach of loading the entire file into memory and then computing the covariance matrix using R’s cov function.

> system.time(m <- matrix(scan(file="/tmp/allnumeric_200col_10GB",what=0.0, sep=","), ncol=200, byrow=TRUE))

Read 611200000 items

user system elapsed

683.159 17.023 712.527

> system.time(res <- cov(m))

user system elapsed

561.627 0.009 563.044

We observe that the loading of data takes 712 seconds (vs. 563 seconds for the actual covariane computation) and dominates the cost. It would be even more pronounced (relative to the total elapsed time) if the cov(m) computation were parallelized using mclapply from the parallel package.

Based on this, we see that for an efficient parallel solution, the main requirement is to parallelize the data loading. This requires that the single input file be split into multiple smaller-sized files. The parallel package does not offer any data management facilities; hence this step has to be performed manually using a Linux command like split. Since there are 24 CPU cores, we split the input file into 24 smaller files.

time(split -l 127334 /tmp/allnumeric_200col_10GB)

real 0m54.343s

user 0m3.598s

sys 0m24.233s

Now, we can run the R script:


# Read the data

readInput <- function(id) {

infile <- file.path("/home/oracle/anasrini/cov",paste("p",id,sep=""))


m <- matrix(scan(file=infile, what=0.0, sep=","), ncol=200, byrow=TRUE)



# Main MAPPER function

compCov <- function(id) {

m <- readInput(id)  # read the input

cs <- colSums(m)    # compute col sums, num rows

# compute main cov portion

nr <- nrow(m)      

mtm <- crossprod(m)

list(mat=mtm, colsum=cs, nrow=nr)


numfiles <- 24

numCores <- 24

# Map step

system.time(mapres <- mclapply(seq_len(numfiles), compCov, mc.cores=numCores))

# Reduce step

system.time(xy <- Reduce("+", lapply(mapres, function(x) x$mat)))

system.time(csf <- Reduce("+", lapply(mapres, function(x) x$colsum)))

system.time(nrf <- Reduce("+", lapply(mapres, function(x) x$nrow)))

sts <- csf %*% t(csf)

m1 <- xy / (nrf -1)

m2 <- sts / (nrf * (nrf-1))

m3 <- 2 * sts / (nrf * (nrf-1))

covmat <- m1 + m2 - m3

user system elapsed

1661.196 21.209 77.781

We observe that the elapsed time (excluding time to split the files) has now come down to 77 seconds. However, it took 54 seconds for splitting the input file into smaller files, making it a significant portion of the total elapsed time of 77+54 = 131 seconds.

Besides impacting performance, there are a number of more serious problems with having to deal with data management manually. We list a few of them here:

1) In other scenarios, with larger files or larger number of chunks, placement of chunks also becomes a factor that influences I/O parallelism. Optimal placement of chunks of data over the available set of disks is a non-trivial problem

2) Requirement of root access – Optimal placement of file chunks on different disks often requires root access. For example, only root has permissions to create files on disks corresponding to the File Systems mounted on /u03, /u04 etc on an Oracle Big Data Appliance node

3) When multiple nodes are involved in the computation, moving fragments of the original data into different nodes manually can drain productivity

4) This form of split can only work in a static environment – in a real-world dynamic environment, information about other workloads and their resource utilization cannot be factored in a practical manner by a human

5) Requires admin to provide user access to all nodes of the cluster in order to allow the user to move data to different nodes

ORCH-based solution

On the other hand, using ORCH, we can directly use the out of the box support for multivariate analysis. Further, no manual steps related to data management (like splitting files and addressing chunk placement issues) are required since Hadoop (specifically HDFS) handles all those requirements seamlessly.

>x <- hdfs.attach("allnumeric_200col_10GB")

> system.time(res <- orch.cov(x))

user system elapsed

18.179 3.991 85.640

Forty-two concurrent map tasks were involved in the computation above as determined by Hadoop.

To conclude, we can see the following advantages of the ORCH based approach in this scenario :

1) No manual steps. Data Management completely handled transparently by HDFS

2) Out of the box support for cov. The distributed parallel algorithm is available out of the box and the user does not have to work it out from scratch

3) Using ORCH we get comparable performance to that obtained through manual coding without any of the manual overheads


We use the same single input file, “allnumeric_200col_10GB” in this case as well. The requirement is to obtain a uniform random sample from the input data set. The size of the sample required is specified as a percentage of the input data set size.

Once again for the solution using the parallel package, the input file has to be split into smaller sized files for better read parallelism.


# Read the data

readInput <- function(id) {

infile <- file.path("/home/oracle/anasrini/cov", paste("p",id,sep=""))


system.time(m <- matrix(scan(file=infile, what=0.0, sep=","),

ncol=200, byrow=TRUE))



# Main MAPPER function

samplemap <- function(id, percent) {

m <- readInput(id)    # read the input

v <- runif(nrow(m))   # Generate runif

# Pick only those rows where random < percent*0.01

keep <- which(v < percent*0.01)

m1 <- m[keep,,drop=FALSE]



numfiles <- 24

numCores <- 24

# Map step

percent <- 0.001

system.time(mapres <- mclapply(seq_len(numfiles), samplemap, percent,


user system elapsed

1112.998 23.196 49.561

ORCH based solution

>x <- hdfs.attach("allnumeric_200col_10GB_single")

>system.time(res <- orch.sample(x, percent=0.001))

user system elapsed

8.173 0.704 33.590

The ORCH based solution out-performs the solution based on the parallel package. This is because orch.sample is implemented in Java and the read rates obtained by a Java implementation are superior to what can be achieved in R.

Partitioned Linear Model Fitting

Partitioned Linear Model Fitting is a very popular use case. The requirement here is to fit separate linear models, one for each partition of the data. The data itself is partitioned based on a user-specified partitioning key.

For example, using the ONTIME data set, the user could specify destination city as the partitioning key indicating the requirement for separate linear models (with, for example, ArrDelay as target), 1 per destination city.

ORCH based solution

dfs_res <- hadoop.run(

data = input,

mapper = function(k, v) { orch.keyvals(v$Dest, v) },

reducer = function(k, v) {

lm_x <- lm(ArrDelay ~ DepDelay + Distance, v)

orch.keyval(k, orch.pack(model=lm_x, count = nrow(v)))


config = new("mapred.config",

job.name = "ORCH Partitioned lm by Destination City",

map.output = mapOut,

mapred.pristine = TRUE,

reduce.output = data.frame(key="", model="packed"),



Notice that the Map Reduce framework is performing the partitioning. The mapper just picks out the partitioning key and the Map Reduce framework handles the rest. The linear model for each partition is then fitted in the reducer.

parallel based solution

As in the previous use cases, for good read parallelism, the single input file needs to be split into smaller files. However, unlike the previous use cases, there is a twist here.

We noted that with the ORCH based solution it is the Map Reduce framework that does the actual partitioning. There is no such out of the box feature available with a parallel package-based solution. There are two options:

1) Break up the file arbitrarily into smaller pieces for better read parallelism. Implement your own partitioning logic mimicking what the Map Reduce framework provides. Then fit linear models on each of these partitions in parallel.


2) Break the file into smaller pieces such that each piece is a separate partition. Fit linear models on each of these partitions in parallel 

Both of these options are not easy and require a lot of user effort. The custom coding required for achieving parallel reads is significant.


ORCH provides a holistic approach to Big Data Analytics in the R environment. By leveraging the Hadoop infrastructure, ORCH inherits several key components that are all required to address real world analytics requirements.

The rich set of out-of-the-box predictive analytic techniques along with the possibility of authoring custom parallel distributed analytics using the framework (as demonstrated in the partitioned linear model fitting case) helps simplify the user’s task while meeting the performance and scalability requirements. 

Appendix – Data Generation

We show the steps required to generate the single input file “allnumeric_200col_10GB”.

Run the following in R:

x <- orch.datagen(datasize=10*1024*1024*1024, numeric.col.count=200,


hdfs.mv(x, "allnumeric_200col_10GB")

Then, from the Linux shell:

hdfs dfs –rm –r –skipTrash /user/oracle/allnumeric_200col_10GB/__ORCHMETA__

hdfs dfs –getmerge /user/oracle/allnumeric_200col_10GB /tmp/allnumeric_200col_10GB

Tuesday Oct 02, 2012

Oracle R Enterprise Tutorial Series on Oracle Learning Library

Oracle Server Technologies Curriculum has just released the Oracle R Enterprise Tutorial Series, which is publicly available on Oracle Learning Library (OLL). This 8 part interactive lecture series with review sessions covers Oracle R Enterprise 1.1 and an introduction to Oracle R Connector for Hadoop 1.1:
  • Introducing Oracle R Enterprise
  • Getting Started with ORE
  • R Language Basics
  • Producing Graphs in R
  • The ORE Transparency Layer
  • ORE Embedded R Scripts: R Interface
  • ORE Embedded R Scripts: SQL Interface
  • Using the Oracle R Connector for Hadoop

We encourage you to download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.


The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.


« February 2015