Friday Jan 03, 2014

ORAAH - Enabling high performance R workloads on Hadoop


One of the features of Oracle R Advanced Analytics for Hadoop (ORAAH) is enabling Hadoop jobs written in the R language. R is a popular open-source language and environment for statistical computing and graphics. ORAAH enables R programmers to leverage a Hadoop cluster operating on data resident in HDFS files.

In this blog post, we examine the performance characteristics of ORAAH with an example and explain what makes ORAAH the fastest alternative available to run Hadoop-R jobs. We also compare the results with another popular Hadoop interface for R, rmr.

Credit to Vlad Sharanhovich and Anand Srinivasan for providing the content for this blog post.

In probability theory and statistics, covariance is a measure of how much two variables change together.

Variables that tend to show similar behavior exhibit positive covariance. Alternatively if the greater values of one variable correspond with the smaller values of another then the covariance between the variables is negative. We use covariance computation as the running example below that you can use to reproduce the results detailed here.

The tests were performed on a 6-node cluster running ORAAH version 2.3.1.


Cluster configuration:


  • 6 node cluster

  • BDA v2.3.1 (based on CDH 4.4)

  • 4 tasktrackers

  • CPU: Intel ® Xeon® CPU X5675  @ 3.07GHz

  • RAM: 47GB


ORAAH comes with a convenience function for data generation. We use this function to generate a 100 GB HDFS input file with numeric values for 200 variables (columns), as follows:


# Generate 100GB input dataset (using pre-release 2.4.0)
inputCsv <- orch.datagen(1e+11, numeric.col.count=200, parts=100)
# 15 mins, 53 sec
# 200 mappers

ORAAH supports 2 types of HDFS input: delimited text files and a binary RDATA representation (R's own binary representation). In many cases, RDATA representation provides much better I/O throughput compared to delimited text files.


# Converting into ORAAH native format
inputRdata <- hdfs.toRData(inputCsv, out.name="100G_200n_rd")
# 4 mins, 54 sec
# 400 mappers

Next, we write the mapper and reducer code for parallel/distributed covariance computation. Below, the mapper function accepts a data.frame representation of the input data and generates structured output with 3 components: a matrix, a vector of column sums, and input row count. The single reducer function merges the structured output generated from the mappers to produce the final covariance matrix.

The mapper and reducer functions are supplied as input to ORAAH's hadoop.run() function. This function additionally takes an HDFS file as input in the dataargument. The structured output from the mapper and reducer is defined in mapred.config data structure. Optionally, the Hadoop job can be given a name (in this case "cov") for traceability.

ORCH_cov
<- function(x) {
hadoop.run(
data = x,
mapper = function(k, v) {
m <- as.matrix(v)
cs <- colSums(m)
nr <- nrow(m)
mtm <- t(m) %*% m
l <- list(mat=mtm, colsum=cs, nrow=nr)
orch.keyval(NULL, orch.pack(l)),
reducer = function(k, v) {
mapres <- orch.unpack(v$val, as.list=T)
xy <- Reduce("+", lapply(mapres,function(x) x$mat))
csf <- Reduce("+", lapply(mapres,function(x) x$colsum))
nrf <- Reduce("+", lapply(mapres,function(x) x$nrow))
sts <- csf %*% t(csf)
m1 <- xy / (nrf -1)
m2 <- sts / (nrf * (nrf-1))
m3 <- 2 * sts / (nrf * (nrf-1))
covmat <- m1 + m2 - m3
orch.keyval(NULL, orch.pack(covmat))
},
config = new("mapred.config",
map.output = data.frame(key="none",
val="c"),
reduce.output = data.frame(key="none", val="c")),
job.name = "cov")
}

The execution times are shown below:


# Using text input
ORCH_cov(inputCsv)
# 7 mins, 19 sec
# 400 mappers / 45 reducers



# Using binary input
ORCH_cov(inputRdata)
# 4 mins, 18 sec
# 400 mappers / 45 reducers

A few points to observe:
1. Before an HDFS file can be used with ORAAH's hadoop.run() function, its metadata must be known. ORAAH automatically determines the data types of the columns in the HDFS file by sampling rows. The metadata is created during the hdfs.attach() call. This metadata enables ORAAH to generate highly optimized scan routines to read rows from the file.
2. ORAAH implements caching of input and output structures from mappers and reducers thereby lifting the burden on dealing with large data volumes from the R engine.
3. ORAAH's orch.pack() and orch.unpack() functions enable transfer of structured constructs between mappers and reducers, which further improves I/O throughput by eliminating the need to scan/parse string inputs.
4. ORAAH leverages R's own RDATA representation as the binary representation. The key to better execution performance of R jobs on Hadoop is managing I/O throughput and carefully bypassing R's inherent limitations with parsing strings.

We contrast ORAAH's performance by comparing it with an open source package called rmr (https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr) repeating the covariance calculation on the same cluster and input data set. For this experiment, we used rmr version 2.3.0.

The covariance computation is written in rmr as shown below.


RMR2_cov <- function(x, input.format) {
mapreduce(
x,
input.format = input.format,
map = function(k, v) {
m <- as.matrix(v)
cs <- colSums(m)
nr <- nrow(m)
mtm <- t(m) %*% m
l <- list(mat=mtm, colsum=cs, nrow=nr)
keyval(1, paste0(deparse(l),
collapse="\n"))
},
reduce = function(k, v) {
mapres <- lapply(v, function(x)
eval(parse(text=x)))
xy <- Reduce("+", lapply(mapres, function(x) x$mat))
csf <- Reduce("+", lapply(mapres, function(x) x$colsum))
nrf <- Reduce("+", lapply(mapres, function(x) x$nrow))
sts <- csf %*% t(csf)
m1 <- xy / (nrf -1)
m2 <- sts / (nrf * (nrf-1))
m3 <- 2 * sts / (nrf * (nrf-1))
covmat <- m1 + m2 - m3
keyval(1, paste0(deparse(covmat),
collapse="\n"))
}
)
}

Notice that the code is somewhat similar to the version used with ORAAH with 3 key differences:
   i) Mapper output is serialized as a string
 ii) Reducer, thus, is forced to parse input strings
iii) Reducer output is once again a string, requiring client to reconstruct the covariance matrix

Not only is passing strings limiting from the R programmer's perspective, but also has a negative effect on I/O throughput. rmr also supports a proprietary binary representation of delimited text data.

Below we repeat the tests with both delimite and binary representations. Not only is the conversion to the binary representation more expensive, the resulting I/O throughput is not substantially better.


# Convert to RMR2 native format
inputRMR <- mapreduce(inputCsv, 
  input.format = make.input.format("csv",sep=","),
  map = function(k, v) keyval(NULL, v)
)
# 20 mins, 17 sec
# 400 mappers
#
# Using text input
RMR2_cov(inputCsv, make.input.format("csv",sep=","))
# 32 mins, 14 sec
# 400 mappers / 45 reducers
#
# Using binary input
RMR2_cov(inputRMR, "native")
# 17 mins, 18 sec
# 400 mappers / 45 reducers

To summarize ORAAH is 4x faster than rmr out of the box for a simple covariance calculation.





























 Text Input
 Binary Input
 Text to Binary Conversion
 ORAAH  7 min, 19 sec
 4 min, 18 sec
 4 min 54 sec
 rmr  32 min, 14 sec
 17 min, 18 sec
 20 min 17 sec

 4.4x faster
 4x faster
 4.14x faster

Friday Jul 19, 2013

Oracle R Connector for Hadoop 2.2.0 released

Oracle R Connector for Hadoop 2.2.0 is now available for download. The Oracle R Connector for Hadoop 2.x series has introduced numerous enhancements, which are highlighted in this article and summarized as follows:

 ORCH 2.0.0
 ORCH 2.1.0
 ORCH 2.2.0

 Analytic Functions

  • orch.lm
  • orch.lmf
  • orch.neural
  • orch.nmf

Oracle Loader for Hadoop (OLH) support

CDH 4.2.0

ORCHhive transparency layer

.

.

.

.

.

.

Analytic Functions
  • orch.cor
  • orch.cov
  • orch.kmeans
  • orch.princomp
  • orch.sample - by percent

Configurable delimiters in text input data files

Map-only and reduce-only jobs

Keyless map/reduce output

"Pristine" data mode for high performance data access

HDFS cache of metadata

Hadoop Abstraction Layer (HAL)

.

Analytic Functions
  • orch.sample - by number of rows

CDH 4.3.0

Full online documentation

Support integer and matrix data types in hdfs.attach with detection of "pristine" data

Out-of-the-box support for "pristine" mode for high I/O performance

HDFS cache to improve interactive performance when navigating HDFS directories and file lists

HDFS multi-file upload and download performance enhancements

HAL for Hortonworks Data Platform 1.2 and Apache Hadoop 1.0

ORCH 2.0.0

In ORCH 2.0.0, we introduced four Hadoop-enabled analytic functions supporting linear  regression, low rank matrix factorization, neural network, and non-negative matrix factorization. These enable R users to immediately begin using advanced analytics functions on HDFS data using the MapReduce paradigm on a Hadoop cluster without having to design and implement such algorithms themselves.

While ORCH 1.x supported moving data between the database and HDFS using sqoop, ORCH 2.0.0 supports the use of Oracle Loader for Hadoop (OLH) to move very large data volumes from HDFS to Oracle Database in a efficient and high performance manner.

ORCH 2.0.0 supported Cloudera Distribution for Hadoop (CDH) version 4.2.0 and introduced the ORCHhive transparency layer, which leverages the Oracle R Enterprise transparency layer for SQL, but instead maps to HiveQL, a SQL-like language for manipulating HDFS data via Hive tables.

ORCH 2.1.0

In ORCH 2.1.0, we added several more analytic functions, including correlation and covariance, clustering via K-Means, principle component analysis (PCA), and sampling by specifying the percent of records to return.

ORCH 2.1.0 also brought a variety of features, including: configurable delimiters (beyond comma delimited text files, using any ASCII delimiter), the ability to specify mapper-only and reduce-only jobs, and the output of NULL keys in mapper and reducer functions.

To speed the loading of data into Hadoop jobs, ORCH introduced “pristine” mode where the user guarantees that the data meets certain requirements so that ORCH skips a time-consuming data validation step. “Pristine” data requires that numeric columns contain only numeric data, that missing values are either R’s NA or the null string, and that all rows have the same number of columns. This improves performance of hdfs.get on a 1GB file by a factor of 10.

ORCH 2.1.0 introduced the caching of ORCH metadata to improve response time of ORCH functions, such as hdfs.ls, hdfs.describe, and hdfs.mget between 5x and 70x faster.

The Hadoop Abstraction Layer, or HAL, enables ORCH to work on top of various Hadoop versions or variants, including Apache/Hortonworks, Cloudera Hadoop distributions: CDH3, and CDH 4.x with MR1 and MR2.

ORCH 2.2.0

In the latest release, ORCH 2.2.0, we’ve augmented orch.sample to allow specifying the number of rows in addition to percentage of rows. CDH 4.3 is now supported, and ORCH functions provide full online documentation via R's help function or ?. The function hdfs.attach now support integer and matrix data types and the ability to detect pristine data automatically. HDFS bulk directory upload and download performance speeds were also improved. Through the caching and automatic synchronization of ORCH metadata and file lists, the responsiveness of metadata HDFS-related functions has improved by 3x over ORCH 2.1.0, which also improves performance of hadoop.run and hadoop.exec functions. These improvements in turn bring a more interactive user experience for the R user when working with HDFS.

Starting in ORCH 2.2.0, we introduced out-of-the-box tuning optimizations for high performance and expanded HDFS caching to include the caching of file lists, which further improves performance of HDFS-related functions.

The function hdfs.upload now supports the option to upload multi-file directories in a single invocation, which optimizes the process. When downloading an HDFS directory, hdfs.download is optimized to issue a single HDFS command to download files into one local temporary directory before combining the separate parts into a single file.

The Hadoop Abstraction Layer (HAL) was extended to support Hortonworks Data Platform 1.2 and Apache Hadoop 1.0. In addition, ORCH now allows the user to override the Hadoop Abstraction Layer version for use with unofficially supported distributions of Hadoop using system environment variables. This enables testing and certification of ORCH by other Hadoop distribution vendors.

Certification of ORCH on non-officially supported platforms can be done using a separate test kit (available for download upon request: mark.hornick@oracle.com) that includes an extensive set of tests for core ORCH functionality and that can be run using the ORCH built-in testing framework. Running the tests pinpoints the failures and ensures that ORCH is compatible with the target platform.

See the ORCH 2.2.0 Change List and Release Notes for additional details. ORCH 2.2.0 can be downloaded here.


Wednesday May 22, 2013

Big Data Analytics in R – the tORCH has been lit!

This guest post from Anand Srinivasan compares performance of the Oracle R Connector for Hadoop with the R {parallel} package for covariance matrix computation, sampling, and parallel linear model fitting. 

Oracle R Connector for Hadoop (ORCH) is a collection of R packages that enables Big Data analytics from the R environment. It enables a Data Scientist /Analyst to work on data straddling multiple data platforms (HDFS, Hive, Oracle Database, local files) from the comfort of the R environment and benefit from the R ecosystem.

ORCH provides:

1) Out of the box predictive analytic techniques for linear regression, neural networks for prediction, matrix completion using low rank matrix factorization, non-negative matrix factorization, kmeans clustering, principal components analysis and multivariate analysis. While all these techniques have R interfaces, they are implemented either in Java or in R as distributed parallel implementations leveraging all nodes of your Hadoop cluster

2) A general framework, where a user can use the R language to write custom logic executable in a distributed parallel manner using available compute and storage resources.

The main idea behind the ORCH architecture and its approach to Big Data analytics is to leverage the Hadoop infrastructure and thereby inherit all its advantages.

The crux of ORCH is read parallelization and robust methods over parallelized data. Efficient parallelization of reads is the single most important step necessary for Big Data Analytics because it is either expensive or impractical to load all available data in a single thread.

ORCH is often compared/contrasted with the other options available in R, in particular the popular open source R package called parallel. The parallel package provides a low-level infrastructure for “coarse-grained” distributed and parallel computation. While it is fairly general, it tends to encourage an approach that is based on using the aggregate RAM in the cluster as opposed to using the file system. Specifically, it lacks a data management component, a task management component and an administrative interface for monitoring. Programming, however, follows the broad Map Reduce paradigm.

 In the rest of this article, we assume that the reader has basic familiarity with the parallel package and proceed to compare ORCH and its approach with the parallel package. The goal of this comparison is to explain what it takes for a user to build a solution for their requirement using each of these technologies and also to understand the performance characteristics of these solutions.

We do this comparison using three concrete use cases – covariance matrix computation, sampling and partitioned linear model fitting. The exercise is designed to be repeatable, so you, the reader, can try this “at home”. We will demonstrate that ORCH is functionally and performance-wise superior to the available alternative of using R’s parallel package.

A six node Oracle Big Data Appliance v2.1.1 cluster is used in the experiments. Each node in this test environment has 48GB RAM and 24 CPU cores.

Covariance Matrix Computation

Computing covariance matrices is one of the most fundamental of statistical techniques.

In this use case, we have a single input file, “allnumeric_200col_10GB” (see appendix on how to generate this data set), that is about 10GB in size and has a data matrix with about 3 million rows and 200 columns. The requirement is to compute the covariance matrix of this input matrix.

Since a single node in the test environment has 48GB RAM and the input file is only 10GB, we start with the approach of loading the entire file into memory and then computing the covariance matrix using R’s cov function.

> system.time(m <- matrix(scan(file="/tmp/allnumeric_200col_10GB",what=0.0, sep=","), ncol=200, byrow=TRUE))

Read 611200000 items

user system elapsed

683.159 17.023 712.527

> system.time(res <- cov(m))

user system elapsed

561.627 0.009 563.044

We observe that the loading of data takes 712 seconds (vs. 563 seconds for the actual covariane computation) and dominates the cost. It would be even more pronounced (relative to the total elapsed time) if the cov(m) computation were parallelized using mclapply from the parallel package.

Based on this, we see that for an efficient parallel solution, the main requirement is to parallelize the data loading. This requires that the single input file be split into multiple smaller-sized files. The parallel package does not offer any data management facilities; hence this step has to be performed manually using a Linux command like split. Since there are 24 CPU cores, we split the input file into 24 smaller files.

time(split -l 127334 /tmp/allnumeric_200col_10GB)

real 0m54.343s

user 0m3.598s

sys 0m24.233s

Now, we can run the R script:

library(parallel)

# Read the data

readInput <- function(id) {

infile <- file.path("/home/oracle/anasrini/cov",paste("p",id,sep=""))

print(infile)

m <- matrix(scan(file=infile, what=0.0, sep=","), ncol=200, byrow=TRUE)

m

}

# Main MAPPER function

compCov <- function(id) {

m <- readInput(id)  # read the input

cs <- colSums(m)    # compute col sums, num rows

# compute main cov portion

nr <- nrow(m)      

mtm <- crossprod(m)

list(mat=mtm, colsum=cs, nrow=nr)

}

numfiles <- 24

numCores <- 24

# Map step

system.time(mapres <- mclapply(seq_len(numfiles), compCov, mc.cores=numCores))

# Reduce step

system.time(xy <- Reduce("+", lapply(mapres, function(x) x$mat)))

system.time(csf <- Reduce("+", lapply(mapres, function(x) x$colsum)))

system.time(nrf <- Reduce("+", lapply(mapres, function(x) x$nrow)))

sts <- csf %*% t(csf)

m1 <- xy / (nrf -1)

m2 <- sts / (nrf * (nrf-1))

m3 <- 2 * sts / (nrf * (nrf-1))

covmat <- m1 + m2 - m3

user system elapsed

1661.196 21.209 77.781

We observe that the elapsed time (excluding time to split the files) has now come down to 77 seconds. However, it took 54 seconds for splitting the input file into smaller files, making it a significant portion of the total elapsed time of 77+54 = 131 seconds.

Besides impacting performance, there are a number of more serious problems with having to deal with data management manually. We list a few of them here:

1) In other scenarios, with larger files or larger number of chunks, placement of chunks also becomes a factor that influences I/O parallelism. Optimal placement of chunks of data over the available set of disks is a non-trivial problem

2) Requirement of root access – Optimal placement of file chunks on different disks often requires root access. For example, only root has permissions to create files on disks corresponding to the File Systems mounted on /u03, /u04 etc on an Oracle Big Data Appliance node

3) When multiple nodes are involved in the computation, moving fragments of the original data into different nodes manually can drain productivity

4) This form of split can only work in a static environment – in a real-world dynamic environment, information about other workloads and their resource utilization cannot be factored in a practical manner by a human

5) Requires admin to provide user access to all nodes of the cluster in order to allow the user to move data to different nodes

ORCH-based solution

On the other hand, using ORCH, we can directly use the out of the box support for multivariate analysis. Further, no manual steps related to data management (like splitting files and addressing chunk placement issues) are required since Hadoop (specifically HDFS) handles all those requirements seamlessly.

>x <- hdfs.attach("allnumeric_200col_10GB")

> system.time(res <- orch.cov(x))

user system elapsed

18.179 3.991 85.640

Forty-two concurrent map tasks were involved in the computation above as determined by Hadoop.

To conclude, we can see the following advantages of the ORCH based approach in this scenario :

1) No manual steps. Data Management completely handled transparently by HDFS

2) Out of the box support for cov. The distributed parallel algorithm is available out of the box and the user does not have to work it out from scratch

3) Using ORCH we get comparable performance to that obtained through manual coding without any of the manual overheads

Sampling

We use the same single input file, “allnumeric_200col_10GB” in this case as well. The requirement is to obtain a uniform random sample from the input data set. The size of the sample required is specified as a percentage of the input data set size.

Once again for the solution using the parallel package, the input file has to be split into smaller sized files for better read parallelism.

library(parallel)

# Read the data

readInput <- function(id) {

infile <- file.path("/home/oracle/anasrini/cov", paste("p",id,sep=""))

print(infile)

system.time(m <- matrix(scan(file=infile, what=0.0, sep=","),

ncol=200, byrow=TRUE))

m

}

# Main MAPPER function

samplemap <- function(id, percent) {

m <- readInput(id)    # read the input

v <- runif(nrow(m))   # Generate runif

# Pick only those rows where random < percent*0.01

keep <- which(v < percent*0.01)

m1 <- m[keep,,drop=FALSE]

m1

}

numfiles <- 24

numCores <- 24

# Map step

percent <- 0.001

system.time(mapres <- mclapply(seq_len(numfiles), samplemap, percent,

mc.cores=numCores))

user system elapsed

1112.998 23.196 49.561

ORCH based solution

>x <- hdfs.attach("allnumeric_200col_10GB_single")

>system.time(res <- orch.sample(x, percent=0.001))

user system elapsed

8.173 0.704 33.590

The ORCH based solution out-performs the solution based on the parallel package. This is because orch.sample is implemented in Java and the read rates obtained by a Java implementation are superior to what can be achieved in R.

Partitioned Linear Model Fitting

Partitioned Linear Model Fitting is a very popular use case. The requirement here is to fit separate linear models, one for each partition of the data. The data itself is partitioned based on a user-specified partitioning key.

For example, using the ONTIME data set, the user could specify destination city as the partitioning key indicating the requirement for separate linear models (with, for example, ArrDelay as target), 1 per destination city.

ORCH based solution

dfs_res <- hadoop.run(

data = input,

mapper = function(k, v) { orch.keyvals(v$Dest, v) },

reducer = function(k, v) {

lm_x <- lm(ArrDelay ~ DepDelay + Distance, v)

orch.keyval(k, orch.pack(model=lm_x, count = nrow(v)))

},

config = new("mapred.config",

job.name = "ORCH Partitioned lm by Destination City",

map.output = mapOut,

mapred.pristine = TRUE,

reduce.output = data.frame(key="", model="packed"),

)

)

Notice that the Map Reduce framework is performing the partitioning. The mapper just picks out the partitioning key and the Map Reduce framework handles the rest. The linear model for each partition is then fitted in the reducer.

parallel based solution

As in the previous use cases, for good read parallelism, the single input file needs to be split into smaller files. However, unlike the previous use cases, there is a twist here.

We noted that with the ORCH based solution it is the Map Reduce framework that does the actual partitioning. There is no such out of the box feature available with a parallel package-based solution. There are two options:

1) Break up the file arbitrarily into smaller pieces for better read parallelism. Implement your own partitioning logic mimicking what the Map Reduce framework provides. Then fit linear models on each of these partitions in parallel.

OR

2) Break the file into smaller pieces such that each piece is a separate partition. Fit linear models on each of these partitions in parallel 

Both of these options are not easy and require a lot of user effort. The custom coding required for achieving parallel reads is significant.

Conclusion

ORCH provides a holistic approach to Big Data Analytics in the R environment. By leveraging the Hadoop infrastructure, ORCH inherits several key components that are all required to address real world analytics requirements.

The rich set of out-of-the-box predictive analytic techniques along with the possibility of authoring custom parallel distributed analytics using the framework (as demonstrated in the partitioned linear model fitting case) helps simplify the user’s task while meeting the performance and scalability requirements. 

Appendix – Data Generation

We show the steps required to generate the single input file “allnumeric_200col_10GB”.

Run the following in R:

x <- orch.datagen(datasize=10*1024*1024*1024, numeric.col.count=200,

map.degree=40)

hdfs.mv(x, "allnumeric_200col_10GB")

Then, from the Linux shell:

hdfs dfs –rm –r –skipTrash /user/oracle/allnumeric_200col_10GB/__ORCHMETA__

hdfs dfs –getmerge /user/oracle/allnumeric_200col_10GB /tmp/allnumeric_200col_10GB


Thursday Feb 02, 2012

Announcing Oracle R Enterprise 1.0

Analyzing huge data sets presents a challenging opportunity for IT decision makers, driven by the balance between the maintenance and support of existing IT infrastructure with the need to analyze rapidly growing data stores. In many cases, processing this data requires a fresh approach because traditional techniques fail when applied to massive data sets. To extract immediate value from big data, we desire tools that efficiently access, organize, analyze and maintain a variety of data types.

Oracle R Enterprise (ORE), a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition, emerges as the clear solution to these challenges. ORE integrates the popular open-source R statistical programming environment with Oracle Database 11g, Oracle Exadata and the Oracle Big Data Appliance, delivering enterprise-level analytics based on R scripts and parallelized, in-database modeling.

How do R and Oracle R Enterprise work together?

The powerful R programming environment enables the creation of sophisticated graphics, statistical analyses, and simulations. It contains a vast set of built-in functions which may be extended to build custom statistical packages. The R engine is limited by capacity and performance for large data, but with Oracle R Enterprise, R bypasses these constraints by leveraging the database as the analytics engine directly from their R session.

The components that support Oracle R Enterprise include:

1. The Oracle R Enterprise transparency layer - a collection of R packages with functions to connect to Oracle Database and use R functionality in Oracle Database. This enables R users to work with data too large to fit into the memory of a user's desktop system, and leverage the scalable Oracle Database as a computational engine.

2. The Oracle statistics engine - a collection of statistical functions and procedures corresponding to commonly-used statistical libraries. The statistics engine packages also execute in Oracle Database.

3. SQL extensions supporting embedded R execution through the database on the database server. R users can execute R closures (functions) using an R or SQL API, while taking advantage of data parallelism. Using the SQL API for embedded R execution, sophisticated R graphics and results can be exposed in OBIEE dashboards and BI Publisher documents.

4. Oracle R Connector for Hadoop (ORCH) - an R package that interfaces with the Hadoop Distributed File System (HDFS) and enables executing MapReduce jobs. ORCH enables R users to work directly with an Oracle Hadoop cluster, executing computations from the R environment, written in the R language and working on data resident in HDFS, Oracle Database, or local files.

Using a simple R workflow, R users can seamlessly utilize the parallel processing architecture of ORE and ORCH for scalability and better performance. Analytics and reporting tasks are moved to the Oracle Database, eliminating long approval chains for data movement and dramatically increasing processing speed. R users are not required to learn SQL because the R-to-SQL translation is shipped to the database and processed behind the scenes. The significant benefits to IT include improved data security, data maintenance and audit compliance practices.

We’re proud to announce Oracle R Enterprise 1.0 and look forward to your comments. To learn more about ORE, visit our product page.

Tuesday Jan 17, 2012

Welcome to Oracle R Enterprise!

Welcome to the Oracle R Enterprise blog - brought to you by the Oracle Advanced Analytics group. We'll be sharing best practices, tips, and tricks for applying Oracle R Enterprise and Oracle R Connector for Hadoop in both traditional and new "big data" environments. Oracle R Enterprise, along with Oracle Data Mining, are the two components of the new Oracle Advanced Analytics Option to Oracle Database.  

Here's a brief introduction to Oracle's R offerings: Oracle R Distribution, Oracle R Enterprise, and Oracle R Connector for Hadoop.

Oracle R Distribution provides an Oracle-supported distribution of open source R — enhanced with Intel’s MKL libraries for high performance mathematical computations on x86 hardware. The Oracle R Distribution facilitates enterprise acceptance of R, since the lack of a major corporate sponsor has made some companies concerned about fully adopting R.

Oracle R Enterprise (ORE) integrates the open-source R statistical environment and language with Oracle Database 11g, and the Oracle engineered solutions of Oracle Exadata and Oracle Big Data Appliance. ORE delivers enterprise-level advanced analytics based on the R environment, leveraging the database as an analytical compute engine. This allows R users like data analysts and statisticians to use the R client directly against data stored in Oracle Database 11g—vastly increasing scalability, performance, and security.

As an embedded component of the RDBMS, ORE eliminates R’s memory constraints since it can work on data directly in the database. R users can also execute R scripts in Oracle Database to support enterprise production applications. R's data.frame results and sophisticated graphics can be delivered through Oracle BI Publisher documents and OBIEE dashboards. Since it’s R, users are also able to leverage the latest contributed open source packages.

For data mining, R users not only can build models using any of the algorithms in the CRAN machine learning task view, but also leverage in-database implementations for predictions (e.g., stepwise regression, GLM, SVM), attribute selection, clustering, feature extraction via non-negative matrix factorization, association rules, and anomaly detection.

Oracle R Connector for Hadoop, one of the connectors available for Oracle Big Data Appliance, allows R users to work with the Hadoop Distributed File System (HDFS) and execute MapReduce programs on the Big Data Appliance Hadoop Cluster. R users write mapper and reducer functions in the R language, and invoke MapReduce jobs from the R environment.

We'll be exploring these components and their application in future posts.


 

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today