Tuesday May 28, 2013

Converting Existing R Scripts to ORE - Getting Started

Oracle R Enterprise provides a comprehensive, database-centric environment for end-to-end analytical processes in R, with immediate deployment to production environments. This message really resonates with our customers who are interested in executing R functions on database-resident data while seamlessly leveraging Oracle Database as a high-performance computing (HPC) environment. The ability to develop and operationalize R scripts for analytical applications in one step is quite appealing.

One frequently asked question is how to convert existing R code that access data in flat files or the database to use Oracle R Enterprise. In this blog post, we talk about a few scenarios and how to begin a conversion from existing R code to using Oracle R Enterprise.

Consider the following scenarios:

Scenario 1: A stand-alone R script that generates its own data and simply returns a result. Data is not obtained from the file system or database. This may result from performing simulations where dadta is dynamically generated, or perhaps access from a URL on the internet.

Scenario 2: An R script that loads data from a flat file such as a CSV file, performs some computations in R, and then writes the result back to a file.

Scenario 3: An R script that loads data from a database table, via one of the database connector packages like RODBC, RJDBC, or ROracle, and writes a result back to the database –using SQL statements or package functions.

Scenario 1

A stand-alone R script might normally be run on a user’s desktop, invoked as a cron job, or even via Java to spawn an R engine and retrieve the result, but we’d like to operationalize its execution as part of a database application, invoked from SQL. Here’s a simple script to illustrate the concept of converting such a script to be executed at the database server using ORE’s embedded R execution. The script generates a data.frame with some random columns, performs summary on that data and returns the summary statistics, which are represented as an R table.

# generate data

set.seed(1)

n <- 1000

df <- 3

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

# perform some analysis

res <- summary(x)

#return the result

res


To convert this to use ORE, create a function with appropriate arguments and body, for example:

myFunction1 <- function (n = 1000, df = 3,seed=1) {

set.seed(seed)

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

res <- summary(x)

res

}

Next, load the ORE packages and connect to Oracle Database using the ore.connect function. Using the all argument set to TRUE loads metadata for all the tables and views in that schema. We then store the function in the R script repository, invoking it via ore.doEval.

# load ORE packages and connect to Oracle Database

library(ORE)

ore.connect("schema","sid","hostname","password",port=1521, all=TRUE)

# load function into R script repository

ore.scriptDrop("myFunction-1")

ore.scriptCreate("myFunction-1", myFunction1)

# invoke using embedded R execution at the database server

ore.doEval(FUN.NAME="myFunction-1")

> ore.doEval(FUN.NAME="myFunction-1")
       a                b                  c           
 Min.   :   1.0   Min.   :-3.00805   Min.   : 0.03449  
 1st Qu.: 250.8   1st Qu.:-0.69737   1st Qu.: 1.27386  
 Median : 500.5   Median :-0.03532   Median : 2.36454  
 Mean   : 500.5   Mean   :-0.01165   Mean   : 3.07924  
 3rd Qu.: 750.2   3rd Qu.: 0.68843   3rd Qu.: 4.25994  
 Max.   :1000.0   Max.   : 3.81028   Max.   :17.56720  

Of course, we’re using default values here. To provide different arguments, change the invocation with arguments as follows:

ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2)

> ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2)
       a               b                  c          
 Min.   :  1.0   Min.   :-2.72182   Min.   : 0.1621  
 1st Qu.:125.8   1st Qu.:-0.65346   1st Qu.: 2.6144  
 Median :250.5   Median : 0.04392   Median : 4.4592  
 Mean   :250.5   Mean   : 0.06169   Mean   : 5.0386  
 3rd Qu.:375.2   3rd Qu.: 0.79096   3rd Qu.: 6.8467  
 Max.   :500.0   Max.   : 2.88842   Max.   :17.0367  

Having successfully invoked this from the R client (my laptop), we can now invoke it from SQL. Here, we retrieve the summary result, which is an R table, as an XML string.

select *

from table(rqEval( NULL,'XML','myFunction-1'));

The result can be viewed from SQL Developer.

The following shows the XML output in a more structured manner.


What if we wanted to get the result to appear as a SQL table? Since the current result is an R table (an R object), we need to convert this to a data.frame to return it. We’ll make a few modifications to “myFunction-1” above. Most notably is the need to convert the table object in res to a data.frame. There are a variety of ways to do this.

myFunction2 <- function (n = 1000, df = 3,seed=1) {

# generate data

set.seed(seed)

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

# perform some analysis

res <- summary(x)

# convert the table result to a data.frame

res.df <- as.matrix(res)

res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c('a','b','c')

res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

res.sum

}

# load function into R script repository

ore.scriptDrop("myFunction-2")

ore.scriptCreate("myFunction-2", myFunction2)

We’ll now modify the SQL statement to specify the format of the result.

select *

from table(rqEval( NULL,'select cast(''a'' as VARCHAR2(12)) as "statname",

1 "a", 1 "b", 1 "c" from dual ','myFunction-2'));

Here’s the result as viewed from SQL Developer.


This type of result could be incorporated into any SQL application accepting table or view input from a SQL query. That is particular useful in combination with OBIEE dashboards via an RPD.

Scenario 2

If you’ve been loading data from a flat file, perhaps a CSV file, your R code may look like the following, where it specifies to builds a model and write hat model to a file for future use, perhaps in scoring. It also generates a graph of the clusters highlighting the individual points, colored by their cluster id, with the centroids indicated with a star.

# read data

setwd("D:/datasets")

dat <- read.csv("myDataFile.csv")

# build a clustering model

cl <- kmeans(x, 2)

# write model to file

save(cl, file="myClusterModel.dat")

# create a graph and write it to a file

pdf("myGraphFile.pdf")

plot(x, col = cl$cluster)

points(cl$centers, col = 1:2, pch = 8, cex=2)

dev.off()

The resulting PDF file contains the following image.


To convert this script for use in ORE, there are several options. We’ll explore two: the first involving minimal change to use embedded R execution, and the second leveraging in-database techniques. First, we’ll want the data we used above in variable dat to be loaded into the database.

# create a row id to enable ordered results (if a key doesn’t already exist)

dat$ID <- 1:nrow(dat)

# remove the table if it exists

ore.drop("MY_DATA")

# create the table using the R data.frame, resulting in an ore.frame named MY_DATA

ore.create(dat,"MY_DATA")

# assign the ID column as the row.names of the ore.frame

row.names(MY_DATA) <- MY_DATA$ID

In the first example, we’ll use embedded R execution and pass the data to the function via ore.tableApply. We’ll generate the graph, but simply display it within the function to allow embedded R execution to return the graph as a result. (Note we could also write the graph to a file in any directory accessible to the database server.) Instead of writing the model to a file, which requires keeping track of its location, as well as worring about backup and recovery, we store the model in the database R datastore using ore.save. All this requires minimal change. As above, we could store the function in the R script repository and invoke it by name – both from R and SQL. In this example, we simply provide the function itself as argument.

myClusterFunction1 <- function(x) {

cl <- kmeans(x, 2)

ore.save(cl, name="myClusterModel",overwrite=TRUE)

plot(x, col = cl$cluster)

points(cl$centers, col = 1:2, pch = 8, cex=2)

TRUE

}

ore.tableApply(MY_DATA[,c('x','y')], myClusterFunction1,

ore.connect=TRUE,ore.png.height=700,ore.png.width=700)

The ore.tableApply function projects the x and y columns of MY_DATA as input and also specifies ore.connect as TRUE since we are using the R datastore, which requires a database connection. Optionally, we can specify control arguments to the PNG output. In this example, these are the height and width of the image.

For the second example, we convert this to leverage the ORE Transparency Layer. We’ll use the in-database K-Means algorithm and save the model in a datastore named “myClusteringModel”, as we did above. Since ore.odmKMeans doesn’t automatically assign cluster ids (since the data may be very large or are not required), the scoring is done separately. Note, however, that the prediction results also exist in the database as an ore.frame. To ensure ordering, we also assign row.names to the ore.frame pred. Lastly, we create the plot. Coloring the nodes requires pulling the cluster assignments; however, the points themselves can be accessed from the ore.frame. The centroids points are obtained from cl$centers2 of the cluster model.

# build a clustering model in-database

cl <- ore.odmKMeans(~., MY_DATA, 2, auto.data.prep=FALSE)

# save model in database R datastore

ore.save(cl,name="myClusterModel",overwrite=TRUE)

# generate predictions to assign each row a cluster id, supplement with original data

pred <- predict(cl,MY_DATA,supp=c('x','y','ID'),type="class")

# assign row names to ensure ordering of results

row.names(pred) <- pred$ID

# create the graph

plot(pred[,c('x','y')], col = ore.pull(pred$CLUSTER_ID))

points(cl$centers2[,c('x','y')], col = c(2,3), pch = 8, cex=2)

We can also combine using the transparency layer within an embedded R function. But we’ll leave that as an exercise to the reader.

Scenario 3

In this last scenario, the data already exists in the database and one of the database interface packages, such as RODBC, RJDBC, and ROracle is be used to retrieve data from and write data to the database. We’ll illustrate this with ROracle, but the same holds for the other two packages.

# connect to the database

drv <- dbDriver("Oracle")

con <- dbConnect(drv, "mySchema", "myPassword")

# retrieve the data specifying a SQL query

dat <- dbGetQuery(con, 'select * from MY_RANDOM_DATA where "a" > 100')

# perform some analysis

res <- summary(dat)

# convert the table result to a data.frame for output as table

res.df <- as.matrix(res)

res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c('a','b','c')

res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

res.sum

dbWriteTable(con, "SUMMARY_STATS", res.sum)

Converting this to ORE is straightforward. We’re already connected to the database using ore.connect from previous scenarios, so the existing table MY_RANDOM_DATA was already loaded in the environment as an ore.frame. Executing ore.ls lists this table is the result, so we can just start using it.

> ore.ls(pattern="MY_RAND")

[1] "MY_RANDOM_DATA"

# no need to retrieve the data, use the transparency layer to compute summary

res <- with(MY_RANDOM_DATA , summary(MY_RANDOM_DATA[a > 100,]))

# convert the table result to a data.frame for output as table

res.df <- as.matrix(res)

res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c('a','b','c')

res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

# create the database table

ore.create(res.sum, "SUMMARY_STATS")

SUMMARY_STATS


As we did in previous scenarios, this script can also be wrapped in a function and used in embedded R execution. This too is left as an exercise to the reader.

Summary

As you can see from the three scenarios discussed here, converting a script that accesses no external data, accesses and manipulates file data, or accesses and manipulates database data can be accomplished with a few strategic modifications. More involved scripts, of course, may require additional manipulation. For example, if the SQL query performs complex joins and filtering, along with derived column creation, the user may want to convert this SQL to the corresponding ORE Transparency Layer code, thereby eliminating reliance on SQL. But that’s a topic for another post.

Friday May 24, 2013

HOWTO: X11 Forwarding for Oracle R Enterprise

Oracle R Enterprise enables users to generate R graphs at the database server and return them in a variety of ways: an XML representation using base 64 encoding of the PNG images, in a table with a BLOB column containing the PNG images, and interactively returning the actual image to the R user at the client. This last case allows users to generate images at the database server machine and have the actual PNG image display at the user’s client R engine.

To take advantage of this capability, users may need to ensure their X11 is properly configured. This blog highlights a solution to a common problem involving X11. W
hen using a graphically based function in Oracle R Enterprise, if you’ve encountered errors such as:

Error in X11(paste("png::", filename, sep = ""), width, height, pointsize,
unable to start device PNG
then read on. The issue is likely that your database server is not configured to run graphics programs locally. 

The X Window system, or X11, allows you to forward a program display from a remote system to a local computer.  X11 is the native windowing interface on Linux However, X11 is not the default for all Unix Operating Systems, and  additional configuration steps may be required to display graphical programs if your server is running Unix. Follow the instructions below to configure your server to forward graphics the display to your local client machine.


X11 forwarding from a Linux client

There are two options presented here. The first uses SSH and the second uses telnet.

Option 1: Usually, when you want to connect to your Unix server from a remote Linux client, you use SSH (Secure Shell). Before logging in to your Unix server, confirm that /etc/ssh/sshd_config contains the following X11 tunneling options:

X11forwarding yes

X11DisplayOffset 10
X11UseLocalhost yes

SSH allows you to make a secure terminal connection to your Unix server from your Linux client using this syntax:

ssh -Y <userid@unixserver>

The –Y option to ssh treats the Unix server as trusted , -X treats it as untrusted. Check with your server or network admin about which flag to use. This command also sets the remote DISPLAY to localhost:10.0.

Option 2: Connecting to the server via telnet

If you choose to connect to the server using telnet, keep in mind that unlike SSH, telnet does not offer the security measures that protect users against anyone with malicious intent. Using telnet, an X11 server can be manually set at a Linux client that is capable of graphical display.
To confirm the graphical capability, verify that a terminal appears after entering at the Linux prompt:

xterm

You also need to know the display environment variable setting on X11 server, the Linux client:

echo $DISPLAY

The DISPLAY environment variable stores the displaynumber and screennumber that the X11 server uses to display. These addresses are in the form:

localhost:displaynumber.screennumber

A typical example would be:

localhost:0.0

Next, enable the Unix server to display on the Linux client:

xhost + <unixserver >

and telnet into the Unix server and set its DISPLAY to the X11 server – the Linux client.

export DISPLAY= X11server:displaynumber.screennumber

After following the steps for either option, you should now be able to launch a remote graphical application locally.  As a quick check, launch your remote Unix server's clock on your client desktop through the SSH connection using X11 forwarding:

1. Type xclock at the Unix server command prompt and hit enter.
2  Your remote server’s X11 GUI clock should appear on your client desktop.

3. If the xclock tests succeeds, launch the ORE client to verify the same DISPLAY setting is used by embedded R:

R> ore.connect(user="<username>",sid="<sid>",host="<hostname>",password="<password>", all=TRUE)
R> ore.is.connected()

TRUE
R
> ore.eval(function() Sys.getenv("DISPLAY"), ore.graphics=FALSE)

If the last returned value matches the DISPLAY setting, you will be able to display images at the client machine.
X11 forwarding from a Windows client

To connect to your remote Unix server from Windows and use its graphical interface, you need two pieces of software: an SSH program to establish the remote connection and an X Server to handle the local display. For the SSH program we'll use PuTTY. For the server, we'll use Xming.

PuTTY is a free SSH client that allows you to connect to a remote Linux computer and use the command line. PuTTY can also be used to forward secure data over SSH to other programs - this is called tunneling.

When you connect to your remote Linux computer, you will need to set several connection settings to make everything work correctly. PuTTY lets you save these settings in a session so you can reuse them the next time you connect. To create a session that allows PuTTY to forward your Linux computer's X11 graphical interface over SSH:

1. Open PuTTY on your Windows desktop. Putty will open and display the Session panel. In the Host Name field, type the hostname or IP address of your Unix server.

2. In the field underneath the Saved Sessions label, type a name for your saved session.


3. Under the Connection category, expand SSH and choose X11. Click the Enable X11
Forwarding checkbox.




4. Go back to the Session category and click Save to save your session connection settings.


 Now we're ready to set up the X server using Xming, which is a free X Window server for the Windows desktop. With Xming, you can display graphical applications from your remote Linux computer on your Windows desktop. Xming provides a simple utility called Xlaunch that allows you to configure Xming easily, and also save your configuration for future use. To run Xming, open XLaunch and select the configuration outlined here:

1. Open XLaunch from the program menu. Select Multiple Windows and click Next. This tells Xming to open
each remote Linux application in a new window.



2. Select Start No Client and click Next. This tells Xming to launch and wait for commands from
another program (like PuTTY).





3. Make sure that Clipboard is selected and click Next. This tells Xming to enable your remote
Linux applications to share a unified clipboard.



4. Click the Finish button to launch Xming.




Now that Xming is running, you can open your PuTTY session and launch a graphical application. As a quick check, launch your remote Unix server’s clock on your Windows desktop through the SSH connection using X11 forwarding:

1. Open PuTTY.
2. Double-click on the saved session you created earlier. PuTTY will create an SSH connection to your remote Unix server.
3. Login to your Unix server.
4. Type xclock at the command prompt and hit enter.
5. Your remote server’s GUI clock should appear on your client desktop.

6. If the xclock test succeeds, If the xclock tests succeeds, launch the ORE client to verify the same DISPLAY setting is used by embedded R:

R> ore.connect(user="<username>",sid="<sid>",host="<hostname>",password="<password>", all=TRUE)
R> ore.is.connected()

TRUE
R
> ore.eval(function() Sys.getenv("DISPLAY"), ore.graphics=FALSE)

If the last returned value matches the DISPLAY setting, you will be able to display images at the client machine.  Here's an example that creates a panel plot using the R dataset mtcars:

ore.doEval(function() {
  library(lattice)
  xyplot(mpg ~ hp | factor(cyl),
         data=mtcars,
         type=c("p", "r"),
         main="Fuel economy vs. Performance with Number of Cylinders",
         xlab="Performance (horse power)",
         ylab="Fuel economy (miles per gallon)",
         scales=list(cex=0.75))
})


Wednesday May 22, 2013

Big Data Analytics in R – the tORCH has been lit!

This guest post from Anand Srinivasan compares performance of the Oracle R Connector for Hadoop with the R {parallel} package for covariance matrix computation, sampling, and parallel linear model fitting. 

Oracle R Connector for Hadoop (ORCH) is a collection of R packages that enables Big Data analytics from the R environment. It enables a Data Scientist /Analyst to work on data straddling multiple data platforms (HDFS, Hive, Oracle Database, local files) from the comfort of the R environment and benefit from the R ecosystem.

ORCH provides:

1) Out of the box predictive analytic techniques for linear regression, neural networks for prediction, matrix completion using low rank matrix factorization, non-negative matrix factorization, kmeans clustering, principal components analysis and multivariate analysis. While all these techniques have R interfaces, they are implemented either in Java or in R as distributed parallel implementations leveraging all nodes of your Hadoop cluster

2) A general framework, where a user can use the R language to write custom logic executable in a distributed parallel manner using available compute and storage resources.

The main idea behind the ORCH architecture and its approach to Big Data analytics is to leverage the Hadoop infrastructure and thereby inherit all its advantages.

The crux of ORCH is read parallelization and robust methods over parallelized data. Efficient parallelization of reads is the single most important step necessary for Big Data Analytics because it is either expensive or impractical to load all available data in a single thread.

ORCH is often compared/contrasted with the other options available in R, in particular the popular open source R package called parallel. The parallel package provides a low-level infrastructure for “coarse-grained” distributed and parallel computation. While it is fairly general, it tends to encourage an approach that is based on using the aggregate RAM in the cluster as opposed to using the file system. Specifically, it lacks a data management component, a task management component and an administrative interface for monitoring. Programming, however, follows the broad Map Reduce paradigm.

 In the rest of this article, we assume that the reader has basic familiarity with the parallel package and proceed to compare ORCH and its approach with the parallel package. The goal of this comparison is to explain what it takes for a user to build a solution for their requirement using each of these technologies and also to understand the performance characteristics of these solutions.

We do this comparison using three concrete use cases – covariance matrix computation, sampling and partitioned linear model fitting. The exercise is designed to be repeatable, so you, the reader, can try this “at home”. We will demonstrate that ORCH is functionally and performance-wise superior to the available alternative of using R’s parallel package.

A six node Oracle Big Data Appliance v2.1.1 cluster is used in the experiments. Each node in this test environment has 48GB RAM and 24 CPU cores.

Covariance Matrix Computation

Computing covariance matrices is one of the most fundamental of statistical techniques.

In this use case, we have a single input file, “allnumeric_200col_10GB” (see appendix on how to generate this data set), that is about 10GB in size and has a data matrix with about 3 million rows and 200 columns. The requirement is to compute the covariance matrix of this input matrix.

Since a single node in the test environment has 48GB RAM and the input file is only 10GB, we start with the approach of loading the entire file into memory and then computing the covariance matrix using R’s cov function.

> system.time(m <- matrix(scan(file="/tmp/allnumeric_200col_10GB",what=0.0, sep=","), ncol=200, byrow=TRUE))

Read 611200000 items

user system elapsed

683.159 17.023 712.527

> system.time(res <- cov(m))

user system elapsed

561.627 0.009 563.044

We observe that the loading of data takes 712 seconds (vs. 563 seconds for the actual covariane computation) and dominates the cost. It would be even more pronounced (relative to the total elapsed time) if the cov(m) computation were parallelized using mclapply from the parallel package.

Based on this, we see that for an efficient parallel solution, the main requirement is to parallelize the data loading. This requires that the single input file be split into multiple smaller-sized files. The parallel package does not offer any data management facilities; hence this step has to be performed manually using a Linux command like split. Since there are 24 CPU cores, we split the input file into 24 smaller files.

time(split -l 127334 /tmp/allnumeric_200col_10GB)

real 0m54.343s

user 0m3.598s

sys 0m24.233s

Now, we can run the R script:

library(parallel)

# Read the data

readInput <- function(id) {

infile <- file.path("/home/oracle/anasrini/cov",paste("p",id,sep=""))

print(infile)

m <- matrix(scan(file=infile, what=0.0, sep=","), ncol=200, byrow=TRUE)

m

}

# Main MAPPER function

compCov <- function(id) {

m <- readInput(id)  # read the input

cs <- colSums(m)    # compute col sums, num rows

# compute main cov portion

nr <- nrow(m)      

mtm <- crossprod(m)

list(mat=mtm, colsum=cs, nrow=nr)

}

numfiles <- 24

numCores <- 24

# Map step

system.time(mapres <- mclapply(seq_len(numfiles), compCov, mc.cores=numCores))

# Reduce step

system.time(xy <- Reduce("+", lapply(mapres, function(x) x$mat)))

system.time(csf <- Reduce("+", lapply(mapres, function(x) x$colsum)))

system.time(nrf <- Reduce("+", lapply(mapres, function(x) x$nrow)))

sts <- csf %*% t(csf)

m1 <- xy / (nrf -1)

m2 <- sts / (nrf * (nrf-1))

m3 <- 2 * sts / (nrf * (nrf-1))

covmat <- m1 + m2 - m3

user system elapsed

1661.196 21.209 77.781

We observe that the elapsed time (excluding time to split the files) has now come down to 77 seconds. However, it took 54 seconds for splitting the input file into smaller files, making it a significant portion of the total elapsed time of 77+54 = 131 seconds.

Besides impacting performance, there are a number of more serious problems with having to deal with data management manually. We list a few of them here:

1) In other scenarios, with larger files or larger number of chunks, placement of chunks also becomes a factor that influences I/O parallelism. Optimal placement of chunks of data over the available set of disks is a non-trivial problem

2) Requirement of root access – Optimal placement of file chunks on different disks often requires root access. For example, only root has permissions to create files on disks corresponding to the File Systems mounted on /u03, /u04 etc on an Oracle Big Data Appliance node

3) When multiple nodes are involved in the computation, moving fragments of the original data into different nodes manually can drain productivity

4) This form of split can only work in a static environment – in a real-world dynamic environment, information about other workloads and their resource utilization cannot be factored in a practical manner by a human

5) Requires admin to provide user access to all nodes of the cluster in order to allow the user to move data to different nodes

ORCH-based solution

On the other hand, using ORCH, we can directly use the out of the box support for multivariate analysis. Further, no manual steps related to data management (like splitting files and addressing chunk placement issues) are required since Hadoop (specifically HDFS) handles all those requirements seamlessly.

>x <- hdfs.attach("allnumeric_200col_10GB")

> system.time(res <- orch.cov(x))

user system elapsed

18.179 3.991 85.640

Forty-two concurrent map tasks were involved in the computation above as determined by Hadoop.

To conclude, we can see the following advantages of the ORCH based approach in this scenario :

1) No manual steps. Data Management completely handled transparently by HDFS

2) Out of the box support for cov. The distributed parallel algorithm is available out of the box and the user does not have to work it out from scratch

3) Using ORCH we get comparable performance to that obtained through manual coding without any of the manual overheads

Sampling

We use the same single input file, “allnumeric_200col_10GB” in this case as well. The requirement is to obtain a uniform random sample from the input data set. The size of the sample required is specified as a percentage of the input data set size.

Once again for the solution using the parallel package, the input file has to be split into smaller sized files for better read parallelism.

library(parallel)

# Read the data

readInput <- function(id) {

infile <- file.path("/home/oracle/anasrini/cov", paste("p",id,sep=""))

print(infile)

system.time(m <- matrix(scan(file=infile, what=0.0, sep=","),

ncol=200, byrow=TRUE))

m

}

# Main MAPPER function

samplemap <- function(id, percent) {

m <- readInput(id)    # read the input

v <- runif(nrow(m))   # Generate runif

# Pick only those rows where random < percent*0.01

keep <- which(v < percent*0.01)

m1 <- m[keep,,drop=FALSE]

m1

}

numfiles <- 24

numCores <- 24

# Map step

percent <- 0.001

system.time(mapres <- mclapply(seq_len(numfiles), samplemap, percent,

mc.cores=numCores))

user system elapsed

1112.998 23.196 49.561

ORCH based solution

>x <- hdfs.attach("allnumeric_200col_10GB_single")

>system.time(res <- orch.sample(x, percent=0.001))

user system elapsed

8.173 0.704 33.590

The ORCH based solution out-performs the solution based on the parallel package. This is because orch.sample is implemented in Java and the read rates obtained by a Java implementation are superior to what can be achieved in R.

Partitioned Linear Model Fitting

Partitioned Linear Model Fitting is a very popular use case. The requirement here is to fit separate linear models, one for each partition of the data. The data itself is partitioned based on a user-specified partitioning key.

For example, using the ONTIME data set, the user could specify destination city as the partitioning key indicating the requirement for separate linear models (with, for example, ArrDelay as target), 1 per destination city.

ORCH based solution

dfs_res <- hadoop.run(

data = input,

mapper = function(k, v) { orch.keyvals(v$Dest, v) },

reducer = function(k, v) {

lm_x <- lm(ArrDelay ~ DepDelay + Distance, v)

orch.keyval(k, orch.pack(model=lm_x, count = nrow(v)))

},

config = new("mapred.config",

job.name = "ORCH Partitioned lm by Destination City",

map.output = mapOut,

mapred.pristine = TRUE,

reduce.output = data.frame(key="", model="packed"),

)

)

Notice that the Map Reduce framework is performing the partitioning. The mapper just picks out the partitioning key and the Map Reduce framework handles the rest. The linear model for each partition is then fitted in the reducer.

parallel based solution

As in the previous use cases, for good read parallelism, the single input file needs to be split into smaller files. However, unlike the previous use cases, there is a twist here.

We noted that with the ORCH based solution it is the Map Reduce framework that does the actual partitioning. There is no such out of the box feature available with a parallel package-based solution. There are two options:

1) Break up the file arbitrarily into smaller pieces for better read parallelism. Implement your own partitioning logic mimicking what the Map Reduce framework provides. Then fit linear models on each of these partitions in parallel.

OR

2) Break the file into smaller pieces such that each piece is a separate partition. Fit linear models on each of these partitions in parallel 

Both of these options are not easy and require a lot of user effort. The custom coding required for achieving parallel reads is significant.

Conclusion

ORCH provides a holistic approach to Big Data Analytics in the R environment. By leveraging the Hadoop infrastructure, ORCH inherits several key components that are all required to address real world analytics requirements.

The rich set of out-of-the-box predictive analytic techniques along with the possibility of authoring custom parallel distributed analytics using the framework (as demonstrated in the partitioned linear model fitting case) helps simplify the user’s task while meeting the performance and scalability requirements. 

Appendix – Data Generation

We show the steps required to generate the single input file “allnumeric_200col_10GB”.

Run the following in R:

x <- orch.datagen(datasize=10*1024*1024*1024, numeric.col.count=200,

map.degree=40)

hdfs.mv(x, "allnumeric_200col_10GB")

Then, from the Linux shell:

hdfs dfs –rm –r –skipTrash /user/oracle/allnumeric_200col_10GB/__ORCHMETA__

hdfs dfs –getmerge /user/oracle/allnumeric_200col_10GB /tmp/allnumeric_200col_10GB


Monday May 06, 2013

Oracle R Distribution for R 2.15.2 available on public-yum

Oracle R Distribution (ORD) for R 2.15.2 on Linux is now available for download from Oracle's public-yum repository.  R 2.15.2 is a maintenance update that includes improved performance and reduced memory usage for some commonly-used functions, increased memory available for data on 64-bit systems, enhanced localization for Polish language users, and a number of bug fixes.  Detailed updates can be found in the NEWS file - see the section 'CHANGES IN R VERSION 2.15.2'.

The most recent update to Oracle R Enterprise, version 1.3.1, is certified with both R 2.15.1 and R 2.15.2. Installing ORD from public-yum will pull the most recently posted version, R 2.15.2.  For example, on Oracle Linux 5, as root, cd to yum.repos.d, download the public yum repository configuration file, and enable the required repositories:

    cd /etc/yum.repos.d
    wget http://public-yum.oracle.com/public-yum-el5.repo
    Edit file public-yum-el5.repo and set
        "enabled=1" for [el5_addons]
        "enabled=1" for [el5_latest]

Next, install ORD:



Start R.  Oracle R Distribution for R 2.15.2 is installed.



To install an older version of ORD such as R 2.15.1, simply specify the R version at the install step:

    yum install R-2.15.1

Detailed instructions for installing Oracle R Distribution are in the Oracle R Enterprise Installation and Administration Guide.  Oracle R Distribution for R 2.15.2 on AIX, Solaris X86 and Solaris SPARC will be available on Oracle's Free and Open Source Software portal in the coming weeks.

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« May 2013 »
SunMonTueWedThuFriSat
   
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
23
25
26
27
29
30
31
 
       
Today