Thursday Mar 20, 2014

ROracle 1-1.11 released - binaries for Windows and other platforms available on OTN


We are pleased to announce the latest update of the open source ROracle package, version 1-1.11, with enhancements and bug fixes. ROracle provides high performance and scalable interaction from R with Oracle Database. In addition to availability on CRAN, ROracle binaries specific to Windows and other platforms can be downloaded from the Oracle Technology Network. Users of ROracle, please take our brief survey. We want to hear from you!

Latest enhancements in version 1-1.11 of ROracle:

• Performance enhancements for RAW data types and large result sets
• Ability to cache the result set in memory to reduce memory consumption on successive reads
• Added session mode to connect as SYSDBA or using external authentication
• bug 17383542: Enhanced dbWritetable() & dbRemoveTable() to work on global schema

Users of ROracle are quite pleased with the performance and functionality:


"In my position as a quantitative researcher, I regularly analyze database data up to a gigabyte in size on client-side R engines. I switched to ROracle from RJDBC because the performance of ROracle is vastly superior, especially when writing large tables. I've also come to depend on ROracle for transactional support, pulling data to my R client, and general scalability. I have been very satisfied with the support from Oracle -- their response has been prompt, friendly and knowledgeable."

           -- Antonio Daggett, Quantitative Researcher in Finance Industry


"Having used ROracle for over a year now with our Oracle Database data, I've come to rely on ROracle for high performance read/write of large data sets (greater than 100 GB), and SQL execution with transactional support for building predictive models in R. We tried RODBC but found ROracle to be faster, much more stable, and scalable."

           -- Dr. Robert Musk, Senior Forest Biometrician, Forestry Tasmania


See the ROracle NEWS for the complete list of updates.

We encourage ROracle users to post questions and provide feedback on the Oracle R Technology Forum.

In addition to being a high performance database interface to Oracle Database from R for general use, ROracle supports database access for Oracle R Enterprise.

Thursday Feb 13, 2014

Monitoring progress of embedded R functions

When you run R functions in the database, especially functions involving multiple R engines in parallel, you can monitor their progress using the Oracle R Enterprise datastore as a central location for progress notifications, or any intermediate status or results. In the following example, based on ore.groupApply, we illustrate instrumenting a simple function that builds a linear model to predict flight arrival delay based on a few other variables.

In the function modelBuildWithStatus, the function verifies that there are rows for building the model after eliminating incomplete cases supplied in argument dat. If not empty, the function builds a model and reports “success”, otherwise, it reports “no data.” It’s likely that the user would like to use this model in some way or save it in a datastore for future use, but for this example, we just build the model and discard it, validating that a model can be built on the data.


modelBuildWithStatus <-
  function(dat) {
    dat <- dat[complete.cases(dat),]
    if (nrow(dat)>0L) {
      mod <- lm(ARRDELAY ~ DISTANCE + AIRTIME + DEPDELAY, dat);
      "success"
    } else
      "no_data"
    }

When we invoke this using ore.groupApply, the goal is to build one model per “unique carrier” or airline. Using an ORE 1.4 feature, we specify the degree of parallelism using the parallel argument, setting it to 2.


res <- ore.groupApply(ONTIME_S[, c("UNIQUECARRIER","DISTANCE", "ARRDELAY", "DEPDELAY", "AIRTIME")],
        ONTIME_S$UNIQUECARRIER,
        modelBuildWithStatus,
        parallel=2L)

res.local<-ore.pull(res)
res.local[unlist(res.local)=="no_data"]

The result tells us about the status of each execution. Below, we print the unique carries that had no data.


R> res.local<-ore.pull(res)
R> res.local[unlist(res.local)=="no_data"]
$EA
[1] "no_data"

$`ML(1)`
[1] "no_data"

$`PA(1)`
[1] "no_data"

$PI
[1] "no_data"

$PS
[1] "no_data"

To monitor the progress of each execution, we can identify the group of data being processed in each function invocation using the value from the UNIQUECARRIER column. For this particular data set, we use the first two characters of the carrier’s symbol appended to “group.” to form a unique object name for storing in the datastore identified by job.name. (If we don’t do this, the value will form an invalid object name.) Note that since the UNIQUECARRIER column contains uniform data, we need only the first value.

The general idea for monitoring progress is to save an object in the datastore named for each execution of the function on a group. We can then list the contents of the named datastore and compute a percentage complete, which is discussed later in this post. For the “success” case, we assign the value “SUCCESS” to the variable named by the string in nm that we created earlier. Using ore.save, this uniquely named object is stored in the datastore with the name in job.name. We use the append=TRUE flag to indicate that the various function executions will be sharing the same named datastore.
If there is no data left in dat, we assign “NO DATA” to the variable named in nm and save that. Notice in both cases, we’re still returning “success” or “no data” so these come back in the list returned by ore.groupApply. However, we can return other values instead, e.g., the model produced.


modelBuildWithMonitoring <-
  function(dat, job.name) {
  nm <- paste("group.", substr(as.character(dat$UNIQUECARRIER[1L]),1,2), sep="")
  dat <- dat[complete.cases(dat),]
  if (nrow(dat)>0L) {
    mod <- lm(ARRDELAY ~ DISTANCE + AIRTIME + DEPDELAY, dat);
    assign(nm, "SUCCESS")
    ore.save(list=nm, name=job.name, append=TRUE)
    "success"
  } else {
    assign(nm, "NO DATA")
    ore.save(list=nm, name=job.name, append=TRUE)
    "no data"
  }
}

When we use this function in ore.groupApply, we provide the job.name and ore.connect arguments as well. The variable ore.connect must be set to TRUE in order to use the datastore. As the ore.groupApply executes, the datastore named by job.name will be increasingly getting objects added with the name of the carrier. First, delete the datastore named “job1”, if it exists.


ore.delete(name="job1")

res <- ore.groupApply(ONTIME_S[, c("UNIQUECARRIER","DISTANCE", "ARRDELAY", "DEPDELAY", "AIRTIME")],
        ONTIME_S$UNIQUECARRIER,
        modelBuildWithMonitoring,
        job.name="job1", parallel=2L, ore.connect=TRUE)

To see the progress during execution, we can use the following function, which takes a job name and the cardinality of the INDEX column to determine the percent complete. This function is invoked in a separate R engine connected to the same schema. If the job name is found, we print the percent complete, otherwise stop with an error message.


check.progress <- function(job.name, total.groups) {
  if ( job.name %in% ore.datastore()$datastore.name )
    print(sprintf("%.1f%%", nrow(ore.datastoreSummary(name=job.name))/total.groups*100L))
  else
    stop(paste("Job", job.name, " does not exist"))
}

To invoke this, compute the total number of groups and provide this and the job name to the function check.progress.
total.groups <- length(unique(ONTIME_S$UNIQUECARRIER))
check.progress("job1",total.groups)

However, we really want a loop to report on the progress automatically. One simple approach is to set up a while loop with a sleep delay. When we reach 100%, stop. To be self-contained, we include a simplification of the function above as a local function.


check.progress.loop <- function(job.name, total.groups, sleep.time=2) {
  check.progress <- function(job.name, total.groups) {
    if ( job.name %in% ore.datastore()$datastore.name )
      print(sprintf("%.1f%%", nrow(ore.datastoreSummary(name=job.name))/total.groups*100L))
    else
      paste("Job", job.name, " does not exist")
  }
  while(1) {
    try(x <- check.progress(job.name,total.groups))
    Sys.sleep(sleep.time)
    if(x=="100.0%") break
  }
}

As before, this function is invoked in a separate R engine connected to the same schema.


check.progress.loop("job1",total.groups)

Looking at the results, we can see the progress reported at one second intervals. Since the models build quickly, it doesn’t take long to reach 100%. For functions that take longer to execute or where there are more groups to process, you may choose a longer sleep time. Following this, we look at the datastore “job1” using ore.datastore and its contents using ore.datastoreSummary.


R> check.progress.loop("job1",total.groups,sleep.time=1)
[1] "6.9%"
[1] "96.6%"
[1] "100.0%"

R> ore.datastore(name="job1")
  datastore.name object.count size      creation.date description
1 job1 29 1073 2014-02-13 22:03:20
R> ore.datastoreSummary(name="job1")
object.name class size length row.count col.count
1 group.9E character 37 1 NA NA
2 group.AA character 37 1 NA NA
3 group.AQ character 37 1 NA NA
4 group.AS character 37 1 NA NA
5 group.B6 character 37 1 NA NA
6 group.CO character 37 1 NA NA
7 group.DH character 37 1 NA NA
8 group.DL character 37 1 NA NA
9 group.EA character 37 1 NA NA
10 group.EV character 37 1 NA NA
11 group.F9 character 37 1 NA NA
12 group.FL character 37 1 NA NA
13 group.HA character 37 1 NA NA
14 group.HP character 37 1 NA NA
15 group.ML character 37 1 NA NA
16 group.MQ character 37 1 NA NA
17 group.NW character 37 1 NA NA
18 group.OH character 37 1 NA NA
19 group.OO character 37 1 NA NA
20 group.PA character 37 1 NA NA
21 group.PI character 37 1 NA NA
22 group.PS character 37 1 NA NA
23 group.TW character 37 1 NA NA
24 group.TZ character 37 1 NA NA
25 group.UA character 37 1 NA NA
26 group.US character 37 1 NA NA
27 group.WN character 37 1 NA NA
28 group.XE character 37 1 NA NA
29 group.YV character 37 1 NA NA

The same basic technique can be used to note progress in any long running or complex embedded R function, e.g., in ore.tableApply or ore.doEval. At various points in the function, sequence-named objects can be added to a datastore. Moreover, the contents of those objects can contain incremental or partial results, or even debug output.

While we’ve focused on the R API for embedded R execution, the same functions could be invoked using the SQL API. However, monitoring would still be done from an interactive R engine.

Thursday Jan 09, 2014

Invoking R scripts via Oracle Database: Theme and Variation, Part 4

In the first three parts of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution involving the functions ore.doEval / rqEval, ore.tableApply / rqTableEval, and ore.groupApply / “rqGroupApply”. In this blog post, we’ll cover the next in our theme and variation series involving ore.rowApply and rqRowEval. The “row apply” function is also one of the parallel-enabled embedded R execution functions. It supports data-parallel execution, where one or more R engines perform the same R function, or task, on disjoint chunks of data. This functionality is essential to enable scalable model scoring/predictions on large data sets and for taking advantage of high-performance computing hardware like Exadata.

As for ore.groupApply, Oracle Database handles the management and control of potentially multiple R engines at the database server machine, automatically chunking and passing data to parallel executing R engines. Oracle Database ensures that R function executions for all chunks of rows complete, or the ORE function returns an error. The result from the execution of each user-defined embedded R function is gathered in an ore.list. This list remains in the database until the user requires the result. However, we’ll also show how data.frame results from each execution can be combined into a single ore.frame. This features works for return values of other embedded R functions as well.

The variation on embedded R execution for ore.rowApply involves passing not only an ore.frame to the function such that the first parameter of your embedded R function receives a data.frame, but also the number of rows that should be passed to each invocation of the user-defined R function. The last chunk, of course, may have fewer rows than specified.

Let’s look at an example. We’re going to use the C50 package to score churn data (i.e., predict which customers are likely to churn) using the C5.0 decision tree models we built in the previous blog post with ore.groupApply. (Well, almost. We need to rebuild the models to take into account the full data set levels.) The goal is to score the customers in parallel leveraging the power of a high performance computing platform, such as Exadata.


library(C50)
data(churn)

ore.create(churnTest, "CHURN_TEST")

myFunction <- function(dat, xlevels, datastorePrefix) {
  library(C50)
  state <- dat[1,"state"]
  datastoreName <- paste(datastorePrefix,state,sep="_")
  dat$state <- NULL
  for (j in names(xlevels))
    dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
  ore.load(name=datastoreName)
  res <- data.frame(pred=predict(mod,dat, type="class"),
        actual=dat$churn,
        state=state)
  res
}

xlevels <- ore.getXlevels(~ ., CHURN_TEST[,-1])
scoreList <- ore.groupApply(
  CHURN_TEST,
  INDEX=CHURN_TEST$state,
  myFunction,
  datastorePrefix="myC5.0model3",xlevels=xlevels, ore.connect=TRUE)
score.MA <- ore.pull(scoreList$MA)
table(score.MA$actual, score.MA$pred)

A few points to highlight:

• Instead of computing the levels using the as.factor function inside the user-defined function, we’ll use ore.getXlevels, which returns the levels for each factor column. We don’t need this for the state column, so we exclude it (“-1”). In the previous post we noted that factor data is passed as character columns in the data.frame. Computing the levels first can ensure that all possible levels are provided during model building, even if there are no rows with some of the level values.
• When building models where some levels were missing (due to using as.factor on each partition of data), scoring can fail if the test data has unknown level values. For this reason, the models built in Part 3 need to be rebuilt using the approach above with ore.getXlevels. This is left as an exercise for the reader.
• Assign the function to the variable “myFunction” to facilitate reuse (see below).
• We construct the datastore name to be the same as when we were building the models, i.e., appending the state value to the datastore prefix separated by an ‘_’.
• The for loop iterates over the levels passed in as xlevels, creating a factor using the provided levels and assigning it back to the data.frame.
• Loading the datastore by name, we have access to the variable mod, which contains the model for the particular state.
• The result is constructed as a data.frame with the prediction and the actual values.
• Three arguments are passed: the datastore prefix, the levels that were pre-computed, and that we need to connect to the database because we’re using a datastore.
• The results are stored as a list of ore.frames. We can pull the scores for MA and compute a confusion matrix using table.

This is fine. However, we likely don’t want to have a list of separate ore.frames as the result. We’d prefer to have a single ore.frame with all the results. This can be accomplished using the FUN.VALUE argument. Whenever a data.frame is the result of the user-defined R function, and if the structure of that data.frame is the same across all invocations of the group apply or row apply, you can combine them into a single result by defining the structure as follows:

scores <- ore.groupApply(
  CHURN_TEST,
  INDEX=CHURN_TEST$state,
  myFunction,
  datastorePrefix="myC5.0model3",xlevels=xlevels, ore.connect=TRUE,
  FUN.VALUE=data.frame(pred=character(0),
        actual=character(0),
        state=character(0)));
head(scores)
scores.local <- ore.pull(scores)
table(scores.local[scores.local$state=="MA",c("actual","pred")])

scores.MA <- scores[scores$state=="MA",c("actual","pred")]
table(scores.MA$actual, scores.MA$pred)

A few important points to highlight:

• FUN.VALUE is set to a data.frame that describes the format of the result. By providing this argument, you will get back a single ore.frame, not an ore.list object.
• The group apply completes instantaneously because it is only defining the ore.frame, not actually performing the scoring. Not until the values are needed does the result get computed. We invoke head on the ore.frame in scores to highlight this.
• We can pull the scores to the client to invoke table as before, but subselecting for state MA. However, we can also do this computation in the database using the transparency layer. First, we filter the rows for MA in scores.MA, and then invoke table on the two columns. Note: ORE requires passing the two columns explicitly to the overloaded function table.
• To do this in parallel, add the argument parallel=TRUE to the ore.groupApply call.

Wait! What happened to ore.rowApply?

Above, we showed how to score with multiple models using ore.groupApply. But what if we had customers from a single state that we wanted to score in parallel? We can use ore.rowApply and rqRowEval to invoke a function on chunks of data (rows) at a time, from 1 to the total number of rows. (Note that values closer to the latter will have no benefit from parallelism, obviously.)


scores <- ore.rowApply(
  CHURN_TEST[CHURN_TEST$state=="MA",],
  myFunction,
  datastorePrefix="myC5.0model3",xlevels=xlevels,
  ore.connect=TRUE, parallel=TRUE,
  FUN.VALUE=data.frame(pred=character(0),
        actual=character(0),
        state=character(0)),
  rows=200)
scores
table(scores$actual, scores$pred)

A few points to highlight:

• Since we want to perform the scoring in parallel by state, we filter the rows for MA. This will ensure that all rows processed can use the same predictive model.
• We set the rows argument to 200. CHURN_TEST has 1667 rows, so this will result in nine executions of myFunction. The first eight receiving 200 rows each and the last receiving 67 rows.
• We also set parallel=TRUE above since we want the scoring performed in parallel.
• The invocation of ore.rowApply returns immediately. Not until we print scores do we incur the cost of executing the underlying query. However, also note that each time we access scores, for example in the following call to table, we incur the cost of executing the query. If the result will be used many times in subsequent operations, you may want to create a table with the result using ore.create.

In SQL, we can do the same, but we’ll need to store the function in the R script repository (perhaps called "myScoringFunction") and also store xlevels in a datastore (perhaps called "myXLevels"). While we can pass complex objects in the R interface to embedded R functions, we cannot do that in SQL. Instead, we must pass the name of a datastore. Since the xlevels are in a datastore, the user-defined R function needs to be modified to take this other datastore name and load that datastore to have access to xlevels. This set of changes is left to the reader as an exercise.


select * from table(rqRowEval(
  cursor(select /*+ parallel(t, 4) */ *
        from CHURN_TEST t
        where "state" = 'MA'),
  cursor(select 1 as "ore.connect",
        'myC5.0model3' as "datastorePrefix",
        'myXLevels' as "xlevelsDatastore"
        from dual),
  'select ''aaa'' "pred",''aaa'' "actual" , ''aa'' "state" from dual',
    200, 'myScoringFunction'));

A few points to highlight:

• The input cursor specifies a parallel hint on the input data cursor and filtering data for MA as well.
• Several arguments are being passed, including the new argument to our function myXLevels.
• The output form is specified in the SQL string. Care must be taken to ensure that the column names, ordering, and the length of character strings match the returned data.frame.

Map Reduce

The “row apply” functionality can be thought of in terms of the map-reduce paradigm where the mapper performs the scoring and outputs a data.frame value (no key required). There is no reducer, or the reducer is simply a pass-through.

Memory and performance considerations

Unlike with group apply, the rows argument in row apply ensures an upper bound on the number of rows (and hence memory requirement). The value of rows should be chosen to balance memory and parallel performance. The usual measures can be taken regarding setting memory limits on the R engine – as noted in Part 2.

There may be instances where setting rows = 1 makes sense. For example, if the computation per row is intensive (i.e., takes a long time), sending one row per R engine may be appropriate. Experiment with a range of values for rows to determine the best value for your particular scenario.

Thursday Jan 02, 2014

Invoking R scripts via Oracle Database: Theme and Variation


Oracle R Enterprise provides several ways for you to invoke R scripts through Oracle Database. From the same R script you can get structured data, an XML representation of R objects and images, and even PNG images via a BLOB column in a database table. This series of blog posts will take you through the various ways you can interact with R scripts and Oracle Database. In this first post, we explore the benefits of embedded R execution and usage of the ore.doEval and rqEval functions. Subsequent posts will detail the use of the other embedded R execution functions, and in the case of data- and task-parallel capabilities, how these relate to the MapReduce paradigm.

Introduction


In Oracle R Enterprise, we use the phrase “embedded R execution” to characterize the storing of R scripts in Oracle Database – using the ORE R script repository – and invoking that script in one or more database-side R engines.

This is a powerful capability for several reasons:


  • enable data- and task-parallel execution of user-defined R functions that correspond to special cases of Hadoop Map-Reduce jobs

  • leverage a more powerful database server machine for R engine execution – both RAM and CPU

  • transfer data between Oracle Database and R engine much faster than to a separate client R engine

  • invoke user-defined R functions from SQL and retrieve results in various forms depending on application requirements: tables, XML, PNG BLOB

  • leverage open source CRAN packages at the database server machine

  • schedule R scripts for automatic execution via SQL with Oracle Database DBMS_SCHEDULER PL/SQL package


Users can interactively develop R scripts using their favorite R IDE, and then deploy the script as an R function to the database where it can be invoked either from R or SQL. Embedded R execution facilitates application use of R scripts with better performance and throughput than using a client-side R engine. Executing R scripts from SQL enables integration of R script results with OBIEE, Oracle BI Publisher, and other SQL-enabled tools for structured data, R objects, and images.

Table 1 provides a summary of the embedded R execution functions and R script repository functions available. The function f refers to the user-defined R code, or script, that is provided as either an R function object or a named R function in the database R script repository. To create functions in the R script repository, ORE has functions as described in Table 1.












































R API SQL API Description

ore.doEval

rqEval

Executes f with no automatic transfer of data.

ore.tableApply

rqTableEval

Executes f passing all rows of provided input ore.frame as the first parameter of f. First parameter provided as a data.frame.

ore.groupApply

“rqGroupEval”
(must
be explicitly defined as function by user)

Executes f by partitioning data according to an “index” column’s values. Each data partition provided as a data.frame in the first parameter of f. Supports parallel execution of each f invocation in a pool of database server-side R engines.

ore.rowApply

rqRowEval

Executes f passing a specified number of rows (a “chunk”) of the provided input ore.frame. Each chunk provided as a data.frame in the first parameter of f. Supports parallel execution of each f invocation in a pool of database server-side R engines.

ore.indexApply

N/A

Executes f with no automatic transfer of data, but provides the index of the invocation, 1 through n, where n is the number of functions to invoke. Supports parallel execution of each f invocation in a pool of database server-side R engines.

ore.scriptCreate

sys.rqScriptCreate

Load the provided R function into the R script repository with the provided name.

ore.scriptDrop

sys.rqScriptDrop

Remove the named R function from the R script repository.

Table 1: Embedded R Execution and R Script Repository function summary


Using ore.doEval and rqEval


The first of the embedded R functions we cover are also the simplest. The R function ore.doEval and the SQL function rqEval do not automatically receive any data from the database. They simply execute the function f provided. Any needed data is either generated within f or explicitly retrieved from a data source such as Oracle Database, other databases, or flat files.

R API


In the R interface, users can specify an R function as an argument to ore.doEval, or use the name of that function as stored in the R script repository. For example, the function RandomRedDots returns a data.frame with two columns and plots 100 random normal values. To invoke the function through the database server requires minimal specification with ore.doEval.

RandomRedDots <- function(divisor=100){
id<- 1:10
plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2 )
data.frame(id=id, val=id / divisor)
}
ore.doEval(RandomRedDots)


Here is the result, where the image is displayed at the client, but generated by the database server R engine that executed the function f.


We can provide arguments to f as well. To override the divisor argument, provide it as an argument to ore.doEval. Note
that any number of parameters, including more complex R objects such as models can be passed as arguments this way.

ore.doEval(RandomRedDots, divisor=50)


Behind the scenes: when passing the function itself (as above), ORE implicitly stores the function in the R script repository before executing it. When finished executing, the function is dropped from the repository. If we want to store this function explicitly in the repository, we can use ore.scriptCreate:

ore.scriptCreate("myRandomRedDots",
RandomRedDots)


Now, the function can be invoked by name:

ore.doEval(FUN.NAME="myRandomRedDots",divisor=50)


The return value of f is a data.frame, however, if we capture the result in a variable, we’ll notice two things: the class of the return value is ore.object and the image does not display.

res <- ore.doEval(FUN.NAME="myRandomRedDots",
divisor=50)
class(res)



To get back the data.frame, we must invoke ore.pull to pull the result to the client R engine.

res.local <- ore.pull(res)
class(res.local)


If we wanted to return an ore.frame instead of an ore.object, specify the argument FUN.VALUE that describes the structure of the result.

res.of <- ore.doEval(FUN.NAME="myRandomRedDots", divisor=50,
FUN.VALUE= data.frame(id=1, val=1))
class(res.of)


SQL API


Now, we’ll look at executing the same R function f using the SQL interface of embedded R execution, while pointing out a few significant differences in the API. The first difference is that the R functions are defined as strings, not R objects. This should be no surprise since we’ll be using a SQL interface like SQL Developer or SQL*Plus. Also, the R function string cannot be passed directly in the rqEval function, but must first be stored in the R script repository. The call to sys.rqScriptCreate must be wrapped in a BEGIN-END PL/SQL block.

begin
sys.rqScriptCreate('myRandomRedDots2',
'function(divisor=100,numDots=100){
id <- 1:10
plot( 1:numDots, rnorm(numDots), pch = 21, bg = "red", cex = 2 )
data.frame(id = id, val = id / divisor)
}');
end;
/


Invoking the function myRandomRedDots2 occurs in a SQL SELECT statement as shown below. The first NULL argument to rqEval indicates that no arguments are supplied to the function myRandomRedDots2. In the SQL API, we can ask for the data.frame returned by f to appear as a SQL table. For this, the second parameter can take a SQL string that describes the column names and data types that correspond to the returned data.frame. You can provide a prototype row using the dual dummy table, however, the select statement can be based on an existing table or view as well.

select *
from table(rqEval(NULL, 'select 1 id, 1 val from dual', 'myRandomRedDots2'));


To pass parameters in SQL, we can replace the first NULL argument with a cursor that specifies a single row of scalar values. Multiple arguments can be specified as shown below. Note that argument names are case sensitive, so it is best to include column names in double quotes. Note also that the first argument is a cursor whereas the second parameter is a string. The former provides data values, whereas the latter is parsed to determine the structure of the result.

select *
from table(rqEval(cursor(select 50 "divisor", 500 "numDots" from dual),
'select 1 id, 1 val from dual',
'myRandomRedDots2'));


When specifying a table structure for the result as above, any image data is discarded. To get back both structured data and images, we replace the second argument with ‘XML’. This instructs the database to generate an XML string, first with any structured or semi-structured R objects, followed by the image or images generated by the R function f. Images are returned as a base 64 encoding of the PNG representation.

select *
from table(rqEval(cursor(select 50 "divisor", 500 "numDots" from dual),
'XML',
'myRandomRedDots2'));


Advanced features


To establish a connection to Oracle Database within the R function f, a special argument ore.connect can be set to TRUE. This uses the credentials of the user who invoked the embedded R function ore.doEval or rqEval to establish a connection and also automatically load the ORE package. This capability can be useful to explicitly use the ORE Transparency Layer or to save and load objects with ORE R object datastores.

RandomRedDots <- function(divisor=100, datatstore.name="myDatastore"){
id <- 1:10
plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2 )
ore.load(datastore.name) # contains numeric variable myVar
data.frame(id=id, val=id / divisor, num=myVar)
}


ore.doEval(RandomRedDots, datastore.name="datastore_1", ore.connect=TRUE)


Notice the additions in red. We pass the name of a datastore to load. That datastore is expected to contain a variable myVar. Arguments prefixed with ‘ore.’ are control arguments and are not passed to f. Other control arguments include: ore.drop which if set to TRUE converts a one-column input data.frame to a vector, ore.graphics which if set to TRUE starts a graphical driver to look for images being returned from f, ore.png.* which provides additional parameters for the PNG graphics device. The ore.png.* control arguments include (replace * with): width, height, units, pointsize, bg, res, type, etc.

In the next post, we will explore ore.tableApply and rqTableEval.

Wednesday Oct 23, 2013

Migrating R Scripts from Development to Production

“How do I move my R scripts stored in one database instance to another? I have my development/test system and want to migrate to production.”

Users of Oracle R Enterprise Embedded R Execution will often store their R scripts in the R Script Repository in Oracle Database, especially when using the ORE SQL API. From previous blog posts, you may recall that Embedded R Execution enables running R scripts managed by Oracle Database using both R and SQL interfaces. In ORE 1.3.1., the SQL API requires scripts to be stored in the database and referenced by name in SQL queries. The SQL API enables seamless integration with database-based applications and ease of production deployment.

Loading R scripts in the repository

Before talking about migration, we’ll first introduce how users store R scripts in Oracle Database. Users can add R scripts to the repository in R using the function ore.scriptCreate, or SQL using the function sys.rqScriptCreate.

For the sample R script

    id <- 1:10
    plot(1:100,rnorm(100),pch=21,bg="red",cex =2)
    data.frame(id=id, val=id / 100)

users wrap this in a function and store it in the R Script Repository with a name. In R, this looks like

ore.scriptCreate("RandomRedDots",
function () {
line-height: 115%; font-family: "Courier New";">     id <- 1:10
    plot(1:100,rnorm(100),pch=21,bg="red",cex =2)
    data.frame(id=id, val=id / 100))
})

In SQL, this looks like

begin
sys.rqScriptCreate('RandomRedDots',

 'function(){
    id <- 1:10
    plot(1:100,rnorm(100),pch=21,bg="red",cex =2)
    data.frame(id=id, val=id / 100)
  }');
end;
/

The R function ore.scriptDrop and SQL function sys.rqScriptDrop can be used to drop these scripts as well. Note that the system will give an error if the script name already exists.

Accessing R scripts once they’ve been loaded

If you’re not using a source code control system, it is possible that your R scripts can be misplaced or files modified, making what is stored in Oracle Database to only or best copy of your R code. If you’ve loaded your R scripts to the database, it is straightforward to access these scripts from the database table SYS.RQ_SCRIPTS. For example,

select * from sys.rq_scripts where name='myScriptName';

From R, scripts in the repository can be loaded into the R client engine using a function similar to the following:

ore.scriptLoad <- function(name) {
query <- paste("select script from sys.rq_scripts where name='",name,"'",sep="")
str.f <- OREbase:::.ore.dbGetQuery(query)
assign(name,eval(parse(text = str.f)),pos=1)
}

ore.scriptLoad("myFunctionName")

This function is also useful if you want to load an existing R script from the repository into another R script in the repository – think modular coding style. Just include this function in the body of the other function and load the named script.

Migrating R scripts from one database instance to another

To move a set of functions from one system to another, the following script loads the functions from one R script repository into the client R engine, then connects to the target database and creates the scripts there with the same names.

scriptNames <- OREbase:::.ore.dbGetQuery("select name from sys.rq_scripts where name not like 'RQG$%' and name not like 'RQ$%'")$NAME

for(s in scriptNames) {
cat(s,"\n")
ore.scriptLoad(s)
}

ore.disconnect()
ore.connect("rquser","orcl","localhost","rquser")

for(s in scriptNames) {
cat(s,"\n")
ore.scriptDrop(s)
ore.scriptCreate(s,get(s))
}

Best Practice

When naming R scripts, keep in mind that the name can be up to 128 characters. As such, consider organizing scripts in a directory structure manner. For example, if an organization has multiple groups or applications sharing the same database and there are multiple components, use “/” to facilitate the function organization:

line-height: 115%;">ore.scriptCreate("/org1/app1/component1/myFuntion1", myFunction1)
ore.scriptCreate("/org1/app1/component1/myFuntion2", myFunction2)
ore.scriptCreate("/org1/app2/component2/myFuntion2", myFunction2)
ore.scriptCreate("/org2/app2/component1/myFuntion3", myFunction3)
ore.scriptCreate("/org3/app2/component1/myFuntion4", myFunction4)

Users can then query for all functions using the path prefix when looking up functions.

Monday Aug 12, 2013

Quick! Swap those models – I’ve got a better one

(or, Why in-database analytics enables real-time scoring and can make model deployment easy)

Refreshing predictive models is a standard part of the process when deploying advanced analytics solutions in production environments. In addition, many predictive models need to be used in a real-time setting for scoring customers, whether that is for fraud detection, predicting churn, or recommending next likely product. One of the problems with using vanilla R is that real-time scoring often requires starting an R engine for each score, or enabling some ad hoc mechanism for real-time scoring, which can increase application complexity.

In this blog post, we look at how Oracle R Enterprise enables:

  • Building models in-database on database data from R
  • Renaming in-database models for use by a stored procedure
  • Invoking the stored procedure to make predictions from SQL
  • Building a second model and swapping it with the original
  • Moving a model from development environment to production environment

Building the model in R

So let’s start with building a generalized linear model (GLM) in Oracle Database. For illustration purposes, we’ll use the longley data set from R – a macroeconomic data set that provides a well-known example for a highly collinear regression. In R, type ?longley for the full description of the data set.

Using the following R script, we create the database table LONGLEY_TABLE from the longley data.frame and then build the model using the in-database GLM algorithm. We’re predicting the number of people employed using the remaining variables. Then, we view the model details using summary and the auto-generated fit.name. This fit.name corresponds to the name of the Oracle Data Mining (ODM) model in the database, which is auto-generated. Next, we use the model to predict using the original data, just for a confirmation that the model works as expected.

ore.connect("rquser","my_sid","my_host","rquser_pswd",1521, all=TRUE)

ore.create(longley, table="LONGLEY_TABLE")

mod.glm <- ore.odmGLM(Employed ~ ., data = LONGLEY_TABLE)

summary(mod.glm)

mod.glm$fit.name

predict(fit1, LONGLEY_TABLE)

While a user can refer to the ODM model by its name in fit.name, for example, when working with it in SQL or the Oracle Data Miner GUI, this may not be convenient since it will look something like ORE$23_123. In addition, unless the R object mod.glm is saved in an ORE datastore (an ORE feature corresponding to R’s save and load functions using ore.save and ore.load, but in the database), at the end of the session, this object and corresponding ODM model will be removed.

In addition, we’ll want to have a common name for the model so that we can swap an existing model with a new model and not have the change higher level code. To rename an ODM model, we can use the PL/SQL statement shown here, invoked with R using ore.exec. Of course, this could also be done from any SQL interface, e.g., SQL*Plus, SQL Developer, etc., just supplying the explicit SQL.

ore.exec(paste("BEGIN DBMS_DATA_MINING.RENAME_MODEL(model_name => '", mod.glm$fit.name, "', new_model_name => 'MY_GLM_MODEL'); END;",sep=""))

So now, we have the ODM model named MY_GLM_MODEL. Keep in mind, after the model is renamed, the original model no longer exists and the R object is invalid – at least from the standpoint of being able to use it in functions like summary or predict.

Scoring data from a SQL procedure

As noted above, users can score in batch from R, however, they can also score in batch from SQL. But we’re interested in real-time scoring from the database using the in-database model. This can be done directly in a SQL query but providing the input data in the query itself. This eliminates having to write data to a database table and then doing a lookup to retrieve the data for scoring – making it real-time.

The following SQL does just this. The WITH clause defines the input data, selecting from dual. The SELECT clause uses the model MY_GLM_MODEL to make the prediction using the data defined by data_in.

WITH data_in as (select 2013 "Year",

234.289 "GNP",

235.6 "Unemployed",

107.608 "Population",

159 "Armed.Forces",

83 "GNP.deflator",

60.323 "Employed"

from dual )

SELECT PREDICTION(MY_GLM_MODEL USING *) "PRED"

FROM data_in

While we could invoke the SQL directly, having a stored procedure in the database can give us more flexibility. Here’s the stored procedure version in PL/SQL.

CREATE OR REPLACE

PROCEDURE MY_SCORING_PROC (year_in IN NUMBER,

gnp_in IN BINARY_DOUBLE,

unemployed_in IN BINARY_DOUBLE,

population_in IN BINARY_DOUBLE,

armed_forces_in IN BINARY_DOUBLE,

gnp_deflator_in IN BINARY_DOUBLE,

employed_in IN BINARY_DOUBLE,

pred_out OUT NUMBER) AS

BEGIN

WITH data_in as (select year_in "Year",

gnp_in "GNP",

unemployed_in "Unemployed",

population_in "Population",

armed_forces_in "Armed.Forces",

gnp_deflator_in "GNP.deflator",

employed_in "Employed"

from dual ),

model_score as (SELECT PREDICTION(MY_GLM_MODEL USING *) "PRED"

FROM data_in )

select PRED into pred_out from model_score;

EXCEPTION

WHEN OTHERS THEN

raise_application_error(-20001,

'An error was encountered - '||SQLCODE||' -ERROR- '||SQLERRM);

END;

To invoke the stored procedure, we can do the following:

SET SERVEROUTPUT ON

DECLARE

score NUMBER;

BEGIN

MY_SCORING_PROC(1947, 234.289, 235.6, 107.608, 159, 83, 60.323, score);

DBMS_OUTPUT.PUT_LINE('Score: '|| score);

END;

Refreshing the model from R

Let’s say the model above has been in production for a while, but has become stale – that is, it’s not predicting as well as it used to due to changing patterns in the data. To refresh it, we build a new model. For illustration purposes, we’re going to use the same data (so an identical model will be produced, except for its name).

mod.glm2 <- ore.odmGLM(Employed ~ ., data = LONGLEY_TABLE)

summary(mod.glm2)

mod.glm2$fit.name

To swap the models, we delete the existing model called MY_GLM_MODEL and rename the new model to MY_GLM_MODEL. Again, we can do this from R using PL/SQL and through ore.exec.

ore.exec(paste("BEGIN DBMS_DATA_MINING.DROP_MODEL('MY_GLM_MODEL'); DBMS_DATA_MINING.RENAME_MODEL(model_name => '",mod.glm2$fit.name,"', new_model_name => 'MY_GLM_MODEL'); END;",sep=""))

We can now re-execute the stored procedure and the new model will be used.

SET SERVEROUTPUT ON

DECLARE

score NUMBER;

BEGIN

MY_SCORING_PROC(1947, 234.289, 235.6, 107.608, 159, 83, 60.323, score);

DBMS_OUTPUT.PUT_LINE('Score: '|| score);

END;

You may have noticed that this approach can introduce a brief period where no model is accessible - between the DROP_MODEL and RENAME_MODEL. A better approach involves the use of SYNONYMs. In general, synonyms provide both data independence and location transparency, being an alternative name for a table, view, sequence, procedure, stored function, and other database objects. We can use this in conjunction with our stored procedure above. First, create a synonym for the original scoring procedure.

CREATE or REPLACE SYNONYM MY_SCORING_PROC_SYM for MY_SCORING_PROC;

When invoking the procedure from your application, use the name MY_SCORING_PROC_SYM in place of MY_SCORING_PROC.  Instead of renaming the model, create a second stored procedure, with a different name, e.g., MY_SCORING_PROC_2. The new procedure references the name of the newly build model internally. 

When it is time to swap the models, invoke the following to change the procedures.

 

CREATE or REPLACE SYNONYM MY_SCORING_PROC_SYM for MY_SCORING_PROC_2;

Another benefit of this approach is that replaced models can still be kept should you need to revert to a previous version. 

Moving an in-database model from one machine to another

In a production deployment, there’s often the need to move a model from the development environment to the production environment. For example, the data scientist may have built the model in a development / sandbox environment and now needs to move it to the production machine(s).

In-database models provide functions EXPORT_MODEL and IMPORT_MODEL as part of the DBMS_DATA_MINING SQL package. See the 11g documentation for details. These calls can be invoked from R, but we’ll show this from SQL just to keep the flow easier to see.

From a SQL prompt, e.g., from SQL*Plus, connect to the schema that contains the model. Create a DIRECTORY object where the exported model file will be stored. List the model names available to this schema, which should contain MY_GLM_MODEL. Then, export the model

CONNECT rquser/rquser_psw

CREATE OR REPLACE DIRECTORY rquserdir AS '/home/MY_DIRECTORY';

-- list the models available to rquser

SELECT name FROM dm_user_models;

-- export the model called MY_GLM_MODEL to a dump file in same schema

EXECUTE DBMS_DATA_MINING.EXPORT_MODEL ('MY_GLM_MODEL_out',

'RQUSERDIR',

'name = ''MY_GLM_MODEL''');

At this point, you have the ODM model named MY_GLM_MODEL in the file MY_GLM_MODEL_out01.dmp stored in the file system under /home/MY_DIRECTORY. This file can now be moved to the production environment and the model loaded into the target schema.

Log into the new schema and invoke IMPORT_MODEL.

CONNECT rquser2/rquser2_psw

EXECUTE DBMS_DATA_MINING.IMPORT_MODEL (MY_GLM_MODEL_out01.dmp',

'RQUSERDIR', 'name = ''MY_GLM_MODEL''',

'IMPORT', NULL, 'glm_imp_job', 'rquser:rquser2');

Summary

In this post, we’ve highlighted how to build an in-database model in R and use it for scoring through SQL in a production, re-time settings. In addition, we showed how it is possible to swap, or refresh, models in a way that can leave your application code untouched. Finally, we highlighted database functionality that allows you to move in-database models from one database environment to another.

Users should note that all the functionality shown involving SQL, or being invoked through ore.exec, can be easily wrapped in R functions that could ultimately become part of ORE. If any of our readers are interested in giving this a try, we can post your solution here to share with the R and Oracle community. For the truly adventurous, check out the Oracle Database package DBMS_FILE_TRANSFER to consider wrapping the ability to move model files from R as well.

Wednesday Apr 17, 2013

Mind Reading... What are our customers thinking?

Overhauling analytics processes is becoming a recurring theme among customers. A major telecommunication provider recently embarked on overhauling their analytics process for customer surveys. They had three broad technical goals:

  • Provide an agile environment that empowers business analysts to test hypotheses based on survey results
  • Allow dynamic customer segmentation based on survey responses and even specific survey questions to drive hypothesis testing
  • Make results of new surveys readily available for research

The ultimate goal is to derive greater value from survey research that drives measurable improvements in survey service delivery, and as a result, overall customer satisfaction.

This provider chose Oracle Advanced Analytics (OAA) to power their survey research. Survey results and analytics are maintained in Oracle Database and delivered via a parameterized BI dashboard. Both the database and BI infrastructure are standard components in their architecture.

A parameterized BI dashboard enables analysts to create samples for hypothesis testing by filtering respondents to a survey question based on a variety of filtering criteria. This provider required the ability to deploy a range of statistical techniques depending on the survey variables, level of measurement of each variable, and the needs of survey research analysts.

Oracle Advanced Analytics offers a range of in-database statistical techniques complemented by a unique architecture supporting deployment of open source R packages in-database to optimize data transport to and from database-side R engines. Additionally, depending on the nature of functionality in such R packages, it is possible to leverage data-parallelism constructs available as part of in-database R integration. Finally, all OAA functionality is exposed through SQL, the ubiquitous language of the IT environment. This enables OAA-based solutions to be readily integrated with BI and other IT technologies.

The survey application noted above has been in production for 3 months. It supports a team of 20 business analysts and has already begun to demonstrate measurable improvements in customer satisfaction.

In the rest of this blog, we explore the range of statistical techniques deployed as part of this application.

At the heart of survey research is hypothesis testing. A completed customer satisfaction survey contains data used to draw conclusions about the state of the world. In the survey domain, hypothesis testing is comparing the significance of answers to specific survey questions across two distinct groups of customers - such groups are identified based on knowledge of the business and technically specified through filtering predicates.

Hypothesis testing sets up the world as consisting of 2 mutually exclusive hypotheses:

a) Null hypothesis - states that there is no difference in satisfaction levels between the 2 groups of customers

b) Alternate hypothesis states that there is a significant difference in satisfaction levels between the 2 groups of customers

Obviously only one of these can be true and the true-ness is determined by the strength, probability, or likelihood of the null hypothesis over the alternate hypothesis. Simplistically, the degree of difference between, e.g., the average score from a specific survey question across two customer groups could provide the necessary evidence in helping decide which hypothesis is true.

In practice the process of providing evidence to make a decision involves having access to a range of test statistics – a number calculated from each group that helps determine the choice of null or alternate hypothesis. A great deal of theory, experience, and business knowledge goes into selecting the right statistic based on the problem at hand.

The t-statistic (available in-database) is a fundamental function used in hypothesis testing that helps understand the differences in means across two groups. When the t-values across 2 groups of customers for a specific survey question are extreme then the alternative hypothesis is likely to be true. It is common to set a critical value that the observed t-value should exceed to conclude that the satisfaction survey results across the two groups are significantly different. Other similar statistics available in-database include F-test, cross tabulation (frequencies of various response combinations captured as a table), related hypothesis testing functions such as chi-square functions, Fisher's exact test, Kendall's coefficients, correlation coefficients and a range of lambda functions.

If an analyst desires to compare across more than 2 groups then analysis of variance (ANOVA) is a collection of techniques that is commonly used. This is an area where the R package ecosystem is rich with several proven implementations. The R stats package has implementations of several test statistics and function glm allows analysis of count data common in survey results including building Poisson and log linear models. R's MASS package implements a popular survey analysis technique called iterative proportional fitting. R's survey package has a rich collection of features (http://faculty.washington.edu/tlumley/survey/).

The provider was specifically interested in one function in the survey package - raking (also known as sample balancing) - a process that assigns a weight to each customer that responded to a survey such that the weighted distribution of the sample is in very close agreement with other customer attributes, such as the type of cellular plan, demographics, or average bill amount. Raking is an iterative process that uses the sample design weight as the starting weight and terminates when a convergence is achieved.

For this survey application, R scripts that expose a wide variety of statistical techniques - some in-database accessible through the transparency layer in Oracle R Enterprise and some in CRAN packages - were built and stored in the Oracle R Enterprise in-database R script repository. These parameterized scripts accept various arguments that identify samples of customers to work with as well as specific constraints for the various hypothesis test functions. The net result is greater agility since the business analyst determines both the set of samples to analyze as well as the application of the appropriate technique to the sample based on the hypothesis being pursued.

For more information see these links for Oracle's R Technologies software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop

Wednesday Feb 06, 2013

Oracle R Enterprise 1.3 gives predictive analytics an in-database performance boost

Recently released Oracle R Enterprise 1.3 adds packages to R that enable even more in-database analytics. These packages provide horizontal, commonly used techniques that are blazingly fast in-database for large data. With Oracle R Enterprise 1.3, Oracle makes R even better and usable in enterprise settings. (You can download ORE 1.3 here and documentation here.)

When it comes to predictive analytics, scoring (predicting outcomes using a data mining model) is often a time critical operation. Scoring can be done online (real-time), e.g., while a customer is browsing a webpage or using a mobile app, where on-the-spot recommendations can be made based on current actions. Scoring can also be done offline (batch), e.g., predict which of your 100 million customers will respond to each of a dozen offers, e.g., where applications leverage results to identify which customers should be targeted with a particular ad campaign or special offer.

In this blog post, we explore where using Oracle R Enterprise pays huge dividends. When working with small data, R can be sufficient, even when pulling data from a database. However, depending on the algorithm, benefits of in-database computation can be seen in a few thousand rows. The time difference with 10s of thousands of rows makes an interactive session more interactive, whereas 100s of thousands of rows becomes a real productivity gain, and on millions (or billions) of rows, becomes a competitive advantage! In addition to performance benefits, ORE integrates R into the database enabling you to leave data in place.

We’ll look at a few proof points across Oracle R Enterprise features, including:

  • OREdm – a new package that provides R access to several in-database Oracle Data Mining algorithms (Attribute Importance, Decision Tree, Generalized Linear Models, K-Means, Naïve Bayes, Support Vector Machine).
  • OREpredict – a new package that enables scoring models built using select standard R algorithms in the database (glm, negbin, hclust, kmeans, lm, multinom, nnet, rpart).
  • Embedded R Execution – an ORE feature that allows running R under database control and boosts real performance of CRAN predictive analytics packages by providing faster access to data than occurs between the database and client, as well as leveraging a more powerful database machine with greater RAM and CPU resources.

OREdm

Pulling data out of a database for any analytical tool impedes interactive data analysis due to access latency, either directly when pulling data out of the database or indirectly via an IT process that involves requesting data to be staged in flat files. Such latencies can quickly become intolerable. On the R front, you’ll also need to consider whether the data will fit in memory. If flat files are involved, consideration needs to be given to how files will be stored, backed up, and secured.

Of course, model building and data scoring execution time is only part of the story. Consider a scenario A, the “build combined script,” where data is extracted from the database, and an R model built and persisted for later use. In the corresponding scenario B, the “score combined script”, data is pulled from the database, a previously built model loaded, data scored, and the scores written to the database. This is a typical scenario for use in, e.g., enterprise dashboards or within an application supporting campaign management or next-best-offer generation. In-database execution provides significant performance benefits, even for relatively small data sets as included below. Readers should be able to reproduce such results at these scales. We’ve also included a Big Data example by replicating the 123.5 million row ONTIME data set to 1 billion rows. Consider the following examples:

Linear Models: We compared R lm and ORE ore.lm in-database algorithm on the combined scripts. On datasets ranging from 500K to 1.5M rows with 3-predictors, in-database analytics showed an average 2x-3x performance improvement for build, and nearly 4x performance improvement for scoring. Notice in Figure 1 that the trend is significantly less for ore.lm than lm, indicating greater scalability for ore.lm.

Figure 1. Overall lm and ore.lm execution time for model building (A) and data scoring (B)

Figure 2 provides a more detailed view comparing data pull and model build time for build detail, followed by data pull, data scoring, and score writing for score detail. For model building, notice that while data pull is a significant part of lm’s total build time, the actual build time is still greater than ore.lm. A similar statement can be made in the case of scoring.

Figure 2. Execution time components for lm and ore.lm (excluding model write and load)

Naïve Bayes from the e1071 package: On 20-predictor datasets ranging from 50k to 150k rows, in-database ore.odmNB improved data scoring performance by a factor of 118x to 418x, while the full scenario B execution time yielded a 13x performance improvement, as depicted in Figure 3B. Using a non-parallel execution of ore.odmNB, we see the cross-over point where ore.odmNB overtakes R, but more importantly, the slope of the trend points to the greater scalability of ORE, as depicted in Figure 3A for the full scenario A execution time.


Figure 3. Overall naiveBayes and ore.odmNB execution time for model building (A) and data scoring (B)

K-Means clustering: Using 6 numeric columns from the ONTIME airline data set ranging from 1 million to 1 billion rows, we compare in-database ore.odmKMeans with R kmeans through embedded R execution with ore.tableApply. At 100 million rows, ore.odmKMeans demonstrates better performance than kmeans , and scalability at 1 billion rows. The performance results depicted in Figure 4 uses a log-log plot. The legend shows the function invoked and corresponding parameters, using subset of ONTIME data set d. While ore.odmKMeans scales linearly with number of rows, R kmeans does not. Further, R kmeans did not complete at 1 billion rows.

Figure 4: K-Means clustering model building on Big Data

OREpredict

With OREpredict, R users can also benefit from in-database scoring of R models. This becomes evident not only when considering the full “round trip” of pulling data from the database, scoring in R, and writing data back to the database, but also for the scoring itself.

Consider an lm model built using a dataset with 4-predictors and 1 million to 5 million rows. Pulling data from the database, scoring, and writing the results back to the database shows a pure R-based approach taking 4x - 9x longer than in-database scoring using ore.predict with that same R model. Notice in Figure 5 that the slope of the trend is dramatically less for ore.predict than predict, indicating greater scalability. When considering the scoring time only, ore.predict was 20x faster than predict in R for 5M rows. In ORE 1.3, ore.predict is recommended and will provide speedup over R for numeric predictors.

Figure 5. Overall lm execution time using R predict vs. ORE ore.predict

For rpart, we see a similar result. On a 20-predictor, 1 million to 5 million row data set, ore.predict resulted in a 6x – 7x faster execution. In Figure 5, we again see that the slope of the trend is dramatically less for ore.predict than predict, indicating greater scalability. When considering the scoring time only, ore.predict was 123x faster than predict in R for 5 million rows.

Figure 6. Overall rpart Execution Time using R predict vs. ORE ore.predict

This scenario is summarized in Figure 7. In the client R engine, we have the ORE packages installed. There, we invoke the pure R-based script, which requires pulling data from the database. We also invoke the ORE-based script that keeps the data in the database.

Figure 7. Summary of OREpredict performance gains

To use a real world data set, we again consider the ONTIME airline data set with 123.5 million rows. We will build lm models with varying number of coefficients derived by converting categorical data to multiple columns. The variable p corresponds to the number of coefficients resulting from the transformed formula and is dependent on the number of distinct values in the column. For example, DAYOFWEEK has 7 values, so with DEPDELAY, p=9. In Figure 8, you see that using an lm model with embedded R for a single row (e.g., one-off or real-time scoring), has much more overhead (as expected given that an R engine is being started) compared to ore.predict, which shows subsecond response time through 40 coefficients at 0.54 seconds, and the 106 coefficients at 1.1 seconds. Here are the formulas describing the columns included in the analysis:

  • ARRDELY ~ DEPDELAY (p=2)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK (p=8)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH (p=19)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH + YEAR (p=40)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH + YEAR (p=106)


Figure 8. Comparing performance of ore.predict with Embedded R Execution for lm

Compare this with scoring the entire ONTIME table of 123.5 million rows. We see that ore.predict outperforms embedded R until about 80 coefficients, when embedded R becomes the preferred choice. 

Data Movement between R and Database: Embedded R Execution

One advantage of R is its community and CRAN packages. The goal for Oracle R Enterprise with CRAN packages is to enable reuse of these packages while:

  • Leveraging the parallelization and efficient data processing capabilities of Oracle Database
  • Minimizing data transfer and communication overhead between R and the database
  • Leveraging R as a programming language for writing custom analytics

There are three ways in which we’ll explore the performance of pulling data.

1) Using ore.pull at a separate client R engine to pull data from the database

2) Using Embedded R Execution and ore.pull within an embedded R script from a database-spawned R engine

3) Using Embedded R Execution functions for data-parallelism and task-parallelism to pass database data to the embedded R script via function parameter

With ORE Embedded R Execution (ERE), the database delivers data-parallelism and task-parallelism, and reduces data access latency due to optimized data transfers into R. Essentially, R runs under the control of the database. As illustrated in Figure 9, loading data at the database server is 12x faster than loading data from the database to a separate R client. Embedded R Execution also provides a 13x advantage when using ore.pull invoked at the database server within an R closure (function) compared with a separate R client. The data load from database to R client is depicted as 1x – the baseline for comparison with embedded R execution data loading.

Figure 9. Summary of Embedded R Execution data load performance gains

Data transfer rates are displayed in Figure 10, for a table with 11 columns and 5 million to 15 million rows of data. Loading data via ORE embedded R execution using server-side ore.pull or through the framework with, e.g., ore.tableApply (one of the embedded R execution functions) is dramatically faster than a non-local client load via ore.pull. The numbers shown reflect MB/sec data transfer rates, so a bigger bar is better!

Figure 10. Data load and write execution time with 11 columns

While this is impressive, let’s expand our data up to 1 billion rows. To create our 1 billion row data set (1.112 billion rows), we duplicated the 123.5 million row ONTIME dataset 9 times, replacing rows with year 1987 with years 2010 through 2033, and selecting 6 integer columns (YEAR, MONTH, DAYOFMONTH, ARRDELAY, DEPDELAY, DISTANCE) with bitmap index of columns (YEAR, MONTH, DAYOFMONTH). The full data set weighs in at ~53 GB.

In Figure 11, we see linear scalability for loading data into the client R engine. Times range from 2.8 seconds for 1 million rows, to 2700 seconds for 1 billion rows. While your typical user may not need to load 1 billion rows into R memory, this graph demonstrates the feasibility to do so.

Figure 11. Client Load of Data via ore.pull for Big Data

In Figure12, we look at how degree of parallelism (DOP) affects data load times involving ore.rowApply. This test addresses the question of how fast ORE can load 1 billion, e.g., when scoring data. The degree of parallelism corresponds to the number of R engines that are spawned for concurrent execution at the database server. The number of chunks the data is divided into is 1 for a single degree of parallelism, and 10 times the DOP for the remaining tests. For DOP of 160, the data was divided into 1600 chunks, i.e., 160 R engines were spawned, each processing 10 chunks. The graph on the left depicts that execution times improve for the 1 billion row data set through DOP of 160. As expected, at some point, the overhead of spawning additional R engines and partitioning the data outweighs the benefit. At its best time, processing 1 billion rows took 43 seconds.

Figure 12. Client Load of Data via ore.pull for Big Data

In the second graph of Figure 12, we contrast execution time for the “sweet spot” identified in the previous graph with varying number of rows. Using this DOP of 160, with 1600 chunks of data, we see that through 100 million rows, there is very little increase in execution time (between 6.4 and 8.5 seconds in actual time). While 1 billion rows took significantly more, it took only 43 seconds.

We can also consider data write at this scale. In Figure 13, we also depict linear scalability from 1 million through 1 billion rows using the ore.create function to creating database tables from R data. Actual times ranged from 2.6 seconds to roughly 2600 seconds.

Figure 13. Data Write using ore.create for Big Data

ORE supports data-parallelism to enable, e.g., building predictive models in parallel on partitions of the data. Consider a marketing firm that micro-segments customers and builds predictive models on each segment. ORE embedded R execution automatically partitions the data, spawns R engines according to the degree of parallelism specified, and executes the specified user R function. To address how efficiently ore.groupApply can process data, Figure 14 shows the total execution time to process the 123.5M rows from the ONTIME data with varying number of columns. The figure shows that ore.groupApply scales linearly as the number of columns increases. Three columns were selected based on their number of distinct values: TAILNUM 12861, DEST 352, and UNIQUECARRIER 29. For UNIQUECARRIER, all columns (total of 29 columns) could not be completed since 29 categories resulted in data too large for a single R engine.

Figure 14. Processing time for 123.5M rows via ore.groupApply

ORE also supports row-parallelism, where the same embedded R function can be invoked on chunks of rows. As with ore.groupApply, depending on the specified degree of parallelism, a different chunk of rows will be submitted to a dynamically spawned database server-side R engine. Figure 15 depicts a near linear execution time to process the 123.5M rows from ONTIME with varying number of columns. The chunk size can be specified, however, testing 3 chunk sizes (10k, 50k, and 100k rows) showed no significant difference in overall execution time, hence a single line is graphed.

Figure 15. Processing time for 123.5M rows via ore.rowApply for chunk sizes 10k-100k

All tests were performed on an Exadata X3-8. Except as noted, the client R session and database were actually on the same machine, so network latency for data read and write were minimum. Over a LAN or WAN, the benefits of in-database execution and ORE will be even more dramatic.

Monday Feb 20, 2012

Announcing Oracle R Distribution

Oracle has released the Oracle R Distribution, an Oracle-supported distribution of open source R. This is provided as a free download from Oracle. Support for Oracle R Distribution is provided to customers of the Oracle Advanced Analytics option and Oracle Big Data Appliance. The Oracle R Distribution facilitates enterprise acceptance of R, since the lack of a major corporate sponsor has made some companies concerned about fully adopting R. With the Oracle R Distribution, Oracle plans to contribute bug fixes and relevant enhancements to open source R.

Oracle has already taken responsibility for and contributed modifications to ROracle - an Oracle database interface (DBI) driver for R based on OCI. As ROracle is LGPL and used for Oracle Database connectivity from R, we are committed to ensuring this is the best package for Oracle connectivity.
About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today