Wednesday Apr 16, 2014

Oracle's Strategy for Advanced Analytics

At Oracle our goal is to enable you to get timely insight from all of your data. We continuously enhance Oracle Database to allow workloads that have traditionally required extracting data from the database to run in-place. We do this to narrow the gap that exists between insights that can be obtained and available data - because any data movement introduces latencies, complexity due to more moving parts, the ensuing need for data reconciliation and governance, as well as increased cost. The Oracle tool set considers the needs of all types of enterprise users - users preferring GUI based access to analytics with smart defaults and heuristics out of the box, users choosing to work interactively and quantitatively with data using R, and users preferring SQL and focusing on operationalization of models.

Oracle recognized the need to support data analysts, statisticians, and data scientists with a widely used and rapidly growing statistical programming language. Oracle chose R - recognizing it as the new de facto standard for computational statistics and advanced analytics. Oracle supports R in at least 3 ways:


  • R as the language of interaction with the database

  • R as the language in which analytics can be written and executed in the database as a high performance computing platform

  • R as the language in which several native high performance analytics have been written that execute in database


Additionally, of course, you may chose to leverage any of the CRAN algorithms to execute R scripts at the database server leveraging several forms of data parallelism.

Providing the first and only supported commercial distribution of R from an established company, Oracle released Oracle R Distribution. In 2012 Oracle embarked on the Hadoop journey acknowledging alternative data management options emerging in the open source for management of unstructured or not-yet-structured data. In keeping with our strategy of delivering analytics close to where data is stored, Oracle extended Advanced Analytics capabilities to execute on HDFS resident data in Hadoop environments. R has been integrated into Hadoop in exactly the same manner as it has been with the database.

Realizing that data is stored in both database and non-database environment, Oracle provides users options for storing their data (in Oracle Database, HDFS, and Spark RDD), where to perform computations (in-database or the Hadoop cluster), and where results should be stored (Oracle Database or HDFS). Users can write R scripts that can be leveraged across database and Hadoop environments. Oracle Database, as a preferred location for storing R scripts, data, and result objects, provides a real-time scoring and deployment platform. It is also easy to create a model factory environment with authorization, roles, and privileges, combined with auditing, backup, recovery, and security.

Oracle provides a common infrastructure that supports both in-database and custom R algorithms. Oracle also provides an integrated GUI for business users. Oracle provides both R-based access and GUI-based access to in-database analytics. A major part of Oracle's strategy is to maintain agility in our portfolio of supported techniques - being responsive to customer needs.

Thursday Mar 27, 2014

Why choose Oracle for Advanced Analytics?

If you're an enterprise company, chances are you have your data in an Oracle database. You chose Oracle for it's global reputation at providing the best software products (and now engineered systems) to support your organization. Oracle database is known for stellar performance and scalability, and Oracle delivers world class support.

If your data is already in Oracle Database or moving in that direction, leverage the high performance computing environment of the database to analyze your data. Traditionally it was common practice to move data to separate analytic servers for the explicit purpose of model building. This is no longer necessary nor is it scalable as your organization seeks to deliver value from Big Data. Oracle database now has several state of the art algorithms that execute in a parallel and distributed architecture directly in-database and augmented by custom algorithms in the R statistical programming language. Leveraging Oracle database for Advanced Analytics has benefits including:


  • Eliminates data movement to analytic servers

  • Enables analysis of all data not just samples

  • Puts your database infrastructure to even greater use

  • Eliminates impedance mismatch in the form of model translation when operationalizing models

  • All aspects of modeling and deployment are optionally available via SQL making integration into other IT software

  • Leverage CRAN algorithms directly in the database

Customers such as Stubhub, dunnhumby, CERN OpenLab, Financiera Uno, Turkcell, and others leverage Oracle Advanced Analytics to scale their applications, simplify their analytics architecture, and reduce time to market of predictive models from weeks to hours or even minutes.

Oracle leverages its own advanced analytics products, for example, by using Oracle Advanced Analytics in a wide range of Oracle Applications and internal deployments, ranging from:


  • Human Capital Management with Predictive Workforce to produce employee turnover, performance prediction, and "what if" analysis

  • Customer Relationship Management with Sales Prediction Engine to predict sales opportunities, what to sell, how much, and when

  • Supply Chain Management with Spend Classification to flag non-compliance or anomalies in expense submissions

  • Retail Analytics with Oracle Retail Customer Analytics to perform shopping cart analysis and next best offers

  • Oracle Financial Services Analytic Applications to enable quantitative analysts in credit risk management divisions to author rules/models directly in R


Oracle wants you to be successful with advanced analytics. Working closely with customers to integrate Oracle Advanced Analytics as an integral process of their analytics strategy, customers are able to put their advanced analytics into production much faster.

Thursday Mar 20, 2014

ROracle 1-1.11 released - binaries for Windows and other platforms available on OTN


We are pleased to announce the latest update of the open source ROracle package, version 1-1.11, with enhancements and bug fixes. ROracle provides high performance and scalable interaction from R with Oracle Database. In addition to availability on CRAN, ROracle binaries specific to Windows and other platforms can be downloaded from the Oracle Technology Network. Users of ROracle, please take our brief survey. We want to hear from you!

Latest enhancements in version 1-1.11 of ROracle:

• Performance enhancements for RAW data types and large result sets
• Ability to cache the result set in memory to reduce memory consumption on successive reads
• Added session mode to connect as SYSDBA or using external authentication
• bug 17383542: Enhanced dbWritetable() & dbRemoveTable() to work on global schema

Users of ROracle are quite pleased with the performance and functionality:


"In my position as a quantitative researcher, I regularly analyze database data up to a gigabyte in size on client-side R engines. I switched to ROracle from RJDBC because the performance of ROracle is vastly superior, especially when writing large tables. I've also come to depend on ROracle for transactional support, pulling data to my R client, and general scalability. I have been very satisfied with the support from Oracle -- their response has been prompt, friendly and knowledgeable."

           -- Antonio Daggett, Quantitative Researcher in Finance Industry


"Having used ROracle for over a year now with our Oracle Database data, I've come to rely on ROracle for high performance read/write of large data sets (greater than 100 GB), and SQL execution with transactional support for building predictive models in R. We tried RODBC but found ROracle to be faster, much more stable, and scalable."

           -- Dr. Robert Musk, Senior Forest Biometrician, Forestry Tasmania


See the ROracle NEWS for the complete list of updates.

We encourage ROracle users to post questions and provide feedback on the Oracle R Technology Forum.

In addition to being a high performance database interface to Oracle Database from R for general use, ROracle supports database access for Oracle R Enterprise.

Thursday Feb 13, 2014

Monitoring progress of embedded R functions

When you run R functions in the database, especially functions involving multiple R engines in parallel, you can monitor their progress using the Oracle R Enterprise datastore as a central location for progress notifications, or any intermediate status or results. In the following example, based on ore.groupApply, we illustrate instrumenting a simple function that builds a linear model to predict flight arrival delay based on a few other variables.

In the function modelBuildWithStatus, the function verifies that there are rows for building the model after eliminating incomplete cases supplied in argument dat. If not empty, the function builds a model and reports “success”, otherwise, it reports “no data.” It’s likely that the user would like to use this model in some way or save it in a datastore for future use, but for this example, we just build the model and discard it, validating that a model can be built on the data.


modelBuildWithStatus <-
  function(dat) {
    dat <- dat[complete.cases(dat),]
    if (nrow(dat)>0L) {
      mod <- lm(ARRDELAY ~ DISTANCE + AIRTIME + DEPDELAY, dat);
      "success"
    } else
      "no_data"
    }

When we invoke this using ore.groupApply, the goal is to build one model per “unique carrier” or airline. Using an ORE 1.4 feature, we specify the degree of parallelism using the parallel argument, setting it to 2.


res <- ore.groupApply(ONTIME_S[, c("UNIQUECARRIER","DISTANCE", "ARRDELAY", "DEPDELAY", "AIRTIME")],
        ONTIME_S$UNIQUECARRIER,
        modelBuildWithStatus,
        parallel=2L)

res.local<-ore.pull(res)
res.local[unlist(res.local)=="no_data"]

The result tells us about the status of each execution. Below, we print the unique carries that had no data.


R> res.local<-ore.pull(res)
R> res.local[unlist(res.local)=="no_data"]
$EA
[1] "no_data"

$`ML(1)`
[1] "no_data"

$`PA(1)`
[1] "no_data"

$PI
[1] "no_data"

$PS
[1] "no_data"

To monitor the progress of each execution, we can identify the group of data being processed in each function invocation using the value from the UNIQUECARRIER column. For this particular data set, we use the first two characters of the carrier’s symbol appended to “group.” to form a unique object name for storing in the datastore identified by job.name. (If we don’t do this, the value will form an invalid object name.) Note that since the UNIQUECARRIER column contains uniform data, we need only the first value.

The general idea for monitoring progress is to save an object in the datastore named for each execution of the function on a group. We can then list the contents of the named datastore and compute a percentage complete, which is discussed later in this post. For the “success” case, we assign the value “SUCCESS” to the variable named by the string in nm that we created earlier. Using ore.save, this uniquely named object is stored in the datastore with the name in job.name. We use the append=TRUE flag to indicate that the various function executions will be sharing the same named datastore.
If there is no data left in dat, we assign “NO DATA” to the variable named in nm and save that. Notice in both cases, we’re still returning “success” or “no data” so these come back in the list returned by ore.groupApply. However, we can return other values instead, e.g., the model produced.


modelBuildWithMonitoring <-
  function(dat, job.name) {
  nm <- paste("group.", substr(as.character(dat$UNIQUECARRIER[1L]),1,2), sep="")
  dat <- dat[complete.cases(dat),]
  if (nrow(dat)>0L) {
    mod <- lm(ARRDELAY ~ DISTANCE + AIRTIME + DEPDELAY, dat);
    assign(nm, "SUCCESS")
    ore.save(list=nm, name=job.name, append=TRUE)
    "success"
  } else {
    assign(nm, "NO DATA")
    ore.save(list=nm, name=job.name, append=TRUE)
    "no data"
  }
}

When we use this function in ore.groupApply, we provide the job.name and ore.connect arguments as well. The variable ore.connect must be set to TRUE in order to use the datastore. As the ore.groupApply executes, the datastore named by job.name will be increasingly getting objects added with the name of the carrier. First, delete the datastore named “job1”, if it exists.


ore.delete(name="job1")

res <- ore.groupApply(ONTIME_S[, c("UNIQUECARRIER","DISTANCE", "ARRDELAY", "DEPDELAY", "AIRTIME")],
        ONTIME_S$UNIQUECARRIER,
        modelBuildWithMonitoring,
        job.name="job1", parallel=2L, ore.connect=TRUE)

To see the progress during execution, we can use the following function, which takes a job name and the cardinality of the INDEX column to determine the percent complete. This function is invoked in a separate R engine connected to the same schema. If the job name is found, we print the percent complete, otherwise stop with an error message.


check.progress <- function(job.name, total.groups) {
  if ( job.name %in% ore.datastore()$datastore.name )
    print(sprintf("%.1f%%", nrow(ore.datastoreSummary(name=job.name))/total.groups*100L))
  else
    stop(paste("Job", job.name, " does not exist"))
}

To invoke this, compute the total number of groups and provide this and the job name to the function check.progress.
total.groups <- length(unique(ONTIME_S$UNIQUECARRIER))
check.progress("job1",total.groups)

However, we really want a loop to report on the progress automatically. One simple approach is to set up a while loop with a sleep delay. When we reach 100%, stop. To be self-contained, we include a simplification of the function above as a local function.


check.progress.loop <- function(job.name, total.groups, sleep.time=2) {
  check.progress <- function(job.name, total.groups) {
    if ( job.name %in% ore.datastore()$datastore.name )
      print(sprintf("%.1f%%", nrow(ore.datastoreSummary(name=job.name))/total.groups*100L))
    else
      paste("Job", job.name, " does not exist")
  }
  while(1) {
    try(x <- check.progress(job.name,total.groups))
    Sys.sleep(sleep.time)
    if(x=="100.0%") break
  }
}

As before, this function is invoked in a separate R engine connected to the same schema.


check.progress.loop("job1",total.groups)

Looking at the results, we can see the progress reported at one second intervals. Since the models build quickly, it doesn’t take long to reach 100%. For functions that take longer to execute or where there are more groups to process, you may choose a longer sleep time. Following this, we look at the datastore “job1” using ore.datastore and its contents using ore.datastoreSummary.


R> check.progress.loop("job1",total.groups,sleep.time=1)
[1] "6.9%"
[1] "96.6%"
[1] "100.0%"

R> ore.datastore(name="job1")
  datastore.name object.count size      creation.date description
1 job1 29 1073 2014-02-13 22:03:20
R> ore.datastoreSummary(name="job1")
object.name class size length row.count col.count
1 group.9E character 37 1 NA NA
2 group.AA character 37 1 NA NA
3 group.AQ character 37 1 NA NA
4 group.AS character 37 1 NA NA
5 group.B6 character 37 1 NA NA
6 group.CO character 37 1 NA NA
7 group.DH character 37 1 NA NA
8 group.DL character 37 1 NA NA
9 group.EA character 37 1 NA NA
10 group.EV character 37 1 NA NA
11 group.F9 character 37 1 NA NA
12 group.FL character 37 1 NA NA
13 group.HA character 37 1 NA NA
14 group.HP character 37 1 NA NA
15 group.ML character 37 1 NA NA
16 group.MQ character 37 1 NA NA
17 group.NW character 37 1 NA NA
18 group.OH character 37 1 NA NA
19 group.OO character 37 1 NA NA
20 group.PA character 37 1 NA NA
21 group.PI character 37 1 NA NA
22 group.PS character 37 1 NA NA
23 group.TW character 37 1 NA NA
24 group.TZ character 37 1 NA NA
25 group.UA character 37 1 NA NA
26 group.US character 37 1 NA NA
27 group.WN character 37 1 NA NA
28 group.XE character 37 1 NA NA
29 group.YV character 37 1 NA NA

The same basic technique can be used to note progress in any long running or complex embedded R function, e.g., in ore.tableApply or ore.doEval. At various points in the function, sequence-named objects can be added to a datastore. Moreover, the contents of those objects can contain incremental or partial results, or even debug output.

While we’ve focused on the R API for embedded R execution, the same functions could be invoked using the SQL API. However, monitoring would still be done from an interactive R engine.

Tuesday Feb 04, 2014

Invoking R scripts via Oracle Database: Theme and Variation, Part 6

How can I use "group apply" to partition data over multiple columns for parallel execution?
How can I use R for statistical computations and return results as a database table?

In this blog post of our theme and variation series, we answer these two questions through several examples, highlighting both R and SQL interfaces.

So far in this blog series on Oracle R Enterprise embedded R execution we've covered:

Part 1: ore.doEval / rqEval
Part 2: ore.tableApply / rqTableEval
Part 3: ore.groupApply / “rqGroupApply”
Part 4: ore.rowApply / rqRowEval
Part 5: ore.indexApply

Using ore.groupApply for partitioning data on multiple columns

While the “group apply” functionality is quite powerful as it is, users sometimes want to partition data on multiple columns. Since ore.groupApply currently takes only a single column for the INDEX argument, users can create a new column that is the concatenation of the columns of interest, and provide this column to the INDEX argument. We’ll illustrate this first using the R API, and then the SQL API.

R API

We adapt an example from Part 3 to illustrate partitioning data on multiple columns. Instead of building a C5.0 model, we’ll use the same CHURN_TRAIN data set, but build an rpart model since it will produce rules on the partitions of data we’ve chosen for the example, namely, voice_mail_plan and international_plan. To understand the number of rows we can expect in each partition, we’ll use the R table function. We then add a new column that pastes together the two columns of interest to create a new column called “vmp_ip”.


library(C50)
data(churn)

ore.create(churnTrain, "CHURN_TRAIN")

table(CHURN_TRAIN$international_plan, CHURN_TRAIN$voice_mail_plan)
CT <- CHURN_TRAIN
CT$vmp_ip <- paste(CT$voice_mail_plan,CT$international_plan,sep="-")
head(CT)

Each invocation of the function “my.rpartFunction” will receive data from one of the partitions identified in vmp_ip. Since our source partition columns are constants, we set them to NULL. The character vectors are converted to factors and the model is built to predict churn and saved in an appropriately named datastore. Instead of returning TRUE as done in the previous example, we create a list to return the specific partition column values, the distribution of churn values, and the model itself.


ore.scriptDrop("my.rpartFunction")
ore.scriptCreate("my.rpartFunction",
  function(dat,datastorePrefix) {
    library(rpart)
    vmp <- dat[1,"voice_mail_plan"]
    ip <- dat[1,"international_plan"]
    datastoreName <- paste(datastorePrefix,vmp,ip,sep="_")
    dat$voice_mail_plan <- NULL
    dat$international_plan <- NULL
    dat$state <- as.factor(dat$state)
    dat$churn <- as.factor(dat$churn)
    dat$area_code <- as.factor(dat$area_code)
    mod <- rpart(churn ~ ., data = dat)
    ore.save(mod, name=datastoreName, overwrite=TRUE)
    list(voice_mail_plan=vmp,
        international_plan=ip,
        churn.table=table(dat$churn),
        rpart.model = mod)
  })

After loading the rpart library and setting the datastore prefix, we invoke ore.groupApply using the derived column vmp_ip as the input to argument INDEX. After building the models, we’ll look at the first entry in the list returned. Using ore.load, we can load the model for the case where the customer neither has the voice mail plan, nor the international plan.


library(rpart)

datastorePrefix="my.rpartModel"

res <- ore.groupApply( CT, INDEX=CT$vmp_ip,
      FUN.NAME="my.rpartFunction",
      datastorePrefix=datastorePrefix,
      ore.connect=TRUE)
res[[1]]
ore.load(name=paste(datastorePrefix,"no","no",sep="_"))
mod
SQL API

To invoke this from the SQL API, we use the same approach as covered in Part 3. While we could create the table CT from the ore.frame used above, instead the following illustrates creating the derived column in SQL and explicitly defining a VIEW.


CREATE OR REPLACE VIEW CT AS
  SELECT t.*, "voice_mail_plan" || '-' || "international_plan" as "vmp_ip"
  FROM CHURN_TRAIN t;

Next, we create a PL/SQL PACKAGE and FUNCTION for the invocation.


CREATE OR REPLACE PACKAGE churnPkg AS
  TYPE cur IS REF CURSOR RETURN CT%ROWTYPE;
END churnPkg;
/
CREATE OR REPLACE FUNCTION churnGroupEval(
  inp_cur churnPkg.cur,
  par_cur SYS_REFCURSOR,
  out_qry VARCHAR2,
  grp_col VARCHAR2,
  exp_txt CLOB)
RETURN SYS.AnyDataSet
PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH ("vmp_ip"))
CLUSTER inp_cur BY ("vmp_ip")
USING rqGroupEvalImpl;
/

Then, we can invoke the R function by name in the SELECT statement as follows:


select *
from table(churnGroupEval(
  cursor(select * from CT),
  cursor(select 1 as "ore.connect",' my.rpartModel2' as "datastorePrefix" from dual),
  'XML', 'state', 'my.rpartFunction'));

As another variation on this theme, suppose that you didn’t want to include all the columns from the source data set. To achieve this, you could create a view and define the PACKAGE from the view. However, you could also define a record that contains the specific columns of interest. This is a standard PL/SQL specification that can be used in combination with “group apply”.


CREATE OR REPLACE PACKAGE churnPkg2 AS
  TYPE rec IS RECORD ("vmp_ip" varchar2(8),
    "churn" varchar2(4),
    "state" varchar2(4),
    "account_length" NUMBER(38));
  TYPE cur IS REF CURSOR RETURN rec;
END churnPkg2;
/

If you don’t want to or cannot create a view, this allows you to specify the exact columns required for model building. Reducing the number of columns on input can improve performance, since only required data will be passed to the server-side R engine. Notice that we could have used this above since we remove the columns for the source partition columns.

How to return results from R statistical functions as database table data

R provides a wide range of statistical and advanced analytics functions. While Oracle Database contains a wide range of statistical functional in SQL, R further extends this set. In this next topic, we illustrate how to return statistical results as a SQL table for use with other SQL queries or to feed SQL-based applications.

As our example, we’ll use the R principal components function princomp. Our goal is to return the loadings of the PCA model as a database table. For our data set, we’ll use the USArrests data set provided with R. We can view the results of princomp in the mod variable, which has class “princomp”. We then push this data to Oracle Database, getting an ore.frame object.


mod <- princomp(USArrests, cor = TRUE)
class(mod)
mod
dat <- ore.push(USArrests)

R> mod <- princomp(USArrests, cor = TRUE)
R> class(mod)
[1] "princomp"
R> mod
Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

4 variables and 50 observations.
R> dat <- ore.push(USArrests)

In the first case considered, we use ore.tableApply to return simply the princomp object. When we do this we’re getting back a serialized object of type ore.object, but the actual princomp object still resides in the database. We can pull this object from the database to get a local princomp object, but this type of result cannot be directly returned as a SQL table because we need an object of class data.frame (which we’ll address later).


res <- ore.tableApply(dat,
      function(dat) {
        princomp(dat, cor=TRUE)
      })
class(res)
res.local <- ore.pull(res)
class(res.local)
str(res.local)
res.local
res

In the following output, we see the result is an ore.object that we pull from the database to get a princomp object. We examine the structure of the object and focus on the loadings element. In the example, we print res.local and res. Since res is an ore.object, it automatically gets pulled to the client before printing it.


R> res <- ore.tableApply(dat,
+ function(dat) {
+ princomp(dat, cor=TRUE)
+ })
R> class(res)
[1] "ore.object"
attr(,"package")
[1] "OREembed"
R> res.local <- ore.pull(res)
R> class(res.local)
[1] "princomp"
R> str(res.local)
List of 7
$ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
.. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ center : Named num [1:4] 7.79 170.76 65.54 21.23
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ scale : Named num [1:4] 4.31 82.5 14.33 9.27
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ n.obs : int 50
$ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:50] "1" "2" "3" "4" ...
.. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ call : language princomp(x = dat, cor = TRUE)
- attr(*, "class")= chr "princomp"
R> res.local
Call:
princomp(x = dat, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

4 variables and 50 observations.
R> res
Call:
princomp(x = dat, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

4 variables and 50 observations.

In this next case, we focus on the loadings component of the princomp object, which contains the matrix of variable loadings, that is a matrix whose columns contain the eigenvectors. This is of class "loadings"…still not a data.frame. To convert the loadings component to a data.frame, we determine the dimensions of the matrix and then construct a data.frame by accessing the cells of the loading object. To get the variables associated with each row, we assign to the column variables the row names of the loadings. Finally, we return the loadings data.frame.


res <- ore.tableApply(dat,
      function(dat) {
        mod <- princomp(dat, cor=TRUE)
        dd <- dim(mod$loadings)
        ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
        ldgs$variables <- row.names(ldgs)
        ldgs
      })
class(res)
res

In the output below, notice that we still have an ore.object being returned, but it’s in the form of a data.frame.


R> res <- ore.tableApply(dat,
+ function(dat) {
+ mod <- princomp(dat, cor=TRUE)
+ dd <- dim(mod$loadings)
+ ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
+ ldgs$variables <- row.names(ldgs)
+ ldgs
+ })
R> class(res)
[1] "ore.object"
attr(,"package")
[1] "OREembed"
R> res
        Comp.1    Comp.2     Comp.3     Comp.4 variables
Murder -0.5358995 0.4181809 -0.3412327 0.64922780 Murder
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748 Assault
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773 UrbanPop
Rape -0.5434321 -0.1673186 0.8177779 0.08902432 Rape

We can address this last issue by specifying the FUN.VALUE argument to get an ore.frame result (left as an exercise to the reader). But our main goal is to enable returning the loadings from SQL as a database table. For that, we create the function in the R script repository and construct the appropriate SQL query. In preparation for the next example, we’ll create the table USARRESTS using the R data set.


ore.create(USArrests,table="USARRESTS")

Now, we’ll switch to SQL. We’re introducing the functions sys.rqScriptDrop and sys.rqScriptCreate, which are used within a BEGIN END PL/SQL block, to store the R function ‘princomp.loadings’.


begin
--sys.rqScriptDrop('princomp.loadings');
sys.rqScriptCreate('princomp.loadings',
      'function(dat) {
        mod <- princomp(dat, cor=TRUE)
        dd <- dim(mod$loadings)
        ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
        ldgs$variables <- row.names(ldgs)
        ldgs
      }');
end;
/

The SELECT statement provides input data by selecting all data from USARRESTS. There are no arguments to pass, so the next parameter is NULL. The SELECT string describes the format of the result. Notice that the column names must match in name (including case) and type. The last parameter is the name of the function stored in the R script repository.


select *
from table(rqTableEval( cursor(select * from USARRESTS),NULL,
          'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));

SQL> select *
from table(rqTableEval( cursor(select * from USARRESTS),NULL,
          'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));
2 3
    Comp.1     Comp.2     Comp.3     Comp.4 variables
---------- ---------- ---------- ---------- ------------
-.53589947 .418180865 -.34123273 .649227804 Murder
-.58318363 .187985604 -.26814843 -.74340748 Assault
-.27819087 -.87280619 -.37801579 .133877731 UrbanPop
-.54343209 -.16731864 .817777908 .089024323 Rape

If you have interesting embedded R scenarios to share with the ORE community, please consider posting a comment.

Monday Jan 20, 2014

Invoking R scripts via Oracle Database: Theme and Variation, Part 5


In the first four parts of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution involving the functions ore.doEval / rqEval, ore.tableApply / rqTableEval, ore.groupApply / “rqGroupApply”, and ore.rowApply / rqRowEval. In this blog post, we cover ore.indexApply. Note that there is no corresponding rqIndexEval – more on that later. The “index apply” function is also one of the parallel-enabled embedded R execution functions. It supports task-parallel execution, where one or more R engines perform the same or different calculations, or task. A number, associated with the index of the execution, is provided to the function. Any required data is expected to be explicitly generated or loaded within the function.

This functionality is valuable in a variety of settings, e.g., simulations, for taking advantage of high-performance computing hardware like Exadata.

As for “group apply” and “row apply”, Oracle Database handles the management and control of potentially multiple R engines at the database server machine, with only the index passed to the function as the first argument. Oracle Database ensures that each R function execution completes, otherwise the ORE function returns an error. Output formats as supported by the other embedded R functions are possible for ore.indexApply, for example, returning an ore.list or combining data.frame data into an ore.frame.

The variation on embedded R execution for ore.indexApply involves passing as an argument the number of times the user-defined R function should be executed.

Let’s look at a simple example.

The following code specifies to execute the function five times in parallel.


res <- ore.indexApply(5,
      function(index) {
        paste("IndexApply:",index)
      },
    parallel=TRUE)
class(res)
res

Notice that the class of the result is an ore.list, and when we print res, we have 5 character vectors, each with the index that was passed to the user-defined function. As with other parallel embedded R functions, the number of concurrently executing R engines can be limited by specifying the degree of parallelism of the database. As we’ll see in ORE 1.4, the parallel argument can specify a preferred number of parallel R engines, as an upper bound.


> class(res)
[1] "ore.list"
attr(,"package")
[1] "OREbase"
> res
$`1`
[1] "IndexApply: 1"

$`2`
[1] "IndexApply: 2"

$`3`
[1] "IndexApply: 3"

$`4`
[1] "IndexApply: 4"

$`5`
[1] "IndexApply: 5"

Column-parallel use case

If we wanted to parallelize R’s summary function, we could compute the summary statistics on each column in parallel and combine them into a final result. The following example does exactly that. While we could generalize this example, we focus on the iris data set and computing summary statistics on the first four numeric columns. Since iris comes standard with R, there’s no need to load data from any other source, we simply access it. The first argument to ore.indexApply is 4, the number of columns we wish to summarize in parallel. The function takes one argument, index, which will be a value between 1 and 4, and will be used to select the column to summarize. We massage the result of summary into a data.frame and add the column name to the result. Note that the function returns a single row: the summary statistics for the column.


res <- NULL
res <- ore.indexApply(4,
      function(index) {
        ss <- summary(iris[,index])
        attr.names <- attr(ss,"names")
        stats <- data.frame(matrix(ss,1,length(ss)))
        names(stats) <- attr.names
        stats$col <- names(iris)[index]
        stats
      },
      parallel=TRUE)
res

The result comes back as an ore.list object:


> res
$`1`
Min. 1st Qu. Median Mean 3rd Qu. Max. col
1 4.3 5.1 5.8 5.843 6.4 7.9 Sepal.Length

$`2`
Min. 1st Qu. Median Mean 3rd Qu. Max. col
1 2 2.8 3 3.057 3.3 4.4 Sepal.Width

$`3`
Min. 1st Qu. Median Mean 3rd Qu. Max. col
1 1 1.6 4.35 3.758 5.1 6.9 Petal.Length

$`4`
Min. 1st Qu. Median Mean 3rd Qu. Max. col
1 0.1 0.3 1.3 1.199 1.8 2.5 Petal.Width

This is good, but it would be better if the result was returned as an ore.frame, especially since all the columns are the same. To enable this, we’ll do a slight variation on the result by specifying FUN.VALUE with the structure of the result defined.


res <- ore.indexApply(4,
      function(index) {
        ss <- summary(iris[,index])
        attr.names <- attr(ss,"names")
        stats <- data.frame(matrix(ss,1,length(ss)))
        names(stats) <- attr.names
        stats$col <- names(iris)[index]
        stats
      },
      FUN.VALUE=data.frame(Min.=numeric(0),
        "1st Qu."=numeric(0),
        Median=numeric(0),
        Mean=numeric(0),
        "3rd Qu."=numeric(0),
        Max.=numeric(0),
        col=character(0)),
      parallel=TRUE)
res

Now, the result comes back as an ore.frame.


> res
  Min. X1st.Qu. Median  Mean X3rd.Qu. Max.      col
1 0.1 0.3 1.30 1.199 1.8 2.5 Petal.Width
2 1.0 1.6 4.35 3.758 5.1 6.9 Petal.Length
3 4.3 5.1 5.80 5.843 6.4 7.9 Sepal.Length
4 2.0 2.8 3.00 3.057 3.3 4.4 Sepal.Width
Simulation use case

The ore.indexApply function can be used in simulations as well. In this next example we take multiple samples from a random normal distribution with the goal to compare the distribution of the summary statistics. For this, we build upon the example above. We provide parameters such as the sample size, mean and standard deviation of the random numbers, and the number of simulations we want to perform. Each one of these simulations will occur in a separate R engine, in parallel, up to the degree of parallelism allowed by the database.

We specify num.simulations as the first parameter to ore.indexApply. Inside the user-defined function, we pass the index and three arguments to the function. The function then sets the random seed based on the index. This allows each invocation to generate a different set of random numbers. Using rnorm, the function produces sample.size random normal values. We invoke summary on the vector of random numbers, and then prepare a data.frame result to be returned. We’re using the FUN.VALUE to get an ore.frame as the final result.


res <- NULL
sample.size = 1000
mean.val = 100
std.dev.val = 10
num.simulations = 1000

res <- ore.indexApply(num.simulations,
      function(index, sample.size=1000, mean=0, std.dev=1) {
        set.seed(index)
        x <- rnorm(sample.size, mean, std.dev)
        ss <- summary(x)
        attr.names <- attr(ss,"names")
        stats <- data.frame(matrix(ss,1,length(ss)))
        names(stats) <- attr.names
        stats$index <- index
        stats
      },
      FUN.VALUE=data.frame(Min.=numeric(0),
        "1st Qu."=numeric(0),
        Median=numeric(0),
        Mean=numeric(0),
        "3rd Qu."=numeric(0),
        Max.=numeric(0),
        index=numeric(0)),
      parallel=TRUE,
      sample.size=sample.size,
      mean=mean.val, std.dev=std.dev.val)
res
boxplot(ore.pull(res[,1:6]),
  main=sprintf("Boxplot of %d rnorm samples size %d, mean=%d, sd=%d",
        num.simulations, sample.size, mean.val, std.dev.val))

To get the distribution of samples, we invoke boxplot on the data.frame after pulling the result to the client.

Here are a couple of plots showing results for different parameters:


In both cases, we run 10,000 samples. The first graph uses a sample size of 10 and the second uses a sample size of 1000. From these results, it is clear that a larger sample size significantly reduces the variance in each of the summary statistics - confirming our Statistics 101 understanding.

Error reporting

As introduced above, Oracle Database ensures that each embedded R user-defined function execution completes, otherwise the ORE function returns an error. Of course, any side-effects of the user-defined function need to be manually cleaned up. Operations that produce files, create tables in the database, or result in completed database transactions through ROracle will remain intact. The ORE embedded R infrastructure will report errors as produced by the function as illustrated in the following example.

The code specifies to invoke 4 parallel R engines. If the index has value 3, attempt to load the non-existant package "abc123" (which produces an error), otherwise return the index value.


R> ore.indexApply(4,
+ function(index) {
+ if (index==3) {library(abc123)}
+ else {return(index)}
+ }
+ )
Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch, :
ORA-12801: error signaled in parallel query server P000
ORA-20000: RQuery error
Error in library(abc123) : there is no package called 'abc123'
ORA-06512: at "RQSYS.RQGROUPEVALIMPL", line 121
ORA-06512: at "RQSYS.RQGROUPEVALIMPL", line 118

Notice that the first reported error is an ORE-12801: error signaled in parallel query server. Then the ORA-20000: RQuery error indicates the error as returned by the R engine. Also interesting to note is that the ORA-06512 errors reveal the underlying implementation of ore.indexApply "RQSYS.RQGROUPEVALIMPL". Which leads us to the next topic.

No rqIndexEval?

“Index apply” is really a variation of “group apply” where the INDEX column is a numeric vector that is pushed to the database. With n distinct numbers, one number is provided to each function as its index. As a result, there is no corresponding rqIndexEval in the SQL API. The user would have to create a similar package and function as was illustrated in the blog post on “group apply.”

Thursday Jan 09, 2014

Invoking R scripts via Oracle Database: Theme and Variation, Part 4

In the first three parts of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution involving the functions ore.doEval / rqEval, ore.tableApply / rqTableEval, and ore.groupApply / “rqGroupApply”. In this blog post, we’ll cover the next in our theme and variation series involving ore.rowApply and rqRowEval. The “row apply” function is also one of the parallel-enabled embedded R execution functions. It supports data-parallel execution, where one or more R engines perform the same R function, or task, on disjoint chunks of data. This functionality is essential to enable scalable model scoring/predictions on large data sets and for taking advantage of high-performance computing hardware like Exadata.

As for ore.groupApply, Oracle Database handles the management and control of potentially multiple R engines at the database server machine, automatically chunking and passing data to parallel executing R engines. Oracle Database ensures that R function executions for all chunks of rows complete, or the ORE function returns an error. The result from the execution of each user-defined embedded R function is gathered in an ore.list. This list remains in the database until the user requires the result. However, we’ll also show how data.frame results from each execution can be combined into a single ore.frame. This features works for return values of other embedded R functions as well.

The variation on embedded R execution for ore.rowApply involves passing not only an ore.frame to the function such that the first parameter of your embedded R function receives a data.frame, but also the number of rows that should be passed to each invocation of the user-defined R function. The last chunk, of course, may have fewer rows than specified.

Let’s look at an example. We’re going to use the C50 package to score churn data (i.e., predict which customers are likely to churn) using the C5.0 decision tree models we built in the previous blog post with ore.groupApply. (Well, almost. We need to rebuild the models to take into account the full data set levels.) The goal is to score the customers in parallel leveraging the power of a high performance computing platform, such as Exadata.


library(C50)
data(churn)

ore.create(churnTest, "CHURN_TEST")

myFunction <- function(dat, xlevels, datastorePrefix) {
  library(C50)
  state <- dat[1,"state"]
  datastoreName <- paste(datastorePrefix,state,sep="_")
  dat$state <- NULL
  for (j in names(xlevels))
    dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
  ore.load(name=datastoreName)
  res <- data.frame(pred=predict(mod,dat, type="class"),
        actual=dat$churn,
        state=state)
  res
}

xlevels <- ore.getXlevels(~ ., CHURN_TEST[,-1])
scoreList <- ore.groupApply(
  CHURN_TEST,
  INDEX=CHURN_TEST$state,
  myFunction,
  datastorePrefix="myC5.0model3",xlevels=xlevels, ore.connect=TRUE)
score.MA <- ore.pull(scoreList$MA)
table(score.MA$actual, score.MA$pred)

A few points to highlight:

• Instead of computing the levels using the as.factor function inside the user-defined function, we’ll use ore.getXlevels, which returns the levels for each factor column. We don’t need this for the state column, so we exclude it (“-1”). In the previous post we noted that factor data is passed as character columns in the data.frame. Computing the levels first can ensure that all possible levels are provided during model building, even if there are no rows with some of the level values.
• When building models where some levels were missing (due to using as.factor on each partition of data), scoring can fail if the test data has unknown level values. For this reason, the models built in Part 3 need to be rebuilt using the approach above with ore.getXlevels. This is left as an exercise for the reader.
• Assign the function to the variable “myFunction” to facilitate reuse (see below).
• We construct the datastore name to be the same as when we were building the models, i.e., appending the state value to the datastore prefix separated by an ‘_’.
• The for loop iterates over the levels passed in as xlevels, creating a factor using the provided levels and assigning it back to the data.frame.
• Loading the datastore by name, we have access to the variable mod, which contains the model for the particular state.
• The result is constructed as a data.frame with the prediction and the actual values.
• Three arguments are passed: the datastore prefix, the levels that were pre-computed, and that we need to connect to the database because we’re using a datastore.
• The results are stored as a list of ore.frames. We can pull the scores for MA and compute a confusion matrix using table.

This is fine. However, we likely don’t want to have a list of separate ore.frames as the result. We’d prefer to have a single ore.frame with all the results. This can be accomplished using the FUN.VALUE argument. Whenever a data.frame is the result of the user-defined R function, and if the structure of that data.frame is the same across all invocations of the group apply or row apply, you can combine them into a single result by defining the structure as follows:

scores <- ore.groupApply(
  CHURN_TEST,
  INDEX=CHURN_TEST$state,
  myFunction,
  datastorePrefix="myC5.0model3",xlevels=xlevels, ore.connect=TRUE,
  FUN.VALUE=data.frame(pred=character(0),
        actual=character(0),
        state=character(0)));
head(scores)
scores.local <- ore.pull(scores)
table(scores.local[scores.local$state=="MA",c("actual","pred")])

scores.MA <- scores[scores$state=="MA",c("actual","pred")]
table(scores.MA$actual, scores.MA$pred)

A few important points to highlight:

• FUN.VALUE is set to a data.frame that describes the format of the result. By providing this argument, you will get back a single ore.frame, not an ore.list object.
• The group apply completes instantaneously because it is only defining the ore.frame, not actually performing the scoring. Not until the values are needed does the result get computed. We invoke head on the ore.frame in scores to highlight this.
• We can pull the scores to the client to invoke table as before, but subselecting for state MA. However, we can also do this computation in the database using the transparency layer. First, we filter the rows for MA in scores.MA, and then invoke table on the two columns. Note: ORE requires passing the two columns explicitly to the overloaded function table.
• To do this in parallel, add the argument parallel=TRUE to the ore.groupApply call.

Wait! What happened to ore.rowApply?

Above, we showed how to score with multiple models using ore.groupApply. But what if we had customers from a single state that we wanted to score in parallel? We can use ore.rowApply and rqRowEval to invoke a function on chunks of data (rows) at a time, from 1 to the total number of rows. (Note that values closer to the latter will have no benefit from parallelism, obviously.)


scores <- ore.rowApply(
  CHURN_TEST[CHURN_TEST$state=="MA",],
  myFunction,
  datastorePrefix="myC5.0model3",xlevels=xlevels,
  ore.connect=TRUE, parallel=TRUE,
  FUN.VALUE=data.frame(pred=character(0),
        actual=character(0),
        state=character(0)),
  rows=200)
scores
table(scores$actual, scores$pred)

A few points to highlight:

• Since we want to perform the scoring in parallel by state, we filter the rows for MA. This will ensure that all rows processed can use the same predictive model.
• We set the rows argument to 200. CHURN_TEST has 1667 rows, so this will result in nine executions of myFunction. The first eight receiving 200 rows each and the last receiving 67 rows.
• We also set parallel=TRUE above since we want the scoring performed in parallel.
• The invocation of ore.rowApply returns immediately. Not until we print scores do we incur the cost of executing the underlying query. However, also note that each time we access scores, for example in the following call to table, we incur the cost of executing the query. If the result will be used many times in subsequent operations, you may want to create a table with the result using ore.create.

In SQL, we can do the same, but we’ll need to store the function in the R script repository (perhaps called "myScoringFunction") and also store xlevels in a datastore (perhaps called "myXLevels"). While we can pass complex objects in the R interface to embedded R functions, we cannot do that in SQL. Instead, we must pass the name of a datastore. Since the xlevels are in a datastore, the user-defined R function needs to be modified to take this other datastore name and load that datastore to have access to xlevels. This set of changes is left to the reader as an exercise.


select * from table(rqRowEval(
  cursor(select /*+ parallel(t, 4) */ *
        from CHURN_TEST t
        where "state" = 'MA'),
  cursor(select 1 as "ore.connect",
        'myC5.0model3' as "datastorePrefix",
        'myXLevels' as "xlevelsDatastore"
        from dual),
  'select ''aaa'' "pred",''aaa'' "actual" , ''aa'' "state" from dual',
    200, 'myScoringFunction'));

A few points to highlight:

• The input cursor specifies a parallel hint on the input data cursor and filtering data for MA as well.
• Several arguments are being passed, including the new argument to our function myXLevels.
• The output form is specified in the SQL string. Care must be taken to ensure that the column names, ordering, and the length of character strings match the returned data.frame.

Map Reduce

The “row apply” functionality can be thought of in terms of the map-reduce paradigm where the mapper performs the scoring and outputs a data.frame value (no key required). There is no reducer, or the reducer is simply a pass-through.

Memory and performance considerations

Unlike with group apply, the rows argument in row apply ensures an upper bound on the number of rows (and hence memory requirement). The value of rows should be chosen to balance memory and parallel performance. The usual measures can be taken regarding setting memory limits on the R engine – as noted in Part 2.

There may be instances where setting rows = 1 makes sense. For example, if the computation per row is intensive (i.e., takes a long time), sending one row per R engine may be appropriate. Experiment with a range of values for rows to determine the best value for your particular scenario.

Sunday Jan 05, 2014

Invoking R scripts via Oracle Database: Theme and Variation, Part 3

In the first two parts of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution, focusing on the functions ore.doEval / rqEval and ore.tableApply / rqTableEval. In this blog post, we’ll cover the next in our theme and variation series involving ore.groupApply and the corresponding definitions required for SQL execution. The “group apply” function is one of the parallel-enabled embedded R execution functions. It supports data-parallel execution, where one or more R engines perform the same R function, or task, on different partitions of data. This functionality is essential to enable the building of potentially 10s or 100s of thousands of predictive models, e.g., one per customer, and for taking advantage of high-performance computing hardware like Exadata.

Oracle Database handles the management and control of potentially multiple R engines at the database server machine, automatically partitioning and passing data to parallel executing R engines. It ensures that all R function executions for all partitions complete, or the ORE function returns an error. The result from the execution of each user-defined embedded R function is gathered in an ore.list. This list remains in the database until the user requires the result.

The variation on embedded R execution for ore.groupApply involves passing not only an ore.frame to the function such that the first parameter of your embedded R function receives a data.frame, but also an INDEX argument that specifies the name of a column by which the rows will be partitioned for processing by a user-defined R function.

Let’s look at an example. We’re going to use the C50 package to build a C5.0 decision tree model on the churn data set from C50. The goal is to build one churn model on the data for each state.


library(C50)
data(churn)

ore.create(churnTrain, "CHURN_TRAIN")

modList <- ore.groupApply(
  CHURN_TRAIN,
  INDEX=CHURN_TRAIN$state,
    function(dat) {
      library(C50)
      dat$state <- NULL
      dat$churn <- as.factor(dat$churn)
      dat$area_code <- as.factor(dat$area_code)
      dat$international_plan <- as.factor(dat$international_plan)
      dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
      C5.0(churn ~ ., data = dat, rules = TRUE)
    });
mod.MA <- ore.pull(modList$MA)
summary(mod.MA)

A few points to highlight:
• As noted in Part 2 of this series, to use the CRAN package C50 on the client, we first load the library, and then the churn data set.
• Since the data is a data.frame, we’ll create a table in the database with this data. Notice that if you compare the results of str(churnTrain) with str(CHURN_TRAIN), you will see that the factor columns have been retained. This becomes relevant later.
• The function ore.groupApply will return a list of models stored as ore.object instances. The first argument is the ore.frame CHURN_TRAIN and the second argument indicates to partition the data on column state such that the user-defined function is invoked on each partition of the data.
• The next argument specifies the function, which could alternatively have been the function name if the FUN.NAME argument were used and the function saved explicitly in the R script repository. The function’s first argument (whatever its name) will receive one partition of data, e.g., all data associated with a single state.
• Regarding the user-defined function body, we explicitly load the package we’re using, C50, so the function body has access to it. Recall that this user-defined R function will execute at the database server in a separate R engine from the client.
• Since we don’t need to know which state we’re working with and we don’t want this included in the model, we delete the column from the data.frame.
• Although the ore.frame defined functions, when they are loaded to the user-defined embedded R function, factors appear as character vectors. As a result, we need to convert them back to factors explicitly.
• The model is built and returned from the function.
• The result from ore.groupApply is a list containing the results from the execution of the user-defined function on each partition of the data. In this case, it will be one C5.0 model per state.
• To view the model, we first use ore.pull to retrieve it from the database and then invoke summary on it. The class of mod.MA is “C5.0”.

SQL API

We can invoke the function through the SQL API by storing the function in the R script repository. Previously we showed doing this using the SQL API, however, we can also do this using the R API , but we’re going to modify the function to store the resulting models in an ORE datastore by state name:


ore.scriptCreate("myC5.0Function",
  function(dat,datastorePrefix) {
    library(C50)
    datastoreName <- paste(datastorePrefix,dat[1,"state"],sep="_")
    dat$state <- NULL
    dat$churn <- as.factor(dat$churn)
    dat$area_code <- as.factor(dat$area_code)
    dat$international_plan <- as.factor(dat$international_plan)
    dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
    mod <- C5.0(churn ~ ., data = dat, rules = TRUE)
    ore.save(mod, name=datastoreName)
    TRUE
  })

Just for comparison, we could invoke this from the R API as follows:


res <- ore.groupApply( CHURN_TRAIN, INDEX=CHURN_TRAIN$state,
          FUN.NAME="myC5.0Function",
          datastorePrefix="myC5.0model", ore.connect=TRUE)
res
res <- ore.pull(res)
all(as.logical(res) == TRUE)

Since we’re using a datastore, we need to connect to the database setting ore.connect to TRUE. We also pass the datastorePrefix. The result res is an ore.list of logical values. To test if all are TRUE, we first pull the result and use the R all function.

Back to the SQL API…Now that we can refer to the function in the SQL API, we invoke the function that places one model per datastore, each with the given prefix and state.


select *
from table(churnGroupEval(
  cursor(select * from CHURN_TRAIN),
  cursor(select 1 as "ore.connect",' myC5.0model2' as "datastorePrefix" from dual),
  'XML', 'state', 'myC5.0Function'));

There’s one thing missing, however. We don’t have the function churnGroupEval. There is no generic “rqGroupEval” in the API – we need to define our own table function that matches the data provided. Due to this and the parallel nature of the implementation, we need to create a PL/SQL FUNCTION and supporting PACKAGE:


CREATE OR REPLACE PACKAGE churnPkg AS
  TYPE cur IS REF CURSOR RETURN CHURN_TRAIN%ROWTYPE;
END churnPkg;
/
CREATE OR REPLACE FUNCTION churnGroupEval(
  inp_cur churnPkg.cur,
  par_cur SYS_REFCURSOR,
  out_qry VARCHAR2,
  grp_col VARCHAR2,
  exp_txt CLOB)
RETURN SYS.AnyDataSet
PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH ("state"))
CLUSTER inp_cur BY ("state")
USING rqGroupEvalImpl;
/

The highlights in red indicate the specific parameters that need to be changed to create this function for any particular data set. There are other variants, but this will get you quite far.

To validate that our datastores were created, we invoke ore.datastore(). This returns the datastores present and we will see 51 such entries – one for each state and the District of Columbia.

Parallelism

Above, we mentioned that “group apply” supports data parallelism. By default, parallelism is turned off. To enable parallelism, the parameter to ore.groupApply needs to be set to TRUE.


ore.groupApply( CHURN_TRAIN, INDEX=CHURN_TRAIN$state,
          FUN.NAME="myC5.0Function",
          datastorePrefix="myC5.0model",
          ore.connect=TRUE,
          parallel=TRUE
)

In the case of the SQL API, the parallel hint can be provided with the input cursor. This indicates that a degree of parallelism up to 4 should be enabled.


select *
from table(churnGroupEval(
  cursor(select * /*+ parallel(t,4) */ from CHURN_TRAIN t),
  cursor(select 1 as "ore.connect",' myC5.0model2' as "datastorePrefix" from dual),
  'XML', 'state', 'myC5.0Function'));
Map Reduce

The “group apply” functionality can be thought of in terms of the map-reduce paradigm where the mapper performs the partitioning by outputting the INDEX value as key and the data.frame as value. Then, each reducer receives the rows associated with one key. In our example above, INDEX was the column state and so each reducer would receive rows associated with a single state.

Memory and performance considerations

While the data is partitioned by the INDEX column, it is still possible that a given partition is quite large, such that either the partition of data will not fit in the R engine memory or the user-defined embedded R function will not be able to execute to completion. The usual remedial measures can be taken regarding setting memory limits – as noted in Part 2.

If the partitions are not balanced, you would have to configure the system’s memory for the largest partition. This will also have implications for performance, obviously, since smaller partitions of data will likely complete faster than larger ones.

The blog post Managing Memory Limits and Configuring Exadata for Embedded R Execution discusses how to instrument your code to understand the memory usage of your R function. This is done in the context of ore.indexApply (to be discussed later in this blog series), but the approach is analogous for “group apply.”

Friday Jan 03, 2014

Invoking R scripts via Oracle Database: Theme and Variation, Part 2

In part 1 of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution, focusing on the functions ore.doEval and rqEval. In this blog post, we’ll cover the next in our theme and variation series involving ore.tableApply and rqTableEval.

The variation on embedded R execution for ore.tableApply involves passing an ore.frame to the function such that the first parameter of your embedded R function receives a data.frame. The rqTableEval function in SQL allows users to specify a data cursor to be delivered to your embedded R function as a data.frame.

Let’s look at a few examples.


R API

In the following example, we’re using ore.tableApply to build a Naïve Bayes model on the iris data set. Naïve Bayes is found in the e1071 package, which must be installed on both the client and database server machine R engines.

library(e1071)
mod <- ore.tableApply(
ore.push(iris),
function(dat) {
library(e1071)
dat$Species <- as.factor(dat$Species)
naiveBayes(Species ~ ., dat)
})
class(mod)
mod

A few points to highlight:
• To use the CRAN package e1071 on the client, we first load the library.
• The iris data set is pushed to the database to create an ore.frame as the first argument to ore.tableApply. This would normally refer to an ore.frame that refers to a table that exists in Oracle Database. If not obvious, note that we could have previously assigned dat <- ore.push(iris) and passed dat as the argument as well.
• The embedded R function is supplied as the second argument to ore.tableApply as a function object. Recall from Part 1 that we could have alternatively assigned this function to a variable and passed the variable as an argument, or stored the function in the R script repository and passed the argument FUN.NAME with the assigned function name.
• The user-defined embedded R function takes dat as its first argument which will contain a data.frame derived from the ore.frame supplied.
• The model itself is returned from the function.
• The result of the ore.tableApply execution will be an ore.object.

SQL API

We can invoke the function through the SQL API by storing the function in the R script repository. Recall that the call to sys.rqScriptCreate must be wrapped in a BEGIN-END PL/SQL block.


begin
sys.rqScriptCreate('myNaiveBayesModel',
'function(dat) {
library(e1071)
dat$Species <- as.factor(dat$Species)
naiveBayes(Species ~ ., dat)
}');
end;
/

Invoking the function myNaiveBayesModel occurs in a SQL SELECT statement as shown below. The first argument to rqTableEval specifies a cursor that retrieves the IRIS table. Note that the IRIS table could have been created earlier using ore.create(iris,"IRIS"). The second argument, NULL, indicates that no arguments are supplied to the function.

The function returns an R object of type naiveBayes, but as a serialized object that is chunked into a table. This likely is not useful to most users.


select *
from table(rqTableEval(cursor(select * from IRIS), NULL, NULL, 'myNaiveBayesModel'));

If we want to keep the model in a more usable form, we can store it in an ORE datastore in Oracle Database. For this, we require a change to the user-defined R function and the SQL invocation.


begin
sys.rqScriptCreate('myNaiveBayesModel',
'function(dat) {
library(e1071)
dat$Species <- as.factor(dat$Species)
mod <- naiveBayes(Species ~ ., dat)
ore.save(mod, name="myNaiveBayesDatastore")
TRUE

}');
end;
/
select *
from table(rqTableEval(cursor(select * from IRIS), cursor(select 1 as "ore.connect" from dual), 'XML', 'myNaiveBayesModel'));

Highlighted in red, we’ve stored the model in the datastore named ‘myNaiveBayesDatastore’. We’ve also returned TRUE to have a simple value that can show up as the result of the function execution. In the SQL query, we changed the third parameter to ‘XML’ to return an XML string containing “TRUE”. The name of the datastore could be passed as an argument as follows:


begin
sys.rqScriptCreate('myNaiveBayesModel',
'function(dat, datastoreName) {
library(e1071)
dat$Species <- as.factor(dat$Species)
mod <- naiveBayes(Species ~ ., dat)
ore.save(mod, name=datastoreName)
TRUE
}');
end;
/
select *
from table(rqTableEval(
cursor(select * from IRIS),
cursor(select 'myNaiveBayesDatastore' "datastoreName", 1 as "ore.connect" from dual),
'XML',
'myNaiveBayesModel'));

Memory considerations with ore.tableApply and rqTableEval

The input data provided as the first argument to a user-defined R function invoked using ore.tableApply or rqTableEval is physically being moved from Oracle Database to the database server R engine. It’s important to realize that R’s memory limitations still apply. If your database server machine has 32 GB RAM and your data table is 64 GB, ORE will not be able to load the data into the R function’s dat argument.
You may see errors like:


Error : vector memory exhausted (limit reached)

or

ORA-28579: network error during callback from external procedure agent

See the blog post on Managing Memory Limits and Configuring Exadata for Embedded R Execution where we discuss setting memory limits for the database server R engine. This can be necessary to load reasonably sized data tables.

Parallelism

As with ore.doEval / rqEval, user-defined R functions invoked using ore.tableApply / rqTableEval are not executed in parallel, i.e., a single R engine is used to execute the user-defined R function.

Invoking certain ORE advanced analytics functions

In the current ORE release, some advanced analytics functions, like ore.lm or ore.glm, which use the embedded R execution framework, cannot be used within other embedded R calls such as ore.doEval / rqEval and ore.tableApply / rqTableEval.

You can expect to see an error like the following:


ORA-28580: recursive external procedures are not supported

In the next post in this series, I’ll discuss ore.groupApply and the corresponding definitions required for SQL execution, since there is no rqGroupApply function. I’ll also cover the relationship of various “group apply” constructs to map-reduce paradigm.

Thursday Jan 02, 2014

Invoking R scripts via Oracle Database: Theme and Variation


Oracle R Enterprise provides several ways for you to invoke R scripts through Oracle Database. From the same R script you can get structured data, an XML representation of R objects and images, and even PNG images via a BLOB column in a database table. This series of blog posts will take you through the various ways you can interact with R scripts and Oracle Database. In this first post, we explore the benefits of embedded R execution and usage of the ore.doEval and rqEval functions. Subsequent posts will detail the use of the other embedded R execution functions, and in the case of data- and task-parallel capabilities, how these relate to the MapReduce paradigm.

Introduction


In Oracle R Enterprise, we use the phrase “embedded R execution” to characterize the storing of R scripts in Oracle Database – using the ORE R script repository – and invoking that script in one or more database-side R engines.

This is a powerful capability for several reasons:


  • enable data- and task-parallel execution of user-defined R functions that correspond to special cases of Hadoop Map-Reduce jobs

  • leverage a more powerful database server machine for R engine execution – both RAM and CPU

  • transfer data between Oracle Database and R engine much faster than to a separate client R engine

  • invoke user-defined R functions from SQL and retrieve results in various forms depending on application requirements: tables, XML, PNG BLOB

  • leverage open source CRAN packages at the database server machine

  • schedule R scripts for automatic execution via SQL with Oracle Database DBMS_SCHEDULER PL/SQL package


Users can interactively develop R scripts using their favorite R IDE, and then deploy the script as an R function to the database where it can be invoked either from R or SQL. Embedded R execution facilitates application use of R scripts with better performance and throughput than using a client-side R engine. Executing R scripts from SQL enables integration of R script results with OBIEE, Oracle BI Publisher, and other SQL-enabled tools for structured data, R objects, and images.

Table 1 provides a summary of the embedded R execution functions and R script repository functions available. The function f refers to the user-defined R code, or script, that is provided as either an R function object or a named R function in the database R script repository. To create functions in the R script repository, ORE has functions as described in Table 1.












































R API SQL API Description

ore.doEval

rqEval

Executes f with no automatic transfer of data.

ore.tableApply

rqTableEval

Executes f passing all rows of provided input ore.frame as the first parameter of f. First parameter provided as a data.frame.

ore.groupApply

“rqGroupEval”
(must
be explicitly defined as function by user)

Executes f by partitioning data according to an “index” column’s values. Each data partition provided as a data.frame in the first parameter of f. Supports parallel execution of each f invocation in a pool of database server-side R engines.

ore.rowApply

rqRowEval

Executes f passing a specified number of rows (a “chunk”) of the provided input ore.frame. Each chunk provided as a data.frame in the first parameter of f. Supports parallel execution of each f invocation in a pool of database server-side R engines.

ore.indexApply

N/A

Executes f with no automatic transfer of data, but provides the index of the invocation, 1 through n, where n is the number of functions to invoke. Supports parallel execution of each f invocation in a pool of database server-side R engines.

ore.scriptCreate

sys.rqScriptCreate

Load the provided R function into the R script repository with the provided name.

ore.scriptDrop

sys.rqScriptDrop

Remove the named R function from the R script repository.

Table 1: Embedded R Execution and R Script Repository function summary


Using ore.doEval and rqEval


The first of the embedded R functions we cover are also the simplest. The R function ore.doEval and the SQL function rqEval do not automatically receive any data from the database. They simply execute the function f provided. Any needed data is either generated within f or explicitly retrieved from a data source such as Oracle Database, other databases, or flat files.

R API


In the R interface, users can specify an R function as an argument to ore.doEval, or use the name of that function as stored in the R script repository. For example, the function RandomRedDots returns a data.frame with two columns and plots 100 random normal values. To invoke the function through the database server requires minimal specification with ore.doEval.

RandomRedDots <- function(divisor=100){
id<- 1:10
plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2 )
data.frame(id=id, val=id / divisor)
}
ore.doEval(RandomRedDots)


Here is the result, where the image is displayed at the client, but generated by the database server R engine that executed the function f.


We can provide arguments to f as well. To override the divisor argument, provide it as an argument to ore.doEval. Note
that any number of parameters, including more complex R objects such as models can be passed as arguments this way.

ore.doEval(RandomRedDots, divisor=50)


Behind the scenes: when passing the function itself (as above), ORE implicitly stores the function in the R script repository before executing it. When finished executing, the function is dropped from the repository. If we want to store this function explicitly in the repository, we can use ore.scriptCreate:

ore.scriptCreate("myRandomRedDots",
RandomRedDots)


Now, the function can be invoked by name:

ore.doEval(FUN.NAME="myRandomRedDots",divisor=50)


The return value of f is a data.frame, however, if we capture the result in a variable, we’ll notice two things: the class of the return value is ore.object and the image does not display.

res <- ore.doEval(FUN.NAME="myRandomRedDots",
divisor=50)
class(res)



To get back the data.frame, we must invoke ore.pull to pull the result to the client R engine.

res.local <- ore.pull(res)
class(res.local)


If we wanted to return an ore.frame instead of an ore.object, specify the argument FUN.VALUE that describes the structure of the result.

res.of <- ore.doEval(FUN.NAME="myRandomRedDots", divisor=50,
FUN.VALUE= data.frame(id=1, val=1))
class(res.of)


SQL API


Now, we’ll look at executing the same R function f using the SQL interface of embedded R execution, while pointing out a few significant differences in the API. The first difference is that the R functions are defined as strings, not R objects. This should be no surprise since we’ll be using a SQL interface like SQL Developer or SQL*Plus. Also, the R function string cannot be passed directly in the rqEval function, but must first be stored in the R script repository. The call to sys.rqScriptCreate must be wrapped in a BEGIN-END PL/SQL block.

begin
sys.rqScriptCreate('myRandomRedDots2',
'function(divisor=100,numDots=100){
id <- 1:10
plot( 1:numDots, rnorm(numDots), pch = 21, bg = "red", cex = 2 )
data.frame(id = id, val = id / divisor)
}');
end;
/


Invoking the function myRandomRedDots2 occurs in a SQL SELECT statement as shown below. The first NULL argument to rqEval indicates that no arguments are supplied to the function myRandomRedDots2. In the SQL API, we can ask for the data.frame returned by f to appear as a SQL table. For this, the second parameter can take a SQL string that describes the column names and data types that correspond to the returned data.frame. You can provide a prototype row using the dual dummy table, however, the select statement can be based on an existing table or view as well.

select *
from table(rqEval(NULL, 'select 1 id, 1 val from dual', 'myRandomRedDots2'));


To pass parameters in SQL, we can replace the first NULL argument with a cursor that specifies a single row of scalar values. Multiple arguments can be specified as shown below. Note that argument names are case sensitive, so it is best to include column names in double quotes. Note also that the first argument is a cursor whereas the second parameter is a string. The former provides data values, whereas the latter is parsed to determine the structure of the result.

select *
from table(rqEval(cursor(select 50 "divisor", 500 "numDots" from dual),
'select 1 id, 1 val from dual',
'myRandomRedDots2'));


When specifying a table structure for the result as above, any image data is discarded. To get back both structured data and images, we replace the second argument with ‘XML’. This instructs the database to generate an XML string, first with any structured or semi-structured R objects, followed by the image or images generated by the R function f. Images are returned as a base 64 encoding of the PNG representation.

select *
from table(rqEval(cursor(select 50 "divisor", 500 "numDots" from dual),
'XML',
'myRandomRedDots2'));


Advanced features


To establish a connection to Oracle Database within the R function f, a special argument ore.connect can be set to TRUE. This uses the credentials of the user who invoked the embedded R function ore.doEval or rqEval to establish a connection and also automatically load the ORE package. This capability can be useful to explicitly use the ORE Transparency Layer or to save and load objects with ORE R object datastores.

RandomRedDots <- function(divisor=100, datatstore.name="myDatastore"){
id <- 1:10
plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2 )
ore.load(datastore.name) # contains numeric variable myVar
data.frame(id=id, val=id / divisor, num=myVar)
}


ore.doEval(RandomRedDots, datastore.name="datastore_1", ore.connect=TRUE)


Notice the additions in red. We pass the name of a datastore to load. That datastore is expected to contain a variable myVar. Arguments prefixed with ‘ore.’ are control arguments and are not passed to f. Other control arguments include: ore.drop which if set to TRUE converts a one-column input data.frame to a vector, ore.graphics which if set to TRUE starts a graphical driver to look for images being returned from f, ore.png.* which provides additional parameters for the PNG graphics device. The ore.png.* control arguments include (replace * with): width, height, units, pointsize, bg, res, type, etc.

In the next post, we will explore ore.tableApply and rqTableEval.

Sunday Dec 08, 2013

Explore Oracle's R Technologies at BIWA Summit 2014

It’s getting to be that time of year again. The Oracle BIWA Summit '14 will be taking place January 14-16 at Oracle HQ Conference Center, Redwood Shores, CA. Check out the detailed agenda.

BIWA Summit provides a wide range of sessions on Business Intelligence, Warehousing, and Analytics, including: novel and interesting use cases of Oracle Big Data, Exadata, Advanced Analytics/Data Mining, OBIEE, Spatial, Endeca and more! You’ll also have opportunities to get hands on experience with products in the Hands-on Labs, great customer case studies and talks by Oracle Technical Professionals and Partners.  Meet with technical experts on the technology you want and need to use. 

Click HERE to read detailed abstracts and speaker profiles.  Use the SPECIAL DISCOUNT code ORACLE12c and registration is only $199 for the 2.5 day technically focused Oracle user group event.

On the topic of Oracle’s R technologies, don't miss:

  • Introduction to Oracle's R Technologies
  • Applying Oracle's R Technologies to Big Data Problems
  • Hands-on Lab: Learn to use Oracle R Enterprise
  • OBIEE + OAA Integration Paths : interactive OAA in SampleApp Dashboards
  • Blazing Business Analytics: Analytic Options to the Oracle Database
  • Best Practices for In-Database Analytics

We look forward to meeting you there!

Wednesday Oct 23, 2013

Migrating R Scripts from Development to Production

“How do I move my R scripts stored in one database instance to another? I have my development/test system and want to migrate to production.”

Users of Oracle R Enterprise Embedded R Execution will often store their R scripts in the R Script Repository in Oracle Database, especially when using the ORE SQL API. From previous blog posts, you may recall that Embedded R Execution enables running R scripts managed by Oracle Database using both R and SQL interfaces. In ORE 1.3.1., the SQL API requires scripts to be stored in the database and referenced by name in SQL queries. The SQL API enables seamless integration with database-based applications and ease of production deployment.

Loading R scripts in the repository

Before talking about migration, we’ll first introduce how users store R scripts in Oracle Database. Users can add R scripts to the repository in R using the function ore.scriptCreate, or SQL using the function sys.rqScriptCreate.

For the sample R script

    id <- 1:10
    plot(1:100,rnorm(100),pch=21,bg="red",cex =2)
    data.frame(id=id, val=id / 100)

users wrap this in a function and store it in the R Script Repository with a name. In R, this looks like

ore.scriptCreate("RandomRedDots",
function () {
line-height: 115%; font-family: "Courier New";">     id <- 1:10
    plot(1:100,rnorm(100),pch=21,bg="red",cex =2)
    data.frame(id=id, val=id / 100))
})

In SQL, this looks like

begin
sys.rqScriptCreate('RandomRedDots',

 'function(){
    id <- 1:10
    plot(1:100,rnorm(100),pch=21,bg="red",cex =2)
    data.frame(id=id, val=id / 100)
  }');
end;
/

The R function ore.scriptDrop and SQL function sys.rqScriptDrop can be used to drop these scripts as well. Note that the system will give an error if the script name already exists.

Accessing R scripts once they’ve been loaded

If you’re not using a source code control system, it is possible that your R scripts can be misplaced or files modified, making what is stored in Oracle Database to only or best copy of your R code. If you’ve loaded your R scripts to the database, it is straightforward to access these scripts from the database table SYS.RQ_SCRIPTS. For example,

select * from sys.rq_scripts where name='myScriptName';

From R, scripts in the repository can be loaded into the R client engine using a function similar to the following:

ore.scriptLoad <- function(name) {
query <- paste("select script from sys.rq_scripts where name='",name,"'",sep="")
str.f <- OREbase:::.ore.dbGetQuery(query)
assign(name,eval(parse(text = str.f)),pos=1)
}

ore.scriptLoad("myFunctionName")

This function is also useful if you want to load an existing R script from the repository into another R script in the repository – think modular coding style. Just include this function in the body of the other function and load the named script.

Migrating R scripts from one database instance to another

To move a set of functions from one system to another, the following script loads the functions from one R script repository into the client R engine, then connects to the target database and creates the scripts there with the same names.

scriptNames <- OREbase:::.ore.dbGetQuery("select name from sys.rq_scripts where name not like 'RQG$%' and name not like 'RQ$%'")$NAME

for(s in scriptNames) {
cat(s,"\n")
ore.scriptLoad(s)
}

ore.disconnect()
ore.connect("rquser","orcl","localhost","rquser")

for(s in scriptNames) {
cat(s,"\n")
ore.scriptDrop(s)
ore.scriptCreate(s,get(s))
}

Best Practice

When naming R scripts, keep in mind that the name can be up to 128 characters. As such, consider organizing scripts in a directory structure manner. For example, if an organization has multiple groups or applications sharing the same database and there are multiple components, use “/” to facilitate the function organization:

line-height: 115%;">ore.scriptCreate("/org1/app1/component1/myFuntion1", myFunction1)
ore.scriptCreate("/org1/app1/component1/myFuntion2", myFunction2)
ore.scriptCreate("/org1/app2/component2/myFuntion2", myFunction2)
ore.scriptCreate("/org2/app2/component1/myFuntion3", myFunction3)
ore.scriptCreate("/org3/app2/component1/myFuntion4", myFunction4)

Users can then query for all functions using the path prefix when looking up functions.

Friday Oct 11, 2013

Take Oracle R Enterprise for a Test Drive

If you'd like try Oracle R Enterprise, Oracle Partner Vlamis Software Solutions provides a quick and easy way to get started using a virtual machine (VM) loaded with all the software you require and hosted on Amazon Web Services (AWS).  Follow this link and within a few clicks you'll have a "Remote Desktop" connection to the cloud with sample scripts for you to explore both the R language and Oracle R Enterprise, both from R and SQL.

Oracle R Enterprise, a component of the Oracle Advanced Analytics option, makes the open source R statistical programming language and environment ready for the enterprise and big data. It provides a comprehensive, database-centric environment for end-to-end analytical processes in R, with immediate deployment to production environments. R users can operationalize entire R scripts in production applications, thereby eliminating porting of R code to other languages or reinventing code to integrate R results into existing applications. Oracle R Enterprise allows users to seamlessly leverage Oracle Database as a high performance computing (HPC) environment for R scripts, providing data parallelism and resources management.


Wednesday Oct 02, 2013

Managing Memory Limits and Configuring Exadata for Embedded R Execution

An R engine can consume significant memory resources in the course of running R scripts. R users who work with Oracle R Enterprise Embedded R Execution on sizable data, especially application designers and database administrators (DBAs), have a vested interest in understanding and controlling the memory demands of R script execution to help ensure that sufficient memory resources are available for both their application and Oracle Database. ORE Embedded R Execution enables running R scripts managed by Oracle Database, both through R and SQL APIs. The SQL API enables seamless integration with database-based applications, data-parallel and task-parallel R script execution, and ease of production deployment.

To provide greater control over R memory consumption, Oracle R Enterprise provides a privileged SQL function for configuring a database server with R memory limits. In this blog post, we provide a discussion of R memory usage and garbage collection, and how this SQL function can be used to limit the amount of memory consumed by individual R engines started as part of ORE’s embedded R execution framework. We follow with an example of involving memory limit calculations on Exadata and some recommendations for DBAs to consider when configuring Exadata for embedded R execution. Note that such calculations and configuration settings are applicable to non-Exadata (single instance or custom RAC) environments as well. At the end, there a “tip of the day” for R memory management.

Garbage Collection as a concept

For those familiar with languages like C, memory is explicitly managed by the programmer through invocations of functions to allocate and free memory (malloc, calloc, free). Failing to free memory when finished with it results in “memory leaks” that can cause a process to consume (or exhaust) memory unnecessarily, often resulting in a program or system crash.

To alleviate programmers from this burden, languages like R and Java rely on garbage collection. “Garbage” is memory that is no longer being used, i.e., no longer referenceable, within your program. With garbage collection, programmers avoid dealing with memory management. The underlying system determines what memory is used or available, and frees memory periodically. Garbage collection, however, is not a panacea. Garbage collection can take time to process, e.g., on the order of seconds, which can make response time for certain functions unpredictable – although modern garbage collection mechanisms have largely mitigated this drawback. In addition, when garbage collection occurs is essentially non-deterministic, depending on heuristics set up by the language implementation. This means that memory may be retained longer than necessary.

Memory in R

In R, memory can be characterized along two dimensions: memory allocated for vectors and arrays (referred to as Vcells), and memory allocated for objects such as lists (referred to as “cons” cells or Ncells). When invoking R’s garbage collection function, gc(), you’ll see results like these:

The function gc() returns a matrix with rows Ncells, which corresponds to the cons cells, and Vcells, which corresponds to vector heap memory. The Ncells are 56 bytes/cell (49.2*1024*1024/.920477) on a 64-bit machine, and Vcells are ~8 bytes/cell (22.6*1024*1024/2.956944). The “used” column indicates the number of cells allocated, along with their corresponding megabytes. The column “gc trigger” indicates at what point garbage collection will kick in. The column “max used” indicates the maximum space used since the last call to gc(reset=TRUE) or since R started if gc(reset=TRUE) wasn’t invoked.

As an example of affecting Ncells, consider the following example where we initialize a list as a sequence of 100K numbers. We see that roughly 5.4 MB of RAM were consumed for the 100K cells.

For Vcells, we create a vector of 1M elements. This consumes roughly 3.8 MB of RAM for the 1M cells. R optimizes for integers.

The same test with floats consumes 7.6 MB of RAM for the 1M cells of floats.

How does R’s garbage collector use VSize and NSize?

We’ll discuss VSize, as NSize is analogous. The garbage collector recovers memory that is no longer in use, determining when to perform garbage collection and how much memory to recover. Looking at heap memory for Vcells, as depicted in the figure below, there are a few key points: Min_VSize, VSizeInUse, R VSize, and Max_VSize. The R VSize serves as the gc() trigger. The Min_VSize and Max_VSize are the specified lower and upper memory limits. Min_VSize is the minimal size for the vector heap as well as its initial value. From there, R grows or shrinks the vector heap depending on memory demands. However, it doesn’t exceed the Max_VSize limit nor go below the Min_VSize limit. In the figure, VSizeInUse reflects the memory currently used by R objects. R_VSize is how much memory can be requested without triggering gc(). As you would expect: Min_VSize <= R_VSize <= Max_VSize and VSizeInUse < R_VSize.


Limiting memory on the database server R engine

Oracle R Enterprise provides the SQL function sys.rqconfigset to set memory limits. Use of this function requires the sys privilege and the setting is applied only to embedded R engines. Consider the following examples:

sys.rqconfigset('MIN_VSIZE', '10M') -- min heap 10MB, default 32MB
sys.rqconfigset('MAX_VSIZE', '100M') -- max heap 100MB, default 4GB
sys.rqconfigset('MIN_NSIZE', '500K') -- min number cons cells 500x1024, default 1M
sys.rqconfigset('MAX_NSIZE', '2M') -- max number cons cells 2M, default 20M

Note that either numeric or string values can be provided to sys.rqconfigset. Default constants are defined as follows:


#define RQET_DEF_MINVSZ 33554432 /* RQER DEFault MIN_VSiZe 32Mb */
#define RQET_DEF_MAXVSZ 4294967296 /* RQER DEFault MAX_VSiZe 4Gb */
#define RQET_DEF_MINNSZ 1048576 /* RQER DEFault MIN_NSiZe 1M */
#define RQET_DEF_MAXNSZ 20971520 /* RQER DEFault MAX_NSiZe 20M */

Getting memory settings and usage through an embedded R engine

To obtain the current set of default values, you can invoke the following SQL statement when connected to the database server using the table sys.rq_config.

select name, value from sys.rq_config;

This can be done from within an embedded R function invocation using, for example:

getMemorySettings <- function() {
con <- dbConnect(Extproc())
rs <- dbSendQuery(con, "select name, value from sys.rq_config")
dat <- fetch(rs)
dat
}
ore.doEval(getMemorySettings,ore.connect=TRUE)

To obtain the current memory usage within an individual embedded R engine, instrumenting your embedded R function with gc() and returning the results of gc() will provide this insight:

getMemoryUse <- function() {
gc.dat <- gc()
list(pid=Sys.getpid(), gc.dat=gc.dat)
}
ore.doEval(getMemoryUse)

Note that the result from an embedded R call will also include the memory limits set for the R engines, shown below in column 6 “limit (Mb)” of the result. This occurs whenever memory limits are in place for an R engine. Here, we’ve included getting the process id of the R engine.

An example of computing memory limits

Consider you have an Exadata X2-2 that has 1152 GB RAM (~1.2 TB) and your DBA allocates you a maximum of 60 GB RAM for parallel R engines per Exadata node. If we set the degree of parallelism at 32, to enable 32 R engines to execute concurrently, this allows 1.875 GB RAM / R engine. If we allocate 2/3 of this for Vcells, we would allocate ~1.25 GB for the MAX_VSIZE. The remaining 1/3, or 625 MB, would translate into 11.6M cells for MAX_NSIZE.

60 GB allocated to R engines per Exadata node

DOP=32

60GB / 32 R engines = 1.875 GB / R Engine

~2/3 for Vcells = 1.25 GB; 1.25 GB / 8 Bytes/Cell = 156.25M Cells

~1/3 for Ncells = 625 MB; 625 MB / 54 Bytes/Cell = 11.6M Cells

sys.rqconfigset('MAX_VSIZE', '1250M')
sys.rqconfigset('MAX_NSIZE', '11600K')

While this example focuses on parallel execution, such as for ore.groupApply, ore.rowApply, and ore.indexApply (or SQL rqRowEval and “rqGroupEval”), the same type of analysis applies to non-parallel embedded R functions, like ore.doEval and ore.tableApply (or SQL rqEval and rqTableEval).

Consider an example that builds a randomForest model using the ore.doEval function. We can compute the amount of RAM consumed by the function by invoking gc() at the beginning and end of the function and subtracting the max used “(Mb)” columns as depicted here:

The result is that 1.4 MB were consumed for Ncells and 11.1 MB for Vcells. This can similarly be done for ore.indexApply to see the amount of RAM consumed by each embedded R function execution and to sum up the actual usage for each of the embedded R engines (assuming they run fully concurrently).

Generating such numbers on real data gives users a sense of how much memory embedded R jobs may require.

For DBAs

When configuring a database on Exadata for parallel R engines, consider the following options. In the following scenario, we contrast the scenarios when the execution time of any one given embedded R function is fast, e.g., 10s of seconds, and there are many such executions, versus few parallel R engines where the execution time is long with fewer such executions. Note that these must be considered in context of other Exadata uses:

· Set parallel_degree_policy to MANUAL. This allows ORE to choose when to apply parallelism, as opposed to setting it to AUTO which allows Oracle Database to decide.

· Set parallel_min_servers to the number of parallel slave processes to be started when the database instances start, e.g., 64, which is the number of parallel slave processes per Exadata node. This avoids incurring the time required to start these processes as needed to service R engines, and is particularly important when individual embedded R function execution time is short, e.g., 10s of seconds. If embedded R function execution time is long, the percentage of time for starting up the parallel slave will not dominate the overall execution time.

· Set parallel_max_servers to the maximum number of parallel slave processes that should be allowed per Exadata node, e.g., 128. This ensures that no more than parallel_max_servers will be active at one time, and in turn corresponds to the maximum number of R engines that can be active at one time.

· To avoid overloading the CPUs if the parallel_max_servers limit is reached, set the hidden parameter _parallel_statement_queuing to TRUE. This parameter is turned off by setting parallel_degree_policy to MANUAL. The _parallel_statement_queuing parameter allows for queuing of parallel requests when they exceed the parallel_server_target, which should be set to a value between parallel_min_servers and parallel_max_servers, e.g., 96. Once the parallel_server_target is reached, an embedded R execution will be allowed to execute in parallel using the remaining available parallel servers. If none are available, parallel requests will be queued. This ensures that parallel requests will be run in parallel, as opposed to being forced to serial execution, and be able to take advantage of parallel slaves as they become available. This can dramatically improve overall embedded R execution completion time. Note that parallel_max_servers cannot be changed during database operation, but the parallel_server_target can be to tune Exadata performance. Note that queuing effectively takes no CPU resources.

· To minimize RAC or cluster overhead for fast-executing individual embedded R functions, set parallel_force_local to TRUE to keep all parallel servers allocated and running on the same database server node. With this setting, starting an embedded R execution with DOP 32, all 32 R engines will run on the same Exadata node. A new embedded R execution also with DOP 32 may be started on a different node. If the embedded R functions are long-running, the setup time is propotionately small so spreading the IO over multiple nodes will not adversely impact overall performance. Having parallel slaves span multiple Exadata nodes results in communication / handshaking across nodes, which requires more resources. If the embedded R functions are fast, this overhead can adversely impact overall performance. When all parallel slaves are local, fewer resources are used.

· Where applicable, set application tables and their indexes to DOP 1 to reinforce the ability of ORE to determine when to use parallelism and not be overridden by table or index settings of DEFAULT or a specific degree of parallelism.

R memory management tip

There are many optimizations to make more efficient use of memory. To end this point, here is a tip to reduce memory consumptions significantly and avoid unnecessary replication of data.

If you know the size of your result in advance, pre-allocate the memory required, whether a vector, list, or matrix, as opposed to building up the result incrementally such as using cbind for adding columns to a matrix or data.frame. For example:

num.rows <- 1000
num.cols <- 2000
myFunction <- function(col) {col:(num.rows+col-1)} # produces vector of values
myMatrix <- matrix(NA, num.rows, num.cols) # pre-allocate required memory

for(col in 1:num.cols) {
myMatrix[,col] <- myFunction(col)
}

A note of thanks to Qin Wang and Martin Farber for their input on this blog post.

Monday Aug 12, 2013

Quick! Swap those models – I’ve got a better one

(or, Why in-database analytics enables real-time scoring and can make model deployment easy)

Refreshing predictive models is a standard part of the process when deploying advanced analytics solutions in production environments. In addition, many predictive models need to be used in a real-time setting for scoring customers, whether that is for fraud detection, predicting churn, or recommending next likely product. One of the problems with using vanilla R is that real-time scoring often requires starting an R engine for each score, or enabling some ad hoc mechanism for real-time scoring, which can increase application complexity.

In this blog post, we look at how Oracle R Enterprise enables:

  • Building models in-database on database data from R
  • Renaming in-database models for use by a stored procedure
  • Invoking the stored procedure to make predictions from SQL
  • Building a second model and swapping it with the original
  • Moving a model from development environment to production environment

Building the model in R

So let’s start with building a generalized linear model (GLM) in Oracle Database. For illustration purposes, we’ll use the longley data set from R – a macroeconomic data set that provides a well-known example for a highly collinear regression. In R, type ?longley for the full description of the data set.

Using the following R script, we create the database table LONGLEY_TABLE from the longley data.frame and then build the model using the in-database GLM algorithm. We’re predicting the number of people employed using the remaining variables. Then, we view the model details using summary and the auto-generated fit.name. This fit.name corresponds to the name of the Oracle Data Mining (ODM) model in the database, which is auto-generated. Next, we use the model to predict using the original data, just for a confirmation that the model works as expected.

ore.connect("rquser","my_sid","my_host","rquser_pswd",1521, all=TRUE)

ore.create(longley, table="LONGLEY_TABLE")

mod.glm <- ore.odmGLM(Employed ~ ., data = LONGLEY_TABLE)

summary(mod.glm)

mod.glm$fit.name

predict(fit1, LONGLEY_TABLE)

While a user can refer to the ODM model by its name in fit.name, for example, when working with it in SQL or the Oracle Data Miner GUI, this may not be convenient since it will look something like ORE$23_123. In addition, unless the R object mod.glm is saved in an ORE datastore (an ORE feature corresponding to R’s save and load functions using ore.save and ore.load, but in the database), at the end of the session, this object and corresponding ODM model will be removed.

In addition, we’ll want to have a common name for the model so that we can swap an existing model with a new model and not have the change higher level code. To rename an ODM model, we can use the PL/SQL statement shown here, invoked with R using ore.exec. Of course, this could also be done from any SQL interface, e.g., SQL*Plus, SQL Developer, etc., just supplying the explicit SQL.

ore.exec(paste("BEGIN DBMS_DATA_MINING.RENAME_MODEL(model_name => '", mod.glm$fit.name, "', new_model_name => 'MY_GLM_MODEL'); END;",sep=""))

So now, we have the ODM model named MY_GLM_MODEL. Keep in mind, after the model is renamed, the original model no longer exists and the R object is invalid – at least from the standpoint of being able to use it in functions like summary or predict.

Scoring data from a SQL procedure

As noted above, users can score in batch from R, however, they can also score in batch from SQL. But we’re interested in real-time scoring from the database using the in-database model. This can be done directly in a SQL query but providing the input data in the query itself. This eliminates having to write data to a database table and then doing a lookup to retrieve the data for scoring – making it real-time.

The following SQL does just this. The WITH clause defines the input data, selecting from dual. The SELECT clause uses the model MY_GLM_MODEL to make the prediction using the data defined by data_in.

WITH data_in as (select 2013 "Year",

234.289 "GNP",

235.6 "Unemployed",

107.608 "Population",

159 "Armed.Forces",

83 "GNP.deflator",

60.323 "Employed"

from dual )

SELECT PREDICTION(MY_GLM_MODEL USING *) "PRED"

FROM data_in

While we could invoke the SQL directly, having a stored procedure in the database can give us more flexibility. Here’s the stored procedure version in PL/SQL.

CREATE OR REPLACE

PROCEDURE MY_SCORING_PROC (year_in IN NUMBER,

gnp_in IN BINARY_DOUBLE,

unemployed_in IN BINARY_DOUBLE,

population_in IN BINARY_DOUBLE,

armed_forces_in IN BINARY_DOUBLE,

gnp_deflator_in IN BINARY_DOUBLE,

employed_in IN BINARY_DOUBLE,

pred_out OUT NUMBER) AS

BEGIN

WITH data_in as (select year_in "Year",

gnp_in "GNP",

unemployed_in "Unemployed",

population_in "Population",

armed_forces_in "Armed.Forces",

gnp_deflator_in "GNP.deflator",

employed_in "Employed"

from dual ),

model_score as (SELECT PREDICTION(MY_GLM_MODEL USING *) "PRED"

FROM data_in )

select PRED into pred_out from model_score;

EXCEPTION

WHEN OTHERS THEN

raise_application_error(-20001,

'An error was encountered - '||SQLCODE||' -ERROR- '||SQLERRM);

END;

To invoke the stored procedure, we can do the following:

SET SERVEROUTPUT ON

DECLARE

score NUMBER;

BEGIN

MY_SCORING_PROC(1947, 234.289, 235.6, 107.608, 159, 83, 60.323, score);

DBMS_OUTPUT.PUT_LINE('Score: '|| score);

END;

Refreshing the model from R

Let’s say the model above has been in production for a while, but has become stale – that is, it’s not predicting as well as it used to due to changing patterns in the data. To refresh it, we build a new model. For illustration purposes, we’re going to use the same data (so an identical model will be produced, except for its name).

mod.glm2 <- ore.odmGLM(Employed ~ ., data = LONGLEY_TABLE)

summary(mod.glm2)

mod.glm2$fit.name

To swap the models, we delete the existing model called MY_GLM_MODEL and rename the new model to MY_GLM_MODEL. Again, we can do this from R using PL/SQL and through ore.exec.

ore.exec(paste("BEGIN DBMS_DATA_MINING.DROP_MODEL('MY_GLM_MODEL'); DBMS_DATA_MINING.RENAME_MODEL(model_name => '",mod.glm2$fit.name,"', new_model_name => 'MY_GLM_MODEL'); END;",sep=""))

We can now re-execute the stored procedure and the new model will be used.

SET SERVEROUTPUT ON

DECLARE

score NUMBER;

BEGIN

MY_SCORING_PROC(1947, 234.289, 235.6, 107.608, 159, 83, 60.323, score);

DBMS_OUTPUT.PUT_LINE('Score: '|| score);

END;

You may have noticed that this approach can introduce a brief period where no model is accessible - between the DROP_MODEL and RENAME_MODEL. A better approach involves the use of SYNONYMs. In general, synonyms provide both data independence and location transparency, being an alternative name for a table, view, sequence, procedure, stored function, and other database objects. We can use this in conjunction with our stored procedure above. First, create a synonym for the original scoring procedure.

CREATE or REPLACE SYNONYM MY_SCORING_PROC_SYM for MY_SCORING_PROC;

When invoking the procedure from your application, use the name MY_SCORING_PROC_SYM in place of MY_SCORING_PROC.  Instead of renaming the model, create a second stored procedure, with a different name, e.g., MY_SCORING_PROC_2. The new procedure references the name of the newly build model internally. 

When it is time to swap the models, invoke the following to change the procedures.

 

CREATE or REPLACE SYNONYM MY_SCORING_PROC_SYM for MY_SCORING_PROC_2;

Another benefit of this approach is that replaced models can still be kept should you need to revert to a previous version. 

Moving an in-database model from one machine to another

In a production deployment, there’s often the need to move a model from the development environment to the production environment. For example, the data scientist may have built the model in a development / sandbox environment and now needs to move it to the production machine(s).

In-database models provide functions EXPORT_MODEL and IMPORT_MODEL as part of the DBMS_DATA_MINING SQL package. See the 11g documentation for details. These calls can be invoked from R, but we’ll show this from SQL just to keep the flow easier to see.

From a SQL prompt, e.g., from SQL*Plus, connect to the schema that contains the model. Create a DIRECTORY object where the exported model file will be stored. List the model names available to this schema, which should contain MY_GLM_MODEL. Then, export the model

CONNECT rquser/rquser_psw

CREATE OR REPLACE DIRECTORY rquserdir AS '/home/MY_DIRECTORY';

-- list the models available to rquser

SELECT name FROM dm_user_models;

-- export the model called MY_GLM_MODEL to a dump file in same schema

EXECUTE DBMS_DATA_MINING.EXPORT_MODEL ('MY_GLM_MODEL_out',

'RQUSERDIR',

'name = ''MY_GLM_MODEL''');

At this point, you have the ODM model named MY_GLM_MODEL in the file MY_GLM_MODEL_out01.dmp stored in the file system under /home/MY_DIRECTORY. This file can now be moved to the production environment and the model loaded into the target schema.

Log into the new schema and invoke IMPORT_MODEL.

CONNECT rquser2/rquser2_psw

EXECUTE DBMS_DATA_MINING.IMPORT_MODEL (MY_GLM_MODEL_out01.dmp',

'RQUSERDIR', 'name = ''MY_GLM_MODEL''',

'IMPORT', NULL, 'glm_imp_job', 'rquser:rquser2');

Summary

In this post, we’ve highlighted how to build an in-database model in R and use it for scoring through SQL in a production, re-time settings. In addition, we showed how it is possible to swap, or refresh, models in a way that can leave your application code untouched. Finally, we highlighted database functionality that allows you to move in-database models from one database environment to another.

Users should note that all the functionality shown involving SQL, or being invoked through ore.exec, can be easily wrapped in R functions that could ultimately become part of ORE. If any of our readers are interested in giving this a try, we can post your solution here to share with the R and Oracle community. For the truly adventurous, check out the Oracle Database package DBMS_FILE_TRANSFER to consider wrapping the ability to move model files from R as well.

Thursday Jul 18, 2013

Simple and Advanced Time series with Oracle R Enterprise

This guest post from Marcos Arancibia describes how to use Oracle R Enterprise to analyze Time Series data.

In this article, we give an overview of how to use Time Series Analysis against data stored in Oracle Database, using the Embedded R Execution capability to send time series computations to the Oracle Database server instead processing at the client. We will also learn how to retrieve the final series or forecasts and retrieve them to the client for plotting, forecasting, and diagnosing.

One key thing to keep in mind when using Time Series techniques with data that is stored in Oracle Database is the order of the rows, or records. Because of the parallel capabilities of Oracle Database, when queried for records, one might end up receiving records out of order if an option for order is not specified.

Simple Example using Stock Data

Let’s start with a simple Time Series example. First we will need to connect to our Oracle Database using ORE. Then, using the package TTR, we will access Oracle Stock data from YahooData service, from January 1, 2008 to January 1, 2013 and push it to the database.

# Load the ORE library and connect to Oracle Database

library(ORE)

ore.connect("myuser","mysid","myserver","mypass",port=1521,all=TRUE)

library(TTR)

# Get data in XTS format

xts.orcl <- getYahooData("ORCL", 20080101, 20130101)

# Convert it to a data frame and gets the date

# Makes the date the Index

df.orcl <- data.frame(xts.orcl)

df.orcl$date <- (data.frame(date=index(xts.orcl))$date)

# Create/overwrite data in Oracle Database

# to a Table called ORCLSTOCK

ore.drop(table="ORCLSTOCK")

ore.create(df.orcl,table="ORCLSTOCK")

# IMPORTANT STEP!!!

# Ensure indexing is kept by date

rownames(ORCLSTOCK) <- ORCLSTOCK$date

# Ensure the data is in the DB

ore.ls()

# Review column names, data statistics and

# print a sample of the data

names(ORCLSTOCK)

>names(ORCLSTOCK)

[1] "Open" "High" "Low" "Close" "Volume"

[6] "Unadj.Close" "Div" "Split" "Adj.Div" "date"

summary(ORCLSTOCK$Close)

>summary(ORCLSTOCK$Close)

Min. 1st Qu. Median Mean 3rd Qu. Max.

13.36 20.53 24.22 24.79 29.70 35.73

head(ORCLSTOCK)

>head(ORCLSTOCK)

Open High Low Close Volume

2008-01-02 01:00:00 21.74414 22.00449 21.58022 21.68629 44360179

2008-01-03 01:00:00 21.62843 22.28413 21.62843 22.28413 43600532

2008-01-04 01:00:00 21.95628 22.06235 21.01130 21.24272 46391263

2008-01-07 01:00:00 21.17523 21.67664 21.01130 21.45486 41527032

2008-01-08 01:00:00 21.44522 21.52236 20.38453 20.39417 45155398

2008-01-09 01:00:00 20.57738 20.91487 20.39417 20.83773 49750304

Unadj.Close Div Split Adj.Div date

2008-01-02 01:00:00 22.49 NA NA NA 2008-01-02

2008-01-03 01:00:00 23.11 NA NA NA 2008-01-03

2008-01-04 01:00:00 22.03 NA NA NA 2008-01-04

2008-01-07 01:00:00 22.25 NA NA NA 2008-01-07

2008-01-08 01:00:00 21.15 NA NA NA 2008-01-08

2008-01-09 01:00:00 21.61 NA NA NA 2008-01-09

Pull data from the database for a simple plot

# Pull data from Oracle Database (only the necessary columns)

orcl <- ore.pull(ORCLSTOCK[,c("date","Close","Open","Low","High")])

# Simple plot with base libraries - Closing

plot(orcl$date,orcl$Close,type="l",col="red",xlab="Date",ylab="US$",

main="Base plot:Daily ORACLE Stock Closing points")

# Simple plot with base libraries - Other Series

plot(orcl$date,orcl$Open,type="l",col="blue",xlab="Date",ylab="US$",

main="Base plot:Daily ORACLE Stock: Open/High/Low points")

lines(orcl$date,orcl$High,col="green")

lines(orcl$date,orcl$Low,col="orange")

legend("topleft", c("Opening","High","Low"),

col=c("blue","green","orange"),lwd=2,title = "Series",bty="n")

A different plot option, using the package xts

library(xts)

# Pull data from Oracle Database (only the necessary columns)

orcl <- ore.pull(ORCLSTOCK[,c("date","Close","Open","Low","High")])

# Convert data to Time Series format

orcl.xts <- as.xts(orcl,order.by=orcl$date,dateFormat="POSIXct")

# Plot original series

plot(orcl.xts$Close,major.ticks='months',minor.ticks=FALSE,

main="Time Series plot:Daily ORACLE Stock Closing points",col="red")

Simple Time Series: Moving Average Smoothing

We might be tempted to call functions like the Smoothing Moving Average from open-source CRAN packages against Oracle Database Tables, but those packages do not know what to do with an “ore.frame”. For that process to work correctly, we can either load the data locally or send the process for remote execution on the Database Server by using Embedded R Execution.

We will also explore the built-in Moving Average process from ore.rollmean() as a third alternative.

ALTERNATIVE 1 - The first example is pulling the data from Oracle Database into a ts (time series) object first, for a Client-side smoothing Process.

library(TTR)

# Pull part of the database table into a local data.frame

sm.orcl <- ore.pull(ORCLSTOCK[,c("date","Close")])

# Convert "Close" attribute into a Time Series (ts)

ts.orcl <- ts(sm.orcl$Close)

# Use SMA - Smoothing Moving Average algorithm from package TTR

ts.sm.orcl <-ts(SMA(ts.orcl,n=30),frequency=365, start=c(2008,1) )

# Plot both Series together

plot(sm.orcl$date,sm.orcl$Close,type="l",col="red",xlab="Date",ylab="US$",

main="ORCL Stock Close CLIENT-side Smoothed Series n=30 days")

lines(sm.orcl$date,ts.sm.orcl,col="blue")

legend("topleft", c("Closing","MA(30) of Closing"),

col=c("red","blue"),lwd=2,title = "Series",bty="n")

ALTERNATIVE 2 – In this alternative, we will use a Server-side example for running the Smoothing via Moving Average, without bringing all data to the client. Only the result is brought locally for plotting. Remember that the TTR package has to be installed on the Server in order to be called.

# Server execution call using ore.tableApply

# Result is an ore.list that remains in the database until needed

sv.orcl.ma30 <-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect = TRUE,

function(dat) {

library(TTR)

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

list(res1 <- ts(ordered$Close,frequency=365, start=c(2008,1)),

res2 <- ts(SMA(res1,n=30),frequency=365, start=c(2008,1)),

res3 <- ordered$date)

}

);

# Bring the results locally for plotting

local.orcl.ma30 <- ore.pull(sv.orcl.ma30)

# Plot two series side by side

# (the third element of the list is the date)

plot(local.orcl.ma30[[3]],local.orcl.ma30[[1]],type="l",

col="red",xlab="Date",ylab="US$",

main="ORCL Stock Close SERVER-side Smoothed Series n=30 days")

# Add smoothed series

lines(local.orcl.ma30[[3]],

local.orcl.ma30[[2]],col="blue",type="l")

# Add legend

legend("topleft", c("Closing","Server MA(30) of Closing"),

col=c("red","blue"), lwd=2,title = "Series", bty="n")

ALTERNATIVE 3 – In this alternative we will use a Server-side example with the computation of Moving Averages using the native ORE in-Database functions without bringing data to the client. Only the result is brought locally for plotting.

Just one line of code is needed to generate an in-Database Computation of Moving averages and the creation of a new VIRTUAL column in the Oracle Database. We will call this new column rollmean30.

We will use the function ore.rollmean(). The option align="right" makes the MA look at only the past k days (30 in this case), or less, depending on the point in time. This creates a small difference between this method and the previous methods in the beginning of the series, since ore.rollmean() can actually calculate the first sets of days using smaller sets of data available, while other methods discard this data.

# Moving Average done directly in Oracle Database

ORCLSTOCK$rollmean30 <- ore.rollmean(ORCLSTOCK$Close, k = 30, align="right")

# Check that new variable is in the database

head(ORCLSTOCK)

>head(ORCLSTOCK)

Open High Low Close Volume

2008-01-02 01:00:00 21.74414 22.00449 21.58022 21.68629 44360179

2008-01-03 01:00:00 21.62843 22.28413 21.62843 22.28413 43600532

2008-01-04 01:00:00 21.95628 22.06235 21.01130 21.24272 46391263

2008-01-07 01:00:00 21.17523 21.67664 21.01130 21.45486 41527032

2008-01-08 01:00:00 21.44522 21.52236 20.38453 20.39417 45155398

2008-01-09 01:00:00 20.57738 20.91487 20.39417 20.83773 49750304

Unadj.Close Div Split Adj.Div date rollmean30

2008-01-02 01:00:00 22.49 NA NA NA 2008-01-02 21.68629

2008-01-03 01:00:00 23.11 NA NA NA 2008-01-03 21.98521

2008-01-04 01:00:00 22.03 NA NA NA 2008-01-04 21.73771

2008-01-07 01:00:00 22.25 NA NA NA 2008-01-07 21.66700

2008-01-08 01:00:00 21.15 NA NA NA 2008-01-08 21.41243

2008-01-09 01:00:00 21.61 NA NA NA 2008-01-09 21.31665

# Get results locally for plotting

local.orcl <- ore.pull(ORCLSTOCK[,c("date","Close", "rollmean30")])

sub.orcl <- subset(local.orcl,local.orcl$date> as.Date("2011-12-16"))

# Plot the two series side by side

# First plot original series

plot(local.orcl$date, local.orcl$Close,type="l", col="red",xlab="Date",ylab="US$",

main="ORCL Stock Close ORE Computation of Smoothed Series n=30 days")

# Add smoothed series

lines(local.orcl$date,local.orcl$rollmean30,col="blue",type="l")

# Add legend

legend("topleft", c("Closing","ORE MA(30) of Closing"),

col=c("red","blue"),lwd=2,title = "Series",bty="n")

Seasonal Decomposition for Time Series Diagnostics

Now that we have learned how to execute these processes using Embedded R, we can start using other methodologies required for Time Series using the same Server-side computation and local plotting.

It is typical for an analyst to try to understand a Time Series better by looking at some of the basic diagnostics like the Seasonal Decomposition of Time Series by Loess. These can be achieved by using the stl() command in the following process:

# Server execution

sv.orcl.dcom <-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect = TRUE,

function(dat) {

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

ts.orcl <- ts(ordered$Close,frequency=365, start=c(2008,1))

res <- stl(ts.orcl,s.window="periodic")

}

);

# Get result for plotting

local.orcl.dcom <- ore.pull(sv.orcl.dcom)

plot(local.orcl.dcom, main="Server-side Decomposition of ORCL Time-Series",col="blue")

Another typical set of diagnostic charts includes Autocorrelation and Partial Autocorrelation function plots. These can be achieved by using the acf() command with the proper options in Embedded R Execution, so computations happen at the Oracle Database server:

# Server-side ACF and PACF computation

# Use function acf() and save result as a list

sv.orcl.acf <-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect=TRUE,

function(dat){

ts.orcl <- ts(dat$Close,frequency=365, start=c(2008,1))

list(res1 <- acf(ts.orcl,lag.max=120,type="correlation"),res2 <- acf(ts.orcl,lag.max=30, type="partial"))

}

);

# Get results for plotting

# ACF and PACF as members of the list pulled

local.orcl.acf <- ore.pull(sv.orcl.acf)

plot(local.orcl.acf[[1]],main="Server-side ACF Analysis for Series ORCL",col="blue",lwd=2)

plot(local.orcl.acf[[2]],main="Server-side PACF Analysis for Series ORCL",col="blue",lwd=5)

Simple Exponential Smoothing

Using the popular package “forecast”, we will use the ses() function to calculate a 90 days horizon (h=90) into the future, using the option criterion=MSE for the model. The package forecast needs to be installed on the Oracle Database server R engine.

Then, we will bring the resulting model locally for plotting. Remember to load the library “forecast” locally as well, to be able to interpret the meaning of the ses() output when it’s brought locally.

# Execute ses() call in the server

sv.orcl.ses <-

ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

function(dat) {

library(forecast)

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

ts.orcl <- ts(ordered$Close,frequency=365, start=c(2008,1) )

res <- ses(ts.orcl, h=90, alpha=0.1, initial="simple")

}

);

# Get SES result locally for plotting

# Since remote object contains a SES model from package forecast,

# load package locally as well

library(forecast)

plot.orcl.ses <- ore.pull(sv.orcl.ses)

plot(plot.orcl.ses,col="blue",fcol="red",

main="ORCL with Server-side SES - Simple Exponential Smoothing Forecast")

Holt Exponential Smoothing

Using the popular package “forecast”, we will use the holt() function to calculate a 90 days horizon (h=90) into the future, requesting the Intervals of confidence of 80 and 95%. Again. the package “forecast” needs to be installed on the Oracle Database server R engine.

Then, we will bring the resulting model locally for plotting. Remember to load the library forecast locally as well, to be able to interpret the meaning of the holt() output when it’s brought locally.

# Execute holt() call in the server

sv.orcl.ets <-

ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

function(dat) {

library(forecast)

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

ts.orcl <- ts(ordered$Close,frequency=365, start=c(2008,1))

res <- holt(ts.orcl, h=90, level=c(80,95), initial="optimal")

}

);

# Get resulting model from the server

# Since remote object contains a Holt Exponential Smoothing

# model from package forecast, load package locally as well

library(forecast)

local.orcl.ets <- ore.pull(sv.orcl.ets)

plot(local.orcl.ets,col="blue",fcol="red",

main="ORCL Original Series Stock Close with Server-side Holt Forecast")

ARIMA – Auto-Regressive Interactive Moving Average

There are at least two options for fitting an ARIMA model into a Time Series. One option is to use the package “forecast”, that allows for an automatic arima fitting (auto.arima) to find the best parameters possible based on the series.

For more advanced users, the arima() function in the “stats” package itself allows for choosing the model parameters.

# ARIMA models on the server using auto.arima() from package forecast

arimaModel <-

ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

FUN = function(dat){

# load forecast library to use auto.arima

library(forecast)

# sort the table into a temp file by date

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

# convert column into a Time Series

# format ts(...) and request creation of an automatic

# ARIMA model auto.arima(...)

res <- auto.arima(ts(ordered$Close,frequency=365, start=c(2008,1)),

stepwise=TRUE, seasonal=TRUE)

})

# Alternative using the arima() from package “stats”.

arimaModel <-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect=TRUE,

FUN = function(dat){

# sort table into a temp file by date

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

# convert column into a Time Series

# format ts(...) and request creation of a specific

# ARIMA model using arima(), for example an ARIMA(2,1,2)

res <- arima(ts(ordered$Close,frequency=365, start=c(2008,1)),

order = c(2,1,2))

})

# Load forecast package locally to use the model

# for plotting and producing forecasts

library(forecast)

# Show remote resulting Time Series model

>arimaModel

Series: ts(ordered$Close, frequency = 365, start = c(2008, 1))

ARIMA(2,1,0)

Coefficients:

ar1 ar2

-0.0935 -0.0192

s.e. 0.0282 0.0282

sigma^2 estimated as 0.2323: log likelihood=-866.77

AIC=1739.55 AICc=1739.57 BIC=1754.96

# Get remote model using ore.pull for local prediction and plotting

local.arimaModel <- ore.pull(arimaModel)

# Generate forecasts for the next 15 days

fore.arimaModel <- forecast(local.arimaModel, h=15)

# Use the following option if you need to remove scientific notation of

# numbers that are too large in charts

options(scipen=10)

# Generate the plot of forecasts, including interval of confidence

# Main title is generated automatically indicating the type of model

# chosen by the Auto ARIMA process

plot(fore.arimaModel,type="l", col="blue", xlab="Date",

ylab="Closing value (US$)", cex.axis=0.75, font.lab="serif EUC",

sub="Auto-generated ARIMA for ORCL Stock Closing"

)

# Generate and print forecasted data points plus standard errors

# of the next 15 days

forecasts <- predict(local.arimaModel, n.ahead = 15)

>forecasts

$pred

Time Series:

Start = c(2011, 165)

End = c(2011, 179)

Frequency = 365

[1] 33.29677 33.29317 33.29395 33.29395 33.29393 33.29393 33.29393 33.29393 33.29393 33.29393 33.29393

[12] 33.29393 33.29393 33.29393 33.29393

$se

Time Series:

Start = c(2011, 165)

End = c(2011, 179)

Frequency = 365

[1] 0.4819417 0.6504925 0.7807798 0.8928901 0.9924032 1.0827998 1.1662115 1.2440430 1.3172839 1.3866617

[11] 1.4527300 1.5159216 1.5765824 1.6349941 1.6913898

Monday Jun 10, 2013

Bringing R to the Enterprise - new white paper available

Check out this new white paper entitled "Bringing R to the Enterprise -  A Familiar R Environment with Enterprise-Caliber Performance, Scalability, and Security."

In this white paper, we begin with "Beyond the Laptop" exploring the ability to run R code in the database, working with CRAN packages at the database server, operationalizing R analytics, and leveraging Hadoop from the comfort of the R language and environment.

Excerpt: "Oracle Advanced Analytics and Oracle R Connector for Hadoop combine the advantages of R with the power and scalability of Oracle Database and Hadoop. R programs and libraries can be used in conjunction with these database assets to process large amounts of data in a secure environment. Customers can build statistical models and execute them against local data stores as well as run R commands and scripts against data stored in a secure corporate database."

The white paper continues with three use cases involving Oracle Database and Hadoop: analyzing credit risk, detecting fraud, and preventing customer churn.  The conclusion: providing analytics for the enterprise based on the R environment is here!


Tuesday May 28, 2013

Converting Existing R Scripts to ORE - Getting Started

Oracle R Enterprise provides a comprehensive, database-centric environment for end-to-end analytical processes in R, with immediate deployment to production environments. This message really resonates with our customers who are interested in executing R functions on database-resident data while seamlessly leveraging Oracle Database as a high-performance computing (HPC) environment. The ability to develop and operationalize R scripts for analytical applications in one step is quite appealing.

One frequently asked question is how to convert existing R code that access data in flat files or the database to use Oracle R Enterprise. In this blog post, we talk about a few scenarios and how to begin a conversion from existing R code to using Oracle R Enterprise.

Consider the following scenarios:

Scenario 1: A stand-alone R script that generates its own data and simply returns a result. Data is not obtained from the file system or database. This may result from performing simulations where dadta is dynamically generated, or perhaps access from a URL on the internet.

Scenario 2: An R script that loads data from a flat file such as a CSV file, performs some computations in R, and then writes the result back to a file.

Scenario 3: An R script that loads data from a database table, via one of the database connector packages like RODBC, RJDBC, or ROracle, and writes a result back to the database –using SQL statements or package functions.

Scenario 1

A stand-alone R script might normally be run on a user’s desktop, invoked as a cron job, or even via Java to spawn an R engine and retrieve the result, but we’d like to operationalize its execution as part of a database application, invoked from SQL. Here’s a simple script to illustrate the concept of converting such a script to be executed at the database server using ORE’s embedded R execution. The script generates a data.frame with some random columns, performs summary on that data and returns the summary statistics, which are represented as an R table.

# generate data

set.seed(1)

n <- 1000

df <- 3

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

# perform some analysis

res <- summary(x)

#return the result

res


To convert this to use ORE, create a function with appropriate arguments and body, for example:

myFunction1 <- function (n = 1000, df = 3,seed=1) {

set.seed(seed)

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

res <- summary(x)

res

}

Next, load the ORE packages and connect to Oracle Database using the ore.connect function. Using the all argument set to TRUE loads metadata for all the tables and views in that schema. We then store the function in the R script repository, invoking it via ore.doEval.

# load ORE packages and connect to Oracle Database

library(ORE)

ore.connect("schema","sid","hostname","password",port=1521, all=TRUE)

# load function into R script repository

ore.scriptDrop("myFunction-1")

ore.scriptCreate("myFunction-1", myFunction1)

# invoke using embedded R execution at the database server

ore.doEval(FUN.NAME="myFunction-1")

> ore.doEval(FUN.NAME="myFunction-1")
       a                b                  c           
 Min.   :   1.0   Min.   :-3.00805   Min.   : 0.03449  
 1st Qu.: 250.8   1st Qu.:-0.69737   1st Qu.: 1.27386  
 Median : 500.5   Median :-0.03532   Median : 2.36454  
 Mean   : 500.5   Mean   :-0.01165   Mean   : 3.07924  
 3rd Qu.: 750.2   3rd Qu.: 0.68843   3rd Qu.: 4.25994  
 Max.   :1000.0   Max.   : 3.81028   Max.   :17.56720  

Of course, we’re using default values here. To provide different arguments, change the invocation with arguments as follows:

ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2)

> ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2)
       a               b                  c          
 Min.   :  1.0   Min.   :-2.72182   Min.   : 0.1621  
 1st Qu.:125.8   1st Qu.:-0.65346   1st Qu.: 2.6144  
 Median :250.5   Median : 0.04392   Median : 4.4592  
 Mean   :250.5   Mean   : 0.06169   Mean   : 5.0386  
 3rd Qu.:375.2   3rd Qu.: 0.79096   3rd Qu.: 6.8467  
 Max.   :500.0   Max.   : 2.88842   Max.   :17.0367  

Having successfully invoked this from the R client (my laptop), we can now invoke it from SQL. Here, we retrieve the summary result, which is an R table, as an XML string.

select *

from table(rqEval( NULL,'XML','myFunction-1'));

The result can be viewed from SQL Developer.

The following shows the XML output in a more structured manner.


What if we wanted to get the result to appear as a SQL table? Since the current result is an R table (an R object), we need to convert this to a data.frame to return it. We’ll make a few modifications to “myFunction-1” above. Most notably is the need to convert the table object in res to a data.frame. There are a variety of ways to do this.

myFunction2 <- function (n = 1000, df = 3,seed=1) {

# generate data

set.seed(seed)

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

# perform some analysis

res <- summary(x)

# convert the table result to a data.frame

res.df <- as.matrix(res)

res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c('a','b','c')

res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

res.sum

}

# load function into R script repository

ore.scriptDrop("myFunction-2")

ore.scriptCreate("myFunction-2", myFunction2)

We’ll now modify the SQL statement to specify the format of the result.

select *

from table(rqEval( NULL,'select cast(''a'' as VARCHAR2(12)) as "statname",

1 "a", 1 "b", 1 "c" from dual ','myFunction-2'));

Here’s the result as viewed from SQL Developer.


This type of result could be incorporated into any SQL application accepting table or view input from a SQL query. That is particular useful in combination with OBIEE dashboards via an RPD.

Scenario 2

If you’ve been loading data from a flat file, perhaps a CSV file, your R code may look like the following, where it specifies to builds a model and write hat model to a file for future use, perhaps in scoring. It also generates a graph of the clusters highlighting the individual points, colored by their cluster id, with the centroids indicated with a star.

# read data

setwd("D:/datasets")

dat <- read.csv("myDataFile.csv")

# build a clustering model

cl <- kmeans(x, 2)

# write model to file

save(cl, file="myClusterModel.dat")

# create a graph and write it to a file

pdf("myGraphFile.pdf")

plot(x, col = cl$cluster)

points(cl$centers, col = 1:2, pch = 8, cex=2)

dev.off()

The resulting PDF file contains the following image.


To convert this script for use in ORE, there are several options. We’ll explore two: the first involving minimal change to use embedded R execution, and the second leveraging in-database techniques. First, we’ll want the data we used above in variable dat to be loaded into the database.

# create a row id to enable ordered results (if a key doesn’t already exist)

dat$ID <- 1:nrow(dat)

# remove the table if it exists

ore.drop("MY_DATA")

# create the table using the R data.frame, resulting in an ore.frame named MY_DATA

ore.create(dat,"MY_DATA")

# assign the ID column as the row.names of the ore.frame

row.names(MY_DATA) <- MY_DATA$ID

In the first example, we’ll use embedded R execution and pass the data to the function via ore.tableApply. We’ll generate the graph, but simply display it within the function to allow embedded R execution to return the graph as a result. (Note we could also write the graph to a file in any directory accessible to the database server.) Instead of writing the model to a file, which requires keeping track of its location, as well as worring about backup and recovery, we store the model in the database R datastore using ore.save. All this requires minimal change. As above, we could store the function in the R script repository and invoke it by name – both from R and SQL. In this example, we simply provide the function itself as argument.

myClusterFunction1 <- function(x) {

cl <- kmeans(x, 2)

ore.save(cl, name="myClusterModel",overwrite=TRUE)

plot(x, col = cl$cluster)

points(cl$centers, col = 1:2, pch = 8, cex=2)

TRUE

}

ore.tableApply(MY_DATA[,c('x','y')], myClusterFunction1,

ore.connect=TRUE,ore.png.height=700,ore.png.width=700)

The ore.tableApply function projects the x and y columns of MY_DATA as input and also specifies ore.connect as TRUE since we are using the R datastore, which requires a database connection. Optionally, we can specify control arguments to the PNG output. In this example, these are the height and width of the image.

For the second example, we convert this to leverage the ORE Transparency Layer. We’ll use the in-database K-Means algorithm and save the model in a datastore named “myClusteringModel”, as we did above. Since ore.odmKMeans doesn’t automatically assign cluster ids (since the data may be very large or are not required), the scoring is done separately. Note, however, that the prediction results also exist in the database as an ore.frame. To ensure ordering, we also assign row.names to the ore.frame pred. Lastly, we create the plot. Coloring the nodes requires pulling the cluster assignments; however, the points themselves can be accessed from the ore.frame. The centroids points are obtained from cl$centers2 of the cluster model.

# build a clustering model in-database

cl <- ore.odmKMeans(~., MY_DATA, 2, auto.data.prep=FALSE)

# save model in database R datastore

ore.save(cl,name="myClusterModel",overwrite=TRUE)

# generate predictions to assign each row a cluster id, supplement with original data

pred <- predict(cl,MY_DATA,supp=c('x','y','ID'),type="class")

# assign row names to ensure ordering of results

row.names(pred) <- pred$ID

# create the graph

plot(pred[,c('x','y')], col = ore.pull(pred$CLUSTER_ID))

points(cl$centers2[,c('x','y')], col = c(2,3), pch = 8, cex=2)

We can also combine using the transparency layer within an embedded R function. But we’ll leave that as an exercise to the reader.

Scenario 3

In this last scenario, the data already exists in the database and one of the database interface packages, such as RODBC, RJDBC, and ROracle is be used to retrieve data from and write data to the database. We’ll illustrate this with ROracle, but the same holds for the other two packages.

# connect to the database

drv <- dbDriver("Oracle")

con <- dbConnect(drv, "mySchema", "myPassword")

# retrieve the data specifying a SQL query

dat <- dbGetQuery(con, 'select * from MY_RANDOM_DATA where "a" > 100')

# perform some analysis

res <- summary(dat)

# convert the table result to a data.frame for output as table

res.df <- as.matrix(res)

res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c('a','b','c')

res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

res.sum

dbWriteTable(con, "SUMMARY_STATS", res.sum)

Converting this to ORE is straightforward. We’re already connected to the database using ore.connect from previous scenarios, so the existing table MY_RANDOM_DATA was already loaded in the environment as an ore.frame. Executing ore.ls lists this table is the result, so we can just start using it.

> ore.ls(pattern="MY_RAND")

[1] "MY_RANDOM_DATA"

# no need to retrieve the data, use the transparency layer to compute summary

res <- with(MY_RANDOM_DATA , summary(MY_RANDOM_DATA[a > 100,]))

# convert the table result to a data.frame for output as table

res.df <- as.matrix(res)

res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c('a','b','c')

res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

# create the database table

ore.create(res.sum, "SUMMARY_STATS")

SUMMARY_STATS


As we did in previous scenarios, this script can also be wrapped in a function and used in embedded R execution. This too is left as an exercise to the reader.

Summary

As you can see from the three scenarios discussed here, converting a script that accesses no external data, accesses and manipulates file data, or accesses and manipulates database data can be accomplished with a few strategic modifications. More involved scripts, of course, may require additional manipulation. For example, if the SQL query performs complex joins and filtering, along with derived column creation, the user may want to convert this SQL to the corresponding ORE Transparency Layer code, thereby eliminating reliance on SQL. But that’s a topic for another post.

Wednesday Apr 17, 2013

Mind Reading... What are our customers thinking?

Overhauling analytics processes is becoming a recurring theme among customers. A major telecommunication provider recently embarked on overhauling their analytics process for customer surveys. They had three broad technical goals:

  • Provide an agile environment that empowers business analysts to test hypotheses based on survey results
  • Allow dynamic customer segmentation based on survey responses and even specific survey questions to drive hypothesis testing
  • Make results of new surveys readily available for research

The ultimate goal is to derive greater value from survey research that drives measurable improvements in survey service delivery, and as a result, overall customer satisfaction.

This provider chose Oracle Advanced Analytics (OAA) to power their survey research. Survey results and analytics are maintained in Oracle Database and delivered via a parameterized BI dashboard. Both the database and BI infrastructure are standard components in their architecture.

A parameterized BI dashboard enables analysts to create samples for hypothesis testing by filtering respondents to a survey question based on a variety of filtering criteria. This provider required the ability to deploy a range of statistical techniques depending on the survey variables, level of measurement of each variable, and the needs of survey research analysts.

Oracle Advanced Analytics offers a range of in-database statistical techniques complemented by a unique architecture supporting deployment of open source R packages in-database to optimize data transport to and from database-side R engines. Additionally, depending on the nature of functionality in such R packages, it is possible to leverage data-parallelism constructs available as part of in-database R integration. Finally, all OAA functionality is exposed through SQL, the ubiquitous language of the IT environment. This enables OAA-based solutions to be readily integrated with BI and other IT technologies.

The survey application noted above has been in production for 3 months. It supports a team of 20 business analysts and has already begun to demonstrate measurable improvements in customer satisfaction.

In the rest of this blog, we explore the range of statistical techniques deployed as part of this application.

At the heart of survey research is hypothesis testing. A completed customer satisfaction survey contains data used to draw conclusions about the state of the world. In the survey domain, hypothesis testing is comparing the significance of answers to specific survey questions across two distinct groups of customers - such groups are identified based on knowledge of the business and technically specified through filtering predicates.

Hypothesis testing sets up the world as consisting of 2 mutually exclusive hypotheses:

a) Null hypothesis - states that there is no difference in satisfaction levels between the 2 groups of customers

b) Alternate hypothesis states that there is a significant difference in satisfaction levels between the 2 groups of customers

Obviously only one of these can be true and the true-ness is determined by the strength, probability, or likelihood of the null hypothesis over the alternate hypothesis. Simplistically, the degree of difference between, e.g., the average score from a specific survey question across two customer groups could provide the necessary evidence in helping decide which hypothesis is true.

In practice the process of providing evidence to make a decision involves having access to a range of test statistics – a number calculated from each group that helps determine the choice of null or alternate hypothesis. A great deal of theory, experience, and business knowledge goes into selecting the right statistic based on the problem at hand.

The t-statistic (available in-database) is a fundamental function used in hypothesis testing that helps understand the differences in means across two groups. When the t-values across 2 groups of customers for a specific survey question are extreme then the alternative hypothesis is likely to be true. It is common to set a critical value that the observed t-value should exceed to conclude that the satisfaction survey results across the two groups are significantly different. Other similar statistics available in-database include F-test, cross tabulation (frequencies of various response combinations captured as a table), related hypothesis testing functions such as chi-square functions, Fisher's exact test, Kendall's coefficients, correlation coefficients and a range of lambda functions.

If an analyst desires to compare across more than 2 groups then analysis of variance (ANOVA) is a collection of techniques that is commonly used. This is an area where the R package ecosystem is rich with several proven implementations. The R stats package has implementations of several test statistics and function glm allows analysis of count data common in survey results including building Poisson and log linear models. R's MASS package implements a popular survey analysis technique called iterative proportional fitting. R's survey package has a rich collection of features (http://faculty.washington.edu/tlumley/survey/).

The provider was specifically interested in one function in the survey package - raking (also known as sample balancing) - a process that assigns a weight to each customer that responded to a survey such that the weighted distribution of the sample is in very close agreement with other customer attributes, such as the type of cellular plan, demographics, or average bill amount. Raking is an iterative process that uses the sample design weight as the starting weight and terminates when a convergence is achieved.

For this survey application, R scripts that expose a wide variety of statistical techniques - some in-database accessible through the transparency layer in Oracle R Enterprise and some in CRAN packages - were built and stored in the Oracle R Enterprise in-database R script repository. These parameterized scripts accept various arguments that identify samples of customers to work with as well as specific constraints for the various hypothesis test functions. The net result is greater agility since the business analyst determines both the set of samples to analyze as well as the application of the appropriate technique to the sample based on the hypothesis being pursued.

For more information see these links for Oracle's R Technologies software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop

Monday Apr 15, 2013

Is the size of your lm model causing you headaches?

If you build an R lm model with a relatively large number of rows, you may be surprised by just how large that lm model is and what impact it has on your environment and application.

Why might you care about size? The most obvious is that the size of R objects impacts the amount of RAM available for further R processing or loading of more data. However, it also has implications for how much space is required to save that model or the time required to move it around the network. For example, you may want to move the model from the database server R engine to the client R engine when using Oracle R Enterprise Embedded R Execution. If the model is too large, you may encounter latency when trying to retrieve the model or even receive the following error:

Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch,  :
  ORA-20000: RQuery error
Error : serialization is too large to store in a raw vector

If you get this error, there are at least a few options:

  • Perform summary component access, like coefficients, inside the embedded R function and return only what is needed
  • Save the model in a database R datastore and manipulate that model at the database server to avoid pulling it to the client
  • Reduce the size of the model by eliminating large and unneeded components

In this blog post, we focus on the third approach and look at the size of lm model components, what you can do to control lm model size, and the implications for doing so. With vanilla R, objects are the "memory" serving as the repository for repeatability. As a result, models tend to be populated with the data used to build them to ensure model build repeatability.

When working with database tables, this "memory" is not needed because governance mechanisms are already in place to ensure either data does not change or logs are available to know what changes took place. Hence it is unnecessary to store the data used to build the model into the model object.

An lm model consists of several components, for example:

coefficients, residuals, effects, fitted.values, rank, qr, df.residual, call, terms, xlevels, model, na.action

Some of these components may appear deceptively small using R’s object.size function. The following script builds an lm model to help reveal what R reports for the size of various components. The examples use a sample of the ONTIME airline arrival and departure delays data set for domestic flights. The ONTIME_S data set is an ore.frame proxy object for data stored in an Oracle database and consists of 219932 rows and 26 columns. The R data.frame ontime_s is this same data pulled to the client R engine using ore.pull and is ~39.4MB.

Note: The results reported below use R 2.15.2 on Windows. Serialization of some components in the lm model has been improved in R 3.0.0, but the implications are the same.

f.lm.1 <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat)

lm.fit.1 <- f.lm.1(ontime_s)

object.size(lm.fit.1)

54807720 bytes

Using the object.size function on the resulting model, the size is about 55MB. If only scoring data with this model, it seems like a lot of bloat for the few coefficients assumed needed for scoring. Also, to move this object over a network will not be instantaneous. But is this the true size of the model?

A better way to determine just how big an object is, and what space is actually required to store the model or time to move it across a network, is the R serialize function.

length(serialize(lm.fit.1,NULL))

[1] 65826324

Notice that the size reported by object.size is different from that of serialize – a difference of 11MB or ~20% greater.

What is taking up so much space? Let’s invoke object.size on each component of this lm model:

lapply(lm.fit.1, object.size)
$coefficients

424 bytes

$residuals

13769600 bytes

$effects

3442760 bytes

$rank

48 bytes

$fitted.values

13769600 bytes

$assign

56 bytes

$qr

17213536 bytes

$df.residual

48 bytes

$na.action

287504 bytes

$xlevels

192 bytes

$call

1008 bytes

$terms

4432 bytes

$model

6317192 bytes

The components residuals, fitted.values, qr, model, and even na.action are large. Do we need all these components?

The lm function provides arguments to control some aspects of model size. This can be done, for example, by specifying model=FALSE and qr=FALSE. However, as we saw above, there are other components that contribute heavily to model size.

f.lm.2 <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY,
                           data = dat, model=FALSE, qr=FALSE)

lm.fit.2 <- f.lm.2(ontime_s)

length(serialize(lm.fit.2,NULL))

[1] 51650410

object.size(lm.fit.2)

31277216 bytes

The resulting serialized model size is down to about ~52MB, which is not significantly smaller than the full model.The difference with the result reported by object.size is now ~20MB, or 39% smaller.

Does removing these components have any effect on the usefulness of an lm model? We’ll explore this using four commonly used functions: coef, summary, anova, and predict. If we try to invoke summary on lm.fit.2, the following error results:

summary(lm.fit.2)

Error in qr.lm(object) : lm object does not have a proper 'qr' component.

Rank zero or should not have used lm(.., qr=FALSE).

The same error results when we try to run anova. Unfortunately, the predict function also fails with the error above. The qr component is necessary for these functions. Function coef returns without error.

coef(lm.fit.2)

(Intercept) DISTANCE DEPDELAY

0.225378249 -0.001217511 0.962528054

If only coefficients are required, these settings may be acceptable. However, as we’ve seen, removing the model and qr components, while each is large, still leaves a large model. The really large components appear to be the effects, residuals, and fitted.values. We can explicitly nullify them to remove them from the model.

f.lm.3 <- function(dat) {
mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY,
data = dat, model=FALSE, qr=FALSE)
mod$effects <- mod$residuals <- mod$fitted.values <- NULL
  mod
}

lm.fit.3 <- f.lm.3(ontime_s)

length(serialize(lm.fit.3,NULL))

[1] 24089000

object.size(lm.fit.3)

294968 bytes

Thinking the model size should be small, we might be surprised to see the results above. The function object.size reports ~295KB, but serializing the model shows 24MB, a difference of 23.8MB or 98.8%. What happened? We’ll get to that in a moment. First, let’s explore what effect nullifying these additional components has on the model.

To answer this, we’ll turn on model and qr, and focus on effects, residuals, and fitted.values. If we nullify effects, the anova results are invalid, but the other results are fine. If we nullify residuals, summary cannot produce residual and coefficient statistics, but it also produces an odd F-statistic with a warning:

Warning message:

In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

The function anova produces invalid F values and residual statistics, clarifying with a warning:

Warning message:

In anova.lm(mod) :

ANOVA F-tests on an essentially perfect fit are unreliable

Otherwise, both predict and coef work fine.

If we nullify fitted.values, summary produces an invalid F-statistics issuing the warning:

Warning message:

In mean.default(f) : argument is not numeric or logical: returning NA


However, there are no adverse effects on results on the other three functions.

Depending on what we need from our model, some of these components could be eliminated. But let’s continue looking at each remaining component, not with object.size, but serialize. Below, we use lapply to compute the serialized length of each model component. This reveals that the terms component is actually the largest component, despite object.size reporting only 4432 bytes above.

as.matrix(lapply(lm.fit.3, function(x) length(serialize(x,NULL))))

[,1]

coefficients 130

rank 26

assign 34

df.residual 26

na.action 84056

xlevels 55

call 275

terms 24004509

If we nullify the terms component, the model becomes quite compact. (By the way, if we simply nullify terms, summary, anova, and predict all fail.) Why is the terms component so large? It turns out it has an environment object as an attribute. The environment contains the variable dat, which contains the original data with 219932 rows and 26 columns. R’s serialize function includes this object and hence the reason the model is so large. The function object.size ignores these objects.

attr(lm.fit.1$terms, ".Environment")  
<environment: 0x1d6778f8>
ls(envir = attr(lm.fit.1$terms, ".Environment"))        
[1] "dat"
d <- get("dat",envir=envir)
dim(d)
[1] 219932 26
length(serialize(attr(lm.fit.1$terms, ".Environment"), NULL))
[1] 38959319
object.size(attr(lm.fit.1$terms, ".Environment"))
56 bytes

If we remove this object from the environment, the serialized object size also becomes small.

rm(list=ls(envir = attr(lm.fit.1$terms, ".Environment")),
envir = attr(lm.fit.1$terms, ".Environment"))  
ls(envir = attr(lm.fit.1$terms, ".Environment"))
character(0)
length(serialize(lm.fit.1, NULL))
[1] 85500
lm.fit.1

Call:
lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE,
    qr = FALSE)

Coefficients:
(Intercept)     DISTANCE     DEPDELAY 
   0.225378    -0.001218     0.962528 

Is the associated environment essential to the model? If not, we could empty it to significantly reduce model size. We'll rebuild the model using the function f.lm.full

f.lm.full <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat)
lm.fit.full <- f.lm.full(ontime_s)
ls(envir=attr(lm.fit.full$terms, ".Environment"))
[1] "dat"
length(serialize(lm.fit.full,NULL))
[1] 65826324

We'll create the model removing some components as defined in function:

line-height: 115%; font-family: "Courier New";">f.lm.small <- function(dat) {
  f.lm <- function(dat) {
  mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model=FALSE)   
  mod$fitted.values <- NULL
  mod
}
  mod <- f.lm(dat)
  # empty the env associated with local function
  e <- attr(mod$terms, ".Environment")
  # set parent env to .GlobalEnv so serialization doesn’t include contents
parent.env(e) <- .GlobalEnv    
  rm(list=ls(envir=e), envir=e) # remove all objects from this environment
  mod
}

lm.fit.small <- f.lm.small(ontime_s)
ls(envir=attr(lm.fit.small$terms, ".Environment")) 
character(0)
length(serialize(lm.fit.small, NULL))
[1] 16219251

We can use the same function with embedded R execution.

lm.fit.ere <- ore.pull(ore.tableApply(ONTIME_S, f.lm.small))
ls(envir=attr(lm.fit.ere$terms, ".Environment"))
character(0)
length(serialize(lm.fit.ere, NULL))
[1] 16219251
as.matrix(lapply(lm.fit.ere, function(x) length(serialize(x,NULL))))    
              [,1]  
coefficients  130   
residuals     4624354
effects       3442434
rank          26    
fitted.values 4624354
assign        34    
qr            8067072
df.residual   26    
na.action     84056 
xlevels       55    
call          245   
terms         938   

Making this change does not affect the workings of the model for coef, summary, anova, or predict. For example, summary produces expected results:

summary(lm.fit.ere)

Call:
lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE)

Residuals:
     Min       1Q   Median       3Q      Max
-1462.45    -6.97    -1.36     5.07   925.08

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  2.254e-01  5.197e-02   4.336 1.45e-05 ***
DISTANCE    -1.218e-03  5.803e-05 -20.979  < 2e-16 ***
DEPDELAY     9.625e-01  1.151e-03 836.289  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.73 on 215144 degrees of freedom
  (4785 observations deleted due to missingness)
Multiple R-squared: 0.7647,     Adjusted R-squared: 0.7647
F-statistic: 3.497e+05 on 2 and 215144 DF,  p-value: < 2.2e-16

Using the model for prediction also produces expected results.

lm.pred <- function(dat, mod) {
prd <- predict(mod, newdata=dat)
prd[as.integer(rownames(prd))] <- prd
cbind(dat, PRED = prd)
}

dat.test <- with(ontime_s, ontime_s[YEAR == 2003 & MONTH == 5,
c("ARRDELAY", "DISTANCE", "DEPDELAY")])
head(lm.pred(dat.test, lm.fit.ere))
       ARRDELAY DISTANCE DEPDELAY        PRED
163267        0      748       -2 -2.61037575
163268       -8      361        0 -0.21414306
163269       -5      484        0 -0.36389686
163270       -3      299        3  2.74892676
163271        6      857       -6 -6.59319662
163272      -21      659       -8 -8.27718564
163273       -2     1448        0 -1.53757703
163274        5      238        9  8.59836323
163275       -5      744        0 -0.68044960
163276       -3      199        0 -0.01690635

As shown above, an lm model can become quite large. At least for some applications, several of these components may be unnecessary, allowing the user to significantly reduce the size of the model and space required for saving or time for transporting the model. Relying on Oracle Database to store the data instead of the R model object further allows for significant reduction in model size.

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today