Tuesday May 27, 2014

R Package Installation with Oracle R Enterprise


Programming languages give developers the opportunity to write reusable functions and to bundle those functions into logical deployable entities. In R, these are called packages. R has thousands of such packages provided by an almost equally large group of third-party contributors. To allow others to benefit from these packages, users can share packages on the CRAN system for use by the vast R development community worldwide.

R's package system along with the CRAN framework provides a process for authoring, documenting and distributing packages to millions of users. In this post, we'll illustrate the various ways in which such R packages can be installed for use with R and together with Oracle R Enterprise. In the following, the same instructions apply when using either open source R or Oracle R Distribution.

In this post, we cover the following package installation scenarios for:


R command line
Linux shell command line
Use with Oracle R Enterprise
Installation on Exadata or RAC
Installing all packages in a CRAN Task View
Troubleshooting common errors


1. R Package Installation Basics

R package installation basics are outlined in Chapter 6 of the R Installation and Administration Guide. There are two ways to install packages from the command line: from the R command line and from the shell command line. For this first example on Oracle Linux using Oracle R Distribution, we’ll install the arules package as root so that packages will be installed in the default R system-wide location where all users can access it, /usr/lib64/R/library.

Within R, using the
install.packages function always attempts to install the latest version of the requested package available on CRAN:

R> install.packages("arules")


If the arules package depends upon other packages that are not already installed locally, the R installer automatically downloads and installs those required packages. This is a huge benefit that frees users from the task of identifying and resolving those dependencies.


You can also install R from the shell command line. This is useful for some packages when an internet connection is not available or for installing packages not uploaded to CRAN. To install packages this way, first locate the package on
CRAN and then download the package source to your local machine. For example:

$ wget http://cran.r-project.org/src/contrib/arules_1.1-2.tar.gz

Then, install the package using the command R CMD INSTALL:

$ R CMD INSTALL arules_1.1-2.tar.gz


A major difference between installing R packages using the R package installer at the R command line and shell command line is that package dependencies must be resolved manually at the shell command line.
Package dependencies are listed in the Depends section of the package’s CRAN site. If dependencies are not identified and installed prior to the package’s installation, you will see an error similar to:

ERROR: dependency ‘xxx’ is not available for package ‘yyy’


As a best practice and to save time, always refer to the package’s CRAN site to understand the package dependencies prior to attempting an installation.


If you don’t run R as root, you won’t have permission to write packages into the default system-wide location and you will be prompted to create a personal library accessible by your userid. You can accept the personal library path chosen by R, or specify the library location by passing parameters to the
install.packages function. For example, to create an R package repository in your home directory:

R> install.packages("arules", lib="/home/username/Rpackages")

or


$ R CMD INSTALL arules_1.1-2.tar.gz --library=/home/username/Rpackages


Refer to the
install.packages help file in R or execute R CMD INSTALL --help at
the shell command line for a full list of command line options.


To set the library location and avoid having to specify this at every package install, simply create the R startup environment file .
Renviron in your home area if it does not already exist, and add the following piece of code to it:

R_LIBS_USER = "/home/username/Rpackages"


2. Setting the Repository

Each time you install an R package from the R command line, you are asked which CRAN mirror, or server, R should use. To set the repository and avoid having to specify this during every package installation, create the R startup command file .Rprofile in your home directory and add the following R code to it:

cat("Setting Seattle repository")
r = getOption("repos") 
r["CRAN"] = "http://cran.fhcrc.org/"
options(repos = r)
rm(r)

This code snippet sets the R package repository to the Seattle CRAN mirror at the start of each R session.


3. Installing R Packages for use with Oracle R Enterprise

Embedded R execution with Oracle R Enterprise allows the use of CRAN or other third-party R packages in user-defined R functions executed on the Oracle Database server. The steps for installing and configuring packages for use with Oracle R Enterprise are the same as for open source R. The database-side R engine just needs to know where to find the R packages.

The Oracle R Enterprise installation is performed by user
oracle, which typically does not have write permission to the default site-wide library, /usr/lib64/R/library. On Linux and UNIX platforms, the Oracle R Enterprise Server installation provides the ORE script, which is executed from the operating system shell to install R packages and to start R. The ORE script is a wrapper for the default R script, a shell wrapper for the R executable. It can be used to start R, run batch scripts, and build or install R packages. Unlike the default R script, the ORE script installs packages to a location writable by user oracle and accessible by all ORE users - $ORACLE_HOME/R/library.

To install a package on the database server so that it can be used by any R user and for use in embedded R execution, an Oracle DBA would typically download 
the package source from CRAN using wget. If the package depends on any packages that are not in the R distribution in use, download the sources for those packages, also. 

For a single Oracle Database instance, replace the R script with ORE to install the packages in the same location as the Oracle R Enterprise packages.

$ wget http://cran.r-project.org/src/contrib/arules_1.1-2.tar.gz
$ ORE CMD INSTALL arules_1.1-2.tar.gz


Behind the scenes, the ORE script performs the equivalent of setting R_LIBS_USER to the value of
$ORACLE_HOME/R/library, and all R packages installed with the ORE script are installed to this location. For installing a package on multiple database servers, such as those in an Oracle Real Application Clusters (Oracle RAC) or a multinode Oracle Exadata Database Machine environment, use the ORE script in conjunction with the Exadata Distributed Command Line Interface (DCLI) utility.

$ dcli -g nodes -l oracle ORE CMD INSTALL arules_1.1-1.tar.gz

The DCLI -g flag designates a file containing a list of nodes to install on, and the -l flag specifies the user id to use when executing the commands. For more information on using DCLI with Oracle R Enterprise, see Chapter 5 in the Oracle R Enterprise Installation Guide.

If you are using an Oracle R Enterprise client, install the package the same as any R package, bearing in mind that you must install the same version of the package on both the client and server machines to avoid incompatibilities.


4. CRAN Task Views

CRAN also maintains a set of Task Views that identify packages associated with a particular task or methodology. Task Views are helpful in guiding users through the huge set of available R packages. They are actively maintained by volunteers who include detailed annotations for routines and packages. If you find one of the task views is a perfect match, you can install every package in that view using the ctv package - an R package for automating package installation.

To use the ctv package to install a task view, first, install and load the ctv package.

R> install.packages("ctv")

R> library(ctv)


Then query the names of the available task views and install the view you choose.


R> available.views()
R> install.views("TimeSeries")


5. Using and Managing R packages

To use a package, start up R and load packages one at a time with the library command.

Load the
arules package in your R session.

R> library(arules)

Verify the version of
arules installed.

R> packageVersion("arules")

[1] '1.1.2'


Verify the version of
arules installed on the database server using embedded R execution.


R> ore.doEval(function() packageVersion("arules"))


View the help file for the apropos function in the
arules
package


R> ?apropos


Over time, your package repository will contain more and more packages, especially if you are using the system-wide repository where others are adding additional packages. It’s good to know the entire set of R packages accessible in your environment. To list all available packages in your local R session, use the
installed.packages command:

R> myLocalPackages <- row.names(installed.packages())

R> myLocalPackages


To access the list of available packages on the ORE database server from the ORE client, use the following embedded R syntax:

R> myServerPackages <-
ore.doEval(function() row.names(installed.packages())
R> myServerPackages


6. Troubleshooting Common Problems

Installing Older Versions of R packages

If you immediately upgrade to the latest version of R, you will have no problem installing the most recent versions of R packages. However, if your version of R is older, some of the more recent package releases will not work and
install.packages will generate a message such as:

Warning message:
In install.packages("arules")
: package ‘arules’ is not available


This is when you have to go to the
Old sources link on the CRAN page for the arules
package and determine which version is compatible with your version of R.


Begin by determining what version of R you are using:


$ R --version

Oracle Distribution of R version 3.0.1 (--) -- "Good Sport"
Copyright (C) The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)


Given that R-3.0.1 was
released May 16, 2013, any version of the arules package released after this date may work. Scanning the arules archive, we might try installing version 0.1.1-1, released in January of 2014:

$ wget http://cran.r-project.org/src/contrib/Archive/arules/arules_1.1-1.tar.gz
$ R CMD INSTALL arules_1.1-1.tar.gz

For use with ORE:

$ ORE CMD INSTALL arules_1.1-1.tar.gz

The "package not available" error can also be thrown if the package you’re trying to install lives elsewhere, either another R package site, or it’s been removed from CRAN. A quick Google search usually leads to more information on the package’s location and status.


Oracle R Enterprise is not in the R library path

On Linux hosts, after installing the ORE server components, starting R, and attempting to load the ORE packages, you may receive the error:


R> library(ORE)
Error in library(ORE) : there is no package called ‘ORE’

If you know the ORE packages have been installed and you receive this error, this is the result of not starting R with the ORE script. To resolve this problem, exit R and restart using the ORE script. After restarting R and running the command to load the ORE packages, you should not receive any errors.

$ ORE
R> library(ORE)

On Windows servers, the solution is to make the location of the ORE packages visible to R by adding them to the R library paths. To accomplish this, exit R, then add the following lines to the .Rprofile file. On Windows, the .Rprofile file is located in
R\etc directory C:\Program Files\R\R-<version>\etcAdd the following lines:

.libPaths("<path to $ORACLE_HOME>/R/library")

The above line will tell R to include the R directory in the Oracle home as part of its search path. When you start R, the path above will be included, and future R package installations will also be saved to $ORACLE_HOME/R/library. This path should be writable by the user oracle, or the userid for the DBA tasked with installing R packages.

Binary package compiled with different version of R

By default, R will install pre-compiled versions of packages if they are found. If the version of R under which the package was compiled does not match your installed version of R you will get an error message:

Warning message: package ‘xxx’ was built under R version 3.0.0

The solution is to download the package source and build it for your version of R.

$ wget
http://cran.r-project.org/src/contrib/Archive/arules/arules_1.1-1.tar.gz
$ R CMD INSTALL arules_1.1-1.tar.gz

For use with ORE:

$ ORE CMD INSTALL arules_1.1-1.tar.gz

Unable to execute files in /tmp directory

By default, R uses the /tmp directory to install packages. On security conscious machines, the /tmp directory is often marked as "noexec" in the /etc/fstab file. This means that no file under /tmp can ever be executed, and users who attempt to install R package will receive an error:

ERROR: 'configure' exists but is not executable -- see the 'R Installation and Administration Manual’

The solution is to set the TMP and TMPDIR environment variables to a location which R will use as the compilation directory. For example:

$ mkdir <some path>/tmp
$ export TMPDIR= <some path>/tmp
$ export TMP= <some path>/tmp

This error typically appears on Linux client machines and not database servers, as Oracle Database writes to the value of the 
TMP environment variable for several tasks, including holding temporary files during database installation.


7. Creating your own R package

Creating your own package and submitting to CRAN is for advanced users, but it is not difficult. The procedure to follow, along with details of R's package system, is detailed in the
Writing R Extensions manual.

Monday May 19, 2014

Model cross-validation with ore.CV()

In this blog post we illustrate how to use Oracle R Enterprise for performing cross-validation of regression and classification models. We describe a new utility R function ore.CV that leverages features of Oracle R Enterprise and is available for download and use.

Predictive models are usually built on given data and verified on held-aside or unseen data. Cross-validation is a model improvement technique that avoids the limitations of a single train-and-test experiment by building and testing multiple models via repeated sampling from the available data. It's purpose is to offer a better insight into how well the model would generalize to new data and avoid over-fitting and deriving wrong conclusions from misleading peculiarities of the seen data.

In a k-fold cross-validation the data is partitioned into k (roughly) equal size subsets. One of the subsets is retained for testing and the remaining k-1 subsets are used for training. The process is repeated k times with each of the k subsets serving exactly once as testing data. Thus, all observations in the original data set are used for both training and testing.

The choice of k depends, in practice on the size n of the data set. For large data, k=3 could be sufficient. For very small data, the extreme case where k=n, leave-one-out cross-validation (LOOCV) would use a single observation from the original sample as testing data and the remaining observations as training data. Common choices are k=10 or k=5.

For a select set of algorithms and cases, the function ore.CV performs cross-validation for models generated by ORE regression and classification functions using in-databse data. ORE embedded R execution is leveraged to support cross-validation also for models built with vanilla R functions.

Usage

ore.CV(funType, function, formula, dataset, nFolds=<nb.folds>, fun.args=NULL, pred.args=NULL, pckg.lst=NULL)
  • funType - "regression" or "classification"
  • function - ORE predictive modeling functions for regression & classification or R function (regression only)
  • formula - object of class "formula"
  • dataset - name of the ore.frame
  • nFolds - number of folds
  • fun.args - list of supplementary arguments for 'function'
  • pred.args - list of supplementary arguments for 'predict'. Must be consistent with the model object/model generator 'function'.
  • pckg.lst - list of packages to be loaded by the DB R engine for embedded execution.
The set of functions supported for ORE include:
  • ore.lm
  • ore.stepwise
  • ore.neural
  • ore.glm
  • ore.odmDT
  • ore.odmSVM
  • ore.odmGLM
  • ore.odmNB
The set of functions supported for R include:
  • lm
  • glm
  • svm
Note: The 'ggplot' and 'reshape' packages are required on the R client side for data post-processing and plotting (classification CV).

Examples

In the following examples, we illustrate various ways to invoke ore.CV using some datasets we have seen in previous posts. The datasets can be created as ore.frame objects using:
 
IRIS <- ore.push(iris)
LONGLEY <- ore.push(longley)
library(rpart)
KYPHOSIS <- ore.push(kyphosis)
library(PASWR)
TITANIC3 <- ore.push(titanic3)
MTCARS <- pore.push(mtcars)
(A) Cross-validation for models generated with ORE functions.
 
# Basic specification
ore.CV("regression","ore.lm",Sepal.Length~.-Species,"IRIS",nFolds=5)
ore.CV("regression","ore.neural",Employed~GNP+Population+Year,
            "LONGLEY",nFolds=5)

#Specification of function arguments
ore.CV("regression","ore.stepwise",Employed~.,"LONGLEY",nFolds=5,
            fun.args= list(add.p=0.15,drop.p=0.15))
ore.CV("regression","ore.odmSVM",Employed~GNP+Population+Year,
             "LONGLEY",nFolds=5, fun.args="regression")

#Specification of function arguments and prediction arguments
ore.CV("classification","ore.glm",Kyphosis~.,"KYPHOSIS",nFolds=5,
             fun.args=list(family=binomial()),pred.args=list(type="response"))
ore.CV("classification","ore.odmGLM",Kyphosis~.,"KYPHOSIS",nFolds=5,
            fun.args= list(type="logistic"),pred.args=list(type="response"))
 
(B) Cross-validation for models generated with R functions via the ORE embedded execution mechanism.

ore.CV("regression","lm",mpg~cyl+disp+hp+drat+wt+qsec,"MTCARS",nFolds=3)
ore.CV("regression","svm",Sepal.Length~.-Species,"IRIS",nFolds=5,
             fun.args=list(type="eps-regression"), pckg.lst=c("e1071")) 


Restrictions

  • The signature of the model generator ‘function’ must be of the following type: function(formula,data,...). For example, functions like, ore.stepwise, ore.odmGLM and lm are supported but the R step(object,scope,...) function for AIC model selection via the stepwise algorithm, does not satisfy this requirement.
  • The model validation process requires the prediction function to return a (1-dimensional) vector with the predicted values. If the (default) returned object is different the requirement must be met by providing an appropriate argument through ‘pred.args’. For example, for classification with ore.glm or ore.odmGLM the user should specify pred.args=list(type="response").
  • Cross-validation of classification models via embedded R execution of vanilla R functions is not supported yet.
  • Remark: Cross-validation is not a technique intended for large data as the cost of multiple model training and testing can become prohibitive. Moreover, with large data sets, it is possible to effectively produce an effective sampled train and test data set. The current ore.CV does not impose any restrictions on the size of the input and the user working with large data should use good judgment when choosing the model generator and the number of folds.

    Output

    The function ore.CV provides output on several levels: datastores to contain model results, plots, and text output.

    Datastores

    The results of each cross-validation run are saved into a datastore named dsCV_funTyp_data_Target_function_nFxx where funTyp, function, nF(=nFolds) have been described above and Target is the left-hand-side of the formula. For example, if one runs the ore.neural, ore.glm, and ore.odmNB-based cross-validation examples from above, the following three datastores are produced:
    
    R> ds <- ore.datastore(pattern="dsCV")
    R> print(ds)
    datastore.name object.count size creation.date description
    1 dsCV_classification_KYPHOSIS_Kyphosis_ore.glm_nF5 104480326 2014-04-30 18:19:55 <NA>
    2 dsCV_classification_TITANIC3_survived_ore.odmNB_nF5 10 592083 2014-04-30 18:21:35 <NA>
    3 dsCV_regression_LONGLEY_Employed_ore.neural_nF5 10 497204 2014-04-30 18:16:35 <NA>
    
    Each datastore contains the models and prediction tables for every fold. Every prediction table has 3 columns: the fold index together with the target variable/class and the predicted values. If we consider the example from above and examine the most recent datastore (the Naive Bayes classification CV), we would see:
    
    R> ds.last <- ds$datastore.name[which.max(as.numeric(ds$creation.date))]
    R> ore.datastoreSummary(name=ds.last)
    object.name class size length row.count col.count
    1 model.fold1 ore.odmNB 66138 9 NA NA
    2 model.fold2 ore.odmNB 88475 9 NA NA
    3 model.fold3 ore.odmNB 110598 9 NA NA
    4 model.fold4 ore.odmNB 133051 9 NA NA
    5 model.fold5 ore.odmNB 155366 9 NA NA
    6 test.fold1 ore.frame 7691 3 261 3
    7 test.fold2 ore.frame 7691 3 262 3
    8 test.fold3 ore.frame 7691 3 262 3
    9 test.fold4 ore.frame 7691 3 262 3
    10 test.fold5 ore.frame 7691 3 262 3
    
    

    Plots

    The following plots are generated automatically by ore.CV and saved in an automatically generated OUTPUT directory:

  • Regression: ore.CV compares predicted vs target values, root mean square error (RMSE) and relative error (RERR) boxplots per fold. The example below is based on 5-fold cross-validation with the ore.lm regression model for Sepal.Length ~.-Species using the ore.frame IRIS dataset.
  • Classification : ore.CV outputs a multi plot figure for classification metrics like Precision, Recall and F-measure. Each metrics is captured per target class (side-by-side barplots) and fold (groups of barplots). The example below is based on the 5-folds CV of the ore.odmSVM classification model for Species ~. using the ore.frame IRIS dataset.
  • Text output
    For classification problems, the confusion tables for each fold are saved in an ouput file residing in the OUTPUT directory together with a summary table displaying the precision, recall and F-measure metrics for every fold and predicted class.
    file.show("OUTDIR/tbl_CV_classification_IRIS_Species_ore.odmSVM_nF5")
    
    Confusion table for fold 1 :           
                 setosa versicolor virginica
      setosa          9          0         0
      versicolor      0         12         1
      virginica       0          1         7
    Confusion table for fold 2 :            
                 setosa versicolor virginica
      setosa          9          0         0
      versicolor      0          8         1
      virginica       0          2        10
    Confusion table for fold 3 :           
                 setosa versicolor virginica
      setosa         11          0         0
      versicolor      0         10         2
      virginica       0          0         7
    Confusion table for fold 4 :            
                 setosa versicolor virginica
      setosa          9          0         0
      versicolor      0         10         0
      virginica       0          2         9
    Confusion table for fold 5 :            
                 setosa versicolor virginica
      setosa         12          0         0
      versicolor      0          5         1
      virginica       0          0        12
    Accuracy, Recall & F-measure table per {class,fold}
       fold      class TP  m  n Precision Recall F_meas
    1     1     setosa  9  9  9     1.000  1.000  1.000
    2     1 versicolor 12 13 13     0.923  0.923  0.923
    3     1  virginica  7  8  8     0.875  0.875  0.875
    4     2     setosa  9  9  9     1.000  1.000  1.000
    5     2 versicolor  8  9 10     0.889  0.800  0.842
    6     2  virginica 10 12 11     0.833  0.909  0.870
    7     3     setosa 11 11 11     1.000  1.000  1.000
    8     3 versicolor 10 12 10     0.833  1.000  0.909
    9     3  virginica  7  7  9     1.000  0.778  0.875
    10    4     setosa  9  9  9     1.000  1.000  1.000
    11    4 versicolor 10 10 12     1.000  0.833  0.909
    12    4  virginica  9 11  9     0.818  1.000  0.900
    13    5     setosa 12 12 12     1.000  1.000  1.000
    14    5 versicolor  5  6  5     0.833  1.000  0.909
    15    5  virginica 12 12 13     1.000  0.923  0.960
    
              
    What's next
    Several extensions of ore.CV are possible involving sampling, parallel model training and testing, support for vanilla R classifiers, post-processing and output. More material for future posts.

    Thursday May 01, 2014

    "Darden uses analytics to understand customer restaurants"

    See the InformationWeek article Darden Uses Analytics To Understand Restaurant Customers highlighting Darden's use of Oracle Advanced Analytics:


    "Check-Level Analytics is one of the tools Darden plans to use to boost sales and customer loyalty, while pulling together data from across all its operations to show how they can work together better. ... it brought in some new tools, including Oracle Data Miner and Oracle R Enterprise, both included in the Oracle Advanced Analytics option to Oracle Database, to spot correlations and meaningful patterns in the data." 

    Darden's data analytics project earned them the 5th spot on this year's InformationWeek Elite 100 ranking.

    Sunday Apr 27, 2014

    Step-by-step: Returning R statistical results as a Database Table


    R provides a rich set of statistical functions that we may want to use directly from SQL. Many of these results can be readily expressed as structured table data for use with other SQL tables, or for use by SQL-enabled applications, e.g., dashboards or other statistical tools.

    In this blog post, we illustrate in a sequence of five simple steps  how to go from an R function to a SQL-enabled result. Taken from recent "proof of concept" customer engagement, our example involves using the function princomp, which performs a principal components analysis on a given numeric data matrix and returns the results as an object of class princomp. The customer actively uses this R function to produce loadings used in subsequent computations and analysis. The loadings is a matrix whose columns contain the eigenvectors).

    The current process of pulling data from their Oracle Database, starting an R  engine, invoking the R script, and placing the results back in the database was proving non-performant and unnecessarily complex. The goal was to leverage Oracle R Enterprise to streamline this process and allow the results to be immediately accessible
    through SQL.

    As a best practice, here is a process that can get you from start to finish:

    Step 1: Invoke from command line, understand results

    If you're using a particular R function, chances are you are familiar with its content. However, you may not be familiar with its structure. We'll use an example from the R princomp documentation that uses the USArrests data set. We see that the class of the result is of type princomp, and the model prints the call and standard deviations of the components. To understand the underlying structure, we invoke the function str and see there are seven elements in the list, one of which is the matrix loadings.

    mod <- princomp(USArrests, cor = TRUE)
    class(mod)
    mod
    str(mod)


    Results:

    R> mod <- princomp(USArrests, cor = TRUE)
    R> class(mod)
    [1] "princomp"
    R> mod
    Call:
    princomp(x = USArrests, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494

    4 variables and 50 observations.

    R> str(mod)
    List of 7
    $ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
    ..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    .. ..$ : chr [1:4] "Comp.1" "Comap.2" "Comp.3" "Comp.4"
    $ center : Named num [1:4] 7.79 170.76 65.54 21.23
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ scale : Named num [1:4] 4.31 82.5 14.33 9.27
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ n.obs : int 50
    $ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:50] "1" "2" "3" "4" ...
    .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ call : language princomp(x = dat, cor = TRUE)
    - attr(*, "class")= chr "princomp"


    Step 2: Wrap script in a function, and invoke from ore.tableApply

    Since we want to invoke princomp on database data, we first push the demo data, USArrests, to the database to create an ore.frame. Other data we wish to use will also be in database tables.

    We'll use ore.tableApply (for the reasons cited in the previous blog post)  providing the ore.frame as the first argument and simply returning within our function the model produced by princomp. We'll then look at its class, retrieve the result from the database, and check its class and structure once again.

    Notice that we are able to obtain the exact same result we received using our local R engine as with the database R engine through embedded R execution.

    dat <- ore.push(USArrests)
    computePrincomp <- function(dat) princomp(dat, cor=TRUE)
    res <- ore.tableApply(dat, computePrincomp)


    class(res)
    res.local <- ore.pull(res)
    class(res.local)
    str(res.local)
    res.local
    res


    Results:

    R> dat <- ore.push(USArrests)
    R> computePrincomp <- function(dat) princomp(dat, cor=TRUE)
    R> res <- ore.tableApply(dat, dat, computePrincomp)
    R> class(res)
    [1] "ore.object"
    attr(,"package")
    [1] "OREembed"
    R> res.local <- ore.pull(res)
    R> class(res.local)
    [1] "princomp"


    R> str(res.local)
    List of 7
    $ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
    ..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    .. ..$ : chr [1:4] "Comp.1" "Comap.2" "Comp.3" "Comp.4"
    $ center : Named num [1:4] 7.79 170.76 65.54 21.23
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ scale : Named num [1:4] 4.31 82.5 14.33 9.27
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ n.obs : int 50
    $ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:50] "1" "2" "3" "4" ...
    .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ call : language princomp(x = dat, cor = TRUE)
    - attr(*, "class")= chr "princomp"

    R> res.local
    Call:
    princomp(x = dat, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494

    4 variables and 50 observations.
    R> res
    Call:
    princomp(x = dat, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494


    4 variables and 50 observations.


    Step 3: Determine what results we really need

    Since we are only interested in the loadings and any result we return needs to be a data.frame to turn it into a database row set (table result), we build the model, transform the loadings object into a data.frame, and return the data.frame as the function result. We then view the class of the result and its values.

    Since we do this from the R API, we can simply print res to display the returned data.frame, as the print does an implicit ore.pull.

    returnLoadings <- function(dat) {
                        mod <- princomp(dat, cor=TRUE)
                        dd <- dim(mod$loadings)
                        ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
                        ldgs$variables <- row.names(ldgs)
                        ldgs
                      }
    res <- ore.tableApply(dat, returnLoadings)
    class(res)
    res

    ore.create(USArrests,table="USARRESTS")


    Results:

    R> res <- ore.tableApply(dat, returnLoadings)
    R> class(res)
    [1] "ore.object"
    attr(,"package")
    [1] "OREembed"
    R> res
                 Comp.1     Comp.2     Comp.3     Comp.4 variables
    Murder   -0.5358995  0.4181809 -0.3412327  0.64922780 Murder
    Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748 Assault
    UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773 UrbanPop
    Rape     -0.5434321 -0.1673186  0.8177779  0.08902432 Rape


    Step 4: Load script into the R Script Repository in the database

    We're at the point of being able to load the script into the R Script Repository before invoking it from SQL. We can create the function from R or from SQL. In R,

    ore.scriptCreate('princomp.loadings', returnLoadings)

    or from SQL,

    begin
    --sys.rqScriptDrop('princomp.loadings');
    sys.rqScriptCreate('princomp.loadings',
          'function(dat) {
            mod <- princomp(dat, cor=TRUE)
            dd <- dim(mod$loadings)
            ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
            ldgs$variables <- row.names(ldgs)
            ldgs
          }');
    end;
    /


    Step 5: invoke from SQL SELECT statement

    Finally, we're able to invoke the function from SQL using the rqTableEval table function. We pass in a cursor with the data from our USARRESTS table. We have no parameters, so the next argument is NULL. To get the results as a table, we specify a SELECT string that defines the structure of the result. Note that the column names must be identical to what is returned in the R data.frame. The last parameter is the name of the function we want to invoke from the R script repository.

    Invoking this, we see the result as a table from the SELECT statement.

    select *
    from table(rqTableEval( cursor(select * from USARRESTS),
                            NULL,
                           'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual',
                            'princomp.loadings'));


    Results:

    SQL> select *
    from table(rqTableEval( cursor(select * from USARRESTS),NULL,
              'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));
    2 3
        Comp.1     Comp.2     Comp.3     Comp.4  variables
    ---------- ---------- ---------- ---------- ------------
    -.53589947  .418180865 -.34123273  .649227804 Murder
    -.58318363  .187985604 -.26814843 -.74340748  Assault
    -.27819087 -.87280619  -.37801579  .133877731 UrbanPop
    -.54343209 -.16731864   .817777908 .089024323 Rape

    As you see above, we have the loadings result returned as a SQL table.


    In this example, we walked through the steps of moving from invoking an R function to obtain a specific result to producing that same result from SQL by invoking an R script at the database server under the control of Oracle Database.

    Friday Apr 25, 2014

    ore.doEval and ore.tableApply: Which one is right for me?


    When beginning to use Oracle R Enterprise, users quickly grasp the techniques and benefits of using embedded R to run scripts in database-side R engines, and gain a solid sense of the available functions for executing R scripts through Oracle Database. However, various embedded R functions are closely related, and a few tips can help in learning which functions are most appropriate for the problem you wish to solve. In this post, we'll demonstrate best practices for two of the non-parallel embedded R functions, ore.doEval and ore.tableApply.

    As with all embedded R functions, both ore.doEval and ore.tableApply invoke R scripts at the database server in an R engine. The difference is that ore.doEval does not take a data table as an input parameter as it's designed
    simply to execute the function provided.  In contrast, ore.tableApply accepts a data table (i.e., ore.frame) as input to be delivered to the embedded R function. The functions ore.doEval and ore.tableApply can be made equivalent simply by passing the name of the database table and pulling the table data within the ore.doEval function.   

    In the following examples, we show embedded R run times for ore.doEval and ore.tableApply using simple functions that build linear models to predict flight arrival delay based on distance traveled and departure delay.

    Model 1: Although ore.doEval does not explicitly accept data from a dedicated input argument, it's possible to retrieve data from the database using ore.sync
    and ore.pull within the function:

    R> system.time(mod1 <- ore.doEval(function(){
               ore.sync(table="ONTIME_S")
               dat <- ore.pull(ore.get("ONTIME_S"))
               lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)},
               ore.connect = TRUE))

        user  system elapsed
        0.008   0.000   4.941

    Model 2: Data can also be passed to a function in the R interface of embedded R exectuion, as shown here with ore.doEval, when connected to the database schema where the data table resides.

    R> system.time(mod2 <- ore.doEval(function(dat){
               lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)},
    dat = ONTIME_S))

       user  system elapsed
       3.196   0.128   9.476


    Model 3: The ore.tableApply function is designed to accept a database table, that is, an ore.frame, as the first input argument.

    R> system.time(mod3 <- ore.tableApply(ONTIME_Sfunction(dat){
                           lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat = dat)}))

       user  system elapsed
       0.001   0.000   3.870

    As the elapsed timings show, ore.tableApply (Model 3) is faster than both ore.doEval implementations (Model 1 and Model 2). Results may vary depending on data size and the operation being performed.  The ONTIME_S airline data used in these examples contains 220,000 rows and 26 columns, and the tests were executed on a Linux 5.8 server with 12 GB RAM and a single processor.

    In summary, ore.doEval takes a function parameter, but can be programmed to source data from Oracle Database or another external source. If your processing is driven by a database table, ore.tableApply is preferable because it's optimized for data transfer from Oracle Database to R. For both approaches, the data must be able to fit in the Database R Engine’s available memory.

    Unlike other embedded R functions, ore.doEval and ore.tableApply run serially by executing a single R process with the entire data in memory.  Other embedded R functions are enabled for parallel execution, and each has a distinct use case: row-wise "chunked" computations can be executed using ore.rowApply. The function ore.groupApply can be applied to grouped data for data sets that have natural partitioning. Lastly, the function ore.indexApply supports task-based execution, where one or more R engines perform the same or different calculations, or tasks, a specified number of times.

    In this post, we haven't yet addressed the functions rqEval and rqTableEval, the SQL equivalents for ore.doEval and ore.tableApply.  One distinction between the R and SQL embedded R execution interfaces is that you can pass a data.frame or ore.frame as an argument in the R interface, as illustrated with ore.doEval above, however, the SQL interface takes only scalar arguments as input, for example, as with the
    rqEval function. The rqTableEval function accepts a full table using a dedicated input argument, and this input may be a cursor or the result of a query.


    Wednesday Apr 16, 2014

    Oracle's Strategy for Advanced Analytics

    At Oracle our goal is to enable you to get timely insight from all of your data. We continuously enhance Oracle Database to allow workloads that have traditionally required extracting data from the database to run in-place. We do this to narrow the gap that exists between insights that can be obtained and available data - because any data movement introduces latencies, complexity due to more moving parts, the ensuing need for data reconciliation and governance, as well as increased cost. The Oracle tool set considers the needs of all types of enterprise users - users preferring GUI based access to analytics with smart defaults and heuristics out of the box, users choosing to work interactively and quantitatively with data using R, and users preferring SQL and focusing on operationalization of models.

    Oracle recognized the need to support data analysts, statisticians, and data scientists with a widely used and rapidly growing statistical programming language. Oracle chose R - recognizing it as the new de facto standard for computational statistics and advanced analytics. Oracle supports R in at least 3 ways:


    • R as the language of interaction with the database

    • R as the language in which analytics can be written and executed in the database as a high performance computing platform

    • R as the language in which several native high performance analytics have been written that execute in database


    Additionally, of course, you may chose to leverage any of the CRAN algorithms to execute R scripts at the database server leveraging several forms of data parallelism.

    Providing the first and only supported commercial distribution of R from an established company, Oracle released Oracle R Distribution. In 2012 Oracle embarked on the Hadoop journey acknowledging alternative data management options emerging in the open source for management of unstructured or not-yet-structured data. In keeping with our strategy of delivering analytics close to where data is stored, Oracle extended Advanced Analytics capabilities to execute on HDFS resident data in Hadoop environments. R has been integrated into Hadoop in exactly the same manner as it has been with the database.

    Realizing that data is stored in both database and non-database environment, Oracle provides users options for storing their data (in Oracle Database, HDFS, and Spark RDD), where to perform computations (in-database or the Hadoop cluster), and where results should be stored (Oracle Database or HDFS). Users can write R scripts that can be leveraged across database and Hadoop environments. Oracle Database, as a preferred location for storing R scripts, data, and result objects, provides a real-time scoring and deployment platform. It is also easy to create a model factory environment with authorization, roles, and privileges, combined with auditing, backup, recovery, and security.

    Oracle provides a common infrastructure that supports both in-database and custom R algorithms. Oracle also provides an integrated GUI for business users. Oracle provides both R-based access and GUI-based access to in-database analytics. A major part of Oracle's strategy is to maintain agility in our portfolio of supported techniques - being responsive to customer needs.

    Thursday Mar 27, 2014

    Why choose Oracle for Advanced Analytics?

    If you're an enterprise company, chances are you have your data in an Oracle database. You chose Oracle for it's global reputation at providing the best software products (and now engineered systems) to support your organization. Oracle database is known for stellar performance and scalability, and Oracle delivers world class support.

    If your data is already in Oracle Database or moving in that direction, leverage the high performance computing environment of the database to analyze your data. Traditionally it was common practice to move data to separate analytic servers for the explicit purpose of model building. This is no longer necessary nor is it scalable as your organization seeks to deliver value from Big Data. Oracle database now has several state of the art algorithms that execute in a parallel and distributed architecture directly in-database and augmented by custom algorithms in the R statistical programming language. Leveraging Oracle database for Advanced Analytics has benefits including:


    • Eliminates data movement to analytic servers

    • Enables analysis of all data not just samples

    • Puts your database infrastructure to even greater use

    • Eliminates impedance mismatch in the form of model translation when operationalizing models

    • All aspects of modeling and deployment are optionally available via SQL making integration into other IT software

    • Leverage CRAN algorithms directly in the database

    Customers such as Stubhub, dunnhumby, CERN OpenLab, Financiera Uno, Turkcell, and others leverage Oracle Advanced Analytics to scale their applications, simplify their analytics architecture, and reduce time to market of predictive models from weeks to hours or even minutes.

    Oracle leverages its own advanced analytics products, for example, by using Oracle Advanced Analytics in a wide range of Oracle Applications and internal deployments, ranging from:


    • Human Capital Management with Predictive Workforce to produce employee turnover, performance prediction, and "what if" analysis

    • Customer Relationship Management with Sales Prediction Engine to predict sales opportunities, what to sell, how much, and when

    • Supply Chain Management with Spend Classification to flag non-compliance or anomalies in expense submissions

    • Retail Analytics with Oracle Retail Customer Analytics to perform shopping cart analysis and next best offers

    • Oracle Financial Services Analytic Applications to enable quantitative analysts in credit risk management divisions to author rules/models directly in R


    Oracle wants you to be successful with advanced analytics. Working closely with customers to integrate Oracle Advanced Analytics as an integral process of their analytics strategy, customers are able to put their advanced analytics into production much faster.

    Thursday Mar 20, 2014

    ROracle 1-1.11 released - binaries for Windows and other platforms available on OTN


    We are pleased to announce the latest update of the open source ROracle package, version 1-1.11, with enhancements and bug fixes. ROracle provides high performance and scalable interaction from R with Oracle Database. In addition to availability on CRAN, ROracle binaries specific to Windows and other platforms can be downloaded from the Oracle Technology Network. Users of ROracle, please take our brief survey. We want to hear from you!

    Latest enhancements in version 1-1.11 of ROracle:

    • Performance enhancements for RAW data types and large result sets
    • Ability to cache the result set in memory to reduce memory consumption on successive reads
    • Added session mode to connect as SYSDBA or using external authentication
    • bug 17383542: Enhanced dbWritetable() & dbRemoveTable() to work on global schema

    Users of ROracle are quite pleased with the performance and functionality:


    "In my position as a quantitative researcher, I regularly analyze database data up to a gigabyte in size on client-side R engines. I switched to ROracle from RJDBC because the performance of ROracle is vastly superior, especially when writing large tables. I've also come to depend on ROracle for transactional support, pulling data to my R client, and general scalability. I have been very satisfied with the support from Oracle -- their response has been prompt, friendly and knowledgeable."

               -- Antonio Daggett, Quantitative Researcher in Finance Industry


    "Having used ROracle for over a year now with our Oracle Database data, I've come to rely on ROracle for high performance read/write of large data sets (greater than 100 GB), and SQL execution with transactional support for building predictive models in R. We tried RODBC but found ROracle to be faster, much more stable, and scalable."

               -- Dr. Robert Musk, Senior Forest Biometrician, Forestry Tasmania


    See the ROracle NEWS for the complete list of updates.

    We encourage ROracle users to post questions and provide feedback on the Oracle R Technology Forum.

    In addition to being a high performance database interface to Oracle Database from R for general use, ROracle supports database access for Oracle R Enterprise.

    Monday Mar 17, 2014

    Oracle R Enterprise Upgrade Steps

    We've recently announced that Oracle R Enterprise 1.4 is available on all platforms. To upgrade Oracle R Enterprise to the latest version:

      1. *Install the version of R that is required for the new version of Oracle R Enterprise.
             See the Oracle R Enterprise supported platforms matrix for the latest requirements.
      2. Update Oracle R Enterprise Server on the database server by running the install.sh script
             and follow the prompts for the upgrade path.
      3. Update the Oracle R Enterprise Supporting packages on the database server.
      4. Update the Oracle R Enterprise Client and Supporting packages on the client.

    For RAC/Exadata installations, upgrade items 1, 2 and 3 must be performed on all compute notes. 

    *If you've changed the R installation directory between releases, manually update the location of the R_HOME directory in the Oracle R Enterprise configuration table.  The sys.rqconfigset procedure edits settings in a configuration table called sys.rq_config. Use of this function requires the sys privilege. You can view the contents of this table to verify various environment settings for Oracle R Enterprise. Among the settings stored in sys.rq_config is the R installation directory, or R_HOME. The following query shows sample values stored in sys.rq_config for a Linux server:

    SQL> select * from sys.rq_config;
    NAME        VALUE
    ------------------------------------------------------------------
    R_HOME       /usr/lib64/R
    MIN_VSIZE    32M
    MAX_VSIZE    4G
    R_LIBS_USER  /u01/app/oracle/product/12.0.1/dbhome_1/R/library
    VERSION      1.4
    MIN_NSIZE    2M
    MAX_NSIZE    20M

    7 rows selected.

    To point to the correct R_HOME:

    SQL > sys.rqconfigset('R_HOME', '<path to current R installation directory>')

    All Oracle R Enterprise downloads are available on the Oracle Technology Network. Refer to the instructions in section 8.3 of the Oracle R Enterprise Installations Guide for detailed steps on upgrading Oracle R Enterprise, and don't hesitate to post questions to the Oracle R forum.



    Saturday Mar 15, 2014

    Oracle R Enterprise 1.4 Released

    We’re pleased to announce that Oracle R Enterprise (ORE) 1.4 is now available for download on all supported platforms. In addition to numerous bug fixes, ORE 1.4 introduces an enhanced high performance computing infrastructure, new and enhanced parallel distributed predictive algorithms for both scalability and performance, added support for production deployment, and compatibility with the latest R versions.  These updates enable IT administrators to easily migrate the ORE database schema to speed production deployment, and statisticians and analysts have access to a larger set of analytics techniques for more powerful predictive models.

    Here are the highlights for the new and upgraded features in ORE 1.4:

    Upgraded R version compatibility


    ORE 1.4 is certified with R-3.0.1 - both open source R and Oracle R Distribution. See the server support matrix for the complete list of supported R versions. R-3.0.1 brings improved performance and big-vector support to R, and compatibility with more than 5000 community-contributed R packages.

    High Performance Computing Enhancements

    Ability to specify degree of parallelism (DOP) for parallel-enabled functions (ore.groupApply, ore.rowApply, and ore.indexApply)
    An additional global option, ore.parallel, to set the number of parallel threads used in embedded R execution

    Data Transformations and Analytics

    ore.neural now provides a highly flexible network architecture with a wide range of activation functions, supporting 1000s of formula-derived columns, in addition to being a parallel and distributed implementation capable of supporting billion row data sets
    ore.glm now also prevents selection of less optimal coefficient methods with parallel distributed in-database execution
    Support for weights in regression models
    New ore.esm enables time series analysis, supporting both simple and double exponential smoothing for scalable in-database execution
    Execute standard R functions for Principal Component Analysis (princomp), ANOVA (anova), and factor analysis (factanal) on database data

    Oracle Data Mining Model Algorithm Functions

    Newly exposed in-database Oracle Data Mining algorithms:

    ore.odmAssocRules function for building Oracle Data Mining association models using the apriori algorithm
    ore.odmNMF function for building Oracle Data Mining feature extraction models using the Non-Negative Matrix Factorization (NMF) algorithm
    ore.odmOC function for building Oracle Data Mining clustering models using the Orthogonal Partitioning Cluster (O-Cluster) algorithm

    Production Deployment

    New migration utility eases production deployment from development environments
    "Snapshotting" of production environments for debugging in test systems

    For a complete list of new features, see the Oracle R Enterprise User's Guide. To learn more about Oracle R Enterprise, check out the white paper entitled, "Bringing R to the Enterprise -  A Familiar R Environment with Enterprise-Caliber Performance, Scalability, and Security.", visit Oracle R Enterprise on Oracle's Technology Network, or review the variety of use cases on the Oracle R blog.

    Monday Mar 10, 2014

    Oracle R Distribution 3.0.1 Benchmarks

    Oracle R Distribution, Oracle's distribution of Open Source R, improves performance by dynamically linking to optimized, multi-threaded BLAS libraries. Unlike open source R, Oracle R Distribution uses all available cores and processors when dynamically linked against optimized BLAS, resulting in increased performance. Thus, the more cores available to Oracle R Distribution, the higher performance for many operations.

    How is this possible?  Standard R's internal BLAS library was created when multi-core machines were not widely used, so it is single-threaded, i.e., operates on a single core. However, the BLAS API in R allows linking to different, multi-threaded BLAS libraries that allow linear algebra computations to use all cores and therefore run much faster. Oracle R Distribution simplifies the linking process by loading the high performance math library after it's added to PATH or LD_LIBRARY_PATH, depending on the Operating System.  Then you are set to use optimized math libraries - the Intel Math Kernel Library (MKL)AMD Core Math Library (ACML), or Solaris Sun Performance Library on Solaris. 

    The benchmarks in this section demonstrate the performance of Oracle R Distribution 3.0.1 with and without dynamically loaded MKL. The R-25 benchmark script developed by the R community consists of fifteen tests. They are split into three groups (matrix calculation, matrix functions and "programmation") with trimmed means for each group, and each test is run three times. For this comparison, we report the mean for the three test runs. The benchmarks show that using Oracle R Distribution with dynamically loaded MKL libraries on a 30-core machine is significantly faster than the single core time.


    Oracle R Distribution 3.0.1 Benchmarks


    This benchmark was executed on a 3-node cluster, with 24 cores at 3.07GHz per CPU and 47 GB RAM, using Linux 5.5.


    In-Database Scalability and Parallelism with Oracle R Enterprise


    Oracle R Enterprise, the set of big data analytics R packages provided by Oracle, provides scalable, parallel in-database data manipulation and algorithms for analyzing very large data sets. Oracle R Enterprise functions implement parallel, out-of-core algorithms that overcome R's limitations of being memory-bound and single-threaded by executing requested R calculations on data in Oracle Database, using the database itself as the computational engine. Oracle R Enterprise allows users to further leverage Oracle's engineered systems, like Exadata, Big Data Appliance, and Exalytics, for enterprise-wide analytics, as well as reporting tools like Oracle Business Intelligence Enterprise Edition dashboards and BI Publisher documents. The combination of Oracle Database and R delivers an enterprise-ready, integrated environment for advanced analytics.

    At the time of this post, Oracle R Distribution 3.0.1 is certified with Oracle R Enterprise 1.4.  See the Install Guide's Oracle R Enterprise support matrix  for a list of current Oracle R Distribution supported configurations and platforms, and this link for instructions on enabling high performance library support for Oracle R Distribution on a Windows or Linux client.

    Tuesday Feb 18, 2014

    Low-Rank Matrix Factorization in Oracle R Advanced Analytics for Hadoop

    This guest post from Arun Kumar, a graduate student in the Department of Computer Sciences at the University of Wisconsin-Madison, describes work done during his internship in the Oracle Advanced Analytics group.

    Oracle R Advanced Analytics For Hadoop (ORAAH), a component of Oracle’s Big Data Connectors software suite is a collection of statistical and predictive techniques implemented on Hadoop infrastructure. In this post, we introduce and explain techniques for a popular machine learning task that has diverse applications ranging from predicting ratings in recommendation systems to feature extraction in text mining namely matrix completion and factorization. Training, scoring, and prediction phases for matrix completion and factorization are available in ORAAH. The models generated can also be transparently loaded into R for ad-hoc inspection. In this blog, post we describe implementation specifics of these two techniques available in ORAAH.

    Motivation

    Consider an e-commerce company that displays products to potential customers on its webpage and collects data about views, purchases, ratings (e.g., 1 to 5 stars), etc. Increasingly, such online retailers are using machine learning techniques to predict in advance which products a customer is likely to rate highly and recommend such products to the customers in the hope that they might purchase them. Users build a statistical model based on the past history of ratings by all customers on all products. One popular model to generate predictions from such a hyper-sparse matrix is the latent factor model, also known as the low-rank matrix factorization model (LMF).

    The setup is the following – we are given a large dataset of past ratings (potentially in the billions), say, with the schema (Customer ID, Product ID, Rating). Here, Customer ID refers to a distinct customer, Product ID refers to a distinct product, and Rating refers to a rating value, e.g., 1 to 5. Conceptually, this dataset represents a large matrix D with m rows (number of customers) and n columns (number of products), where the entries are the available ratings. Notice that this matrix is likely to be extremely sparse, i.e., many ratings could be missing since most customers typically rate only a few products. Thus, the task here is matrix completion – we need to predict the missing ratings so that it can be used for downstream processing such as displaying the top recommendations for each customer.

    The LMF model assumes that the ratings matrix can be approximately generated as a product of two factor matrices, L and R, which are much smaller than D (lower rank). The idea is that the product L * R will approximately reconstruct the existing ratings and also automatically predict the missing ratings in D. More precisely, for each available rating (i,j,v) in D, we have (L x R) [i,j] ≈ v, while for each missing rating (i',j') in D, the predicted rating is (L x R) [i',j']. The model has a parameter r, which dictates the rank of the factor matrices, i.e., L is m x r, while R is r x n.

    Matrix Completion in ORAAH

    LMF can be invoked out-of-the-box using the routine orch.lmf. An execution based on the above example is shown below. The dataset of ratings is in a CSV file on HDFS with the schema above (named “retail_ratings” here).


    input <- hdfs.attach("retail_ratings")
    fit <- orch.lmf(input)

    # Export the model into R memory
    lr <- orch.export.fit(fit)

    # Compute the prediction for the point (100, 50)

    # First column of lr$L contains the userid
    userid <- lr$L[,1] == 100 # find row corresponding to user id 100
    L <- lr$L[, 2:(rank+1)]

    #First column contains the itemid
    itemid <- lr$R[,1] == 50 # find row corresponding to item id 50
    R <- lr$R[, 2:(rank+1)]

    # dot product as sum of terms obtained through component wise multiplication
    pred <- sum(L[userid,] * R[itemid,])

    The factor matrices can be transparently loaded into R for further inspection and for ad-hoc predictions of specific customer ratings using R. The algorithm we use for training the LMF model is called Incremental Gradient Descent (IGD), which has been shown to be one of the fastest algorithms for this task [1, 2].

    The entire set of arguments for the function orch.lmf along with a brief description of each and their default values is given in the table below. The latin parameter configures the degree of parallelism for executing IGD for LMF on Hadoop [2]. ORAAH sets this automatically based on the dimensions of the problem and the memory available to each Mapper. Each Mapper fits its partition of the model in memory, and the multiple partitions run in parallel to learn different parts of the model. The last five parameters configure IGD and need to be tuned by the user to a given dataset since they can impact the quality of the model obtained.

    ORAAH also provides routines for predicting ratings as well as for evaluating the model (computing the error of the model on a given labeled dataset) on a large scale over HDFS-resident datasets. The routine for prediction of ratings is predict, and for evaluating is orch.evaluate. Use help(orch.lmf) for online documentation, and demo(orch_lmf_jellyfish) for a fully working example including model fit, evaluation, and prediction.

    Other Matrix Factorization Tasks

    While LMF is primarily used for matrix completion tasks, it can also be used for other matrix factorization tasks that arise in text mining, computer vision, and bio-informatics, e.g., dimension reduction and feature extraction. In these applications, the input data matrix need not necessarily be sparse. Although many zeros might be present, they are not treated as missing values. The goal here is simply to obtain a low-rank factorization D ≈ L x R as accurately as possible, i.e., the product L x R should recover all entries in D, including the zeros. Typically, such applications use a Non-Negative Matrix Factorization (NMF) approach due to non-negativity constraints on the factor matrix entries. However, many of these applications often do not need non-negativity in the factor matrices. Using NMF algorithms for such applications leads to poorer-quality solutions. Our implementation of matrix factorization for such NMF-style tasks can be invoked out-of-the-box in ORAAH using the routine orch.nmf, which has the same set of arguments as LMF.

    Experimental Results & Comparison with Apache Mahout

    We now present an empirical evaluation of the performance, quality, and scalability of the ORAAH LMF tool based on IGD and compare it to the most widely used off-the-shelf tool for LMF on Hadoop – an implementation of the ALS algorithm from Apache Mahout [3].

    All our experiments are run on an Oracle Big Data Appliance Hadoop cluster with nine nodes, each with Intel Xeon X5675 12-core 3.07GHz processors, 48 GB RAM, and 20 TB disk. We use 256MB HDFS blocks and 10 reducers for MapReduce jobs.

    We use two standard public datasets for recommendation tasks – MovieLens10M (referred to as MLens) and Netflix – for the performance and quality comparisons (insert URL). To study scalability aspects, we use several synthetic datasets of different sizes by changing the number of rows, number of columns, and/or number of ratings. The table below presents the data set statistics.


    Results: Performance and Quality

    We first present end-to-end overview of the performance and quality achieved by our implementation and Mahout on MLens and Netflix. The rank parameter was set at 50 (a typical choice for such tasks) and the other parameters for both tools were chosen using a grid search. The quality of the factor matrices was determined using the standard measure of root mean square error (RMSE) [2]. We use a 70%-15%-15% Wold holdout of the datasets, i.e., 70% for training, 15% for testing, and 15% for validation of generalization error. The training was performed until 0.1% convergence, i.e., until the fractional decrease in the training RMSE after every iteration reached 0.1%. The table below presents the results.

    1. ORAAH LMF has a faster performance than Mahout LMF on the overall training runtime on both datasets – 1.8x faster on MLens and 2.3x faster on Netflix.
    2. The per-iteration runtime of ORAAH LMF is much lower than that of Mahout LMF – between 4.4x and 5.4x.
    3. Although ORAAH LMF runs more iterations than Mahout LMF, the huge difference in the per-iteration runtimes make the overall runtime smaller for ORAAH LMF.
    4. The training quality (training RMSE) achieved is comparable across both tools on both datasets. Similarly, the generalization quality is also comparable. Thus, ORAAH LMF can offer state-of-the-art quality along with faster performance.

    Results: Scalability

    The ability to scale along all possible dimensions of the data is key to big data analytics. Both ORAAH LMF and Mahout LMF are able to scale to billions of ratings by parallelizing and distributing computations on Hadoop. But we now show that unlike Mahout LMF, ORAAH LMF is also able to scale to hundreds of millions of customers (m) and products (n), and also scales well with the rank results along these three dimensions – m, n, and r. parameter (r, which affects the size of the factor matrices). The figure below presents the scalability.

    1. Figures (A) and (B) plot the results for the Syn-row and Syn-col datasets, respectively (r = 2). ORAAH LMF scales linearly with both number of rows (m) and number of columns (n), while Mahout LMF does not show up on either plot because it crashes at all these values of m. In fact, we verified that Mahout LMF does not scale beyond even m = 20 M! The situation is similar with n. This is because Mahout LMF assumes that the factor matrices L and R fit entirely in the memory of each Mapper. In contrast, ORAAH LMF uses a clever partitioning scheme on all matrices ([2]) and can thus scale seamlessly on all dataset dimensions.
    2. Figure (C) shows the impact of the rank parameter r. ORAAH LMF scales linearly with r and the per-iteration runtime roughly doubles between r = 20 and r = 100. However, the per-iteration runtime of Mahout LMF varies quadratically with r, and in fact, increases by a factor of 40x between r = 20 and r = 100! Thus, ORAAH LMF is also able to scale better with r.
    3. Finally, on the tera-scale dataset Syn-tera with 1 billion rows, 10 million columns, and 20 billion ratings, ORAAH LMF (for r = 2) finishes an iteration in just under 2 hours!

    Acknowledgements

    The matrix factorization features in ORAAH were implemented and benchmarked by Arun Kumar during his summer internship at Oracle under the guidance of Vaishnavi Sashikanth. He is pursuing his PhD in computer science from the University of Wisconsin-Madison. This work is the result of a collaboration between Oracle and the research group of Dr. Christopher Ré, who is now at Stanford University. Anand Srinivasan helped integrate these features into ORAAH.

    References

    [1] Towards a Unified Architecture for in-RDBMS Analytics. Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. ACM SIGMOD 2012.

    [2] Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion. Benjamin Recht and Christopher Ré. Mathematical Programming Computation 2013.

    [3] Apache Mahout. http://mahout.apache.org/.

    Thursday Feb 13, 2014

    Monitoring progress of embedded R functions

    When you run R functions in the database, especially functions involving multiple R engines in parallel, you can monitor their progress using the Oracle R Enterprise datastore as a central location for progress notifications, or any intermediate status or results. In the following example, based on ore.groupApply, we illustrate instrumenting a simple function that builds a linear model to predict flight arrival delay based on a few other variables.

    In the function modelBuildWithStatus, the function verifies that there are rows for building the model after eliminating incomplete cases supplied in argument dat. If not empty, the function builds a model and reports “success”, otherwise, it reports “no data.” It’s likely that the user would like to use this model in some way or save it in a datastore for future use, but for this example, we just build the model and discard it, validating that a model can be built on the data.


    modelBuildWithStatus <-
      function(dat) {
        dat <- dat[complete.cases(dat),]
        if (nrow(dat)>0L) {
          mod <- lm(ARRDELAY ~ DISTANCE + AIRTIME + DEPDELAY, dat);
          "success"
        } else
          "no_data"
        }

    When we invoke this using ore.groupApply, the goal is to build one model per “unique carrier” or airline. Using an ORE 1.4 feature, we specify the degree of parallelism using the parallel argument, setting it to 2.


    res <- ore.groupApply(ONTIME_S[, c("UNIQUECARRIER","DISTANCE", "ARRDELAY", "DEPDELAY", "AIRTIME")],
            ONTIME_S$UNIQUECARRIER,
            modelBuildWithStatus,
            parallel=2L)

    res.local<-ore.pull(res)
    res.local[unlist(res.local)=="no_data"]

    The result tells us about the status of each execution. Below, we print the unique carries that had no data.


    R> res.local<-ore.pull(res)
    R> res.local[unlist(res.local)=="no_data"]
    $EA
    [1] "no_data"

    $`ML(1)`
    [1] "no_data"

    $`PA(1)`
    [1] "no_data"

    $PI
    [1] "no_data"

    $PS
    [1] "no_data"

    To monitor the progress of each execution, we can identify the group of data being processed in each function invocation using the value from the UNIQUECARRIER column. For this particular data set, we use the first two characters of the carrier’s symbol appended to “group.” to form a unique object name for storing in the datastore identified by job.name. (If we don’t do this, the value will form an invalid object name.) Note that since the UNIQUECARRIER column contains uniform data, we need only the first value.

    The general idea for monitoring progress is to save an object in the datastore named for each execution of the function on a group. We can then list the contents of the named datastore and compute a percentage complete, which is discussed later in this post. For the “success” case, we assign the value “SUCCESS” to the variable named by the string in nm that we created earlier. Using ore.save, this uniquely named object is stored in the datastore with the name in job.name. We use the append=TRUE flag to indicate that the various function executions will be sharing the same named datastore.
    If there is no data left in dat, we assign “NO DATA” to the variable named in nm and save that. Notice in both cases, we’re still returning “success” or “no data” so these come back in the list returned by ore.groupApply. However, we can return other values instead, e.g., the model produced.


    modelBuildWithMonitoring <-
      function(dat, job.name) {
      nm <- paste("group.", substr(as.character(dat$UNIQUECARRIER[1L]),1,2), sep="")
      dat <- dat[complete.cases(dat),]
      if (nrow(dat)>0L) {
        mod <- lm(ARRDELAY ~ DISTANCE + AIRTIME + DEPDELAY, dat);
        assign(nm, "SUCCESS")
        ore.save(list=nm, name=job.name, append=TRUE)
        "success"
      } else {
        assign(nm, "NO DATA")
        ore.save(list=nm, name=job.name, append=TRUE)
        "no data"
      }
    }

    When we use this function in ore.groupApply, we provide the job.name and ore.connect arguments as well. The variable ore.connect must be set to TRUE in order to use the datastore. As the ore.groupApply executes, the datastore named by job.name will be increasingly getting objects added with the name of the carrier. First, delete the datastore named “job1”, if it exists.


    ore.delete(name="job1")

    res <- ore.groupApply(ONTIME_S[, c("UNIQUECARRIER","DISTANCE", "ARRDELAY", "DEPDELAY", "AIRTIME")],
            ONTIME_S$UNIQUECARRIER,
            modelBuildWithMonitoring,
            job.name="job1", parallel=2L, ore.connect=TRUE)

    To see the progress during execution, we can use the following function, which takes a job name and the cardinality of the INDEX column to determine the percent complete. This function is invoked in a separate R engine connected to the same schema. If the job name is found, we print the percent complete, otherwise stop with an error message.


    check.progress <- function(job.name, total.groups) {
      if ( job.name %in% ore.datastore()$datastore.name )
        print(sprintf("%.1f%%", nrow(ore.datastoreSummary(name=job.name))/total.groups*100L))
      else
        stop(paste("Job", job.name, " does not exist"))
    }

    To invoke this, compute the total number of groups and provide this and the job name to the function check.progress.
    total.groups <- length(unique(ONTIME_S$UNIQUECARRIER))
    check.progress("job1",total.groups)

    However, we really want a loop to report on the progress automatically. One simple approach is to set up a while loop with a sleep delay. When we reach 100%, stop. To be self-contained, we include a simplification of the function above as a local function.


    check.progress.loop <- function(job.name, total.groups, sleep.time=2) {
      check.progress <- function(job.name, total.groups) {
        if ( job.name %in% ore.datastore()$datastore.name )
          print(sprintf("%.1f%%", nrow(ore.datastoreSummary(name=job.name))/total.groups*100L))
        else
          paste("Job", job.name, " does not exist")
      }
      while(1) {
        try(x <- check.progress(job.name,total.groups))
        Sys.sleep(sleep.time)
        if(x=="100.0%") break
      }
    }

    As before, this function is invoked in a separate R engine connected to the same schema.


    check.progress.loop("job1",total.groups)

    Looking at the results, we can see the progress reported at one second intervals. Since the models build quickly, it doesn’t take long to reach 100%. For functions that take longer to execute or where there are more groups to process, you may choose a longer sleep time. Following this, we look at the datastore “job1” using ore.datastore and its contents using ore.datastoreSummary.


    R> check.progress.loop("job1",total.groups,sleep.time=1)
    [1] "6.9%"
    [1] "96.6%"
    [1] "100.0%"

    R> ore.datastore(name="job1")
      datastore.name object.count size      creation.date description
    1 job1 29 1073 2014-02-13 22:03:20
    R> ore.datastoreSummary(name="job1")
    object.name class size length row.count col.count
    1 group.9E character 37 1 NA NA
    2 group.AA character 37 1 NA NA
    3 group.AQ character 37 1 NA NA
    4 group.AS character 37 1 NA NA
    5 group.B6 character 37 1 NA NA
    6 group.CO character 37 1 NA NA
    7 group.DH character 37 1 NA NA
    8 group.DL character 37 1 NA NA
    9 group.EA character 37 1 NA NA
    10 group.EV character 37 1 NA NA
    11 group.F9 character 37 1 NA NA
    12 group.FL character 37 1 NA NA
    13 group.HA character 37 1 NA NA
    14 group.HP character 37 1 NA NA
    15 group.ML character 37 1 NA NA
    16 group.MQ character 37 1 NA NA
    17 group.NW character 37 1 NA NA
    18 group.OH character 37 1 NA NA
    19 group.OO character 37 1 NA NA
    20 group.PA character 37 1 NA NA
    21 group.PI character 37 1 NA NA
    22 group.PS character 37 1 NA NA
    23 group.TW character 37 1 NA NA
    24 group.TZ character 37 1 NA NA
    25 group.UA character 37 1 NA NA
    26 group.US character 37 1 NA NA
    27 group.WN character 37 1 NA NA
    28 group.XE character 37 1 NA NA
    29 group.YV character 37 1 NA NA

    The same basic technique can be used to note progress in any long running or complex embedded R function, e.g., in ore.tableApply or ore.doEval. At various points in the function, sequence-named objects can be added to a datastore. Moreover, the contents of those objects can contain incremental or partial results, or even debug output.

    While we’ve focused on the R API for embedded R execution, the same functions could be invoked using the SQL API. However, monitoring would still be done from an interactive R engine.

    Tuesday Feb 04, 2014

    Invoking R scripts via Oracle Database: Theme and Variation, Part 6

    How can I use "group apply" to partition data over multiple columns for parallel execution?
    How can I use R for statistical computations and return results as a database table?

    In this blog post of our theme and variation series, we answer these two questions through several examples, highlighting both R and SQL interfaces.

    So far in this blog series on Oracle R Enterprise embedded R execution we've covered:

    Part 1: ore.doEval / rqEval
    Part 2: ore.tableApply / rqTableEval
    Part 3: ore.groupApply / “rqGroupApply”
    Part 4: ore.rowApply / rqRowEval
    Part 5: ore.indexApply

    Using ore.groupApply for partitioning data on multiple columns

    While the “group apply” functionality is quite powerful as it is, users sometimes want to partition data on multiple columns. Since ore.groupApply currently takes only a single column for the INDEX argument, users can create a new column that is the concatenation of the columns of interest, and provide this column to the INDEX argument. We’ll illustrate this first using the R API, and then the SQL API.

    R API

    We adapt an example from Part 3 to illustrate partitioning data on multiple columns. Instead of building a C5.0 model, we’ll use the same CHURN_TRAIN data set, but build an rpart model since it will produce rules on the partitions of data we’ve chosen for the example, namely, voice_mail_plan and international_plan. To understand the number of rows we can expect in each partition, we’ll use the R table function. We then add a new column that pastes together the two columns of interest to create a new column called “vmp_ip”.


    library(C50)
    data(churn)

    ore.create(churnTrain, "CHURN_TRAIN")

    table(CHURN_TRAIN$international_plan, CHURN_TRAIN$voice_mail_plan)
    CT <- CHURN_TRAIN
    CT$vmp_ip <- paste(CT$voice_mail_plan,CT$international_plan,sep="-")
    head(CT)

    Each invocation of the function “my.rpartFunction” will receive data from one of the partitions identified in vmp_ip. Since our source partition columns are constants, we set them to NULL. The character vectors are converted to factors and the model is built to predict churn and saved in an appropriately named datastore. Instead of returning TRUE as done in the previous example, we create a list to return the specific partition column values, the distribution of churn values, and the model itself.


    ore.scriptDrop("my.rpartFunction")
    ore.scriptCreate("my.rpartFunction",
      function(dat,datastorePrefix) {
        library(rpart)
        vmp <- dat[1,"voice_mail_plan"]
        ip <- dat[1,"international_plan"]
        datastoreName <- paste(datastorePrefix,vmp,ip,sep="_")
        dat$voice_mail_plan <- NULL
        dat$international_plan <- NULL
        dat$state <- as.factor(dat$state)
        dat$churn <- as.factor(dat$churn)
        dat$area_code <- as.factor(dat$area_code)
        mod <- rpart(churn ~ ., data = dat)
        ore.save(mod, name=datastoreName, overwrite=TRUE)
        list(voice_mail_plan=vmp,
            international_plan=ip,
            churn.table=table(dat$churn),
            rpart.model = mod)
      })

    After loading the rpart library and setting the datastore prefix, we invoke ore.groupApply using the derived column vmp_ip as the input to argument INDEX. After building the models, we’ll look at the first entry in the list returned. Using ore.load, we can load the model for the case where the customer neither has the voice mail plan, nor the international plan.


    library(rpart)

    datastorePrefix="my.rpartModel"

    res <- ore.groupApply( CT, INDEX=CT$vmp_ip,
          FUN.NAME="my.rpartFunction",
          datastorePrefix=datastorePrefix,
          ore.connect=TRUE)
    res[[1]]
    ore.load(name=paste(datastorePrefix,"no","no",sep="_"))
    mod
    SQL API

    To invoke this from the SQL API, we use the same approach as covered in Part 3. While we could create the table CT from the ore.frame used above, instead the following illustrates creating the derived column in SQL and explicitly defining a VIEW.


    CREATE OR REPLACE VIEW CT AS
      SELECT t.*, "voice_mail_plan" || '-' || "international_plan" as "vmp_ip"
      FROM CHURN_TRAIN t;

    Next, we create a PL/SQL PACKAGE and FUNCTION for the invocation.


    CREATE OR REPLACE PACKAGE churnPkg AS
      TYPE cur IS REF CURSOR RETURN CT%ROWTYPE;
    END churnPkg;
    /
    CREATE OR REPLACE FUNCTION churnGroupEval(
      inp_cur churnPkg.cur,
      par_cur SYS_REFCURSOR,
      out_qry VARCHAR2,
      grp_col VARCHAR2,
      exp_txt CLOB)
    RETURN SYS.AnyDataSet
    PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH ("vmp_ip"))
    CLUSTER inp_cur BY ("vmp_ip")
    USING rqGroupEvalImpl;
    /

    Then, we can invoke the R function by name in the SELECT statement as follows:


    select *
    from table(churnGroupEval(
      cursor(select * from CT),
      cursor(select 1 as "ore.connect",' my.rpartModel2' as "datastorePrefix" from dual),
      'XML', 'state', 'my.rpartFunction'));

    As another variation on this theme, suppose that you didn’t want to include all the columns from the source data set. To achieve this, you could create a view and define the PACKAGE from the view. However, you could also define a record that contains the specific columns of interest. This is a standard PL/SQL specification that can be used in combination with “group apply”.


    CREATE OR REPLACE PACKAGE churnPkg2 AS
      TYPE rec IS RECORD ("vmp_ip" varchar2(8),
        "churn" varchar2(4),
        "state" varchar2(4),
        "account_length" NUMBER(38));
      TYPE cur IS REF CURSOR RETURN rec;
    END churnPkg2;
    /

    If you don’t want to or cannot create a view, this allows you to specify the exact columns required for model building. Reducing the number of columns on input can improve performance, since only required data will be passed to the server-side R engine. Notice that we could have used this above since we remove the columns for the source partition columns.

    How to return results from R statistical functions as database table data

    R provides a wide range of statistical and advanced analytics functions. While Oracle Database contains a wide range of statistical functional in SQL, R further extends this set. In this next topic, we illustrate how to return statistical results as a SQL table for use with other SQL queries or to feed SQL-based applications.

    As our example, we’ll use the R principal components function princomp. Our goal is to return the loadings of the PCA model as a database table. For our data set, we’ll use the USArrests data set provided with R. We can view the results of princomp in the mod variable, which has class “princomp”. We then push this data to Oracle Database, getting an ore.frame object.


    mod <- princomp(USArrests, cor = TRUE)
    class(mod)
    mod
    dat <- ore.push(USArrests)

    R> mod <- princomp(USArrests, cor = TRUE)
    R> class(mod)
    [1] "princomp"
    R> mod
    Call:
    princomp(x = USArrests, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494

    4 variables and 50 observations.
    R> dat <- ore.push(USArrests)

    In the first case considered, we use ore.tableApply to return simply the princomp object. When we do this we’re getting back a serialized object of type ore.object, but the actual princomp object still resides in the database. We can pull this object from the database to get a local princomp object, but this type of result cannot be directly returned as a SQL table because we need an object of class data.frame (which we’ll address later).


    res <- ore.tableApply(dat,
          function(dat) {
            princomp(dat, cor=TRUE)
          })
    class(res)
    res.local <- ore.pull(res)
    class(res.local)
    str(res.local)
    res.local
    res

    In the following output, we see the result is an ore.object that we pull from the database to get a princomp object. We examine the structure of the object and focus on the loadings element. In the example, we print res.local and res. Since res is an ore.object, it automatically gets pulled to the client before printing it.


    R> res <- ore.tableApply(dat,
    + function(dat) {
    + princomp(dat, cor=TRUE)
    + })
    R> class(res)
    [1] "ore.object"
    attr(,"package")
    [1] "OREembed"
    R> res.local <- ore.pull(res)
    R> class(res.local)
    [1] "princomp"
    R> str(res.local)
    List of 7
    $ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
    ..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ center : Named num [1:4] 7.79 170.76 65.54 21.23
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ scale : Named num [1:4] 4.31 82.5 14.33 9.27
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ n.obs : int 50
    $ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:50] "1" "2" "3" "4" ...
    .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ call : language princomp(x = dat, cor = TRUE)
    - attr(*, "class")= chr "princomp"
    R> res.local
    Call:
    princomp(x = dat, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494

    4 variables and 50 observations.
    R> res
    Call:
    princomp(x = dat, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494

    4 variables and 50 observations.

    In this next case, we focus on the loadings component of the princomp object, which contains the matrix of variable loadings, that is a matrix whose columns contain the eigenvectors. This is of class "loadings"…still not a data.frame. To convert the loadings component to a data.frame, we determine the dimensions of the matrix and then construct a data.frame by accessing the cells of the loading object. To get the variables associated with each row, we assign to the column variables the row names of the loadings. Finally, we return the loadings data.frame.


    res <- ore.tableApply(dat,
          function(dat) {
            mod <- princomp(dat, cor=TRUE)
            dd <- dim(mod$loadings)
            ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
            ldgs$variables <- row.names(ldgs)
            ldgs
          })
    class(res)
    res

    In the output below, notice that we still have an ore.object being returned, but it’s in the form of a data.frame.


    R> res <- ore.tableApply(dat,
    + function(dat) {
    + mod <- princomp(dat, cor=TRUE)
    + dd <- dim(mod$loadings)
    + ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
    + ldgs$variables <- row.names(ldgs)
    + ldgs
    + })
    R> class(res)
    [1] "ore.object"
    attr(,"package")
    [1] "OREembed"
    R> res
            Comp.1    Comp.2     Comp.3     Comp.4 variables
    Murder -0.5358995 0.4181809 -0.3412327 0.64922780 Murder
    Assault -0.5831836 0.1879856 -0.2681484 -0.74340748 Assault
    UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773 UrbanPop
    Rape -0.5434321 -0.1673186 0.8177779 0.08902432 Rape

    We can address this last issue by specifying the FUN.VALUE argument to get an ore.frame result (left as an exercise to the reader). But our main goal is to enable returning the loadings from SQL as a database table. For that, we create the function in the R script repository and construct the appropriate SQL query. In preparation for the next example, we’ll create the table USARRESTS using the R data set.


    ore.create(USArrests,table="USARRESTS")

    Now, we’ll switch to SQL. We’re introducing the functions sys.rqScriptDrop and sys.rqScriptCreate, which are used within a BEGIN END PL/SQL block, to store the R function ‘princomp.loadings’.


    begin
    --sys.rqScriptDrop('princomp.loadings');
    sys.rqScriptCreate('princomp.loadings',
          'function(dat) {
            mod <- princomp(dat, cor=TRUE)
            dd <- dim(mod$loadings)
            ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
            ldgs$variables <- row.names(ldgs)
            ldgs
          }');
    end;
    /

    The SELECT statement provides input data by selecting all data from USARRESTS. There are no arguments to pass, so the next parameter is NULL. The SELECT string describes the format of the result. Notice that the column names must match in name (including case) and type. The last parameter is the name of the function stored in the R script repository.


    select *
    from table(rqTableEval( cursor(select * from USARRESTS),NULL,
              'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));

    SQL> select *
    from table(rqTableEval( cursor(select * from USARRESTS),NULL,
              'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));
    2 3
        Comp.1     Comp.2     Comp.3     Comp.4 variables
    ---------- ---------- ---------- ---------- ------------
    -.53589947 .418180865 -.34123273 .649227804 Murder
    -.58318363 .187985604 -.26814843 -.74340748 Assault
    -.27819087 -.87280619 -.37801579 .133877731 UrbanPop
    -.54343209 -.16731864 .817777908 .089024323 Rape

    If you have interesting embedded R scenarios to share with the ORE community, please consider posting a comment.

    Monday Jan 20, 2014

    Invoking R scripts via Oracle Database: Theme and Variation, Part 5


    In the first four parts of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution involving the functions ore.doEval / rqEval, ore.tableApply / rqTableEval, ore.groupApply / “rqGroupApply”, and ore.rowApply / rqRowEval. In this blog post, we cover ore.indexApply. Note that there is no corresponding rqIndexEval – more on that later. The “index apply” function is also one of the parallel-enabled embedded R execution functions. It supports task-parallel execution, where one or more R engines perform the same or different calculations, or task. A number, associated with the index of the execution, is provided to the function. Any required data is expected to be explicitly generated or loaded within the function.

    This functionality is valuable in a variety of settings, e.g., simulations, for taking advantage of high-performance computing hardware like Exadata.

    As for “group apply” and “row apply”, Oracle Database handles the management and control of potentially multiple R engines at the database server machine, with only the index passed to the function as the first argument. Oracle Database ensures that each R function execution completes, otherwise the ORE function returns an error. Output formats as supported by the other embedded R functions are possible for ore.indexApply, for example, returning an ore.list or combining data.frame data into an ore.frame.

    The variation on embedded R execution for ore.indexApply involves passing as an argument the number of times the user-defined R function should be executed.

    Let’s look at a simple example.

    The following code specifies to execute the function five times in parallel.


    res <- ore.indexApply(5,
          function(index) {
            paste("IndexApply:",index)
          },
        parallel=TRUE)
    class(res)
    res

    Notice that the class of the result is an ore.list, and when we print res, we have 5 character vectors, each with the index that was passed to the user-defined function. As with other parallel embedded R functions, the number of concurrently executing R engines can be limited by specifying the degree of parallelism of the database. As we’ll see in ORE 1.4, the parallel argument can specify a preferred number of parallel R engines, as an upper bound.


    > class(res)
    [1] "ore.list"
    attr(,"package")
    [1] "OREbase"
    > res
    $`1`
    [1] "IndexApply: 1"

    $`2`
    [1] "IndexApply: 2"

    $`3`
    [1] "IndexApply: 3"

    $`4`
    [1] "IndexApply: 4"

    $`5`
    [1] "IndexApply: 5"

    Column-parallel use case

    If we wanted to parallelize R’s summary function, we could compute the summary statistics on each column in parallel and combine them into a final result. The following example does exactly that. While we could generalize this example, we focus on the iris data set and computing summary statistics on the first four numeric columns. Since iris comes standard with R, there’s no need to load data from any other source, we simply access it. The first argument to ore.indexApply is 4, the number of columns we wish to summarize in parallel. The function takes one argument, index, which will be a value between 1 and 4, and will be used to select the column to summarize. We massage the result of summary into a data.frame and add the column name to the result. Note that the function returns a single row: the summary statistics for the column.


    res <- NULL
    res <- ore.indexApply(4,
          function(index) {
            ss <- summary(iris[,index])
            attr.names <- attr(ss,"names")
            stats <- data.frame(matrix(ss,1,length(ss)))
            names(stats) <- attr.names
            stats$col <- names(iris)[index]
            stats
          },
          parallel=TRUE)
    res

    The result comes back as an ore.list object:


    > res
    $`1`
    Min. 1st Qu. Median Mean 3rd Qu. Max. col
    1 4.3 5.1 5.8 5.843 6.4 7.9 Sepal.Length

    $`2`
    Min. 1st Qu. Median Mean 3rd Qu. Max. col
    1 2 2.8 3 3.057 3.3 4.4 Sepal.Width

    $`3`
    Min. 1st Qu. Median Mean 3rd Qu. Max. col
    1 1 1.6 4.35 3.758 5.1 6.9 Petal.Length

    $`4`
    Min. 1st Qu. Median Mean 3rd Qu. Max. col
    1 0.1 0.3 1.3 1.199 1.8 2.5 Petal.Width

    This is good, but it would be better if the result was returned as an ore.frame, especially since all the columns are the same. To enable this, we’ll do a slight variation on the result by specifying FUN.VALUE with the structure of the result defined.


    res <- ore.indexApply(4,
          function(index) {
            ss <- summary(iris[,index])
            attr.names <- attr(ss,"names")
            stats <- data.frame(matrix(ss,1,length(ss)))
            names(stats) <- attr.names
            stats$col <- names(iris)[index]
            stats
          },
          FUN.VALUE=data.frame(Min.=numeric(0),
            "1st Qu."=numeric(0),
            Median=numeric(0),
            Mean=numeric(0),
            "3rd Qu."=numeric(0),
            Max.=numeric(0),
            col=character(0)),
          parallel=TRUE)
    res

    Now, the result comes back as an ore.frame.


    > res
      Min. X1st.Qu. Median  Mean X3rd.Qu. Max.      col
    1 0.1 0.3 1.30 1.199 1.8 2.5 Petal.Width
    2 1.0 1.6 4.35 3.758 5.1 6.9 Petal.Length
    3 4.3 5.1 5.80 5.843 6.4 7.9 Sepal.Length
    4 2.0 2.8 3.00 3.057 3.3 4.4 Sepal.Width
    Simulation use case

    The ore.indexApply function can be used in simulations as well. In this next example we take multiple samples from a random normal distribution with the goal to compare the distribution of the summary statistics. For this, we build upon the example above. We provide parameters such as the sample size, mean and standard deviation of the random numbers, and the number of simulations we want to perform. Each one of these simulations will occur in a separate R engine, in parallel, up to the degree of parallelism allowed by the database.

    We specify num.simulations as the first parameter to ore.indexApply. Inside the user-defined function, we pass the index and three arguments to the function. The function then sets the random seed based on the index. This allows each invocation to generate a different set of random numbers. Using rnorm, the function produces sample.size random normal values. We invoke summary on the vector of random numbers, and then prepare a data.frame result to be returned. We’re using the FUN.VALUE to get an ore.frame as the final result.


    res <- NULL
    sample.size = 1000
    mean.val = 100
    std.dev.val = 10
    num.simulations = 1000

    res <- ore.indexApply(num.simulations,
          function(index, sample.size=1000, mean=0, std.dev=1) {
            set.seed(index)
            x <- rnorm(sample.size, mean, std.dev)
            ss <- summary(x)
            attr.names <- attr(ss,"names")
            stats <- data.frame(matrix(ss,1,length(ss)))
            names(stats) <- attr.names
            stats$index <- index
            stats
          },
          FUN.VALUE=data.frame(Min.=numeric(0),
            "1st Qu."=numeric(0),
            Median=numeric(0),
            Mean=numeric(0),
            "3rd Qu."=numeric(0),
            Max.=numeric(0),
            index=numeric(0)),
          parallel=TRUE,
          sample.size=sample.size,
          mean=mean.val, std.dev=std.dev.val)
    res
    boxplot(ore.pull(res[,1:6]),
      main=sprintf("Boxplot of %d rnorm samples size %d, mean=%d, sd=%d",
            num.simulations, sample.size, mean.val, std.dev.val))

    To get the distribution of samples, we invoke boxplot on the data.frame after pulling the result to the client.

    Here are a couple of plots showing results for different parameters:


    In both cases, we run 10,000 samples. The first graph uses a sample size of 10 and the second uses a sample size of 1000. From these results, it is clear that a larger sample size significantly reduces the variance in each of the summary statistics - confirming our Statistics 101 understanding.

    Error reporting

    As introduced above, Oracle Database ensures that each embedded R user-defined function execution completes, otherwise the ORE function returns an error. Of course, any side-effects of the user-defined function need to be manually cleaned up. Operations that produce files, create tables in the database, or result in completed database transactions through ROracle will remain intact. The ORE embedded R infrastructure will report errors as produced by the function as illustrated in the following example.

    The code specifies to invoke 4 parallel R engines. If the index has value 3, attempt to load the non-existant package "abc123" (which produces an error), otherwise return the index value.


    R> ore.indexApply(4,
    + function(index) {
    + if (index==3) {library(abc123)}
    + else {return(index)}
    + }
    + )
    Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch, :
    ORA-12801: error signaled in parallel query server P000
    ORA-20000: RQuery error
    Error in library(abc123) : there is no package called 'abc123'
    ORA-06512: at "RQSYS.RQGROUPEVALIMPL", line 121
    ORA-06512: at "RQSYS.RQGROUPEVALIMPL", line 118

    Notice that the first reported error is an ORE-12801: error signaled in parallel query server. Then the ORA-20000: RQuery error indicates the error as returned by the R engine. Also interesting to note is that the ORA-06512 errors reveal the underlying implementation of ore.indexApply "RQSYS.RQGROUPEVALIMPL". Which leads us to the next topic.

    No rqIndexEval?

    “Index apply” is really a variation of “group apply” where the INDEX column is a numeric vector that is pushed to the database. With n distinct numbers, one number is provided to each function as its index. As a result, there is no corresponding rqIndexEval in the SQL API. The user would have to create a similar package and function as was illustrated in the blog post on “group apply.”

    Thursday Jan 09, 2014

    Invoking R scripts via Oracle Database: Theme and Variation, Part 4

    In the first three parts of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution involving the functions ore.doEval / rqEval, ore.tableApply / rqTableEval, and ore.groupApply / “rqGroupApply”. In this blog post, we’ll cover the next in our theme and variation series involving ore.rowApply and rqRowEval. The “row apply” function is also one of the parallel-enabled embedded R execution functions. It supports data-parallel execution, where one or more R engines perform the same R function, or task, on disjoint chunks of data. This functionality is essential to enable scalable model scoring/predictions on large data sets and for taking advantage of high-performance computing hardware like Exadata.

    As for ore.groupApply, Oracle Database handles the management and control of potentially multiple R engines at the database server machine, automatically chunking and passing data to parallel executing R engines. Oracle Database ensures that R function executions for all chunks of rows complete, or the ORE function returns an error. The result from the execution of each user-defined embedded R function is gathered in an ore.list. This list remains in the database until the user requires the result. However, we’ll also show how data.frame results from each execution can be combined into a single ore.frame. This features works for return values of other embedded R functions as well.

    The variation on embedded R execution for ore.rowApply involves passing not only an ore.frame to the function such that the first parameter of your embedded R function receives a data.frame, but also the number of rows that should be passed to each invocation of the user-defined R function. The last chunk, of course, may have fewer rows than specified.

    Let’s look at an example. We’re going to use the C50 package to score churn data (i.e., predict which customers are likely to churn) using the C5.0 decision tree models we built in the previous blog post with ore.groupApply. (Well, almost. We need to rebuild the models to take into account the full data set levels.) The goal is to score the customers in parallel leveraging the power of a high performance computing platform, such as Exadata.


    library(C50)
    data(churn)

    ore.create(churnTest, "CHURN_TEST")

    myFunction <- function(dat, xlevels, datastorePrefix) {
      library(C50)
      state <- dat[1,"state"]
      datastoreName <- paste(datastorePrefix,state,sep="_")
      dat$state <- NULL
      for (j in names(xlevels))
        dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
      ore.load(name=datastoreName)
      res <- data.frame(pred=predict(mod,dat, type="class"),
            actual=dat$churn,
            state=state)
      res
    }

    xlevels <- ore.getXlevels(~ ., CHURN_TEST[,-1])
    scoreList <- ore.groupApply(
      CHURN_TEST,
      INDEX=CHURN_TEST$state,
      myFunction,
      datastorePrefix="myC5.0model3",xlevels=xlevels, ore.connect=TRUE)
    score.MA <- ore.pull(scoreList$MA)
    table(score.MA$actual, score.MA$pred)

    A few points to highlight:

    • Instead of computing the levels using the as.factor function inside the user-defined function, we’ll use ore.getXlevels, which returns the levels for each factor column. We don’t need this for the state column, so we exclude it (“-1”). In the previous post we noted that factor data is passed as character columns in the data.frame. Computing the levels first can ensure that all possible levels are provided during model building, even if there are no rows with some of the level values.
    • When building models where some levels were missing (due to using as.factor on each partition of data), scoring can fail if the test data has unknown level values. For this reason, the models built in Part 3 need to be rebuilt using the approach above with ore.getXlevels. This is left as an exercise for the reader.
    • Assign the function to the variable “myFunction” to facilitate reuse (see below).
    • We construct the datastore name to be the same as when we were building the models, i.e., appending the state value to the datastore prefix separated by an ‘_’.
    • The for loop iterates over the levels passed in as xlevels, creating a factor using the provided levels and assigning it back to the data.frame.
    • Loading the datastore by name, we have access to the variable mod, which contains the model for the particular state.
    • The result is constructed as a data.frame with the prediction and the actual values.
    • Three arguments are passed: the datastore prefix, the levels that were pre-computed, and that we need to connect to the database because we’re using a datastore.
    • The results are stored as a list of ore.frames. We can pull the scores for MA and compute a confusion matrix using table.

    This is fine. However, we likely don’t want to have a list of separate ore.frames as the result. We’d prefer to have a single ore.frame with all the results. This can be accomplished using the FUN.VALUE argument. Whenever a data.frame is the result of the user-defined R function, and if the structure of that data.frame is the same across all invocations of the group apply or row apply, you can combine them into a single result by defining the structure as follows:

    scores <- ore.groupApply(
      CHURN_TEST,
      INDEX=CHURN_TEST$state,
      myFunction,
      datastorePrefix="myC5.0model3",xlevels=xlevels, ore.connect=TRUE,
      FUN.VALUE=data.frame(pred=character(0),
            actual=character(0),
            state=character(0)));
    head(scores)
    scores.local <- ore.pull(scores)
    table(scores.local[scores.local$state=="MA",c("actual","pred")])

    scores.MA <- scores[scores$state=="MA",c("actual","pred")]
    table(scores.MA$actual, scores.MA$pred)

    A few important points to highlight:

    • FUN.VALUE is set to a data.frame that describes the format of the result. By providing this argument, you will get back a single ore.frame, not an ore.list object.
    • The group apply completes instantaneously because it is only defining the ore.frame, not actually performing the scoring. Not until the values are needed does the result get computed. We invoke head on the ore.frame in scores to highlight this.
    • We can pull the scores to the client to invoke table as before, but subselecting for state MA. However, we can also do this computation in the database using the transparency layer. First, we filter the rows for MA in scores.MA, and then invoke table on the two columns. Note: ORE requires passing the two columns explicitly to the overloaded function table.
    • To do this in parallel, add the argument parallel=TRUE to the ore.groupApply call.

    Wait! What happened to ore.rowApply?

    Above, we showed how to score with multiple models using ore.groupApply. But what if we had customers from a single state that we wanted to score in parallel? We can use ore.rowApply and rqRowEval to invoke a function on chunks of data (rows) at a time, from 1 to the total number of rows. (Note that values closer to the latter will have no benefit from parallelism, obviously.)


    scores <- ore.rowApply(
      CHURN_TEST[CHURN_TEST$state=="MA",],
      myFunction,
      datastorePrefix="myC5.0model3",xlevels=xlevels,
      ore.connect=TRUE, parallel=TRUE,
      FUN.VALUE=data.frame(pred=character(0),
            actual=character(0),
            state=character(0)),
      rows=200)
    scores
    table(scores$actual, scores$pred)

    A few points to highlight:

    • Since we want to perform the scoring in parallel by state, we filter the rows for MA. This will ensure that all rows processed can use the same predictive model.
    • We set the rows argument to 200. CHURN_TEST has 1667 rows, so this will result in nine executions of myFunction. The first eight receiving 200 rows each and the last receiving 67 rows.
    • We also set parallel=TRUE above since we want the scoring performed in parallel.
    • The invocation of ore.rowApply returns immediately. Not until we print scores do we incur the cost of executing the underlying query. However, also note that each time we access scores, for example in the following call to table, we incur the cost of executing the query. If the result will be used many times in subsequent operations, you may want to create a table with the result using ore.create.

    In SQL, we can do the same, but we’ll need to store the function in the R script repository (perhaps called "myScoringFunction") and also store xlevels in a datastore (perhaps called "myXLevels"). While we can pass complex objects in the R interface to embedded R functions, we cannot do that in SQL. Instead, we must pass the name of a datastore. Since the xlevels are in a datastore, the user-defined R function needs to be modified to take this other datastore name and load that datastore to have access to xlevels. This set of changes is left to the reader as an exercise.


    select * from table(rqRowEval(
      cursor(select /*+ parallel(t, 4) */ *
            from CHURN_TEST t
            where "state" = 'MA'),
      cursor(select 1 as "ore.connect",
            'myC5.0model3' as "datastorePrefix",
            'myXLevels' as "xlevelsDatastore"
            from dual),
      'select ''aaa'' "pred",''aaa'' "actual" , ''aa'' "state" from dual',
        200, 'myScoringFunction'));

    A few points to highlight:

    • The input cursor specifies a parallel hint on the input data cursor and filtering data for MA as well.
    • Several arguments are being passed, including the new argument to our function myXLevels.
    • The output form is specified in the SQL string. Care must be taken to ensure that the column names, ordering, and the length of character strings match the returned data.frame.

    Map Reduce

    The “row apply” functionality can be thought of in terms of the map-reduce paradigm where the mapper performs the scoring and outputs a data.frame value (no key required). There is no reducer, or the reducer is simply a pass-through.

    Memory and performance considerations

    Unlike with group apply, the rows argument in row apply ensures an upper bound on the number of rows (and hence memory requirement). The value of rows should be chosen to balance memory and parallel performance. The usual measures can be taken regarding setting memory limits on the R engine – as noted in Part 2.

    There may be instances where setting rows = 1 makes sense. For example, if the computation per row is intensive (i.e., takes a long time), sending one row per R engine may be appropriate. Experiment with a range of values for rows to determine the best value for your particular scenario.

    Sunday Jan 05, 2014

    Invoking R scripts via Oracle Database: Theme and Variation, Part 3

    In the first two parts of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution, focusing on the functions ore.doEval / rqEval and ore.tableApply / rqTableEval. In this blog post, we’ll cover the next in our theme and variation series involving ore.groupApply and the corresponding definitions required for SQL execution. The “group apply” function is one of the parallel-enabled embedded R execution functions. It supports data-parallel execution, where one or more R engines perform the same R function, or task, on different partitions of data. This functionality is essential to enable the building of potentially 10s or 100s of thousands of predictive models, e.g., one per customer, and for taking advantage of high-performance computing hardware like Exadata.

    Oracle Database handles the management and control of potentially multiple R engines at the database server machine, automatically partitioning and passing data to parallel executing R engines. It ensures that all R function executions for all partitions complete, or the ORE function returns an error. The result from the execution of each user-defined embedded R function is gathered in an ore.list. This list remains in the database until the user requires the result.

    The variation on embedded R execution for ore.groupApply involves passing not only an ore.frame to the function such that the first parameter of your embedded R function receives a data.frame, but also an INDEX argument that specifies the name of a column by which the rows will be partitioned for processing by a user-defined R function.

    Let’s look at an example. We’re going to use the C50 package to build a C5.0 decision tree model on the churn data set from C50. The goal is to build one churn model on the data for each state.


    library(C50)
    data(churn)

    ore.create(churnTrain, "CHURN_TRAIN")

    modList <- ore.groupApply(
      CHURN_TRAIN,
      INDEX=CHURN_TRAIN$state,
        function(dat) {
          library(C50)
          dat$state <- NULL
          dat$churn <- as.factor(dat$churn)
          dat$area_code <- as.factor(dat$area_code)
          dat$international_plan <- as.factor(dat$international_plan)
          dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
          C5.0(churn ~ ., data = dat, rules = TRUE)
        });
    mod.MA <- ore.pull(modList$MA)
    summary(mod.MA)

    A few points to highlight:
    • As noted in Part 2 of this series, to use the CRAN package C50 on the client, we first load the library, and then the churn data set.
    • Since the data is a data.frame, we’ll create a table in the database with this data. Notice that if you compare the results of str(churnTrain) with str(CHURN_TRAIN), you will see that the factor columns have been retained. This becomes relevant later.
    • The function ore.groupApply will return a list of models stored as ore.object instances. The first argument is the ore.frame CHURN_TRAIN and the second argument indicates to partition the data on column state such that the user-defined function is invoked on each partition of the data.
    • The next argument specifies the function, which could alternatively have been the function name if the FUN.NAME argument were used and the function saved explicitly in the R script repository. The function’s first argument (whatever its name) will receive one partition of data, e.g., all data associated with a single state.
    • Regarding the user-defined function body, we explicitly load the package we’re using, C50, so the function body has access to it. Recall that this user-defined R function will execute at the database server in a separate R engine from the client.
    • Since we don’t need to know which state we’re working with and we don’t want this included in the model, we delete the column from the data.frame.
    • Although the ore.frame defined functions, when they are loaded to the user-defined embedded R function, factors appear as character vectors. As a result, we need to convert them back to factors explicitly.
    • The model is built and returned from the function.
    • The result from ore.groupApply is a list containing the results from the execution of the user-defined function on each partition of the data. In this case, it will be one C5.0 model per state.
    • To view the model, we first use ore.pull to retrieve it from the database and then invoke summary on it. The class of mod.MA is “C5.0”.

    SQL API

    We can invoke the function through the SQL API by storing the function in the R script repository. Previously we showed doing this using the SQL API, however, we can also do this using the R API , but we’re going to modify the function to store the resulting models in an ORE datastore by state name:


    ore.scriptCreate("myC5.0Function",
      function(dat,datastorePrefix) {
        library(C50)
        datastoreName <- paste(datastorePrefix,dat[1,"state"],sep="_")
        dat$state <- NULL
        dat$churn <- as.factor(dat$churn)
        dat$area_code <- as.factor(dat$area_code)
        dat$international_plan <- as.factor(dat$international_plan)
        dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
        mod <- C5.0(churn ~ ., data = dat, rules = TRUE)
        ore.save(mod, name=datastoreName)
        TRUE
      })

    Just for comparison, we could invoke this from the R API as follows:


    res <- ore.groupApply( CHURN_TRAIN, INDEX=CHURN_TRAIN$state,
              FUN.NAME="myC5.0Function",
              datastorePrefix="myC5.0model", ore.connect=TRUE)
    res
    res <- ore.pull(res)
    all(as.logical(res) == TRUE)

    Since we’re using a datastore, we need to connect to the database setting ore.connect to TRUE. We also pass the datastorePrefix. The result res is an ore.list of logical values. To test if all are TRUE, we first pull the result and use the R all function.

    Back to the SQL API…Now that we can refer to the function in the SQL API, we invoke the function that places one model per datastore, each with the given prefix and state.


    select *
    from table(churnGroupEval(
      cursor(select * from CHURN_TRAIN),
      cursor(select 1 as "ore.connect",' myC5.0model2' as "datastorePrefix" from dual),
      'XML', 'state', 'myC5.0Function'));

    There’s one thing missing, however. We don’t have the function churnGroupEval. There is no generic “rqGroupEval” in the API – we need to define our own table function that matches the data provided. Due to this and the parallel nature of the implementation, we need to create a PL/SQL FUNCTION and supporting PACKAGE:


    CREATE OR REPLACE PACKAGE churnPkg AS
      TYPE cur IS REF CURSOR RETURN CHURN_TRAIN%ROWTYPE;
    END churnPkg;
    /
    CREATE OR REPLACE FUNCTION churnGroupEval(
      inp_cur churnPkg.cur,
      par_cur SYS_REFCURSOR,
      out_qry VARCHAR2,
      grp_col VARCHAR2,
      exp_txt CLOB)
    RETURN SYS.AnyDataSet
    PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH ("state"))
    CLUSTER inp_cur BY ("state")
    USING rqGroupEvalImpl;
    /

    The highlights in red indicate the specific parameters that need to be changed to create this function for any particular data set. There are other variants, but this will get you quite far.

    To validate that our datastores were created, we invoke ore.datastore(). This returns the datastores present and we will see 51 such entries – one for each state and the District of Columbia.

    Parallelism

    Above, we mentioned that “group apply” supports data parallelism. By default, parallelism is turned off. To enable parallelism, the parameter to ore.groupApply needs to be set to TRUE.


    ore.groupApply( CHURN_TRAIN, INDEX=CHURN_TRAIN$state,
              FUN.NAME="myC5.0Function",
              datastorePrefix="myC5.0model",
              ore.connect=TRUE,
              parallel=TRUE
    )

    In the case of the SQL API, the parallel hint can be provided with the input cursor. This indicates that a degree of parallelism up to 4 should be enabled.


    select *
    from table(churnGroupEval(
      cursor(select * /*+ parallel(t,4) */ from CHURN_TRAIN t),
      cursor(select 1 as "ore.connect",' myC5.0model2' as "datastorePrefix" from dual),
      'XML', 'state', 'myC5.0Function'));
    Map Reduce

    The “group apply” functionality can be thought of in terms of the map-reduce paradigm where the mapper performs the partitioning by outputting the INDEX value as key and the data.frame as value. Then, each reducer receives the rows associated with one key. In our example above, INDEX was the column state and so each reducer would receive rows associated with a single state.

    Memory and performance considerations

    While the data is partitioned by the INDEX column, it is still possible that a given partition is quite large, such that either the partition of data will not fit in the R engine memory or the user-defined embedded R function will not be able to execute to completion. The usual remedial measures can be taken regarding setting memory limits – as noted in Part 2.

    If the partitions are not balanced, you would have to configure the system’s memory for the largest partition. This will also have implications for performance, obviously, since smaller partitions of data will likely complete faster than larger ones.

    The blog post Managing Memory Limits and Configuring Exadata for Embedded R Execution discusses how to instrument your code to understand the memory usage of your R function. This is done in the context of ore.indexApply (to be discussed later in this blog series), but the approach is analogous for “group apply.”

    Friday Jan 03, 2014

    Invoking R scripts via Oracle Database: Theme and Variation, Part 2

    In part 1 of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution, focusing on the functions ore.doEval and rqEval. In this blog post, we’ll cover the next in our theme and variation series involving ore.tableApply and rqTableEval.

    The variation on embedded R execution for ore.tableApply involves passing an ore.frame to the function such that the first parameter of your embedded R function receives a data.frame. The rqTableEval function in SQL allows users to specify a data cursor to be delivered to your embedded R function as a data.frame.

    Let’s look at a few examples.


    R API

    In the following example, we’re using ore.tableApply to build a Naïve Bayes model on the iris data set. Naïve Bayes is found in the e1071 package, which must be installed on both the client and database server machine R engines.

    library(e1071)
    mod <- ore.tableApply(
    ore.push(iris),
    function(dat) {
    library(e1071)
    dat$Species <- as.factor(dat$Species)
    naiveBayes(Species ~ ., dat)
    })
    class(mod)
    mod

    A few points to highlight:
    • To use the CRAN package e1071 on the client, we first load the library.
    • The iris data set is pushed to the database to create an ore.frame as the first argument to ore.tableApply. This would normally refer to an ore.frame that refers to a table that exists in Oracle Database. If not obvious, note that we could have previously assigned dat <- ore.push(iris) and passed dat as the argument as well.
    • The embedded R function is supplied as the second argument to ore.tableApply as a function object. Recall from Part 1 that we could have alternatively assigned this function to a variable and passed the variable as an argument, or stored the function in the R script repository and passed the argument FUN.NAME with the assigned function name.
    • The user-defined embedded R function takes dat as its first argument which will contain a data.frame derived from the ore.frame supplied.
    • The model itself is returned from the function.
    • The result of the ore.tableApply execution will be an ore.object.

    SQL API

    We can invoke the function through the SQL API by storing the function in the R script repository. Recall that the call to sys.rqScriptCreate must be wrapped in a BEGIN-END PL/SQL block.


    begin
    sys.rqScriptCreate('myNaiveBayesModel',
    'function(dat) {
    library(e1071)
    dat$Species <- as.factor(dat$Species)
    naiveBayes(Species ~ ., dat)
    }');
    end;
    /

    Invoking the function myNaiveBayesModel occurs in a SQL SELECT statement as shown below. The first argument to rqTableEval specifies a cursor that retrieves the IRIS table. Note that the IRIS table could have been created earlier using ore.create(iris,"IRIS"). The second argument, NULL, indicates that no arguments are supplied to the function.

    The function returns an R object of type naiveBayes, but as a serialized object that is chunked into a table. This likely is not useful to most users.


    select *
    from table(rqTableEval(cursor(select * from IRIS), NULL, NULL, 'myNaiveBayesModel'));

    If we want to keep the model in a more usable form, we can store it in an ORE datastore in Oracle Database. For this, we require a change to the user-defined R function and the SQL invocation.


    begin
    sys.rqScriptCreate('myNaiveBayesModel',
    'function(dat) {
    library(e1071)
    dat$Species <- as.factor(dat$Species)
    mod <- naiveBayes(Species ~ ., dat)
    ore.save(mod, name="myNaiveBayesDatastore")
    TRUE

    }');
    end;
    /
    select *
    from table(rqTableEval(cursor(select * from IRIS), cursor(select 1 as "ore.connect" from dual), 'XML', 'myNaiveBayesModel'));

    Highlighted in red, we’ve stored the model in the datastore named ‘myNaiveBayesDatastore’. We’ve also returned TRUE to have a simple value that can show up as the result of the function execution. In the SQL query, we changed the third parameter to ‘XML’ to return an XML string containing “TRUE”. The name of the datastore could be passed as an argument as follows:


    begin
    sys.rqScriptCreate('myNaiveBayesModel',
    'function(dat, datastoreName) {
    library(e1071)
    dat$Species <- as.factor(dat$Species)
    mod <- naiveBayes(Species ~ ., dat)
    ore.save(mod, name=datastoreName)
    TRUE
    }');
    end;
    /
    select *
    from table(rqTableEval(
    cursor(select * from IRIS),
    cursor(select 'myNaiveBayesDatastore' "datastoreName", 1 as "ore.connect" from dual),
    'XML',
    'myNaiveBayesModel'));

    Memory considerations with ore.tableApply and rqTableEval

    The input data provided as the first argument to a user-defined R function invoked using ore.tableApply or rqTableEval is physically being moved from Oracle Database to the database server R engine. It’s important to realize that R’s memory limitations still apply. If your database server machine has 32 GB RAM and your data table is 64 GB, ORE will not be able to load the data into the R function’s dat argument.
    You may see errors like:


    Error : vector memory exhausted (limit reached)

    or

    ORA-28579: network error during callback from external procedure agent

    See the blog post on Managing Memory Limits and Configuring Exadata for Embedded R Execution where we discuss setting memory limits for the database server R engine. This can be necessary to load reasonably sized data tables.

    Parallelism

    As with ore.doEval / rqEval, user-defined R functions invoked using ore.tableApply / rqTableEval are not executed in parallel, i.e., a single R engine is used to execute the user-defined R function.

    Invoking certain ORE advanced analytics functions

    In the current ORE release, some advanced analytics functions, like ore.lm or ore.glm, which use the embedded R execution framework, cannot be used within other embedded R calls such as ore.doEval / rqEval and ore.tableApply / rqTableEval.

    You can expect to see an error like the following:


    ORA-28580: recursive external procedures are not supported

    In the next post in this series, I’ll discuss ore.groupApply and the corresponding definitions required for SQL execution, since there is no rqGroupApply function. I’ll also cover the relationship of various “group apply” constructs to map-reduce paradigm.

    ORAAH - Enabling high performance R workloads on Hadoop


    One of the features of Oracle R Advanced Analytics for Hadoop (ORAAH) is enabling Hadoop jobs written in the R language. R is a popular open-source language and environment for statistical computing and graphics. ORAAH enables R programmers to leverage a Hadoop cluster operating on data resident in HDFS files.

    In this blog post, we examine the performance characteristics of ORAAH with an example and explain what makes ORAAH the fastest alternative available to run Hadoop-R jobs. We also compare the results with another popular Hadoop interface for R, rmr.

    Credit to Vlad Sharanhovich and Anand Srinivasan for providing the content for this blog post.

    In probability theory and statistics, covariance is a measure of how much two variables change together.

    Variables that tend to show similar behavior exhibit positive covariance. Alternatively if the greater values of one variable correspond with the smaller values of another then the covariance between the variables is negative. We use covariance computation as the running example below that you can use to reproduce the results detailed here.

    The tests were performed on a 6-node cluster running ORAAH version 2.3.1.


    Cluster configuration:


    • 6 node cluster

    • BDA v2.3.1 (based on CDH 4.4)

    • 4 tasktrackers

    • CPU: Intel ® Xeon® CPU X5675  @ 3.07GHz

    • RAM: 47GB


    ORAAH comes with a convenience function for data generation. We use this function to generate a 100 GB HDFS input file with numeric values for 200 variables (columns), as follows:


    # Generate 100GB input dataset (using pre-release 2.4.0)
    inputCsv <- orch.datagen(1e+11, numeric.col.count=200, parts=100)
    # 15 mins, 53 sec
    # 200 mappers

    ORAAH supports 2 types of HDFS input: delimited text files and a binary RDATA representation (R's own binary representation). In many cases, RDATA representation provides much better I/O throughput compared to delimited text files.


    # Converting into ORAAH native format
    inputRdata <- hdfs.toRData(inputCsv, out.name="100G_200n_rd")
    # 4 mins, 54 sec
    # 400 mappers

    Next, we write the mapper and reducer code for parallel/distributed covariance computation. Below, the mapper function accepts a data.frame representation of the input data and generates structured output with 3 components: a matrix, a vector of column sums, and input row count. The single reducer function merges the structured output generated from the mappers to produce the final covariance matrix.

    The mapper and reducer functions are supplied as input to ORAAH's hadoop.run() function. This function additionally takes an HDFS file as input in the dataargument. The structured output from the mapper and reducer is defined in mapred.config data structure. Optionally, the Hadoop job can be given a name (in this case "cov") for traceability.

    ORCH_cov
    <- function(x) {
    hadoop.run(
    data = x,
    mapper = function(k, v) {
    m <- as.matrix(v)
    cs <- colSums(m)
    nr <- nrow(m)
    mtm <- t(m) %*% m
    l <- list(mat=mtm, colsum=cs, nrow=nr)
    orch.keyval(NULL, orch.pack(l)),
    reducer = function(k, v) {
    mapres <- orch.unpack(v$val, as.list=T)
    xy <- Reduce("+", lapply(mapres,function(x) x$mat))
    csf <- Reduce("+", lapply(mapres,function(x) x$colsum))
    nrf <- Reduce("+", lapply(mapres,function(x) x$nrow))
    sts <- csf %*% t(csf)
    m1 <- xy / (nrf -1)
    m2 <- sts / (nrf * (nrf-1))
    m3 <- 2 * sts / (nrf * (nrf-1))
    covmat <- m1 + m2 - m3
    orch.keyval(NULL, orch.pack(covmat))
    },
    config = new("mapred.config",
    map.output = data.frame(key="none",
    val="c"),
    reduce.output = data.frame(key="none", val="c")),
    job.name = "cov")
    }

    The execution times are shown below:


    # Using text input
    ORCH_cov(inputCsv)
    # 7 mins, 19 sec
    # 400 mappers / 45 reducers



    # Using binary input
    ORCH_cov(inputRdata)
    # 4 mins, 18 sec
    # 400 mappers / 45 reducers

    A few points to observe:
    1. Before an HDFS file can be used with ORAAH's hadoop.run() function, its metadata must be known. ORAAH automatically determines the data types of the columns in the HDFS file by sampling rows. The metadata is created during the hdfs.attach() call. This metadata enables ORAAH to generate highly optimized scan routines to read rows from the file.
    2. ORAAH implements caching of input and output structures from mappers and reducers thereby lifting the burden on dealing with large data volumes from the R engine.
    3. ORAAH's orch.pack() and orch.unpack() functions enable transfer of structured constructs between mappers and reducers, which further improves I/O throughput by eliminating the need to scan/parse string inputs.
    4. ORAAH leverages R's own RDATA representation as the binary representation. The key to better execution performance of R jobs on Hadoop is managing I/O throughput and carefully bypassing R's inherent limitations with parsing strings.

    We contrast ORAAH's performance by comparing it with an open source package called rmr (https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr) repeating the covariance calculation on the same cluster and input data set. For this experiment, we used rmr version 2.3.0.

    The covariance computation is written in rmr as shown below.


    RMR2_cov <- function(x, input.format) {
    mapreduce(
    x,
    input.format = input.format,
    map = function(k, v) {
    m <- as.matrix(v)
    cs <- colSums(m)
    nr <- nrow(m)
    mtm <- t(m) %*% m
    l <- list(mat=mtm, colsum=cs, nrow=nr)
    keyval(1, paste0(deparse(l),
    collapse="\n"))
    },
    reduce = function(k, v) {
    mapres <- lapply(v, function(x)
    eval(parse(text=x)))
    xy <- Reduce("+", lapply(mapres, function(x) x$mat))
    csf <- Reduce("+", lapply(mapres, function(x) x$colsum))
    nrf <- Reduce("+", lapply(mapres, function(x) x$nrow))
    sts <- csf %*% t(csf)
    m1 <- xy / (nrf -1)
    m2 <- sts / (nrf * (nrf-1))
    m3 <- 2 * sts / (nrf * (nrf-1))
    covmat <- m1 + m2 - m3
    keyval(1, paste0(deparse(covmat),
    collapse="\n"))
    }
    )
    }

    Notice that the code is somewhat similar to the version used with ORAAH with 3 key differences:
       i) Mapper output is serialized as a string
     ii) Reducer, thus, is forced to parse input strings
    iii) Reducer output is once again a string, requiring client to reconstruct the covariance matrix

    Not only is passing strings limiting from the R programmer's perspective, but also has a negative effect on I/O throughput. rmr also supports a proprietary binary representation of delimited text data.

    Below we repeat the tests with both delimite and binary representations. Not only is the conversion to the binary representation more expensive, the resulting I/O throughput is not substantially better.


    # Convert to RMR2 native format
    inputRMR <- mapreduce(inputCsv, 
      input.format = make.input.format("csv",sep=","),
      map = function(k, v) keyval(NULL, v)
    )
    # 20 mins, 17 sec
    # 400 mappers
    #
    # Using text input
    RMR2_cov(inputCsv, make.input.format("csv",sep=","))
    # 32 mins, 14 sec
    # 400 mappers / 45 reducers
    #
    # Using binary input
    RMR2_cov(inputRMR, "native")
    # 17 mins, 18 sec
    # 400 mappers / 45 reducers

    To summarize ORAAH is 4x faster than rmr out of the box for a simple covariance calculation.





























     Text Input
     Binary Input
     Text to Binary Conversion
     ORAAH  7 min, 19 sec
     4 min, 18 sec
     4 min 54 sec
     rmr  32 min, 14 sec
     17 min, 18 sec
     20 min 17 sec

     4.4x faster
     4x faster
     4.14x faster

    Thursday Jan 02, 2014

    Invoking R scripts via Oracle Database: Theme and Variation


    Oracle R Enterprise provides several ways for you to invoke R scripts through Oracle Database. From the same R script you can get structured data, an XML representation of R objects and images, and even PNG images via a BLOB column in a database table. This series of blog posts will take you through the various ways you can interact with R scripts and Oracle Database. In this first post, we explore the benefits of embedded R execution and usage of the ore.doEval and rqEval functions. Subsequent posts will detail the use of the other embedded R execution functions, and in the case of data- and task-parallel capabilities, how these relate to the MapReduce paradigm.

    Introduction


    In Oracle R Enterprise, we use the phrase “embedded R execution” to characterize the storing of R scripts in Oracle Database – using the ORE R script repository – and invoking that script in one or more database-side R engines.

    This is a powerful capability for several reasons:


    • enable data- and task-parallel execution of user-defined R functions that correspond to special cases of Hadoop Map-Reduce jobs

    • leverage a more powerful database server machine for R engine execution – both RAM and CPU

    • transfer data between Oracle Database and R engine much faster than to a separate client R engine

    • invoke user-defined R functions from SQL and retrieve results in various forms depending on application requirements: tables, XML, PNG BLOB

    • leverage open source CRAN packages at the database server machine

    • schedule R scripts for automatic execution via SQL with Oracle Database DBMS_SCHEDULER PL/SQL package


    Users can interactively develop R scripts using their favorite R IDE, and then deploy the script as an R function to the database where it can be invoked either from R or SQL. Embedded R execution facilitates application use of R scripts with better performance and throughput than using a client-side R engine. Executing R scripts from SQL enables integration of R script results with OBIEE, Oracle BI Publisher, and other SQL-enabled tools for structured data, R objects, and images.

    Table 1 provides a summary of the embedded R execution functions and R script repository functions available. The function f refers to the user-defined R code, or script, that is provided as either an R function object or a named R function in the database R script repository. To create functions in the R script repository, ORE has functions as described in Table 1.












































    R API SQL API Description

    ore.doEval

    rqEval

    Executes f with no automatic transfer of data.

    ore.tableApply

    rqTableEval

    Executes f passing all rows of provided input ore.frame as the first parameter of f. First parameter provided as a data.frame.

    ore.groupApply

    “rqGroupEval”
    (must
    be explicitly defined as function by user)

    Executes f by partitioning data according to an “index” column’s values. Each data partition provided as a data.frame in the first parameter of f. Supports parallel execution of each f invocation in a pool of database server-side R engines.

    ore.rowApply

    rqRowEval

    Executes f passing a specified number of rows (a “chunk”) of the provided input ore.frame. Each chunk provided as a data.frame in the first parameter of f. Supports parallel execution of each f invocation in a pool of database server-side R engines.

    ore.indexApply

    N/A

    Executes f with no automatic transfer of data, but provides the index of the invocation, 1 through n, where n is the number of functions to invoke. Supports parallel execution of each f invocation in a pool of database server-side R engines.

    ore.scriptCreate

    sys.rqScriptCreate

    Load the provided R function into the R script repository with the provided name.

    ore.scriptDrop

    sys.rqScriptDrop

    Remove the named R function from the R script repository.

    Table 1: Embedded R Execution and R Script Repository function summary


    Using ore.doEval and rqEval


    The first of the embedded R functions we cover are also the simplest. The R function ore.doEval and the SQL function rqEval do not automatically receive any data from the database. They simply execute the function f provided. Any needed data is either generated within f or explicitly retrieved from a data source such as Oracle Database, other databases, or flat files.

    R API


    In the R interface, users can specify an R function as an argument to ore.doEval, or use the name of that function as stored in the R script repository. For example, the function RandomRedDots returns a data.frame with two columns and plots 100 random normal values. To invoke the function through the database server requires minimal specification with ore.doEval.

    RandomRedDots <- function(divisor=100){
    id<- 1:10
    plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2 )
    data.frame(id=id, val=id / divisor)
    }
    ore.doEval(RandomRedDots)


    Here is the result, where the image is displayed at the client, but generated by the database server R engine that executed the function f.


    We can provide arguments to f as well. To override the divisor argument, provide it as an argument to ore.doEval. Note
    that any number of parameters, including more complex R objects such as models can be passed as arguments this way.

    ore.doEval(RandomRedDots, divisor=50)


    Behind the scenes: when passing the function itself (as above), ORE implicitly stores the function in the R script repository before executing it. When finished executing, the function is dropped from the repository. If we want to store this function explicitly in the repository, we can use ore.scriptCreate:

    ore.scriptCreate("myRandomRedDots",
    RandomRedDots)


    Now, the function can be invoked by name:

    ore.doEval(FUN.NAME="myRandomRedDots",divisor=50)


    The return value of f is a data.frame, however, if we capture the result in a variable, we’ll notice two things: the class of the return value is ore.object and the image does not display.

    res <- ore.doEval(FUN.NAME="myRandomRedDots",
    divisor=50)
    class(res)



    To get back the data.frame, we must invoke ore.pull to pull the result to the client R engine.

    res.local <- ore.pull(res)
    class(res.local)


    If we wanted to return an ore.frame instead of an ore.object, specify the argument FUN.VALUE that describes the structure of the result.

    res.of <- ore.doEval(FUN.NAME="myRandomRedDots", divisor=50,
    FUN.VALUE= data.frame(id=1, val=1))
    class(res.of)


    SQL API


    Now, we’ll look at executing the same R function f using the SQL interface of embedded R execution, while pointing out a few significant differences in the API. The first difference is that the R functions are defined as strings, not R objects. This should be no surprise since we’ll be using a SQL interface like SQL Developer or SQL*Plus. Also, the R function string cannot be passed directly in the rqEval function, but must first be stored in the R script repository. The call to sys.rqScriptCreate must be wrapped in a BEGIN-END PL/SQL block.

    begin
    sys.rqScriptCreate('myRandomRedDots2',
    'function(divisor=100,numDots=100){
    id <- 1:10
    plot( 1:numDots, rnorm(numDots), pch = 21, bg = "red", cex = 2 )
    data.frame(id = id, val = id / divisor)
    }');
    end;
    /


    Invoking the function myRandomRedDots2 occurs in a SQL SELECT statement as shown below. The first NULL argument to rqEval indicates that no arguments are supplied to the function myRandomRedDots2. In the SQL API, we can ask for the data.frame returned by f to appear as a SQL table. For this, the second parameter can take a SQL string that describes the column names and data types that correspond to the returned data.frame. You can provide a prototype row using the dual dummy table, however, the select statement can be based on an existing table or view as well.

    select *
    from table(rqEval(NULL, 'select 1 id, 1 val from dual', 'myRandomRedDots2'));


    To pass parameters in SQL, we can replace the first NULL argument with a cursor that specifies a single row of scalar values. Multiple arguments can be specified as shown below. Note that argument names are case sensitive, so it is best to include column names in double quotes. Note also that the first argument is a cursor whereas the second parameter is a string. The former provides data values, whereas the latter is parsed to determine the structure of the result.

    select *
    from table(rqEval(cursor(select 50 "divisor", 500 "numDots" from dual),
    'select 1 id, 1 val from dual',
    'myRandomRedDots2'));


    When specifying a table structure for the result as above, any image data is discarded. To get back both structured data and images, we replace the second argument with ‘XML’. This instructs the database to generate an XML string, first with any structured or semi-structured R objects, followed by the image or images generated by the R function f. Images are returned as a base 64 encoding of the PNG representation.

    select *
    from table(rqEval(cursor(select 50 "divisor", 500 "numDots" from dual),
    'XML',
    'myRandomRedDots2'));


    Advanced features


    To establish a connection to Oracle Database within the R function f, a special argument ore.connect can be set to TRUE. This uses the credentials of the user who invoked the embedded R function ore.doEval or rqEval to establish a connection and also automatically load the ORE package. This capability can be useful to explicitly use the ORE Transparency Layer or to save and load objects with ORE R object datastores.

    RandomRedDots <- function(divisor=100, datatstore.name="myDatastore"){
    id <- 1:10
    plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2 )
    ore.load(datastore.name) # contains numeric variable myVar
    data.frame(id=id, val=id / divisor, num=myVar)
    }


    ore.doEval(RandomRedDots, datastore.name="datastore_1", ore.connect=TRUE)


    Notice the additions in red. We pass the name of a datastore to load. That datastore is expected to contain a variable myVar. Arguments prefixed with ‘ore.’ are control arguments and are not passed to f. Other control arguments include: ore.drop which if set to TRUE converts a one-column input data.frame to a vector, ore.graphics which if set to TRUE starts a graphical driver to look for images being returned from f, ore.png.* which provides additional parameters for the PNG graphics device. The ore.png.* control arguments include (replace * with): width, height, units, pointsize, bg, res, type, etc.

    In the next post, we will explore ore.tableApply and rqTableEval.

    About

    The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

    Search

    Archives
    « March 2015
    SunMonTueWedThuFriSat
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    24
    25
    26
    27
    28
    29
    30
    31
        
           
    Today