Friday Aug 17, 2012

Experience with Oracle R Enterprise in the Oracle micro-processor tools environment

ore.stepwise for high-performance stepwise regression

This guest post from Alexandre Ardelea describes the Oracle R Enterprise deployment in the Oracle micro-processor tools environment.

The deployment of ORE within the Oracle micro-processor tools environment introduced a technology that significantly expands our data analysis capabilities and opens the door to new applications.

Oracle R Enterprise (ORE) has been recently deployed in the Oracle micro-processor tools environment, replacing a popular commercial tool as a production engine for data analysis.  Fit/response models are important components of the simulation flows in the Oracle microprocessor tools environment; such models are used for a variety of purposes ranging from library generation to design yield prediction and optimization. Several tools were targeted for the migration to ORE; these tools are constructed around an iterative loop processing hundreds of data structures. At each iteration, a simulator engine generates data for multiple figures of metrics (targets), and a fit engine is called to construct response models for each target. The fit models are assembled into libraries for subsequent simulation within other flows or used on the fly.

A common characteristic of these models is that they need to express strong nonlinear relations between the targets and large sets of explanatory variables. Multiple interactions and many non-linear dependencies are considered as candidate effects for the model construction; they often result from an automatic generation procedure attempting to cover a large variety of anticipated relations between the dependent and independent variables. For example, for a case with O[10^2] main explanatory variables, the total number of candidate regressors, nregs_tot, could quickly rise to O[10^3-10^4]. Linear regression models with such a high number of terms are not only too expensive to use, as they have, potentially, a large number of terms with negligible coefficients, but are also likely to lead to instability and inaccuracy problems. For example, overfitting is an important concern with models expressing fluctuations in the data rather than capturing the trend. Moreover, if strong quasi-linear interactions occur between large numbers of regressors, the variance of the calculated (model) coefficients can be massively inflated.

In order to reduce the size of the fit models while retaining significant dependencies, we use stepwise regression. Stepwise regression is an iterative fitting method which builds the model incrementally by adding and eliminating regressors (from a specified candidate set) using a pre-defined selection mechanism/test of statistical significance - the model converges when all regressors retained in the model satisfy the significance test criteria. Commercial and open source regression packages offer various flavors of stepwise regression which differ in multiple ways through the choice of the selection mechanism (F-test, Information Criteria: AIC, BIC, etc), choice of the 'direction' (forward, backward, or both), flexibility for specifying the model scope, the significance threshold(s), the handling of interaction terms, etc.

ORE has developed a proprietary in-database algorithm for stepwise regression, ore.stepwise, which complements the functionality of R's step and, especially, offers a very significant performance improvement through faster, scalable algorithms and in-database execution. The basic syntax is the following:

ore.stepwise(start_model, DB_data, model_scope, direction, add.p, drop.p)

ore.stepwise allows the user to specify a model scope and stepwise direction, and uses the F-test for regressor selection with the add.p and drop.p significance levels for adding and removing regressors while the model is iteratively constructed.

A typical usage for ore.stepwise is illustrated below:

  list_modnames <- c("mdlA","mdlB",...)
  Effects_mdlA <- ...
  Effects_mdlB <- ...
  ...
for(modname in list_modnames){ 
  Upper_model <- as.formula(sprintf("%s ~ %s", modname,...)
    get(sprintf("Effects_%s", modname)))
  Lower_model <- as.formula(sprintf("%s ~ 1", modname))
  Scope_model <- list(lower = Lower_model, upper = Upper_model)
  ...
  ...
  assign(sprintf("%s_model", modname), 
     ore.stepwise(Lower_model, data = dataDB, scope = Scope_model, 
        direction="both", add.p=..., drop.p=...))
  ...
 
The in-database ore frame dataDB contains observations for several target models mdlA,mdlB,..and 
a list with all target names (list_modnames) is assembled for iterative processing. For each target, a 
model scope is specified within lower and upper bounds. In the example above the lower bound is the intercept 
but the upper bound is customized so that each target model can be constructed from its own collection 
of regressors. The results shown in Figure1 illustrate the performance difference between ore.stepwise 
in using ORE and base R’s step function for both a bi-linear and a fully quadratic model constructed from 
34 independent variables and 10k data observations.


Bilinear model

method

R^2

Number of Regressors

mean(rel_error)

Elapsed Time (seconds)

step

0.9658

86

3.52e-02

2110.0

ore.stepwise

0.9966

124

3.50e-02

32.1

performance difference




ore.stepwise is approx. 65X faster than step at similar R^2 and relative error as stepwise.

Quadratic model

method

R^2

Number of Regressors

mean(rel_error)

Elapsed Time (seconds)

step

0.9962

154

1.05e-02

12600.0

ore.stepwise

0.9963

210

1.04e-02

69.5

performance difference




ore.stepwise is approx. 180X faster than step at similar R^2 relative error.

Figure 1: Comparison of results for R's step function and ORE's ore.stepwise function for both bi-linear and quadratic models


ore.stepwise is faster than R's step by a factor of 66-180X. The larger the data set and the number of regressors, we observed greater performance with ore.stepwise compared to R's step.  The models produced by R’s step and ore.stepwise have a different number of regressors because both the selection mechanisms and interaction terms are handled differently.  step favors the main terms - x1:x2 will be added only if x1 and x2 were previously added, and, reversibly, x1:x2 will be eliminated before x1 and x2 are eliminated, whereas ore.stepwise does not differentiate between main terms and interactions. With respect to collinearity, ore.stepwise detects strict linear dependencies and eliminates from start the regressors involved in multi-collinear relations.

In summary, the ORE capabilities for stepwise regression far surpass similar functionality in tools we considered as alternatives to ORE. The deployment of ORE within the Oracle micro-processor tools environment introduced a technology which significantly expands the data analysis capabilities through the R ecosystem combined with in-database high performance algorithms and opens the door to new applications. This technology leverages the flexibility and extensibility of the R environment and allows massive and complex data analysis sustained by the scalability and performance of the Oracle database for Big Data.

Alexandre Ardelea is a principal hardware engineer at Oracle Corporation. Alex has a PhD and MS in Physics from Ecole polytechnique fédérale de Lausanne and post-doctoral research in non-linear physics, CFD and parallel methods. Alex's specialities include response surface modeling, optimization strategies for multi-parametric/objective/constraint problems, statistical process characterization, circuit analysis and RF algorithms. 

_______________________________________________________________________________________________________________

Oracle R Enterprise (ORE) implements a transparency layer on top of the R engine that allows R computations specified in the R environment to be pushed for execution in Oracle Database. A mapping is established between a special R object called an ore frame and a corresponding database table or view, allowing analytical functions such as ore.stepwise to be executed on these mapped objects in Oracle Database. The overloaded functions in the ORE packages generate SQL statements in the background, export the expensive computations to the database for execution, and return results to the R environment.  Here is a simple example using ore.stepwise with the longley data, which is shipped in the datasets package with R:


# load the Oracle R Enterprise library and connect to Oracle Database
R> library(ORE)
R> ore.connect(user = "rquser", host = "localhost", sid = "oracle", 
     password = "rquser")

R> # push the longley data to a database table:
R> LONGLEY <- ore.push(longley)

R> class(LONGLEY)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"

# Fit full model
R> oreFit <- ore.lm(Employed ~ ., data = LONGLEY)
R> summary(oreFit)

Call:
ore.lm(formula = Employed ~ ., data = LONGLEY)

Residuals:
     Min       1Q   Median       3Q      Max
-0.41011 -0.15980 -0.02816  0.15681  0.45539

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)  -3.482e+03  8.904e+02  -3.911 0.003560 **
GNP.deflator  1.506e-02  8.492e-02   0.177 0.863141
GNP          -3.582e-02  3.349e-02  -1.070 0.312681
Unemployed   -2.020e-02  4.884e-03  -4.136 0.002535 **
Armed.Forces -1.033e-02  2.143e-03  -4.822 0.000944 ***
Population   -5.110e-02  2.261e-01  -0.226 0.826212
Year          1.829e+00  4.555e-01   4.016 0.003037 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared: 0.9955,     Adjusted R-squared: 0.9925
F-statistic: 330.3 on 6 and 9 DF,  p-value: 4.984e-10


# perform stepwise variable selection
R> oreStep <- ore.stepwise(Employed ~ .^2, data = LONGLEY, add.p =
0.1, drop.p = 0.1)

# View a summary of ore.stepwise object
R> summary(oreStep)

Call:
ore.stepwise(formula = Employed ~ .^2, data = LONGLEY, add.p = 0.1,
    drop.p = 0.1)

Residuals:
   Min     1Q Median     3Q    Max
 18.42  22.08  24.08  29.40  33.19

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)             -3.539e-01  8.455e-03 -41.852  < 2e-16 ***
Year                     3.589e-05  1.821e-05   1.971   0.0675 .
GNP.deflator:GNP        -2.978e-03  3.039e-04  -9.800 6.51e-08 ***
GNP.deflator:Unemployed  2.326e-04  5.720e-06  40.656  < 2e-16 ***
GNP.deflator:Population  2.303e-05  2.293e-06  10.044 4.72e-08 ***
GNP:Armed.Forces         6.875e-06  8.415e-07   8.170 6.64e-07 ***
GNP:Year                 2.007e-04  1.890e-05  10.618 2.26e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9829 on 15 degrees of freedom
Multiple R-squared: 0.9972,     Adjusted R-squared: 0.9961
F-statistic:  3242 on 6 and 15 DF,  p-value: < 2.2e-16

For more information on how Oracle R Enterprise leverages Oracle Database as computational engine, see the Oracle R Training Series. Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition. We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.

Thursday Aug 02, 2012

Data Parallelism Using Oracle R Enterprise

Modern computer processors are adequately optimized for many statistical calculations, but large data operations may require hours or days to return a result.  Oracle R Enterprise (ORE), a set of R packages designed to process large data computations in Oracle Database, can run many R operations in parallel, significantly reducing processing time. ORE supports parallelism through the transparency layer, where the database is used as a computational engine, and embedded R execution, where R scripts can be executed in a data parallel manner.

The backbone of parallel computing is breaking down a resource intensive computation into chunks that can be performed independently, while maintaining a framework that allows for the results of those independent computations to be combined.  Writing parallel code is typically trickier than writing serial code, but this is simplified using ORE, as there is no need for the user to create worker instances or combine results. Using the transparency layer, users simply execute their ORE code and the database implicitly manages the entire process, returning results for further processing.

With ORE, each R function invocation that operates on an ORE object, such as ore.frame, is translated to a SQL statement behind the scenes. This SQL, which may be stacked after several function invocations, undergoes optimization and parallelization when parsed and executed. This technique enables deferred evaluation, but that's a topic for another blog. Depending on the resource requirements of the statement, the database decides if it should leverage parallel execution. 

For embedded R execution, database degree of parallelism settings help determine the number of parallel R engines to start.  When data parallel functions execute in parallel, each unit of work is sent to a different R external process, or extproc, at the database server. The results are automatically collated and returned as R-proxy objects, e.g., ore.frame objects, in the R interface and SQL objects in the SQL interface, which can be processed further in R or by SQL functions. The SQL functions enable the operationalizion or productization of R scripts as part of a database-based application, in what we refer to as "lights out" mode.

In the ORE Transparency Layer, where the database executes SQL generated from overloaded R functions, parallelism is automatic, assuming the database or table is configured for parallelism. Parallel computations in the transparency layer are ideal for bigger data where functionality exists in the database.

Using Embedded R Script Execution, parallelism is enabled for row, group and index operations if specified using a function parameter or parallel cursor hint:

  • ore.groupApply and rqGroupEval* split the data into grouped partitions and invoke the R function on each partition in a separate engine at the database server.
  • ore.rowApply and rqRowEval split the data into row chunks and invoke the R function on each chunk in a separate engine at the database server.
  • ore.indexApply runs an R function x times, with each iteration of the function invoked in separate engine at the database server.

With embedded R execution, the expectation is that the database server machine has greater RAM and CPU capacity than the user's client machine.  So executing R scipts at the server will inherently allow larger data sets to be processed by an individual R engine.

In addition, users can include contributed R packages in their embedded R scripts. Consider an example using a sample of the airline on-time performance data from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs. The data sample consists of 220K records of U.S. domestic commercial flights between 1987 and 2008. 

We use the R interface to embedded R to partition the airline data table (ONTIME_S) by the DAYOFWEEK variable, fit a linear model using the biglm package, and then combine the results. Note: To run this example, the biglm package must be installed on both the database server and client machine.

res <- ore.groupApply(ONTIME_S,
               INDEX = ONTIME_S$DAYOFWEEK,
               parallel = TRUE,
               function(dat) {
                library(biglm)
                library(ORE)
                ore.connect("rquser", "orcl", "localhost", "rquser")
                biglm(ARRDELAY ~ DEPDELAY + DISTANCE, dat)
               })
R> summary(res$Monday)
Large data regression model: biglm(ARRDELAY ~ DEPDELAY + DISTANCE, dat)
Sample size =  31649
               Coef    (95%     CI)     SE     p
(Intercept)  0.5177  0.2295  0.8058 0.1441 3e-04
DEPDELAY     0.9242  0.9178  0.9305 0.0032 0e+00
DISTANCE    -0.0014 -0.0017 -0.0011 0.0002 0e+00

The call to ore.groupApply uses Oracle Database to partition the ONTIME_S table by the categories in the DAYOFWEEK variable.  Each category is sent to an R engine at the database server machine to apply the R function in parallel.  The individual category results are combined in the returned result.  Using embedded R alleviates the typical memory problems associated with running R serially because we are fitting only a single partition, or day of the week, in memory of an R engine. Using a Linux server with 8 GB RAM and 4 CPUs, fitting the model in parallel by setting parallel = TRUE in the call to ore.groupApply, reduces the processing time from approximately 30 seconds to 10 seconds.

If the goal is to integrate the model results as an operationalized process, we can use rqGroupEval, the SQL interface equivalent to ore.groupApply.  We create a script to set up the structure of the input and grouping column and then run the script in SQL. The nature of pipelined table functions requires that we explicitly represent the type of the result, captured in the package, and create a function that includes the column used for partitioning explicitly.

# setup  

SQL> CREATE PACKAGE airlinePkg AS
  2  TYPE cur IS REF CURSOR RETURN ONTIME_S%ROWTYPE;
  3  END airlinePkg;
  4  /

Package created.

SQL> CREATE FUNCTION ontimeGroupEval(
  inp_cur  airlinePkg.cur,
  par_cur  SYS_REFCURSOR, 
  out_qry  VARCHAR2,
  grp_col  VARCHAR2,
  exp_txt  CLOB)
RETURN SYS.AnyDataSet
PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH (DAYOFWEEK))
CLUSTER inp_cur BY (DAYOFWEEK)
USING rqGroupEvalImpl;
/

# model build

alter table ONTIME_S parallel;

SQL> begin
  sys.rqScriptCreate('GroupingExample',
 'function(dat) {
      library(biglm)
library(ORE)
      ore.connect("rquser", "orcl", "localhost", "rquser")
      result <- biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
      result}');
end;
/ 
create table ONTIME_LM as
select *
  from table(ontimeGroupEval(
         cursor(select /*+ parallel(ONTIME_S)*/
         from ONTIME_S),
         NULL, NULL, 'DAYOFWEEK', 'GroupingExample'));

begin
  sys.rqScriptDrop('GroupingExample');
end;
/


We use a parallel hint on the cursor that is the input to our rqGroupEval function to enable Oracle Database to use parallel R engines.  In this case, using the same Linux server, the processing time is reduced from approximately 25 seconds to 7 seconds as we used 7 parallel R engines (one for each day of the week) across a single server.  Of course, a real-world scenario may utilize hundreds of parallel engines across many servers, returning results on large amounts of data in short period of time.

Additional details on how parallel execution works in Oracle database can be found here. We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.

*To enable execution of an R script in the SQL interface, ORE provides variants of ore.doEval, ore.groupApply and ore.indexApply in SQL. These functions are rqEval, rqTableEval, rqRowEval and rqGroupEval. The ore.groupApply feature does not have a direct parallel in the SQL interface. We refer to rqGroupApply as a concept, however, there is specific code required to enable this feature. This is highlighted in the second example.

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« August 2012 »
SunMonTueWedThuFriSat
   
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today