Thursday Oct 17, 2013

ORE graphics using Remote Desktop Protocol

Oracle R Enterprise graphics are returned as raster, or bitmap graphics. Raster images consist of tiny squares of color information referred to as pixels that form points of color to create a complete image. Plots that contain raster images render quickly in R and create small, high-quality exported image files in a wide variety of formats.

However, it is a known issue that the rendering of raster images can be problematic when creating graphics using a Remote Desktop connection. Raster images do not display in the windows device using Remote Desktop under the default settings. This happens because Remote Desktop restricts the number of colors when connecting to a Windows machine to 16 bits per pixel, and interpolating raster graphics requires many colors, at least 32 bits per pixel..

For example, this simple embedded R image plot will be returned in a raster-based format using a standalone Windows machine:

 R> library(ORE)
 R> ore.connect(user="rquser", sid="orcl", host="localhost", password="rquser", all=TRUE)
 R> ore.doEval(function() image(volcano, col=terrain.colors(30)))


Here, we first load the ORE packages and connect to the database instance using database login credentials. The ore.doEval function executes the R code within the database embedded R engine and returns the image back to the client R session.

Over a Remote Desktop connection under the default settings, this graph will appear blank due to the restricted number of colors. Users who encounter this issue have two options to display ORE graphics over Remote Desktop: either raise Remote Desktop's Color Depth or direct the plot output to an alternate device.

Option #1: Raise Remote Desktop Color Depth setting

In a Remote Desktop session, all environment variables, including display variables determining Color Depth, are determined by the RCP-Tcp connection settings. For example, users can reduce the Color Depth when connecting over a slow connection. The different settings are 15 bits, 16 bits, 24 bits, or 32 bits per pixel. To raise the Remote Desktop color depth:

On the Windows server, launch Remote Desktop Session Host Configuration from the Accessories menu.
Under Connections, right click on RDP-Tcp and select Properties.
On the Client Settings tab either uncheck LimitMaximum Color Depth or set it to 32 bits per pixel.




Click Apply, then OK, log out of the remote session and reconnect.

After reconnecting, the Color Depth on the Display tab will be set to 32 bits per pixel.  Raster graphics will now display as expected. For ORE users, the increased color depth results in slightly reduced performance during plot creation, but the graph will be created instead of displaying an empty plot.

Option #2: Direct plot output to alternate device

Plotting to a non-windows device is a good option if it's not possible to increase Remote Desktop Color Depth, or if performance is degraded when creating the graph. Several device drivers are available for off-screen graphics in R, such as postscript, pdf, and png. On-screen devices include windows, X11 and Cairo.

Here we output to the Cairo device to render an on-screen raster graphic.  The grid.raster function in the grid package is analogous to other grid graphical primitives - it draws a raster image within the current plot's grid.

 R> options(device = "CairoWin") # use Cairo device for plotting during the session
 R> library(Cairo) # load Cairo, grid and png libraries
 R> library(grid)
 R> library(png)
 R> res <- ore.doEval(function()image(volcano,col=terrain.colors(30))) # create embedded R plot
 R> img <- ore.pull(res, graphics = TRUE)$img[[1]] # extract image
 R> grid.raster(as.raster(readPNG(img)), interpolate = FALSE) # generate raster graph
 R> dev.off() # turn off first device

 


By default, the
interpolate argument to grid.raster is TRUE, which means that what is actually drawn by R is a linear interpolation of the pixels in the original image. Setting interpolate to FALSE uses a sample from the pixels in the original image.

A list of graphics devices available in R can be found in the Devices help file from the grDevices package:

R> help(Devices)

Friday Oct 11, 2013

Take Oracle R Enterprise for a Test Drive

If you'd like try Oracle R Enterprise, Oracle Partner Vlamis Software Solutions provides a quick and easy way to get started using a virtual machine (VM) loaded with all the software you require and hosted on Amazon Web Services (AWS).  Follow this link and within a few clicks you'll have a "Remote Desktop" connection to the cloud with sample scripts for you to explore both the R language and Oracle R Enterprise, both from R and SQL.

Oracle R Enterprise, a component of the Oracle Advanced Analytics option, makes the open source R statistical programming language and environment ready for the enterprise and big data. It provides a comprehensive, database-centric environment for end-to-end analytical processes in R, with immediate deployment to production environments. R users can operationalize entire R scripts in production applications, thereby eliminating porting of R code to other languages or reinventing code to integrate R results into existing applications. Oracle R Enterprise allows users to seamlessly leverage Oracle Database as a high performance computing (HPC) environment for R scripts, providing data parallelism and resources management.


Wednesday Oct 02, 2013

Managing Memory Limits and Configuring Exadata for Embedded R Execution

An R engine can consume significant memory resources in the course of running R scripts. R users who work with Oracle R Enterprise Embedded R Execution on sizable data, especially application designers and database administrators (DBAs), have a vested interest in understanding and controlling the memory demands of R script execution to help ensure that sufficient memory resources are available for both their application and Oracle Database. ORE Embedded R Execution enables running R scripts managed by Oracle Database, both through R and SQL APIs. The SQL API enables seamless integration with database-based applications, data-parallel and task-parallel R script execution, and ease of production deployment.

To provide greater control over R memory consumption, Oracle R Enterprise provides a privileged SQL function for configuring a database server with R memory limits. In this blog post, we provide a discussion of R memory usage and garbage collection, and how this SQL function can be used to limit the amount of memory consumed by individual R engines started as part of ORE’s embedded R execution framework. We follow with an example of involving memory limit calculations on Exadata and some recommendations for DBAs to consider when configuring Exadata for embedded R execution. Note that such calculations and configuration settings are applicable to non-Exadata (single instance or custom RAC) environments as well. At the end, there a “tip of the day” for R memory management.

Garbage Collection as a concept

For those familiar with languages like C, memory is explicitly managed by the programmer through invocations of functions to allocate and free memory (malloc, calloc, free). Failing to free memory when finished with it results in “memory leaks” that can cause a process to consume (or exhaust) memory unnecessarily, often resulting in a program or system crash.

To alleviate programmers from this burden, languages like R and Java rely on garbage collection. “Garbage” is memory that is no longer being used, i.e., no longer referenceable, within your program. With garbage collection, programmers avoid dealing with memory management. The underlying system determines what memory is used or available, and frees memory periodically. Garbage collection, however, is not a panacea. Garbage collection can take time to process, e.g., on the order of seconds, which can make response time for certain functions unpredictable – although modern garbage collection mechanisms have largely mitigated this drawback. In addition, when garbage collection occurs is essentially non-deterministic, depending on heuristics set up by the language implementation. This means that memory may be retained longer than necessary.

Memory in R

In R, memory can be characterized along two dimensions: memory allocated for vectors and arrays (referred to as Vcells), and memory allocated for objects such as lists (referred to as “cons” cells or Ncells). When invoking R’s garbage collection function, gc(), you’ll see results like these:

The function gc() returns a matrix with rows Ncells, which corresponds to the cons cells, and Vcells, which corresponds to vector heap memory. The Ncells are 56 bytes/cell (49.2*1024*1024/.920477) on a 64-bit machine, and Vcells are ~8 bytes/cell (22.6*1024*1024/2.956944). The “used” column indicates the number of cells allocated, along with their corresponding megabytes. The column “gc trigger” indicates at what point garbage collection will kick in. The column “max used” indicates the maximum space used since the last call to gc(reset=TRUE) or since R started if gc(reset=TRUE) wasn’t invoked.

As an example of affecting Ncells, consider the following example where we initialize a list as a sequence of 100K numbers. We see that roughly 5.4 MB of RAM were consumed for the 100K cells.

For Vcells, we create a vector of 1M elements. This consumes roughly 3.8 MB of RAM for the 1M cells. R optimizes for integers.

The same test with floats consumes 7.6 MB of RAM for the 1M cells of floats.

How does R’s garbage collector use VSize and NSize?

We’ll discuss VSize, as NSize is analogous. The garbage collector recovers memory that is no longer in use, determining when to perform garbage collection and how much memory to recover. Looking at heap memory for Vcells, as depicted in the figure below, there are a few key points: Min_VSize, VSizeInUse, R VSize, and Max_VSize. The R VSize serves as the gc() trigger. The Min_VSize and Max_VSize are the specified lower and upper memory limits. Min_VSize is the minimal size for the vector heap as well as its initial value. From there, R grows or shrinks the vector heap depending on memory demands. However, it doesn’t exceed the Max_VSize limit nor go below the Min_VSize limit. In the figure, VSizeInUse reflects the memory currently used by R objects. R_VSize is how much memory can be requested without triggering gc(). As you would expect: Min_VSize <= R_VSize <= Max_VSize and VSizeInUse < R_VSize.


Limiting memory on the database server R engine

Oracle R Enterprise provides the SQL function sys.rqconfigset to set memory limits. Use of this function requires the sys privilege and the setting is applied only to embedded R engines. Consider the following examples:

sys.rqconfigset('MIN_VSIZE', '10M') -- min heap 10MB, default 32MB
sys.rqconfigset('MAX_VSIZE', '100M') -- max heap 100MB, default 4GB
sys.rqconfigset('MIN_NSIZE', '500K') -- min number cons cells 500x1024, default 1M
sys.rqconfigset('MAX_NSIZE', '2M') -- max number cons cells 2M, default 20M

Note that either numeric or string values can be provided to sys.rqconfigset. Default constants are defined as follows:


#define RQET_DEF_MINVSZ 33554432 /* RQER DEFault MIN_VSiZe 32Mb */
#define RQET_DEF_MAXVSZ 4294967296 /* RQER DEFault MAX_VSiZe 4Gb */
#define RQET_DEF_MINNSZ 1048576 /* RQER DEFault MIN_NSiZe 1M */
#define RQET_DEF_MAXNSZ 20971520 /* RQER DEFault MAX_NSiZe 20M */

Getting memory settings and usage through an embedded R engine

To obtain the current set of default values, you can invoke the following SQL statement when connected to the database server using the table sys.rq_config.

select name, value from sys.rq_config;

This can be done from within an embedded R function invocation using, for example:

getMemorySettings <- function() {
con <- dbConnect(Extproc())
rs <- dbSendQuery(con, "select name, value from sys.rq_config")
dat <- fetch(rs)
dat
}
ore.doEval(getMemorySettings,ore.connect=TRUE)

To obtain the current memory usage within an individual embedded R engine, instrumenting your embedded R function with gc() and returning the results of gc() will provide this insight:

getMemoryUse <- function() {
gc.dat <- gc()
list(pid=Sys.getpid(), gc.dat=gc.dat)
}
ore.doEval(getMemoryUse)

Note that the result from an embedded R call will also include the memory limits set for the R engines, shown below in column 6 “limit (Mb)” of the result. This occurs whenever memory limits are in place for an R engine. Here, we’ve included getting the process id of the R engine.

An example of computing memory limits

Consider you have an Exadata X2-2 that has 1152 GB RAM (~1.2 TB) and your DBA allocates you a maximum of 60 GB RAM for parallel R engines per Exadata node. If we set the degree of parallelism at 32, to enable 32 R engines to execute concurrently, this allows 1.875 GB RAM / R engine. If we allocate 2/3 of this for Vcells, we would allocate ~1.25 GB for the MAX_VSIZE. The remaining 1/3, or 625 MB, would translate into 11.6M cells for MAX_NSIZE.

60 GB allocated to R engines per Exadata node

DOP=32

60GB / 32 R engines = 1.875 GB / R Engine

~2/3 for Vcells = 1.25 GB; 1.25 GB / 8 Bytes/Cell = 156.25M Cells

~1/3 for Ncells = 625 MB; 625 MB / 54 Bytes/Cell = 11.6M Cells

sys.rqconfigset('MAX_VSIZE', '1250M')
sys.rqconfigset('MAX_NSIZE', '11600K')

While this example focuses on parallel execution, such as for ore.groupApply, ore.rowApply, and ore.indexApply (or SQL rqRowEval and “rqGroupEval”), the same type of analysis applies to non-parallel embedded R functions, like ore.doEval and ore.tableApply (or SQL rqEval and rqTableEval).

Consider an example that builds a randomForest model using the ore.doEval function. We can compute the amount of RAM consumed by the function by invoking gc() at the beginning and end of the function and subtracting the max used “(Mb)” columns as depicted here:

The result is that 1.4 MB were consumed for Ncells and 11.1 MB for Vcells. This can similarly be done for ore.indexApply to see the amount of RAM consumed by each embedded R function execution and to sum up the actual usage for each of the embedded R engines (assuming they run fully concurrently).

Generating such numbers on real data gives users a sense of how much memory embedded R jobs may require.

For DBAs

When configuring a database on Exadata for parallel R engines, consider the following options. In the following scenario, we contrast the scenarios when the execution time of any one given embedded R function is fast, e.g., 10s of seconds, and there are many such executions, versus few parallel R engines where the execution time is long with fewer such executions. Note that these must be considered in context of other Exadata uses:

· Set parallel_degree_policy to MANUAL. This allows ORE to choose when to apply parallelism, as opposed to setting it to AUTO which allows Oracle Database to decide.

· Set parallel_min_servers to the number of parallel slave processes to be started when the database instances start, e.g., 64, which is the number of parallel slave processes per Exadata node. This avoids incurring the time required to start these processes as needed to service R engines, and is particularly important when individual embedded R function execution time is short, e.g., 10s of seconds. If embedded R function execution time is long, the percentage of time for starting up the parallel slave will not dominate the overall execution time.

· Set parallel_max_servers to the maximum number of parallel slave processes that should be allowed per Exadata node, e.g., 128. This ensures that no more than parallel_max_servers will be active at one time, and in turn corresponds to the maximum number of R engines that can be active at one time.

· To avoid overloading the CPUs if the parallel_max_servers limit is reached, set the hidden parameter _parallel_statement_queuing to TRUE. This parameter is turned off by setting parallel_degree_policy to MANUAL. The _parallel_statement_queuing parameter allows for queuing of parallel requests when they exceed the parallel_server_target, which should be set to a value between parallel_min_servers and parallel_max_servers, e.g., 96. Once the parallel_server_target is reached, an embedded R execution will be allowed to execute in parallel using the remaining available parallel servers. If none are available, parallel requests will be queued. This ensures that parallel requests will be run in parallel, as opposed to being forced to serial execution, and be able to take advantage of parallel slaves as they become available. This can dramatically improve overall embedded R execution completion time. Note that parallel_max_servers cannot be changed during database operation, but the parallel_server_target can be to tune Exadata performance. Note that queuing effectively takes no CPU resources.

· To minimize RAC or cluster overhead for fast-executing individual embedded R functions, set parallel_force_local to TRUE to keep all parallel servers allocated and running on the same database server node. With this setting, starting an embedded R execution with DOP 32, all 32 R engines will run on the same Exadata node. A new embedded R execution also with DOP 32 may be started on a different node. If the embedded R functions are long-running, the setup time is propotionately small so spreading the IO over multiple nodes will not adversely impact overall performance. Having parallel slaves span multiple Exadata nodes results in communication / handshaking across nodes, which requires more resources. If the embedded R functions are fast, this overhead can adversely impact overall performance. When all parallel slaves are local, fewer resources are used.

· Where applicable, set application tables and their indexes to DOP 1 to reinforce the ability of ORE to determine when to use parallelism and not be overridden by table or index settings of DEFAULT or a specific degree of parallelism.

R memory management tip

There are many optimizations to make more efficient use of memory. To end this point, here is a tip to reduce memory consumptions significantly and avoid unnecessary replication of data.

If you know the size of your result in advance, pre-allocate the memory required, whether a vector, list, or matrix, as opposed to building up the result incrementally such as using cbind for adding columns to a matrix or data.frame. For example:

num.rows <- 1000
num.cols <- 2000
myFunction <- function(col) {col:(num.rows+col-1)} # produces vector of values
myMatrix <- matrix(NA, num.rows, num.cols) # pre-allocate required memory

for(col in 1:num.cols) {
myMatrix[,col] <- myFunction(col)
}

A note of thanks to Qin Wang and Martin Farber for their input on this blog post.

Monday Aug 12, 2013

Quick! Swap those models – I’ve got a better one

(or, Why in-database analytics enables real-time scoring and can make model deployment easy)

Refreshing predictive models is a standard part of the process when deploying advanced analytics solutions in production environments. In addition, many predictive models need to be used in a real-time setting for scoring customers, whether that is for fraud detection, predicting churn, or recommending next likely product. One of the problems with using vanilla R is that real-time scoring often requires starting an R engine for each score, or enabling some ad hoc mechanism for real-time scoring, which can increase application complexity.

In this blog post, we look at how Oracle R Enterprise enables:

  • Building models in-database on database data from R
  • Renaming in-database models for use by a stored procedure
  • Invoking the stored procedure to make predictions from SQL
  • Building a second model and swapping it with the original
  • Moving a model from development environment to production environment

Building the model in R

So let’s start with building a generalized linear model (GLM) in Oracle Database. For illustration purposes, we’ll use the longley data set from R – a macroeconomic data set that provides a well-known example for a highly collinear regression. In R, type ?longley for the full description of the data set.

Using the following R script, we create the database table LONGLEY_TABLE from the longley data.frame and then build the model using the in-database GLM algorithm. We’re predicting the number of people employed using the remaining variables. Then, we view the model details using summary and the auto-generated fit.name. This fit.name corresponds to the name of the Oracle Data Mining (ODM) model in the database, which is auto-generated. Next, we use the model to predict using the original data, just for a confirmation that the model works as expected.

ore.connect("rquser","my_sid","my_host","rquser_pswd",1521, all=TRUE)

ore.create(longley, table="LONGLEY_TABLE")

mod.glm <- ore.odmGLM(Employed ~ ., data = LONGLEY_TABLE)

summary(mod.glm)

mod.glm$fit.name

predict(fit1, LONGLEY_TABLE)

While a user can refer to the ODM model by its name in fit.name, for example, when working with it in SQL or the Oracle Data Miner GUI, this may not be convenient since it will look something like ORE$23_123. In addition, unless the R object mod.glm is saved in an ORE datastore (an ORE feature corresponding to R’s save and load functions using ore.save and ore.load, but in the database), at the end of the session, this object and corresponding ODM model will be removed.

In addition, we’ll want to have a common name for the model so that we can swap an existing model with a new model and not have the change higher level code. To rename an ODM model, we can use the PL/SQL statement shown here, invoked with R using ore.exec. Of course, this could also be done from any SQL interface, e.g., SQL*Plus, SQL Developer, etc., just supplying the explicit SQL.

ore.exec(paste("BEGIN DBMS_DATA_MINING.RENAME_MODEL(model_name => '", mod.glm$fit.name, "', new_model_name => 'MY_GLM_MODEL'); END;",sep=""))

So now, we have the ODM model named MY_GLM_MODEL. Keep in mind, after the model is renamed, the original model no longer exists and the R object is invalid – at least from the standpoint of being able to use it in functions like summary or predict.

Scoring data from a SQL procedure

As noted above, users can score in batch from R, however, they can also score in batch from SQL. But we’re interested in real-time scoring from the database using the in-database model. This can be done directly in a SQL query but providing the input data in the query itself. This eliminates having to write data to a database table and then doing a lookup to retrieve the data for scoring – making it real-time.

The following SQL does just this. The WITH clause defines the input data, selecting from dual. The SELECT clause uses the model MY_GLM_MODEL to make the prediction using the data defined by data_in.

WITH data_in as (select 2013 "Year",

234.289 "GNP",

235.6 "Unemployed",

107.608 "Population",

159 "Armed.Forces",

83 "GNP.deflator",

60.323 "Employed"

from dual )

SELECT PREDICTION(MY_GLM_MODEL USING *) "PRED"

FROM data_in

While we could invoke the SQL directly, having a stored procedure in the database can give us more flexibility. Here’s the stored procedure version in PL/SQL.

CREATE OR REPLACE

PROCEDURE MY_SCORING_PROC (year_in IN NUMBER,

gnp_in IN BINARY_DOUBLE,

unemployed_in IN BINARY_DOUBLE,

population_in IN BINARY_DOUBLE,

armed_forces_in IN BINARY_DOUBLE,

gnp_deflator_in IN BINARY_DOUBLE,

employed_in IN BINARY_DOUBLE,

pred_out OUT NUMBER) AS

BEGIN

WITH data_in as (select year_in "Year",

gnp_in "GNP",

unemployed_in "Unemployed",

population_in "Population",

armed_forces_in "Armed.Forces",

gnp_deflator_in "GNP.deflator",

employed_in "Employed"

from dual ),

model_score as (SELECT PREDICTION(MY_GLM_MODEL USING *) "PRED"

FROM data_in )

select PRED into pred_out from model_score;

EXCEPTION

WHEN OTHERS THEN

raise_application_error(-20001,

'An error was encountered - '||SQLCODE||' -ERROR- '||SQLERRM);

END;

To invoke the stored procedure, we can do the following:

SET SERVEROUTPUT ON

DECLARE

score NUMBER;

BEGIN

MY_SCORING_PROC(1947, 234.289, 235.6, 107.608, 159, 83, 60.323, score);

DBMS_OUTPUT.PUT_LINE('Score: '|| score);

END;

Refreshing the model from R

Let’s say the model above has been in production for a while, but has become stale – that is, it’s not predicting as well as it used to due to changing patterns in the data. To refresh it, we build a new model. For illustration purposes, we’re going to use the same data (so an identical model will be produced, except for its name).

mod.glm2 <- ore.odmGLM(Employed ~ ., data = LONGLEY_TABLE)

summary(mod.glm2)

mod.glm2$fit.name

To swap the models, we delete the existing model called MY_GLM_MODEL and rename the new model to MY_GLM_MODEL. Again, we can do this from R using PL/SQL and through ore.exec.

ore.exec(paste("BEGIN DBMS_DATA_MINING.DROP_MODEL('MY_GLM_MODEL'); DBMS_DATA_MINING.RENAME_MODEL(model_name => '",mod.glm2$fit.name,"', new_model_name => 'MY_GLM_MODEL'); END;",sep=""))

We can now re-execute the stored procedure and the new model will be used.

SET SERVEROUTPUT ON

DECLARE

score NUMBER;

BEGIN

MY_SCORING_PROC(1947, 234.289, 235.6, 107.608, 159, 83, 60.323, score);

DBMS_OUTPUT.PUT_LINE('Score: '|| score);

END;

You may have noticed that this approach can introduce a brief period where no model is accessible - between the DROP_MODEL and RENAME_MODEL. A better approach involves the use of SYNONYMs. In general, synonyms provide both data independence and location transparency, being an alternative name for a table, view, sequence, procedure, stored function, and other database objects. We can use this in conjunction with our stored procedure above. First, create a synonym for the original scoring procedure.

CREATE or REPLACE SYNONYM MY_SCORING_PROC_SYM for MY_SCORING_PROC;

When invoking the procedure from your application, use the name MY_SCORING_PROC_SYM in place of MY_SCORING_PROC.  Instead of renaming the model, create a second stored procedure, with a different name, e.g., MY_SCORING_PROC_2. The new procedure references the name of the newly build model internally. 

When it is time to swap the models, invoke the following to change the procedures.

 

CREATE or REPLACE SYNONYM MY_SCORING_PROC_SYM for MY_SCORING_PROC_2;

Another benefit of this approach is that replaced models can still be kept should you need to revert to a previous version. 

Moving an in-database model from one machine to another

In a production deployment, there’s often the need to move a model from the development environment to the production environment. For example, the data scientist may have built the model in a development / sandbox environment and now needs to move it to the production machine(s).

In-database models provide functions EXPORT_MODEL and IMPORT_MODEL as part of the DBMS_DATA_MINING SQL package. See the 11g documentation for details. These calls can be invoked from R, but we’ll show this from SQL just to keep the flow easier to see.

From a SQL prompt, e.g., from SQL*Plus, connect to the schema that contains the model. Create a DIRECTORY object where the exported model file will be stored. List the model names available to this schema, which should contain MY_GLM_MODEL. Then, export the model

CONNECT rquser/rquser_psw

CREATE OR REPLACE DIRECTORY rquserdir AS '/home/MY_DIRECTORY';

-- list the models available to rquser

SELECT name FROM dm_user_models;

-- export the model called MY_GLM_MODEL to a dump file in same schema

EXECUTE DBMS_DATA_MINING.EXPORT_MODEL ('MY_GLM_MODEL_out',

'RQUSERDIR',

'name = ''MY_GLM_MODEL''');

At this point, you have the ODM model named MY_GLM_MODEL in the file MY_GLM_MODEL_out01.dmp stored in the file system under /home/MY_DIRECTORY. This file can now be moved to the production environment and the model loaded into the target schema.

Log into the new schema and invoke IMPORT_MODEL.

CONNECT rquser2/rquser2_psw

EXECUTE DBMS_DATA_MINING.IMPORT_MODEL (MY_GLM_MODEL_out01.dmp',

'RQUSERDIR', 'name = ''MY_GLM_MODEL''',

'IMPORT', NULL, 'glm_imp_job', 'rquser:rquser2');

Summary

In this post, we’ve highlighted how to build an in-database model in R and use it for scoring through SQL in a production, re-time settings. In addition, we showed how it is possible to swap, or refresh, models in a way that can leave your application code untouched. Finally, we highlighted database functionality that allows you to move in-database models from one database environment to another.

Users should note that all the functionality shown involving SQL, or being invoked through ore.exec, can be easily wrapped in R functions that could ultimately become part of ORE. If any of our readers are interested in giving this a try, we can post your solution here to share with the R and Oracle community. For the truly adventurous, check out the Oracle Database package DBMS_FILE_TRANSFER to consider wrapping the ability to move model files from R as well.

Monday Jul 29, 2013

Oracle R Distribution for R-3.0.1 released

We're pleased to announce that the Oracle R Distribution 3.0.1 Linux RPMs are now available on Oracle's public yum. R-3.0.1, code-named "Good Sport", is the second release in the R-3.0.x series. This new series in R doesn't announce new features, but indicates that the code base has developed to a new level of maturity.

However, there are some significant improvements in the 3.0 series worth mentioning.  R-3.0.0 introduces the use of large vectors in R, and eliminates some restrictions in the core R engine by allowing R to use the memory available on 64-bit systems more efficiently. Prior to this release, objects had a hard-coded limit of 2^31-1 elements, or roughly 2.1 billion elements.  Objects exceeding this limit were treated as missing (NA) and R sometimes returned a warning, regardless of available memory on the system. Starting in R-3.0.0, objects can exceed this limit, which is a significant improvement. Here's the relevant statement from the R-devel NEWS file:

 There is a subtle change in behaviour for numeric index 
 values 2^31 and larger. These never used to be legitimate 
 and so were treated as NA, sometimes with a warning. They 
 are now legal for long vectors so there is no longer a 
 warning, and x[2^31] <- y will now extend the vector on a 
 64-bit platform and give an error on a 32-bit one. 

R-3.0.1 adds to these updates by improving serialization for big objects and fixing a variety of bugs.

Older open source R packages will need to be re-installed after upgrading from ORD 2.15.x to ORD 3.0.1, which is accomplished by running:

R> update.packages(checkBuilt = TRUE) 

This command upgrades open source packages if a more recent version exists on CRAN or if the installed package was build with an older version of R.

Oracle R Distribution 3.0.1 will be compatible with future versions of Oracle R Enterprise.  As of this posting, we recommend using ORD 2.15.3 with Oracle R Enterprise 1.3.1.  When installing ORD for use with ORE 1.3.1, be sure to use the command yum install R- 2.15.3, otherwise R-3.0.1 will be installed by default.

 
ORD 3.0.1 binaries for AIX, Solaris x86, and Solaris SPARC platforms will be available from Oracle's free and Open Source portal soon. Please check back for updates.

Friday Jul 19, 2013

Oracle R Connector for Hadoop 2.2.0 released

Oracle R Connector for Hadoop 2.2.0 is now available for download. The Oracle R Connector for Hadoop 2.x series has introduced numerous enhancements, which are highlighted in this article and summarized as follows:

 ORCH 2.0.0
 ORCH 2.1.0
 ORCH 2.2.0

 Analytic Functions

  • orch.lm
  • orch.lmf
  • orch.neural
  • orch.nmf

Oracle Loader for Hadoop (OLH) support

CDH 4.2.0

ORCHhive transparency layer

.

.

.

.

.

.

Analytic Functions
  • orch.cor
  • orch.cov
  • orch.kmeans
  • orch.princomp
  • orch.sample - by percent

Configurable delimiters in text input data files

Map-only and reduce-only jobs

Keyless map/reduce output

"Pristine" data mode for high performance data access

HDFS cache of metadata

Hadoop Abstraction Layer (HAL)

.

Analytic Functions
  • orch.sample - by number of rows

CDH 4.3.0

Full online documentation

Support integer and matrix data types in hdfs.attach with detection of "pristine" data

Out-of-the-box support for "pristine" mode for high I/O performance

HDFS cache to improve interactive performance when navigating HDFS directories and file lists

HDFS multi-file upload and download performance enhancements

HAL for Hortonworks Data Platform 1.2 and Apache Hadoop 1.0

ORCH 2.0.0

In ORCH 2.0.0, we introduced four Hadoop-enabled analytic functions supporting linear  regression, low rank matrix factorization, neural network, and non-negative matrix factorization. These enable R users to immediately begin using advanced analytics functions on HDFS data using the MapReduce paradigm on a Hadoop cluster without having to design and implement such algorithms themselves.

While ORCH 1.x supported moving data between the database and HDFS using sqoop, ORCH 2.0.0 supports the use of Oracle Loader for Hadoop (OLH) to move very large data volumes from HDFS to Oracle Database in a efficient and high performance manner.

ORCH 2.0.0 supported Cloudera Distribution for Hadoop (CDH) version 4.2.0 and introduced the ORCHhive transparency layer, which leverages the Oracle R Enterprise transparency layer for SQL, but instead maps to HiveQL, a SQL-like language for manipulating HDFS data via Hive tables.

ORCH 2.1.0

In ORCH 2.1.0, we added several more analytic functions, including correlation and covariance, clustering via K-Means, principle component analysis (PCA), and sampling by specifying the percent of records to return.

ORCH 2.1.0 also brought a variety of features, including: configurable delimiters (beyond comma delimited text files, using any ASCII delimiter), the ability to specify mapper-only and reduce-only jobs, and the output of NULL keys in mapper and reducer functions.

To speed the loading of data into Hadoop jobs, ORCH introduced “pristine” mode where the user guarantees that the data meets certain requirements so that ORCH skips a time-consuming data validation step. “Pristine” data requires that numeric columns contain only numeric data, that missing values are either R’s NA or the null string, and that all rows have the same number of columns. This improves performance of hdfs.get on a 1GB file by a factor of 10.

ORCH 2.1.0 introduced the caching of ORCH metadata to improve response time of ORCH functions, such as hdfs.ls, hdfs.describe, and hdfs.mget between 5x and 70x faster.

The Hadoop Abstraction Layer, or HAL, enables ORCH to work on top of various Hadoop versions or variants, including Apache/Hortonworks, Cloudera Hadoop distributions: CDH3, and CDH 4.x with MR1 and MR2.

ORCH 2.2.0

In the latest release, ORCH 2.2.0, we’ve augmented orch.sample to allow specifying the number of rows in addition to percentage of rows. CDH 4.3 is now supported, and ORCH functions provide full online documentation via R's help function or ?. The function hdfs.attach now support integer and matrix data types and the ability to detect pristine data automatically. HDFS bulk directory upload and download performance speeds were also improved. Through the caching and automatic synchronization of ORCH metadata and file lists, the responsiveness of metadata HDFS-related functions has improved by 3x over ORCH 2.1.0, which also improves performance of hadoop.run and hadoop.exec functions. These improvements in turn bring a more interactive user experience for the R user when working with HDFS.

Starting in ORCH 2.2.0, we introduced out-of-the-box tuning optimizations for high performance and expanded HDFS caching to include the caching of file lists, which further improves performance of HDFS-related functions.

The function hdfs.upload now supports the option to upload multi-file directories in a single invocation, which optimizes the process. When downloading an HDFS directory, hdfs.download is optimized to issue a single HDFS command to download files into one local temporary directory before combining the separate parts into a single file.

The Hadoop Abstraction Layer (HAL) was extended to support Hortonworks Data Platform 1.2 and Apache Hadoop 1.0. In addition, ORCH now allows the user to override the Hadoop Abstraction Layer version for use with unofficially supported distributions of Hadoop using system environment variables. This enables testing and certification of ORCH by other Hadoop distribution vendors.

Certification of ORCH on non-officially supported platforms can be done using a separate test kit (available for download upon request: mark.hornick@oracle.com) that includes an extensive set of tests for core ORCH functionality and that can be run using the ORCH built-in testing framework. Running the tests pinpoints the failures and ensures that ORCH is compatible with the target platform.

See the ORCH 2.2.0 Change List and Release Notes for additional details. ORCH 2.2.0 can be downloaded here.


Thursday Jul 18, 2013

Simple and Advanced Time series with Oracle R Enterprise

This guest post from Marcos Arancibia describes how to use Oracle R Enterprise to analyze Time Series data.

In this article, we give an overview of how to use Time Series Analysis against data stored in Oracle Database, using the Embedded R Execution capability to send time series computations to the Oracle Database server instead processing at the client. We will also learn how to retrieve the final series or forecasts and retrieve them to the client for plotting, forecasting, and diagnosing.

One key thing to keep in mind when using Time Series techniques with data that is stored in Oracle Database is the order of the rows, or records. Because of the parallel capabilities of Oracle Database, when queried for records, one might end up receiving records out of order if an option for order is not specified.

Simple Example using Stock Data

Let’s start with a simple Time Series example. First we will need to connect to our Oracle Database using ORE. Then, using the package TTR, we will access Oracle Stock data from YahooData service, from January 1, 2008 to January 1, 2013 and push it to the database.

# Load the ORE library and connect to Oracle Database

library(ORE)

ore.connect("myuser","mysid","myserver","mypass",port=1521,all=TRUE)

library(TTR)

# Get data in XTS format

xts.orcl <- getYahooData("ORCL", 20080101, 20130101)

# Convert it to a data frame and gets the date

# Makes the date the Index

df.orcl <- data.frame(xts.orcl)

df.orcl$date <- (data.frame(date=index(xts.orcl))$date)

# Create/overwrite data in Oracle Database

# to a Table called ORCLSTOCK

ore.drop(table="ORCLSTOCK")

ore.create(df.orcl,table="ORCLSTOCK")

# IMPORTANT STEP!!!

# Ensure indexing is kept by date

rownames(ORCLSTOCK) <- ORCLSTOCK$date

# Ensure the data is in the DB

ore.ls()

# Review column names, data statistics and

# print a sample of the data

names(ORCLSTOCK)

>names(ORCLSTOCK)

[1] "Open" "High" "Low" "Close" "Volume"

[6] "Unadj.Close" "Div" "Split" "Adj.Div" "date"

summary(ORCLSTOCK$Close)

>summary(ORCLSTOCK$Close)

Min. 1st Qu. Median Mean 3rd Qu. Max.

13.36 20.53 24.22 24.79 29.70 35.73

head(ORCLSTOCK)

>head(ORCLSTOCK)

Open High Low Close Volume

2008-01-02 01:00:00 21.74414 22.00449 21.58022 21.68629 44360179

2008-01-03 01:00:00 21.62843 22.28413 21.62843 22.28413 43600532

2008-01-04 01:00:00 21.95628 22.06235 21.01130 21.24272 46391263

2008-01-07 01:00:00 21.17523 21.67664 21.01130 21.45486 41527032

2008-01-08 01:00:00 21.44522 21.52236 20.38453 20.39417 45155398

2008-01-09 01:00:00 20.57738 20.91487 20.39417 20.83773 49750304

Unadj.Close Div Split Adj.Div date

2008-01-02 01:00:00 22.49 NA NA NA 2008-01-02

2008-01-03 01:00:00 23.11 NA NA NA 2008-01-03

2008-01-04 01:00:00 22.03 NA NA NA 2008-01-04

2008-01-07 01:00:00 22.25 NA NA NA 2008-01-07

2008-01-08 01:00:00 21.15 NA NA NA 2008-01-08

2008-01-09 01:00:00 21.61 NA NA NA 2008-01-09

Pull data from the database for a simple plot

# Pull data from Oracle Database (only the necessary columns)

orcl <- ore.pull(ORCLSTOCK[,c("date","Close","Open","Low","High")])

# Simple plot with base libraries - Closing

plot(orcl$date,orcl$Close,type="l",col="red",xlab="Date",ylab="US$",

main="Base plot:Daily ORACLE Stock Closing points")

# Simple plot with base libraries - Other Series

plot(orcl$date,orcl$Open,type="l",col="blue",xlab="Date",ylab="US$",

main="Base plot:Daily ORACLE Stock: Open/High/Low points")

lines(orcl$date,orcl$High,col="green")

lines(orcl$date,orcl$Low,col="orange")

legend("topleft", c("Opening","High","Low"),

col=c("blue","green","orange"),lwd=2,title = "Series",bty="n")

A different plot option, using the package xts

library(xts)

# Pull data from Oracle Database (only the necessary columns)

orcl <- ore.pull(ORCLSTOCK[,c("date","Close","Open","Low","High")])

# Convert data to Time Series format

orcl.xts <- as.xts(orcl,order.by=orcl$date,dateFormat="POSIXct")

# Plot original series

plot(orcl.xts$Close,major.ticks='months',minor.ticks=FALSE,

main="Time Series plot:Daily ORACLE Stock Closing points",col="red")

Simple Time Series: Moving Average Smoothing

We might be tempted to call functions like the Smoothing Moving Average from open-source CRAN packages against Oracle Database Tables, but those packages do not know what to do with an “ore.frame”. For that process to work correctly, we can either load the data locally or send the process for remote execution on the Database Server by using Embedded R Execution.

We will also explore the built-in Moving Average process from ore.rollmean() as a third alternative.

ALTERNATIVE 1 - The first example is pulling the data from Oracle Database into a ts (time series) object first, for a Client-side smoothing Process.

library(TTR)

# Pull part of the database table into a local data.frame

sm.orcl <- ore.pull(ORCLSTOCK[,c("date","Close")])

# Convert "Close" attribute into a Time Series (ts)

ts.orcl <- ts(sm.orcl$Close)

# Use SMA - Smoothing Moving Average algorithm from package TTR

ts.sm.orcl <-ts(SMA(ts.orcl,n=30),frequency=365, start=c(2008,1) )

# Plot both Series together

plot(sm.orcl$date,sm.orcl$Close,type="l",col="red",xlab="Date",ylab="US$",

main="ORCL Stock Close CLIENT-side Smoothed Series n=30 days")

lines(sm.orcl$date,ts.sm.orcl,col="blue")

legend("topleft", c("Closing","MA(30) of Closing"),

col=c("red","blue"),lwd=2,title = "Series",bty="n")

ALTERNATIVE 2 – In this alternative, we will use a Server-side example for running the Smoothing via Moving Average, without bringing all data to the client. Only the result is brought locally for plotting. Remember that the TTR package has to be installed on the Server in order to be called.

# Server execution call using ore.tableApply

# Result is an ore.list that remains in the database until needed

sv.orcl.ma30 <-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect = TRUE,

function(dat) {

library(TTR)

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

list(res1 <- ts(ordered$Close,frequency=365, start=c(2008,1)),

res2 <- ts(SMA(res1,n=30),frequency=365, start=c(2008,1)),

res3 <- ordered$date)

}

);

# Bring the results locally for plotting

local.orcl.ma30 <- ore.pull(sv.orcl.ma30)

# Plot two series side by side

# (the third element of the list is the date)

plot(local.orcl.ma30[[3]],local.orcl.ma30[[1]],type="l",

col="red",xlab="Date",ylab="US$",

main="ORCL Stock Close SERVER-side Smoothed Series n=30 days")

# Add smoothed series

lines(local.orcl.ma30[[3]],

local.orcl.ma30[[2]],col="blue",type="l")

# Add legend

legend("topleft", c("Closing","Server MA(30) of Closing"),

col=c("red","blue"), lwd=2,title = "Series", bty="n")

ALTERNATIVE 3 – In this alternative we will use a Server-side example with the computation of Moving Averages using the native ORE in-Database functions without bringing data to the client. Only the result is brought locally for plotting.

Just one line of code is needed to generate an in-Database Computation of Moving averages and the creation of a new VIRTUAL column in the Oracle Database. We will call this new column rollmean30.

We will use the function ore.rollmean(). The option align="right" makes the MA look at only the past k days (30 in this case), or less, depending on the point in time. This creates a small difference between this method and the previous methods in the beginning of the series, since ore.rollmean() can actually calculate the first sets of days using smaller sets of data available, while other methods discard this data.

# Moving Average done directly in Oracle Database

ORCLSTOCK$rollmean30 <- ore.rollmean(ORCLSTOCK$Close, k = 30, align="right")

# Check that new variable is in the database

head(ORCLSTOCK)

>head(ORCLSTOCK)

Open High Low Close Volume

2008-01-02 01:00:00 21.74414 22.00449 21.58022 21.68629 44360179

2008-01-03 01:00:00 21.62843 22.28413 21.62843 22.28413 43600532

2008-01-04 01:00:00 21.95628 22.06235 21.01130 21.24272 46391263

2008-01-07 01:00:00 21.17523 21.67664 21.01130 21.45486 41527032

2008-01-08 01:00:00 21.44522 21.52236 20.38453 20.39417 45155398

2008-01-09 01:00:00 20.57738 20.91487 20.39417 20.83773 49750304

Unadj.Close Div Split Adj.Div date rollmean30

2008-01-02 01:00:00 22.49 NA NA NA 2008-01-02 21.68629

2008-01-03 01:00:00 23.11 NA NA NA 2008-01-03 21.98521

2008-01-04 01:00:00 22.03 NA NA NA 2008-01-04 21.73771

2008-01-07 01:00:00 22.25 NA NA NA 2008-01-07 21.66700

2008-01-08 01:00:00 21.15 NA NA NA 2008-01-08 21.41243

2008-01-09 01:00:00 21.61 NA NA NA 2008-01-09 21.31665

# Get results locally for plotting

local.orcl <- ore.pull(ORCLSTOCK[,c("date","Close", "rollmean30")])

sub.orcl <- subset(local.orcl,local.orcl$date> as.Date("2011-12-16"))

# Plot the two series side by side

# First plot original series

plot(local.orcl$date, local.orcl$Close,type="l", col="red",xlab="Date",ylab="US$",

main="ORCL Stock Close ORE Computation of Smoothed Series n=30 days")

# Add smoothed series

lines(local.orcl$date,local.orcl$rollmean30,col="blue",type="l")

# Add legend

legend("topleft", c("Closing","ORE MA(30) of Closing"),

col=c("red","blue"),lwd=2,title = "Series",bty="n")

Seasonal Decomposition for Time Series Diagnostics

Now that we have learned how to execute these processes using Embedded R, we can start using other methodologies required for Time Series using the same Server-side computation and local plotting.

It is typical for an analyst to try to understand a Time Series better by looking at some of the basic diagnostics like the Seasonal Decomposition of Time Series by Loess. These can be achieved by using the stl() command in the following process:

# Server execution

sv.orcl.dcom <-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect = TRUE,

function(dat) {

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

ts.orcl <- ts(ordered$Close,frequency=365, start=c(2008,1))

res <- stl(ts.orcl,s.window="periodic")

}

);

# Get result for plotting

local.orcl.dcom <- ore.pull(sv.orcl.dcom)

plot(local.orcl.dcom, main="Server-side Decomposition of ORCL Time-Series",col="blue")

Another typical set of diagnostic charts includes Autocorrelation and Partial Autocorrelation function plots. These can be achieved by using the acf() command with the proper options in Embedded R Execution, so computations happen at the Oracle Database server:

# Server-side ACF and PACF computation

# Use function acf() and save result as a list

sv.orcl.acf <-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect=TRUE,

function(dat){

ts.orcl <- ts(dat$Close,frequency=365, start=c(2008,1))

list(res1 <- acf(ts.orcl,lag.max=120,type="correlation"),res2 <- acf(ts.orcl,lag.max=30, type="partial"))

}

);

# Get results for plotting

# ACF and PACF as members of the list pulled

local.orcl.acf <- ore.pull(sv.orcl.acf)

plot(local.orcl.acf[[1]],main="Server-side ACF Analysis for Series ORCL",col="blue",lwd=2)

plot(local.orcl.acf[[2]],main="Server-side PACF Analysis for Series ORCL",col="blue",lwd=5)

Simple Exponential Smoothing

Using the popular package “forecast”, we will use the ses() function to calculate a 90 days horizon (h=90) into the future, using the option criterion=MSE for the model. The package forecast needs to be installed on the Oracle Database server R engine.

Then, we will bring the resulting model locally for plotting. Remember to load the library “forecast” locally as well, to be able to interpret the meaning of the ses() output when it’s brought locally.

# Execute ses() call in the server

sv.orcl.ses <-

ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

function(dat) {

library(forecast)

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

ts.orcl <- ts(ordered$Close,frequency=365, start=c(2008,1) )

res <- ses(ts.orcl, h=90, alpha=0.1, initial="simple")

}

);

# Get SES result locally for plotting

# Since remote object contains a SES model from package forecast,

# load package locally as well

library(forecast)

plot.orcl.ses <- ore.pull(sv.orcl.ses)

plot(plot.orcl.ses,col="blue",fcol="red",

main="ORCL with Server-side SES - Simple Exponential Smoothing Forecast")

Holt Exponential Smoothing

Using the popular package “forecast”, we will use the holt() function to calculate a 90 days horizon (h=90) into the future, requesting the Intervals of confidence of 80 and 95%. Again. the package “forecast” needs to be installed on the Oracle Database server R engine.

Then, we will bring the resulting model locally for plotting. Remember to load the library forecast locally as well, to be able to interpret the meaning of the holt() output when it’s brought locally.

# Execute holt() call in the server

sv.orcl.ets <-

ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

function(dat) {

library(forecast)

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

ts.orcl <- ts(ordered$Close,frequency=365, start=c(2008,1))

res <- holt(ts.orcl, h=90, level=c(80,95), initial="optimal")

}

);

# Get resulting model from the server

# Since remote object contains a Holt Exponential Smoothing

# model from package forecast, load package locally as well

library(forecast)

local.orcl.ets <- ore.pull(sv.orcl.ets)

plot(local.orcl.ets,col="blue",fcol="red",

main="ORCL Original Series Stock Close with Server-side Holt Forecast")

ARIMA – Auto-Regressive Interactive Moving Average

There are at least two options for fitting an ARIMA model into a Time Series. One option is to use the package “forecast”, that allows for an automatic arima fitting (auto.arima) to find the best parameters possible based on the series.

For more advanced users, the arima() function in the “stats” package itself allows for choosing the model parameters.

# ARIMA models on the server using auto.arima() from package forecast

arimaModel <-

ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

FUN = function(dat){

# load forecast library to use auto.arima

library(forecast)

# sort the table into a temp file by date

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

# convert column into a Time Series

# format ts(...) and request creation of an automatic

# ARIMA model auto.arima(...)

res <- auto.arima(ts(ordered$Close,frequency=365, start=c(2008,1)),

stepwise=TRUE, seasonal=TRUE)

})

# Alternative using the arima() from package “stats”.

arimaModel <-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect=TRUE,

FUN = function(dat){

# sort table into a temp file by date

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

# convert column into a Time Series

# format ts(...) and request creation of a specific

# ARIMA model using arima(), for example an ARIMA(2,1,2)

res <- arima(ts(ordered$Close,frequency=365, start=c(2008,1)),

order = c(2,1,2))

})

# Load forecast package locally to use the model

# for plotting and producing forecasts

library(forecast)

# Show remote resulting Time Series model

>arimaModel

Series: ts(ordered$Close, frequency = 365, start = c(2008, 1))

ARIMA(2,1,0)

Coefficients:

ar1 ar2

-0.0935 -0.0192

s.e. 0.0282 0.0282

sigma^2 estimated as 0.2323: log likelihood=-866.77

AIC=1739.55 AICc=1739.57 BIC=1754.96

# Get remote model using ore.pull for local prediction and plotting

local.arimaModel <- ore.pull(arimaModel)

# Generate forecasts for the next 15 days

fore.arimaModel <- forecast(local.arimaModel, h=15)

# Use the following option if you need to remove scientific notation of

# numbers that are too large in charts

options(scipen=10)

# Generate the plot of forecasts, including interval of confidence

# Main title is generated automatically indicating the type of model

# chosen by the Auto ARIMA process

plot(fore.arimaModel,type="l", col="blue", xlab="Date",

ylab="Closing value (US$)", cex.axis=0.75, font.lab="serif EUC",

sub="Auto-generated ARIMA for ORCL Stock Closing"

)

# Generate and print forecasted data points plus standard errors

# of the next 15 days

forecasts <- predict(local.arimaModel, n.ahead = 15)

>forecasts

$pred

Time Series:

Start = c(2011, 165)

End = c(2011, 179)

Frequency = 365

[1] 33.29677 33.29317 33.29395 33.29395 33.29393 33.29393 33.29393 33.29393 33.29393 33.29393 33.29393

[12] 33.29393 33.29393 33.29393 33.29393

$se

Time Series:

Start = c(2011, 165)

End = c(2011, 179)

Frequency = 365

[1] 0.4819417 0.6504925 0.7807798 0.8928901 0.9924032 1.0827998 1.1662115 1.2440430 1.3172839 1.3866617

[11] 1.4527300 1.5159216 1.5765824 1.6349941 1.6913898

Monday Jul 08, 2013

Accessing Data from Multiple Schemas using Oracle R Enterprise

The most common Oracle R Enterprise configuration is to connect directly to a database schema that contains tables you wish to analyze. However, users may occasionally need to access tables that exist in other schemas. Oracle R Enterprise allows several options when accessing tables from another schema is desired. Database tables and views are currently supported, and these form the basis for our recommendations.

Named Schema Access

If you have SELECT TABLE or SELECT ANY TABLE privilege on tables in another schema, you can access these tables after connecting to the database with your own schema credentials. The function ore.sync synchronizes database table and view metadata with the R client environment. For example, by setting schema to “user2”, user1 will see all of user2's tables on which user1 has been granted access:

R> library(ORE)
R> ore.connect(user="user1", sid="sid", host="hostname", password="password")
R> ore.sync(schema = "user2", table="myTable")
R> ore.attach(schema = "user2")
R> ore.ls()
[1] "myTable" 

Here, we combine the schema and table arguments to look at a specific table, but this can be omitted to access all tables available in schema "user2" at once:

R> ore.sync(schema = "user2")
R> ore.ls()
  [1] "myTable" "anotherTable"

Accessing a materialized table typically offers the best query performance for operations such as joins, however in 
other cases, such as calculating simple summaries, the performance advantage may be negligible.

Create Views in Local Schema

Another option is to map views in your own schema to the tables or views in the another schema. You can restrict users to the view instead of the underlying table, thereby enhancing security, and also include in the view only those columns needed. For example, if a user exports the contents of a carefully defined view, they will see only the table columns selected by the view - no unselected columns, unique identifiers or table keys. Views also simplify the user experience by exposing only those database tables the user can or should access. The only catch when using views is that you must update those views if the underlying tables or views change.

Provided you've been granted CREATE VIEW privilege and SELECT TABLE access, use the ore.exec function to execute the SQL statement that will create the view from the R client

R> library(ORE)
R> ore.connect(user="user1", sid="sid", host="hostname", password="password")
R> ore.exec("create view myView as select * from user2.myTable")
R> ore.sync(table = "myView")
  [1] "myView" 

The code above assumes you already have privileges to access the table or view. If you do not, log in as sysdba or to the schema of interest in invoke:

SQL> grant select on MYTABLE to user1;

Oracle Wallet

Password credentials for connecting to databases can be stored in a client-side Oracle Wallet, a container used to encrypt authentication credentials. The contents of the wallet are not readable, eliminating the need to expose schema credentials when connecting to the database.  Security risks are reduced because such passwords are not exposed in clear text. Oracle R Enterprise 1.3 and later is integrated with Oracle Wallet, providing a secure way for R scripts to avoid storing passwords in the script. For detailed information about creating wallets, see Oracle Database Advanced Security Administrator's GuideSteps for using Oracle Wallet with Oracle R Enterprise are provided in the Oracle R Enterprise Installation and Administration Guide.

If you have a creative technique for accessing data across schemas or other platforms, please recommend it in the blog comments, along with any opinions you have on these approaches.


Monday Jun 17, 2013

Oracle R Connector for Hadoop 2.1.0 released

Oracle R Connector for Hadoop (ORCH), a collection of R packages that enables Big Data analytics using HDFS, Hive, and Oracle Database from a local R environment, continues to make advancements. ORCH 2.1.0 is now available, providing a flexible framework while remarkably improving performance and adding new analytics based on the ORCH framework.

Previous releases enabled users to write MapReduce tasks in the R language and run them in HDFS. The API was then expanded to include support for Hive data sources, providing easy access to Hive data from R, leveraging the same transparency interface as found in Oracle R Enterprise. ORCH HAL was included to enable portability and compatibility of ORCH with any Cloudera's Hadoop distribution starting from version 3.x up to 4.3.

In this release, new analytic functions that work in parallel, distributed mode and execute on the Hadoop cluster, include:

  • Covariance and Correlation matrix computation
  • Principal Component Analysis
  • K-means clustering
  • Linear regression
  • Single layer feed forward neural networks for linear regression
  • Matrix completion using low rank matrix factorization
  • Non negative matrix factorization
  • Sampling
  • Predict methods

ORCH 2.1.0 also adds support for keyless mapReduce output and many other improvements that contribute to overall performance enhancements.

You can find an ORCH technical reference here, and download Oracle R Connector for Hadoop here.






Wednesday Jun 12, 2013

R to Oracle Database Connectivity: Use ROracle for both Performance and Scalability

R users have a few choices of how to connect to their Oracle Database. The most commonly seen include: RODBC, RJDBC, and ROracle. However, these three packages have significantly different performance and scalability characteristics which can greatly impact your application development. In this blog, we’ll discuss these options and highlight performance benchmark results on a wide range of data sets.

If you use ROracle, we'd like to hear about your experience. Please take this brief survey.

By way of introduction, RODBC is an R package that implements ODBC database connectivity. There are two groups of functions: the largely internal odbc* functions implement low-level access to the corresponding ODBC functions having a similar name, and the higher level sql* functions that support read, save, copy, and manipulation of data between R data.frame objects and database tables. Here is an example using RODBC:

library(RODBC)

con <- odbcConnect("DD1", uid="rquser", pwd="rquser", rows_at_time = 500)

sqlSave(con, test_table, "TEST_TABLE")

sqlQuery(con, "select count(*) from TEST_TABLE")

d <- sqlQuery(con, "select * from TEST_TABLE")

close(con)

The R package RJDBC is an implementation of the R DBI package – database interface – that uses JDBC as the back-end connection to the database. Any database that supports a JDBC driver can be used in connection with RJDBC. Here is an example using RJDBC:

library(RJDBC)

drv <- JDBC("oracle.jdbc.OracleDriver",

classPath="…tklocal/instantclient_11_2/ojdbc5.jar", " ")
con <- dbConnect(drv, "
jdbc:oracle:thin:@myHost:1521:db", "rquser", "rquser")

dbWriteTable(con, "TEST_TABLE", test_table)

dbGetQuery(con, "select count(*) from TEST_TABLE")

d <- dbReadTable(con, "TEST_TABLE")
dbDisconnect(con)

The ROracle package is an implementation of the R DBI package that uses Oracle OCI for high performance and scalability with Oracle Databases. It requires Oracle Instant Client or Oracle Database Client to be installed on the client machine. Here is an example using ROracle:

library(ROracle)

drv <- dbDriver("Oracle")

con <- dbConnect(drv, "rquser", "rquser")

dbWriteTable(con,”TEST_TABLE”, test_table)

dbGetQuery(con, "select count(*) from TEST_TABLE")

d <- dbReadTable(con, "TEST_TABLE")

dbDisconnect(con)

Notice that since both RJDBC and ROracle implement the DBI interface, their code is the same except for the driver and connection details.

To compare these interfaces, we prepared tests along several dimensions:

  • Number of rows – 1K, 10K, 100K, and 1M
  • Number of columns – 10, 100, 1000
  • Data types – NUMBER, BINARY_DOUBLE, TIMESTAMP, and VARCHAR; Numeric data is randomly generated, all character data is 10 characters long.
  • Interface: RODBC 1.3-6 (with Data Direct 7.0 driver), RJDBC 0.2-1 (with rJava 0.9-4 with increased memory limit in JRIBootstrap.java),
    and ROracle 1.1-10 (with Oracle Database Client 11.2.0.4)
  • Types of operations: select *, create table, connect

Loading database data to an R data.frame

Where an in-database capability as provided by Oracle R Enterprise is not available, typical usage is to pull data to the R client for subsequent processing. In Figure 1, we compare the execution time to pull 10, 100, and 1000 columns of data from 1K, 10, 100K, and 1M rows for BINARY_DOUBLE data on a log-log scale. Notice that RJDBC does not scale to 100 columns x 1M rows, or above 1000 cols x 100K records. While RODBC and ROracle both scale to these volumes, ROracle is consistently faster than RODBC: up to 2.5X faster. For RJDBC, ROracle is up to 79X faster.

Figure 1: Comparison of RJDBC, RODBC, and ROracle for BINARY_DOUBLE for Select *

In Figure 2, we provide the range of results for RODBC, ROracle, and RJDBC across all data types. Notice that only ROracle provides the full range of scalability while providing superior performance in general.

ROracle is virtually always faster than RODBC: NUMBER data up to 2.5X faster, VARCHAR2 data up to 142X faster, and time stamp data up to 214X faster. RODBC fails to process 1000 columns at 1M rows.

For RJDBC, ROracle is up to 13X faster on NUMBER data, 79X faster on binary double data, 3X for VARCHAR2 data (excluding the 25X over the smallest data set). Note that RODBC and RJDBC have a limit of 255 characters on the length the VARCHAR2 columns. TIMESTAMP data is the one area where RJDBC initially shines, but then fails to scale to larger data sets.

Figure 2: Comparing the three interfaces for select * from <table>

Data set sizes represented in megabytes are captured in Table 1 for all data types. With only minor variation, the data sizes are the same across data types.

Table 1: Dataset sizes in megabytes

Creating database tables from an R data.frame

Data or results created in R may need to be written to a database table. In Figure 3, we compare the execution time to create tables with 10, 100, and 1000 columns of data with 1K, 10, 100K, and 1M rows for BINARY_DOUBLE. Notice that in all three cases, RJDBC is slowest and does not scale. RJDBC does not support the NUMBER or BINARY_DOUBLE data types, but uses FLOAT(126) instead. ROracle scaled across the remaining data types, while RODBC and RJDBC were not tested.

ROracle is 61faster than RODBC for 10 columns x 10K rows, with a median of 5X faster across all data sets. ROracle is 630X faster on 10 columns x 10K rows, with a median of 135X faster across all data sets. RJDBC did not scale to the 1M row data sets.

Figure 3: Comparison of RJDBC, RODBC, and ROracle for BINARY_DOUBLE create table

Connecting to Oracle Database

Depending on the application any sub-second response time may be sufficient. However, as depicted in Figure 4, ROracle introduces minimal time to establish a database connection. ROracle is nearly 10X faster than RJDBC and 1.6X faster than RODBC.

Figure 4: Database connection times for ROracle, RODBC, and RJDBC

In summary, for maximal performance and scalability, ROracle can support a wide range of application needs. RJDBC has significant limitations in both performance and scalability. RODBC can be more difficult to configure on various platforms and while it largely scales to the datasets tested here, its performance lags behind ROracle.

If you use ROracle, we'd like to hear about your experience. Please take this brief survey.

All tests were performed on a 16 processor machine with 4 core Intel Xeon E5540 CPUs @ 2.53 GHz and 74 GB RAM. Oracle Database was version 11.2.0.4. For JDBC, the following was modified before installing rJava.

rJava_0.9-4.tar.gz\rJava_0.9-4.tar\rJava\jri\bootstrap\JRIBootstrap.java was modified to use 2GB :

try {

System.out.println(jl.toString()+" -cp "+System.getProperty("java.class.path")+" -Xmx2g -Dstage=2 Boot");

Process p = Runtime.getRuntime().exec(new String[] {

jl.toString(), "-cp", System.getProperty("java.class.path"),"-Xmx2g", "-Dstage=2", "Boot" });

System.out.println("Started stage 2 ("+p+"), waiting for it to finish...");

System.exit(p.waitFor());

} catch (Exception re) {}

Monday Jun 10, 2013

Bringing R to the Enterprise - new white paper available

Check out this new white paper entitled "Bringing R to the Enterprise -  A Familiar R Environment with Enterprise-Caliber Performance, Scalability, and Security."

In this white paper, we begin with "Beyond the Laptop" exploring the ability to run R code in the database, working with CRAN packages at the database server, operationalizing R analytics, and leveraging Hadoop from the comfort of the R language and environment.

Excerpt: "Oracle Advanced Analytics and Oracle R Connector for Hadoop combine the advantages of R with the power and scalability of Oracle Database and Hadoop. R programs and libraries can be used in conjunction with these database assets to process large amounts of data in a secure environment. Customers can build statistical models and execute them against local data stores as well as run R commands and scripts against data stored in a secure corporate database."

The white paper continues with three use cases involving Oracle Database and Hadoop: analyzing credit risk, detecting fraud, and preventing customer churn.  The conclusion: providing analytics for the enterprise based on the R environment is here!


Wednesday Jun 05, 2013

Oracle R Distribution for R 2.15.3 is released

We are pleased to announce that Oracle R Distribution (ORD) for R 2.15.3 is available for download today. This update consists of mostly minor bug fixes, and is the final release of the R 2.x series.

Oracle recommends using yum to install ORD from our public yum serverTo install ORD 2.15.3, first remove the previously installed ORD.  For example, if ORD 2.15.2 is installed, remove it along with it's dependencies:

1. Become root

sudo su -

2. Uninstall ORD 2.15.2 RPMs, in this order:

rpm -e R-2.15.2-1.el5.x86_64
rpm -e R-devel
rpm -e R-core
rpm -e libRmath-devel
rpm -e libRmath

3. Go to http://public-yum.oracle.com/ and follow these steps to install ORD 2.15.3:

Install the yum repository as follows

cd /etc/yum.repos.d

4a. Use this command to download the Oracle Linux 5 yum repository (el5.repo):

wget http://public-yum.oracle.com/public-yum-el5.repo

Open a text editor on the file just downloaded

vi public-yum-el5.repo

Set

"enabled=1" for [el5_addons]
"enabled=1" for [el5_latest]

4b. Use this command to download the Oracle Linux 6 yum repository (ol6.repo):

wget http://public-yum.oracle.com/public-yum-ol6.repo

Open a text editor on the file just downloaded

vi public-yum-ol6.repo

Set

"enabled=1" for [ol6_addons]
"enabled=1" for [ol6_latest]

5. To install ORD, use the command

yum install R.x86_64

You can check that ORD was installed by starting it from the command line. You will see this startup 
message:


6. Older open source R packages may need to be re-installed after an ORD upgrade, which is 
accomplished by running:

R> update.packages(checkBuilt=TRUE)

This command upgrades open source packages if a more recent version exists on CRAN or if the
installed package was build with an older version of R.

Oracle R Distribution for R 2.15.3 is certified with Oracle R Enterprise 1.3.1, available for download on Oracle Technology Network. Oracle offers support for users of Oracle R Distribution on Linux, AIX and Solaris 64 bit platforms.

Tuesday May 28, 2013

Converting Existing R Scripts to ORE - Getting Started

Oracle R Enterprise provides a comprehensive, database-centric environment for end-to-end analytical processes in R, with immediate deployment to production environments. This message really resonates with our customers who are interested in executing R functions on database-resident data while seamlessly leveraging Oracle Database as a high-performance computing (HPC) environment. The ability to develop and operationalize R scripts for analytical applications in one step is quite appealing.

One frequently asked question is how to convert existing R code that access data in flat files or the database to use Oracle R Enterprise. In this blog post, we talk about a few scenarios and how to begin a conversion from existing R code to using Oracle R Enterprise.

Consider the following scenarios:

Scenario 1: A stand-alone R script that generates its own data and simply returns a result. Data is not obtained from the file system or database. This may result from performing simulations where dadta is dynamically generated, or perhaps access from a URL on the internet.

Scenario 2: An R script that loads data from a flat file such as a CSV file, performs some computations in R, and then writes the result back to a file.

Scenario 3: An R script that loads data from a database table, via one of the database connector packages like RODBC, RJDBC, or ROracle, and writes a result back to the database –using SQL statements or package functions.

Scenario 1

A stand-alone R script might normally be run on a user’s desktop, invoked as a cron job, or even via Java to spawn an R engine and retrieve the result, but we’d like to operationalize its execution as part of a database application, invoked from SQL. Here’s a simple script to illustrate the concept of converting such a script to be executed at the database server using ORE’s embedded R execution. The script generates a data.frame with some random columns, performs summary on that data and returns the summary statistics, which are represented as an R table.

# generate data

set.seed(1)

n <- 1000

df <- 3

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

# perform some analysis

res <- summary(x)

#return the result

res


To convert this to use ORE, create a function with appropriate arguments and body, for example:

myFunction1 <- function (n = 1000, df = 3,seed=1) {

set.seed(seed)

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

res <- summary(x)

res

}

Next, load the ORE packages and connect to Oracle Database using the ore.connect function. Using the all argument set to TRUE loads metadata for all the tables and views in that schema. We then store the function in the R script repository, invoking it via ore.doEval.

# load ORE packages and connect to Oracle Database

library(ORE)

ore.connect("schema","sid","hostname","password",port=1521, all=TRUE)

# load function into R script repository

ore.scriptDrop("myFunction-1")

ore.scriptCreate("myFunction-1", myFunction1)

# invoke using embedded R execution at the database server

ore.doEval(FUN.NAME="myFunction-1")

> ore.doEval(FUN.NAME="myFunction-1")
       a                b                  c           
 Min.   :   1.0   Min.   :-3.00805   Min.   : 0.03449  
 1st Qu.: 250.8   1st Qu.:-0.69737   1st Qu.: 1.27386  
 Median : 500.5   Median :-0.03532   Median : 2.36454  
 Mean   : 500.5   Mean   :-0.01165   Mean   : 3.07924  
 3rd Qu.: 750.2   3rd Qu.: 0.68843   3rd Qu.: 4.25994  
 Max.   :1000.0   Max.   : 3.81028   Max.   :17.56720  

Of course, we’re using default values here. To provide different arguments, change the invocation with arguments as follows:

ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2)

> ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2)
       a               b                  c          
 Min.   :  1.0   Min.   :-2.72182   Min.   : 0.1621  
 1st Qu.:125.8   1st Qu.:-0.65346   1st Qu.: 2.6144  
 Median :250.5   Median : 0.04392   Median : 4.4592  
 Mean   :250.5   Mean   : 0.06169   Mean   : 5.0386  
 3rd Qu.:375.2   3rd Qu.: 0.79096   3rd Qu.: 6.8467  
 Max.   :500.0   Max.   : 2.88842   Max.   :17.0367  

Having successfully invoked this from the R client (my laptop), we can now invoke it from SQL. Here, we retrieve the summary result, which is an R table, as an XML string.

select *

from table(rqEval( NULL,'XML','myFunction-1'));

The result can be viewed from SQL Developer.

The following shows the XML output in a more structured manner.


What if we wanted to get the result to appear as a SQL table? Since the current result is an R table (an R object), we need to convert this to a data.frame to return it. We’ll make a few modifications to “myFunction-1” above. Most notably is the need to convert the table object in res to a data.frame. There are a variety of ways to do this.

myFunction2 <- function (n = 1000, df = 3,seed=1) {

# generate data

set.seed(seed)

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

# perform some analysis

res <- summary(x)

# convert the table result to a data.frame

res.df <- as.matrix(res)

res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c('a','b','c')

res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

res.sum

}

# load function into R script repository

ore.scriptDrop("myFunction-2")

ore.scriptCreate("myFunction-2", myFunction2)

We’ll now modify the SQL statement to specify the format of the result.

select *

from table(rqEval( NULL,'select cast(''a'' as VARCHAR2(12)) as "statname",

1 "a", 1 "b", 1 "c" from dual ','myFunction-2'));

Here’s the result as viewed from SQL Developer.


This type of result could be incorporated into any SQL application accepting table or view input from a SQL query. That is particular useful in combination with OBIEE dashboards via an RPD.

Scenario 2

If you’ve been loading data from a flat file, perhaps a CSV file, your R code may look like the following, where it specifies to builds a model and write hat model to a file for future use, perhaps in scoring. It also generates a graph of the clusters highlighting the individual points, colored by their cluster id, with the centroids indicated with a star.

# read data

setwd("D:/datasets")

dat <- read.csv("myDataFile.csv")

# build a clustering model

cl <- kmeans(x, 2)

# write model to file

save(cl, file="myClusterModel.dat")

# create a graph and write it to a file

pdf("myGraphFile.pdf")

plot(x, col = cl$cluster)

points(cl$centers, col = 1:2, pch = 8, cex=2)

dev.off()

The resulting PDF file contains the following image.


To convert this script for use in ORE, there are several options. We’ll explore two: the first involving minimal change to use embedded R execution, and the second leveraging in-database techniques. First, we’ll want the data we used above in variable dat to be loaded into the database.

# create a row id to enable ordered results (if a key doesn’t already exist)

dat$ID <- 1:nrow(dat)

# remove the table if it exists

ore.drop("MY_DATA")

# create the table using the R data.frame, resulting in an ore.frame named MY_DATA

ore.create(dat,"MY_DATA")

# assign the ID column as the row.names of the ore.frame

row.names(MY_DATA) <- MY_DATA$ID

In the first example, we’ll use embedded R execution and pass the data to the function via ore.tableApply. We’ll generate the graph, but simply display it within the function to allow embedded R execution to return the graph as a result. (Note we could also write the graph to a file in any directory accessible to the database server.) Instead of writing the model to a file, which requires keeping track of its location, as well as worring about backup and recovery, we store the model in the database R datastore using ore.save. All this requires minimal change. As above, we could store the function in the R script repository and invoke it by name – both from R and SQL. In this example, we simply provide the function itself as argument.

myClusterFunction1 <- function(x) {

cl <- kmeans(x, 2)

ore.save(cl, name="myClusterModel",overwrite=TRUE)

plot(x, col = cl$cluster)

points(cl$centers, col = 1:2, pch = 8, cex=2)

TRUE

}

ore.tableApply(MY_DATA[,c('x','y')], myClusterFunction1,

ore.connect=TRUE,ore.png.height=700,ore.png.width=700)

The ore.tableApply function projects the x and y columns of MY_DATA as input and also specifies ore.connect as TRUE since we are using the R datastore, which requires a database connection. Optionally, we can specify control arguments to the PNG output. In this example, these are the height and width of the image.

For the second example, we convert this to leverage the ORE Transparency Layer. We’ll use the in-database K-Means algorithm and save the model in a datastore named “myClusteringModel”, as we did above. Since ore.odmKMeans doesn’t automatically assign cluster ids (since the data may be very large or are not required), the scoring is done separately. Note, however, that the prediction results also exist in the database as an ore.frame. To ensure ordering, we also assign row.names to the ore.frame pred. Lastly, we create the plot. Coloring the nodes requires pulling the cluster assignments; however, the points themselves can be accessed from the ore.frame. The centroids points are obtained from cl$centers2 of the cluster model.

# build a clustering model in-database

cl <- ore.odmKMeans(~., MY_DATA, 2, auto.data.prep=FALSE)

# save model in database R datastore

ore.save(cl,name="myClusterModel",overwrite=TRUE)

# generate predictions to assign each row a cluster id, supplement with original data

pred <- predict(cl,MY_DATA,supp=c('x','y','ID'),type="class")

# assign row names to ensure ordering of results

row.names(pred) <- pred$ID

# create the graph

plot(pred[,c('x','y')], col = ore.pull(pred$CLUSTER_ID))

points(cl$centers2[,c('x','y')], col = c(2,3), pch = 8, cex=2)

We can also combine using the transparency layer within an embedded R function. But we’ll leave that as an exercise to the reader.

Scenario 3

In this last scenario, the data already exists in the database and one of the database interface packages, such as RODBC, RJDBC, and ROracle is be used to retrieve data from and write data to the database. We’ll illustrate this with ROracle, but the same holds for the other two packages.

# connect to the database

drv <- dbDriver("Oracle")

con <- dbConnect(drv, "mySchema", "myPassword")

# retrieve the data specifying a SQL query

dat <- dbGetQuery(con, 'select * from MY_RANDOM_DATA where "a" > 100')

# perform some analysis

res <- summary(dat)

# convert the table result to a data.frame for output as table

res.df <- as.matrix(res)

res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c('a','b','c')

res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

res.sum

dbWriteTable(con, "SUMMARY_STATS", res.sum)

Converting this to ORE is straightforward. We’re already connected to the database using ore.connect from previous scenarios, so the existing table MY_RANDOM_DATA was already loaded in the environment as an ore.frame. Executing ore.ls lists this table is the result, so we can just start using it.

> ore.ls(pattern="MY_RAND")

[1] "MY_RANDOM_DATA"

# no need to retrieve the data, use the transparency layer to compute summary

res <- with(MY_RANDOM_DATA , summary(MY_RANDOM_DATA[a > 100,]))

# convert the table result to a data.frame for output as table

res.df <- as.matrix(res)

res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c('a','b','c')

res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

# create the database table

ore.create(res.sum, "SUMMARY_STATS")

SUMMARY_STATS


As we did in previous scenarios, this script can also be wrapped in a function and used in embedded R execution. This too is left as an exercise to the reader.

Summary

As you can see from the three scenarios discussed here, converting a script that accesses no external data, accesses and manipulates file data, or accesses and manipulates database data can be accomplished with a few strategic modifications. More involved scripts, of course, may require additional manipulation. For example, if the SQL query performs complex joins and filtering, along with derived column creation, the user may want to convert this SQL to the corresponding ORE Transparency Layer code, thereby eliminating reliance on SQL. But that’s a topic for another post.

Friday May 24, 2013

HOWTO: X11 Forwarding for Oracle R Enterprise

Oracle R Enterprise enables users to generate R graphs at the database server and return them in a variety of ways: an XML representation using base 64 encoding of the PNG images, in a table with a BLOB column containing the PNG images, and interactively returning the actual image to the R user at the client. This last case allows users to generate images at the database server machine and have the actual PNG image display at the user’s client R engine.

To take advantage of this capability, users may need to ensure their X11 is properly configured. This blog highlights a solution to a common problem involving X11. W
hen using a graphically based function in Oracle R Enterprise, if you’ve encountered errors such as:

Error in X11(paste("png::", filename, sep = ""), width, height, pointsize,
unable to start device PNG
then read on. The issue is likely that your database server is not configured to run graphics programs locally. 

The X Window system, or X11, allows you to forward a program display from a remote system to a local computer.  X11 is the native windowing interface on Linux However, X11 is not the default for all Unix Operating Systems, and  additional configuration steps may be required to display graphical programs if your server is running Unix. Follow the instructions below to configure your server to forward graphics the display to your local client machine.


X11 forwarding from a Linux client

There are two options presented here. The first uses SSH and the second uses telnet.

Option 1: Usually, when you want to connect to your Unix server from a remote Linux client, you use SSH (Secure Shell). Before logging in to your Unix server, confirm that /etc/ssh/sshd_config contains the following X11 tunneling options:

X11forwarding yes

X11DisplayOffset 10
X11UseLocalhost yes

SSH allows you to make a secure terminal connection to your Unix server from your Linux client using this syntax:

ssh -Y <userid@unixserver>

The –Y option to ssh treats the Unix server as trusted , -X treats it as untrusted. Check with your server or network admin about which flag to use. This command also sets the remote DISPLAY to localhost:10.0.

Option 2: Connecting to the server via telnet

If you choose to connect to the server using telnet, keep in mind that unlike SSH, telnet does not offer the security measures that protect users against anyone with malicious intent. Using telnet, an X11 server can be manually set at a Linux client that is capable of graphical display.
To confirm the graphical capability, verify that a terminal appears after entering at the Linux prompt:

xterm

You also need to know the display environment variable setting on X11 server, the Linux client:

echo $DISPLAY

The DISPLAY environment variable stores the displaynumber and screennumber that the X11 server uses to display. These addresses are in the form:

localhost:displaynumber.screennumber

A typical example would be:

localhost:0.0

Next, enable the Unix server to display on the Linux client:

xhost + <unixserver >

and telnet into the Unix server and set its DISPLAY to the X11 server – the Linux client.

export DISPLAY= X11server:displaynumber.screennumber

After following the steps for either option, you should now be able to launch a remote graphical application locally.  As a quick check, launch your remote Unix server's clock on your client desktop through the SSH connection using X11 forwarding:

1. Type xclock at the Unix server command prompt and hit enter.
2  Your remote server’s X11 GUI clock should appear on your client desktop.

3. If the xclock tests succeeds, launch the ORE client to verify the same DISPLAY setting is used by embedded R:

R> ore.connect(user="<username>",sid="<sid>",host="<hostname>",password="<password>", all=TRUE)
R> ore.is.connected()

TRUE
R
> ore.eval(function() Sys.getenv("DISPLAY"), ore.graphics=FALSE)

If the last returned value matches the DISPLAY setting, you will be able to display images at the client machine.
X11 forwarding from a Windows client

To connect to your remote Unix server from Windows and use its graphical interface, you need two pieces of software: an SSH program to establish the remote connection and an X Server to handle the local display. For the SSH program we'll use PuTTY. For the server, we'll use Xming.

PuTTY is a free SSH client that allows you to connect to a remote Linux computer and use the command line. PuTTY can also be used to forward secure data over SSH to other programs - this is called tunneling.

When you connect to your remote Linux computer, you will need to set several connection settings to make everything work correctly. PuTTY lets you save these settings in a session so you can reuse them the next time you connect. To create a session that allows PuTTY to forward your Linux computer's X11 graphical interface over SSH:

1. Open PuTTY on your Windows desktop. Putty will open and display the Session panel. In the Host Name field, type the hostname or IP address of your Unix server.

2. In the field underneath the Saved Sessions label, type a name for your saved session.


3. Under the Connection category, expand SSH and choose X11. Click the Enable X11
Forwarding checkbox.




4. Go back to the Session category and click Save to save your session connection settings.


 Now we're ready to set up the X server using Xming, which is a free X Window server for the Windows desktop. With Xming, you can display graphical applications from your remote Linux computer on your Windows desktop. Xming provides a simple utility called Xlaunch that allows you to configure Xming easily, and also save your configuration for future use. To run Xming, open XLaunch and select the configuration outlined here:

1. Open XLaunch from the program menu. Select Multiple Windows and click Next. This tells Xming to open
each remote Linux application in a new window.



2. Select Start No Client and click Next. This tells Xming to launch and wait for commands from
another program (like PuTTY).





3. Make sure that Clipboard is selected and click Next. This tells Xming to enable your remote
Linux applications to share a unified clipboard.



4. Click the Finish button to launch Xming.




Now that Xming is running, you can open your PuTTY session and launch a graphical application. As a quick check, launch your remote Unix server’s clock on your Windows desktop through the SSH connection using X11 forwarding:

1. Open PuTTY.
2. Double-click on the saved session you created earlier. PuTTY will create an SSH connection to your remote Unix server.
3. Login to your Unix server.
4. Type xclock at the command prompt and hit enter.
5. Your remote server’s GUI clock should appear on your client desktop.

6. If the xclock test succeeds, If the xclock tests succeeds, launch the ORE client to verify the same DISPLAY setting is used by embedded R:

R> ore.connect(user="<username>",sid="<sid>",host="<hostname>",password="<password>", all=TRUE)
R> ore.is.connected()

TRUE
R
> ore.eval(function() Sys.getenv("DISPLAY"), ore.graphics=FALSE)

If the last returned value matches the DISPLAY setting, you will be able to display images at the client machine.  Here's an example that creates a panel plot using the R dataset mtcars:

ore.doEval(function() {
  library(lattice)
  xyplot(mpg ~ hp | factor(cyl),
         data=mtcars,
         type=c("p", "r"),
         main="Fuel economy vs. Performance with Number of Cylinders",
         xlab="Performance (horse power)",
         ylab="Fuel economy (miles per gallon)",
         scales=list(cex=0.75))
})


Wednesday May 22, 2013

Big Data Analytics in R – the tORCH has been lit!

This guest post from Anand Srinivasan compares performance of the Oracle R Connector for Hadoop with the R {parallel} package for covariance matrix computation, sampling, and parallel linear model fitting. 

Oracle R Connector for Hadoop (ORCH) is a collection of R packages that enables Big Data analytics from the R environment. It enables a Data Scientist /Analyst to work on data straddling multiple data platforms (HDFS, Hive, Oracle Database, local files) from the comfort of the R environment and benefit from the R ecosystem.

ORCH provides:

1) Out of the box predictive analytic techniques for linear regression, neural networks for prediction, matrix completion using low rank matrix factorization, non-negative matrix factorization, kmeans clustering, principal components analysis and multivariate analysis. While all these techniques have R interfaces, they are implemented either in Java or in R as distributed parallel implementations leveraging all nodes of your Hadoop cluster

2) A general framework, where a user can use the R language to write custom logic executable in a distributed parallel manner using available compute and storage resources.

The main idea behind the ORCH architecture and its approach to Big Data analytics is to leverage the Hadoop infrastructure and thereby inherit all its advantages.

The crux of ORCH is read parallelization and robust methods over parallelized data. Efficient parallelization of reads is the single most important step necessary for Big Data Analytics because it is either expensive or impractical to load all available data in a single thread.

ORCH is often compared/contrasted with the other options available in R, in particular the popular open source R package called parallel. The parallel package provides a low-level infrastructure for “coarse-grained” distributed and parallel computation. While it is fairly general, it tends to encourage an approach that is based on using the aggregate RAM in the cluster as opposed to using the file system. Specifically, it lacks a data management component, a task management component and an administrative interface for monitoring. Programming, however, follows the broad Map Reduce paradigm.

 In the rest of this article, we assume that the reader has basic familiarity with the parallel package and proceed to compare ORCH and its approach with the parallel package. The goal of this comparison is to explain what it takes for a user to build a solution for their requirement using each of these technologies and also to understand the performance characteristics of these solutions.

We do this comparison using three concrete use cases – covariance matrix computation, sampling and partitioned linear model fitting. The exercise is designed to be repeatable, so you, the reader, can try this “at home”. We will demonstrate that ORCH is functionally and performance-wise superior to the available alternative of using R’s parallel package.

A six node Oracle Big Data Appliance v2.1.1 cluster is used in the experiments. Each node in this test environment has 48GB RAM and 24 CPU cores.

Covariance Matrix Computation

Computing covariance matrices is one of the most fundamental of statistical techniques.

In this use case, we have a single input file, “allnumeric_200col_10GB” (see appendix on how to generate this data set), that is about 10GB in size and has a data matrix with about 3 million rows and 200 columns. The requirement is to compute the covariance matrix of this input matrix.

Since a single node in the test environment has 48GB RAM and the input file is only 10GB, we start with the approach of loading the entire file into memory and then computing the covariance matrix using R’s cov function.

> system.time(m <- matrix(scan(file="/tmp/allnumeric_200col_10GB",what=0.0, sep=","), ncol=200, byrow=TRUE))

Read 611200000 items

user system elapsed

683.159 17.023 712.527

> system.time(res <- cov(m))

user system elapsed

561.627 0.009 563.044

We observe that the loading of data takes 712 seconds (vs. 563 seconds for the actual covariane computation) and dominates the cost. It would be even more pronounced (relative to the total elapsed time) if the cov(m) computation were parallelized using mclapply from the parallel package.

Based on this, we see that for an efficient parallel solution, the main requirement is to parallelize the data loading. This requires that the single input file be split into multiple smaller-sized files. The parallel package does not offer any data management facilities; hence this step has to be performed manually using a Linux command like split. Since there are 24 CPU cores, we split the input file into 24 smaller files.

time(split -l 127334 /tmp/allnumeric_200col_10GB)

real 0m54.343s

user 0m3.598s

sys 0m24.233s

Now, we can run the R script:

library(parallel)

# Read the data

readInput <- function(id) {

infile <- file.path("/home/oracle/anasrini/cov",paste("p",id,sep=""))

print(infile)

m <- matrix(scan(file=infile, what=0.0, sep=","), ncol=200, byrow=TRUE)

m

}

# Main MAPPER function

compCov <- function(id) {

m <- readInput(id)  # read the input

cs <- colSums(m)    # compute col sums, num rows

# compute main cov portion

nr <- nrow(m)      

mtm <- crossprod(m)

list(mat=mtm, colsum=cs, nrow=nr)

}

numfiles <- 24

numCores <- 24

# Map step

system.time(mapres <- mclapply(seq_len(numfiles), compCov, mc.cores=numCores))

# Reduce step

system.time(xy <- Reduce("+", lapply(mapres, function(x) x$mat)))

system.time(csf <- Reduce("+", lapply(mapres, function(x) x$colsum)))

system.time(nrf <- Reduce("+", lapply(mapres, function(x) x$nrow)))

sts <- csf %*% t(csf)

m1 <- xy / (nrf -1)

m2 <- sts / (nrf * (nrf-1))

m3 <- 2 * sts / (nrf * (nrf-1))

covmat <- m1 + m2 - m3

user system elapsed

1661.196 21.209 77.781

We observe that the elapsed time (excluding time to split the files) has now come down to 77 seconds. However, it took 54 seconds for splitting the input file into smaller files, making it a significant portion of the total elapsed time of 77+54 = 131 seconds.

Besides impacting performance, there are a number of more serious problems with having to deal with data management manually. We list a few of them here:

1) In other scenarios, with larger files or larger number of chunks, placement of chunks also becomes a factor that influences I/O parallelism. Optimal placement of chunks of data over the available set of disks is a non-trivial problem

2) Requirement of root access – Optimal placement of file chunks on different disks often requires root access. For example, only root has permissions to create files on disks corresponding to the File Systems mounted on /u03, /u04 etc on an Oracle Big Data Appliance node

3) When multiple nodes are involved in the computation, moving fragments of the original data into different nodes manually can drain productivity

4) This form of split can only work in a static environment – in a real-world dynamic environment, information about other workloads and their resource utilization cannot be factored in a practical manner by a human

5) Requires admin to provide user access to all nodes of the cluster in order to allow the user to move data to different nodes

ORCH-based solution

On the other hand, using ORCH, we can directly use the out of the box support for multivariate analysis. Further, no manual steps related to data management (like splitting files and addressing chunk placement issues) are required since Hadoop (specifically HDFS) handles all those requirements seamlessly.

>x <- hdfs.attach("allnumeric_200col_10GB")

> system.time(res <- orch.cov(x))

user system elapsed

18.179 3.991 85.640

Forty-two concurrent map tasks were involved in the computation above as determined by Hadoop.

To conclude, we can see the following advantages of the ORCH based approach in this scenario :

1) No manual steps. Data Management completely handled transparently by HDFS

2) Out of the box support for cov. The distributed parallel algorithm is available out of the box and the user does not have to work it out from scratch

3) Using ORCH we get comparable performance to that obtained through manual coding without any of the manual overheads

Sampling

We use the same single input file, “allnumeric_200col_10GB” in this case as well. The requirement is to obtain a uniform random sample from the input data set. The size of the sample required is specified as a percentage of the input data set size.

Once again for the solution using the parallel package, the input file has to be split into smaller sized files for better read parallelism.

library(parallel)

# Read the data

readInput <- function(id) {

infile <- file.path("/home/oracle/anasrini/cov", paste("p",id,sep=""))

print(infile)

system.time(m <- matrix(scan(file=infile, what=0.0, sep=","),

ncol=200, byrow=TRUE))

m

}

# Main MAPPER function

samplemap <- function(id, percent) {

m <- readInput(id)    # read the input

v <- runif(nrow(m))   # Generate runif

# Pick only those rows where random < percent*0.01

keep <- which(v < percent*0.01)

m1 <- m[keep,,drop=FALSE]

m1

}

numfiles <- 24

numCores <- 24

# Map step

percent <- 0.001

system.time(mapres <- mclapply(seq_len(numfiles), samplemap, percent,

mc.cores=numCores))

user system elapsed

1112.998 23.196 49.561

ORCH based solution

>x <- hdfs.attach("allnumeric_200col_10GB_single")

>system.time(res <- orch.sample(x, percent=0.001))

user system elapsed

8.173 0.704 33.590

The ORCH based solution out-performs the solution based on the parallel package. This is because orch.sample is implemented in Java and the read rates obtained by a Java implementation are superior to what can be achieved in R.

Partitioned Linear Model Fitting

Partitioned Linear Model Fitting is a very popular use case. The requirement here is to fit separate linear models, one for each partition of the data. The data itself is partitioned based on a user-specified partitioning key.

For example, using the ONTIME data set, the user could specify destination city as the partitioning key indicating the requirement for separate linear models (with, for example, ArrDelay as target), 1 per destination city.

ORCH based solution

dfs_res <- hadoop.run(

data = input,

mapper = function(k, v) { orch.keyvals(v$Dest, v) },

reducer = function(k, v) {

lm_x <- lm(ArrDelay ~ DepDelay + Distance, v)

orch.keyval(k, orch.pack(model=lm_x, count = nrow(v)))

},

config = new("mapred.config",

job.name = "ORCH Partitioned lm by Destination City",

map.output = mapOut,

mapred.pristine = TRUE,

reduce.output = data.frame(key="", model="packed"),

)

)

Notice that the Map Reduce framework is performing the partitioning. The mapper just picks out the partitioning key and the Map Reduce framework handles the rest. The linear model for each partition is then fitted in the reducer.

parallel based solution

As in the previous use cases, for good read parallelism, the single input file needs to be split into smaller files. However, unlike the previous use cases, there is a twist here.

We noted that with the ORCH based solution it is the Map Reduce framework that does the actual partitioning. There is no such out of the box feature available with a parallel package-based solution. There are two options:

1) Break up the file arbitrarily into smaller pieces for better read parallelism. Implement your own partitioning logic mimicking what the Map Reduce framework provides. Then fit linear models on each of these partitions in parallel.

OR

2) Break the file into smaller pieces such that each piece is a separate partition. Fit linear models on each of these partitions in parallel 

Both of these options are not easy and require a lot of user effort. The custom coding required for achieving parallel reads is significant.

Conclusion

ORCH provides a holistic approach to Big Data Analytics in the R environment. By leveraging the Hadoop infrastructure, ORCH inherits several key components that are all required to address real world analytics requirements.

The rich set of out-of-the-box predictive analytic techniques along with the possibility of authoring custom parallel distributed analytics using the framework (as demonstrated in the partitioned linear model fitting case) helps simplify the user’s task while meeting the performance and scalability requirements. 

Appendix – Data Generation

We show the steps required to generate the single input file “allnumeric_200col_10GB”.

Run the following in R:

x <- orch.datagen(datasize=10*1024*1024*1024, numeric.col.count=200,

map.degree=40)

hdfs.mv(x, "allnumeric_200col_10GB")

Then, from the Linux shell:

hdfs dfs –rm –r –skipTrash /user/oracle/allnumeric_200col_10GB/__ORCHMETA__

hdfs dfs –getmerge /user/oracle/allnumeric_200col_10GB /tmp/allnumeric_200col_10GB


Monday May 06, 2013

Oracle R Distribution for R 2.15.2 available on public-yum

Oracle R Distribution (ORD) for R 2.15.2 on Linux is now available for download from Oracle's public-yum repository.  R 2.15.2 is a maintenance update that includes improved performance and reduced memory usage for some commonly-used functions, increased memory available for data on 64-bit systems, enhanced localization for Polish language users, and a number of bug fixes.  Detailed updates can be found in the NEWS file - see the section 'CHANGES IN R VERSION 2.15.2'.

The most recent update to Oracle R Enterprise, version 1.3.1, is certified with both R 2.15.1 and R 2.15.2. Installing ORD from public-yum will pull the most recently posted version, R 2.15.2.  For example, on Oracle Linux 5, as root, cd to yum.repos.d, download the public yum repository configuration file, and enable the required repositories:

    cd /etc/yum.repos.d
    wget http://public-yum.oracle.com/public-yum-el5.repo
    Edit file public-yum-el5.repo and set
        "enabled=1" for [el5_addons]
        "enabled=1" for [el5_latest]

Next, install ORD:



Start R.  Oracle R Distribution for R 2.15.2 is installed.



To install an older version of ORD such as R 2.15.1, simply specify the R version at the install step:

    yum install R-2.15.1

Detailed instructions for installing Oracle R Distribution are in the Oracle R Enterprise Installation and Administration Guide.  Oracle R Distribution for R 2.15.2 on AIX, Solaris X86 and Solaris SPARC will be available on Oracle's Free and Open Source Software portal in the coming weeks.

Wednesday Apr 17, 2013

Mind Reading... What are our customers thinking?

Overhauling analytics processes is becoming a recurring theme among customers. A major telecommunication provider recently embarked on overhauling their analytics process for customer surveys. They had three broad technical goals:

  • Provide an agile environment that empowers business analysts to test hypotheses based on survey results
  • Allow dynamic customer segmentation based on survey responses and even specific survey questions to drive hypothesis testing
  • Make results of new surveys readily available for research

The ultimate goal is to derive greater value from survey research that drives measurable improvements in survey service delivery, and as a result, overall customer satisfaction.

This provider chose Oracle Advanced Analytics (OAA) to power their survey research. Survey results and analytics are maintained in Oracle Database and delivered via a parameterized BI dashboard. Both the database and BI infrastructure are standard components in their architecture.

A parameterized BI dashboard enables analysts to create samples for hypothesis testing by filtering respondents to a survey question based on a variety of filtering criteria. This provider required the ability to deploy a range of statistical techniques depending on the survey variables, level of measurement of each variable, and the needs of survey research analysts.

Oracle Advanced Analytics offers a range of in-database statistical techniques complemented by a unique architecture supporting deployment of open source R packages in-database to optimize data transport to and from database-side R engines. Additionally, depending on the nature of functionality in such R packages, it is possible to leverage data-parallelism constructs available as part of in-database R integration. Finally, all OAA functionality is exposed through SQL, the ubiquitous language of the IT environment. This enables OAA-based solutions to be readily integrated with BI and other IT technologies.

The survey application noted above has been in production for 3 months. It supports a team of 20 business analysts and has already begun to demonstrate measurable improvements in customer satisfaction.

In the rest of this blog, we explore the range of statistical techniques deployed as part of this application.

At the heart of survey research is hypothesis testing. A completed customer satisfaction survey contains data used to draw conclusions about the state of the world. In the survey domain, hypothesis testing is comparing the significance of answers to specific survey questions across two distinct groups of customers - such groups are identified based on knowledge of the business and technically specified through filtering predicates.

Hypothesis testing sets up the world as consisting of 2 mutually exclusive hypotheses:

a) Null hypothesis - states that there is no difference in satisfaction levels between the 2 groups of customers

b) Alternate hypothesis states that there is a significant difference in satisfaction levels between the 2 groups of customers

Obviously only one of these can be true and the true-ness is determined by the strength, probability, or likelihood of the null hypothesis over the alternate hypothesis. Simplistically, the degree of difference between, e.g., the average score from a specific survey question across two customer groups could provide the necessary evidence in helping decide which hypothesis is true.

In practice the process of providing evidence to make a decision involves having access to a range of test statistics – a number calculated from each group that helps determine the choice of null or alternate hypothesis. A great deal of theory, experience, and business knowledge goes into selecting the right statistic based on the problem at hand.

The t-statistic (available in-database) is a fundamental function used in hypothesis testing that helps understand the differences in means across two groups. When the t-values across 2 groups of customers for a specific survey question are extreme then the alternative hypothesis is likely to be true. It is common to set a critical value that the observed t-value should exceed to conclude that the satisfaction survey results across the two groups are significantly different. Other similar statistics available in-database include F-test, cross tabulation (frequencies of various response combinations captured as a table), related hypothesis testing functions such as chi-square functions, Fisher's exact test, Kendall's coefficients, correlation coefficients and a range of lambda functions.

If an analyst desires to compare across more than 2 groups then analysis of variance (ANOVA) is a collection of techniques that is commonly used. This is an area where the R package ecosystem is rich with several proven implementations. The R stats package has implementations of several test statistics and function glm allows analysis of count data common in survey results including building Poisson and log linear models. R's MASS package implements a popular survey analysis technique called iterative proportional fitting. R's survey package has a rich collection of features (http://faculty.washington.edu/tlumley/survey/).

The provider was specifically interested in one function in the survey package - raking (also known as sample balancing) - a process that assigns a weight to each customer that responded to a survey such that the weighted distribution of the sample is in very close agreement with other customer attributes, such as the type of cellular plan, demographics, or average bill amount. Raking is an iterative process that uses the sample design weight as the starting weight and terminates when a convergence is achieved.

For this survey application, R scripts that expose a wide variety of statistical techniques - some in-database accessible through the transparency layer in Oracle R Enterprise and some in CRAN packages - were built and stored in the Oracle R Enterprise in-database R script repository. These parameterized scripts accept various arguments that identify samples of customers to work with as well as specific constraints for the various hypothesis test functions. The net result is greater agility since the business analyst determines both the set of samples to analyze as well as the application of the appropriate technique to the sample based on the hypothesis being pursued.

For more information see these links for Oracle's R Technologies software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop

Monday Apr 15, 2013

Is the size of your lm model causing you headaches?

If you build an R lm model with a relatively large number of rows, you may be surprised by just how large that lm model is and what impact it has on your environment and application.

Why might you care about size? The most obvious is that the size of R objects impacts the amount of RAM available for further R processing or loading of more data. However, it also has implications for how much space is required to save that model or the time required to move it around the network. For example, you may want to move the model from the database server R engine to the client R engine when using Oracle R Enterprise Embedded R Execution. If the model is too large, you may encounter latency when trying to retrieve the model or even receive the following error:

Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch,  :
  ORA-20000: RQuery error
Error : serialization is too large to store in a raw vector

If you get this error, there are at least a few options:

  • Perform summary component access, like coefficients, inside the embedded R function and return only what is needed
  • Save the model in a database R datastore and manipulate that model at the database server to avoid pulling it to the client
  • Reduce the size of the model by eliminating large and unneeded components

In this blog post, we focus on the third approach and look at the size of lm model components, what you can do to control lm model size, and the implications for doing so. With vanilla R, objects are the "memory" serving as the repository for repeatability. As a result, models tend to be populated with the data used to build them to ensure model build repeatability.

When working with database tables, this "memory" is not needed because governance mechanisms are already in place to ensure either data does not change or logs are available to know what changes took place. Hence it is unnecessary to store the data used to build the model into the model object.

An lm model consists of several components, for example:

coefficients, residuals, effects, fitted.values, rank, qr, df.residual, call, terms, xlevels, model, na.action

Some of these components may appear deceptively small using R’s object.size function. The following script builds an lm model to help reveal what R reports for the size of various components. The examples use a sample of the ONTIME airline arrival and departure delays data set for domestic flights. The ONTIME_S data set is an ore.frame proxy object for data stored in an Oracle database and consists of 219932 rows and 26 columns. The R data.frame ontime_s is this same data pulled to the client R engine using ore.pull and is ~39.4MB.

Note: The results reported below use R 2.15.2 on Windows. Serialization of some components in the lm model has been improved in R 3.0.0, but the implications are the same.

f.lm.1 <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat)

lm.fit.1 <- f.lm.1(ontime_s)

object.size(lm.fit.1)

54807720 bytes

Using the object.size function on the resulting model, the size is about 55MB. If only scoring data with this model, it seems like a lot of bloat for the few coefficients assumed needed for scoring. Also, to move this object over a network will not be instantaneous. But is this the true size of the model?

A better way to determine just how big an object is, and what space is actually required to store the model or time to move it across a network, is the R serialize function.

length(serialize(lm.fit.1,NULL))

[1] 65826324

Notice that the size reported by object.size is different from that of serialize – a difference of 11MB or ~20% greater.

What is taking up so much space? Let’s invoke object.size on each component of this lm model:

lapply(lm.fit.1, object.size)
$coefficients

424 bytes

$residuals

13769600 bytes

$effects

3442760 bytes

$rank

48 bytes

$fitted.values

13769600 bytes

$assign

56 bytes

$qr

17213536 bytes

$df.residual

48 bytes

$na.action

287504 bytes

$xlevels

192 bytes

$call

1008 bytes

$terms

4432 bytes

$model

6317192 bytes

The components residuals, fitted.values, qr, model, and even na.action are large. Do we need all these components?

The lm function provides arguments to control some aspects of model size. This can be done, for example, by specifying model=FALSE and qr=FALSE. However, as we saw above, there are other components that contribute heavily to model size.

f.lm.2 <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY,
                           data = dat, model=FALSE, qr=FALSE)

lm.fit.2 <- f.lm.2(ontime_s)

length(serialize(lm.fit.2,NULL))

[1] 51650410

object.size(lm.fit.2)

31277216 bytes

The resulting serialized model size is down to about ~52MB, which is not significantly smaller than the full model.The difference with the result reported by object.size is now ~20MB, or 39% smaller.

Does removing these components have any effect on the usefulness of an lm model? We’ll explore this using four commonly used functions: coef, summary, anova, and predict. If we try to invoke summary on lm.fit.2, the following error results:

summary(lm.fit.2)

Error in qr.lm(object) : lm object does not have a proper 'qr' component.

Rank zero or should not have used lm(.., qr=FALSE).

The same error results when we try to run anova. Unfortunately, the predict function also fails with the error above. The qr component is necessary for these functions. Function coef returns without error.

coef(lm.fit.2)

(Intercept) DISTANCE DEPDELAY

0.225378249 -0.001217511 0.962528054

If only coefficients are required, these settings may be acceptable. However, as we’ve seen, removing the model and qr components, while each is large, still leaves a large model. The really large components appear to be the effects, residuals, and fitted.values. We can explicitly nullify them to remove them from the model.

f.lm.3 <- function(dat) {
mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY,
data = dat, model=FALSE, qr=FALSE)
mod$effects <- mod$residuals <- mod$fitted.values <- NULL
  mod
}

lm.fit.3 <- f.lm.3(ontime_s)

length(serialize(lm.fit.3,NULL))

[1] 24089000

object.size(lm.fit.3)

294968 bytes

Thinking the model size should be small, we might be surprised to see the results above. The function object.size reports ~295KB, but serializing the model shows 24MB, a difference of 23.8MB or 98.8%. What happened? We’ll get to that in a moment. First, let’s explore what effect nullifying these additional components has on the model.

To answer this, we’ll turn on model and qr, and focus on effects, residuals, and fitted.values. If we nullify effects, the anova results are invalid, but the other results are fine. If we nullify residuals, summary cannot produce residual and coefficient statistics, but it also produces an odd F-statistic with a warning:

Warning message:

In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

The function anova produces invalid F values and residual statistics, clarifying with a warning:

Warning message:

In anova.lm(mod) :

ANOVA F-tests on an essentially perfect fit are unreliable

Otherwise, both predict and coef work fine.

If we nullify fitted.values, summary produces an invalid F-statistics issuing the warning:

Warning message:

In mean.default(f) : argument is not numeric or logical: returning NA


However, there are no adverse effects on results on the other three functions.

Depending on what we need from our model, some of these components could be eliminated. But let’s continue looking at each remaining component, not with object.size, but serialize. Below, we use lapply to compute the serialized length of each model component. This reveals that the terms component is actually the largest component, despite object.size reporting only 4432 bytes above.

as.matrix(lapply(lm.fit.3, function(x) length(serialize(x,NULL))))

[,1]

coefficients 130

rank 26

assign 34

df.residual 26

na.action 84056

xlevels 55

call 275

terms 24004509

If we nullify the terms component, the model becomes quite compact. (By the way, if we simply nullify terms, summary, anova, and predict all fail.) Why is the terms component so large? It turns out it has an environment object as an attribute. The environment contains the variable dat, which contains the original data with 219932 rows and 26 columns. R’s serialize function includes this object and hence the reason the model is so large. The function object.size ignores these objects.

attr(lm.fit.1$terms, ".Environment")  
<environment: 0x1d6778f8>
ls(envir = attr(lm.fit.1$terms, ".Environment"))        
[1] "dat"
d <- get("dat",envir=envir)
dim(d)
[1] 219932 26
length(serialize(attr(lm.fit.1$terms, ".Environment"), NULL))
[1] 38959319
object.size(attr(lm.fit.1$terms, ".Environment"))
56 bytes

If we remove this object from the environment, the serialized object size also becomes small.

rm(list=ls(envir = attr(lm.fit.1$terms, ".Environment")),
envir = attr(lm.fit.1$terms, ".Environment"))  
ls(envir = attr(lm.fit.1$terms, ".Environment"))
character(0)
length(serialize(lm.fit.1, NULL))
[1] 85500
lm.fit.1

Call:
lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE,
    qr = FALSE)

Coefficients:
(Intercept)     DISTANCE     DEPDELAY 
   0.225378    -0.001218     0.962528 

Is the associated environment essential to the model? If not, we could empty it to significantly reduce model size. We'll rebuild the model using the function f.lm.full

f.lm.full <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat)
lm.fit.full <- f.lm.full(ontime_s)
ls(envir=attr(lm.fit.full$terms, ".Environment"))
[1] "dat"
length(serialize(lm.fit.full,NULL))
[1] 65826324

We'll create the model removing some components as defined in function:

line-height: 115%; font-family: "Courier New";">f.lm.small <- function(dat) {
  f.lm <- function(dat) {
  mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model=FALSE)   
  mod$fitted.values <- NULL
  mod
}
  mod <- f.lm(dat)
  # empty the env associated with local function
  e <- attr(mod$terms, ".Environment")
  # set parent env to .GlobalEnv so serialization doesn’t include contents
parent.env(e) <- .GlobalEnv    
  rm(list=ls(envir=e), envir=e) # remove all objects from this environment
  mod
}

lm.fit.small <- f.lm.small(ontime_s)
ls(envir=attr(lm.fit.small$terms, ".Environment")) 
character(0)
length(serialize(lm.fit.small, NULL))
[1] 16219251

We can use the same function with embedded R execution.

lm.fit.ere <- ore.pull(ore.tableApply(ONTIME_S, f.lm.small))
ls(envir=attr(lm.fit.ere$terms, ".Environment"))
character(0)
length(serialize(lm.fit.ere, NULL))
[1] 16219251
as.matrix(lapply(lm.fit.ere, function(x) length(serialize(x,NULL))))    
              [,1]  
coefficients  130   
residuals     4624354
effects       3442434
rank          26    
fitted.values 4624354
assign        34    
qr            8067072
df.residual   26    
na.action     84056 
xlevels       55    
call          245   
terms         938   

Making this change does not affect the workings of the model for coef, summary, anova, or predict. For example, summary produces expected results:

summary(lm.fit.ere)

Call:
lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE)

Residuals:
     Min       1Q   Median       3Q      Max
-1462.45    -6.97    -1.36     5.07   925.08

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  2.254e-01  5.197e-02   4.336 1.45e-05 ***
DISTANCE    -1.218e-03  5.803e-05 -20.979  < 2e-16 ***
DEPDELAY     9.625e-01  1.151e-03 836.289  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.73 on 215144 degrees of freedom
  (4785 observations deleted due to missingness)
Multiple R-squared: 0.7647,     Adjusted R-squared: 0.7647
F-statistic: 3.497e+05 on 2 and 215144 DF,  p-value: < 2.2e-16

Using the model for prediction also produces expected results.

lm.pred <- function(dat, mod) {
prd <- predict(mod, newdata=dat)
prd[as.integer(rownames(prd))] <- prd
cbind(dat, PRED = prd)
}

dat.test <- with(ontime_s, ontime_s[YEAR == 2003 & MONTH == 5,
c("ARRDELAY", "DISTANCE", "DEPDELAY")])
head(lm.pred(dat.test, lm.fit.ere))
       ARRDELAY DISTANCE DEPDELAY        PRED
163267        0      748       -2 -2.61037575
163268       -8      361        0 -0.21414306
163269       -5      484        0 -0.36389686
163270       -3      299        3  2.74892676
163271        6      857       -6 -6.59319662
163272      -21      659       -8 -8.27718564
163273       -2     1448        0 -1.53757703
163274        5      238        9  8.59836323
163275       -5      744        0 -0.68044960
163276       -3      199        0 -0.01690635

As shown above, an lm model can become quite large. At least for some applications, several of these components may be unnecessary, allowing the user to significantly reduce the size of the model and space required for saving or time for transporting the model. Relying on Oracle Database to store the data instead of the R model object further allows for significant reduction in model size.

Friday Mar 22, 2013

Are you a Type I or Type II Data Scientist?

The role of Data Scientist has been getting a lot of attention lately. Brendan Tierney's blog post titled Type I and Type II Data Scientists adds an interesting perspective by defining and characterizing two key types of Data Scientist, both of which are needed in an organization.

Tierney writes about Type I Data Scientists, "These are people who know a lot about and are really good at a technique or technology that is associated with Data Science. ...have a deep knowledge of their topic and can tell/show you lots of detail about how best to to explore data in their given field." Whereas Type II Data Scientists "concentrate on the business goals and business problems that the organisation are facing. Based on these they will identify what the data scientist project will focus on, ensuring that there is a measurable outcome and business goal." Type IIs are also characterized as good communicators and story tellers. 

As the role of Data Scientist continues to evolve, continuing to add structure and definition to the role, or roles, will help organizations make the most of advanced analytics and data science projects.

Monday Feb 18, 2013

Saving R Objects in Oracle Database using Oracle R Enterprise 1.3 Datastore

R allows users to save R objects to disk. The whole workspace of R objects can be saved to a file, and reloaded across R sessions, which allows users to return to their previous R environment even after quitting R or to avoid recreating objects needed in the future. One such type of object includes predictive models, which can be built in one session, and saved for scoring in another R session, or even multiple, possibly parallel R sessions.

R provides the save and load functions where objects in memory are serialized and unserialized, respectively. Figure 1 depicts an example where two R objects, a linear model and data.frame are saved to a file, and then reloaded. When objects are restored, they have the same names as when they were saved.

Figure 1: Using R save() and load() functions

Oracle R Enterprise (ORE) 1.3 supports object persistence using an R datastore in the database. Now, ORE proxy objects, as well as any R objects, can be saved and restored across R sessions as a named entity in Oracle Database. Serializing ORE objects, such as ore.frames, and saving them doesn’t work across sessions, since any referenced temporary tables or other database objects are not saved across R sessions. If these ore.frame proxy object references are not maintained, restoring them makes them incomplete and inoperative.

Figure 2 has an example similar to the previous example. The main difference is that we are using ore.save and ore.load, and providing the name of the datastore from which we want to retrieve ORE objects.

Figure 2: Using ORE datastore functions ore.save and ore.load

Each schema has its own datastore table where R objects are saved. By being managed in Oracle Database, ORE provides referential integrity of saved objects such that when otherwise temporary database objects are no longer referenced, they are auto-deleted at the end of the R session. This applies, for example, to tables created via ore.push or Oracle Data Mining models produced using the OREdm package.

Here’s a simple example:

DAT1 <- ore.push(iris)

ore.lm.mod <- ore.lm(Sepal.Length~., DAT1 )

lm.mod <- lm(mpg ~ cyl + disp + hp + wt + gear, mtcars)

nb.mod <- ore.odmNB(YEAR ~ ARRDELAY + DEPDELAY + log(DISTANCE), ONTIME_S)

ore.save(ore.lm.mod, lm.mod, nb.mod, name = "myModels")

We’re creating four objects: an ore.frame “DAT1” consisting of the iris data set, an ore.lm model that uses DAT1,a standard R lm model using the mtcars dataset, and an ODM naïve Bayes model using ONTIME_S. We then invoke ore.save for the three models and use the datastore name “myModels”.

The R objects are saved in the datastore, and any referenced data tables or ODM models are kept in the database, otherwise these are treated as temporary database objects and dropped when the ORE session ends.

To load these objects, invoke:

ore.load(name = "myModels")

 To see the content of a datastore, you can invoke ore.datastoreSummary with the name of the datastore.

The datastore also makes it easy to access R and ORE objects within ORE embedded R execution functions. Simply, the name of the datastore is passed as an embedded R function argument. Within the function, ore.load is invoked with the name of that datastore. We'll see an example of this using the SQL API below. In addition, by maintaining persisted R objects in the database, ORE facilitates application deployment while leveraging existing database backup, recovery, and security mechanisms. 

Consider the following ORE embedded R execution example using the SQL API.  We build a simple linear model and store it in a datastore with the name "myDatastore". This R script will be stored in the database R script repository with the name "BuildModel-1". To invoke this R script, we use the rqTableEval function, providing the input data from ONTIME_S, passing the parameters that include the name of the datastore, requesting XML output for the result, and specifying the R script by name. 

begin
  sys.rqScriptCreate('BuildModel-1',
 'function(dat,datastore_name) {
    mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
    ore.save(mod,name=datastore_name, overwrite=TRUE)
    TRUE
}');
end;
/

select *

from table(rqTableEval(
  cursor(select ARRDELAY,DISTANCE,DEPDELAY from ONTIME_S),
  cursor(select 1 as "ore.connect", 'myDatastore' as "datastore_name"
         from dual),
  'XML',
  'BuildModel-1'));

To score using this model, we create an R script named "ScoreData-1" that loads the model from the named datastore, invokes predict, and binds the predictions with the predictors to be returned as the result.To invoke this script, we again use the rqTableEval function that takes as parameters a cursor specifying the data to be scored, parameters, a description of the result as a SQL query, and the R script name.

begin
  sys.rqScriptCreate('ScoreData-1',
 'function(dat, datastore_name) {
     ore.load(datastore_name)
     pred <- predict(mod, newdata=dat)
     pred[as.integer(rownames(pred))] <- pred
     cbind(dat, PRED = pred)
     }');
end;
/

select *
from table(rqTableEval(
    cursor(select ARRDELAY, DISTANCE, DEPDELAY from ONTIME_S
           where YEAR = 2003 and MONTH = 5 and DAYOFMONTH = 2),
    cursor(select 1 as "ore.connect",
                 'myDatastore' as "datastore_name" from dual),
    'select ARRDELAY, DISTANCE, DEPDELAY, 1 PRED from ontime_s',
    'ScoreData-1'))
order by 1, 2, 3;

Datastore functionality supports the following interface: ore.save, ore.load, ore.datastore, ore.datastoreSummary, and ore.delete. See the online documentation for details. A presentation on the ORE transparency layer that provides additional examples is available here.

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today