Wednesday Mar 30, 2016

Real-time model scoring for streaming data - a prototype based on Oracle Stream Explorer and Oracle R Enterprise

Whether applied to manufacturing, financial services, energy, transportation, retail, government, security or other domains, real-time analytics is an umbrella term which covers a broad spectrum of capabilities (data integration, analytics, business intelligence) built on streaming input from multiple channels. Examples of such channels are: sensor data, log data, market data, click streams, social media and monitoring imagery.

Key metrics separating real-time analytics from more traditional, batch, off-line analytics are latency and availability. At one end of the analytics spectrum are complex, long running batch analyses with slow response time and low availability requirements. At the other end are real-time, lightweight analytic applications with fast response time (O[ms]) and high availability (99.99..%). Another distinction is between the capability for responding to individual events and/or ordered sequences of events versus the capability for handling only event collections in micro batches without preservation of their ordered characteristics. The complexity of the analysis performed on the real-time data is also a big differentiator: capabilities range from simple filtering and aggregations to complex predictive procedures. The level of integration between the model generation and the model scoring functionalities needs also to be considered for real-time applications. Machine learning algorithms specially designed for online model building exist and are offered by some streaming data platforms but their number is small. Practical solutions could be built by combining an off-line model generation platform with a data streaming platform augmented with scoring capabilities.

In this blog we describe a new prototype for real time analytics integrating two components : Oracle Stream Explorer (OSX) and Oracle R Enterprise (ORE). Examples of target applications for this type of integration are: equipment monitoring through sensors, anomaly detection and failure prediction for large systems made of a high number of components.

The basic architecture is illustrated below:

ORE is used for model building, in batch mode, at low frequency, and OSX handles the high frequency streams and pushes data toward a scoring application, performs predictions in real time and returns results to consumer applications connected to the output channels.

OSX is a middleware platform for developing streaming data applications. These applications monitor and process large amounts of streaming data in real time, from a multitude of sources like sensors, social media, financial feeds, etc. Readers unfamiliar with OSX should visit Getting Started with Event Processing for OSX.

In OSX, streaming data flows into, through, and out of an application. The applications can be created, configured and deployed with pre-built components provided with the platform or built from customized adapters and event beans. The application in this case is a custom scoring application for real time data.  A thorough description of the application building process can be found in the following guide: Developing Applications for Event Processing with Oracle Stream Explorer.

In our solution prototype for streaming analytics, the model exchange between ORE and OSX is realized by converting the R models to a PMML representation. After that, JPMML - the Java Evaluator API for PMML - is leveraged for reading the model and building a custom OSX scoring application.
The end-to-end workflow is represented below:

and the subsequent sections of this blog will summarize the essentials aspects.

Model Generation

As previously stated, the use cases targeted by this OSX-ORE integration prototype application consist of systems made of a large number of different components. Each  component type is abstracted by a different model. We leverage ORE's Embedded R Execution capability for data and task parallelism to generate a large number of models concurrently. This is accomplished for example with ore.groupApply():

res <- ore.groupApply(
   function(dat,frml) {mdl<-...},

Model representation in PMML

The model transfer between the model generator and the scoring engine is enabled by conversion to a PMML representation. PMML is an XML-based mature standard for model exchange. A model in PMML format is represented by a collection of XML elements, or PMML components, which completely describe the modeling flow. For example, the Data Dictionary component contains the definitions for all fields used by the model (attribute types, value ranges, etc) the Data Transformations component describes the mapping functions between the
raw data and its desired form for the modeling algorithms, the Mining Schema component assigns the active and target variables and enumerates the policies for missing data, outliers, and so on. Besides the specifications for the data mining algorithms together with accompanying of pre- and post-processing steps, PMML can also describe more complex modeling concepts like model composition, model hierarchies, model verification and fields scoping - to find out more about PMML's structure and functionality go to General Structure. PMML representations have been standardized for several classes of data mining algorithms.  Details are available at the same location.


In R the conversion/translation to PMML formats is enabled through the pmml package. The following algorithms are supported:

ada (ada)
arules (arules)
coxph (survival)
glm (stats)
glmnet (glmnet)
hclust (stats)
kmeans (stats)
ksvm (kernlab)
lm (stats)
multinom (nnet)
naiveBayes (e1071)
nnet (nnet)
randomForest (randomFoerst)
rfsrc (randomForestSRC)
rpart (rpart)
svm (e1071)

The r2pmml package offers complementary support for

gbm (gbm)
and a much better (performance-wise) conversion to PMML for randomForest. Check the details at converting_randomforest.

The conversion to pmml is done via the pmml() generic function which dispatches the appropriate method for the supplied model,  depending on it's class.

mdl <- randomForest(...)
pmld <- pmml(mdl)

Exporting the PMML model

In the current prototype, the pmml model is exported to the streaming platform as a physical XML file. A better solution is to leverage R's serialization interface which supports a rich set of connections through pipes, url's, sockets, etc.
The pmml objects can be also saved in ORE datastores within the database and specific policies can be implemented to control the access and usage.

ore.grant(name=dsname, type="datastore", user=...)

OSX Applications and the Event Processing Network (EPN)

The OSX workflow, implemented as an OSX application, consists of three logical steps: the pmml model is imported into OSX, a scoring application is created and scoring is performed on the input streams.

In OSX, applications are modeled as Data Flow graphs named Event Processing Networks (EPN). Data flows into, through, and out of EPNs. When raw data flows into an OSX application it is first converted into events. Events flow through the different stages of application where they are processed according to the specifics of the application. At the end, events are converted back to data in suitable format for consumption by downstream applications. 

The EPN for our prototype is basic:

Streaming data flows from the Input Adapters through Input Channels, reaches the Scoring Processor where the prediction is performed, flows through the Output Channel to an Output Adapter and exits the application in a desired form. In our demo application the data is streamed out of a CSV file into the Input Adapter. The top adaptors (left & right) on the EPN  diagram represent connections to  the Stream Explorer User Interface (UI). Their purpose is to demonstrate options for controlling the scoring process (like, for example, change the model while the application is still running) and visualizing the predictions.

The JPMML-Evaluator

The Scoring Processor was implemented by leveraging the open source library JPMML library, the Java Evaluator API for PMML. The methods of this class allow, among others to pre-process the active & target fields according to the DataDictionary and MiningSchema elements, evaluate the model for several classes of algorithms and post-process the results according to the Targets element.

JPMML offers support for:

Association Rules
Cluster Models
General Regression
k-Nearest Neighbors
Naïve Bayes
Neural Networks
Tree Models
Support Vector Machines
Ensemble Models
which covers most of the models which can be converted to PMML from R, using the pmml() method, except for time series, sequence rules & text models.

The Scoring Processor

The Scoring Processor (see EPN) is implemented as a JAVA class with methods that automate scoring based on the PMML model. The important steps of this automation are enumerated below:

  • The PMML schema is loaded, from the xml document,

  •     pmml = pmmlUtil.loadModel(pmmlFileName);

  • An instance of the Model Evaluator is created. In the example below we assume that we don't know what type of model we are dealing with so the instantiation is delegated to an instance of a ModelEvaluatorFactory class.

  •     ModelEvaluatorFactory modelEvaluatorFactory =
        ModelEvaluator<?>  evaluator = modelEvaluatorFactory.newModelManager(pmml);

  • This Model Evaluator instance is queried for the fields definitions. For the active fields:

  •     List<FieldName>  activeModelFields = evaluator.getActiveFields();

  • The subsequent data preparation performs several tasks: value conversions between the Java type system and the PMML type system, validation of these values according to the specifications in the Data Field element, handling of invalid, missing values and outliers as per the Mining Field element.

  •     FieldValue activeValue = evaluator.prepare(activeField, inputValue)
        pmmlArguments.put(activeField, activeValue);

  • The prediction is executed next

  •     Map<FieldName, ?> results = evaluator.evaluate(pmmlArguments);

  • Once this is done, the mapping between the scoring results & other fields to output events is performed. This needs to differentiate between the cases where the target values are Java primitive values or smth different.

  •     FieldName targetName = evaluator.getTargetField();
        Object targetValue = results.get(targetName);
        if (targetValue instanceof Computable){ ….

    More details about this approach can be found at JPMML-Evaluator: Preparing arguments for evaluation and Java (JPMML) Prediction using R PMML model.

    The key aspect is that the JPMML Evaluator API provides the functionality for implementing the Scoring Processor independently of the actual model being used. The active variables, mappings, assignments, predictor invocations are figured out automatically, from the PMML representation. This approach allows flexibility for the scoring application. Suppose that several PMML models have been generated off-line, for the same system component/equipment part, etc. Then, for example, an n-variables logistic model could be replaced by an m-variables decision tree model via the UI control by just pointing a Scoring Processor variable to the new PMML object. Moreover the substitution can be executed via signal events sent through the UI Application Control (upper left of EPN) without stopping and restarting the scoring application. This is practical because the real-time data keeps flowing in !

    Tested models

    The R algorithms listed below were tested and identical results were obtained for predictions based on the OSX PMML/JPMML scoring application and predictions in R.

    lm (stats)
    glm (stats)
    rpart (rpart)
    naiveBayes (e1071)
    nnet (nnet)
    randomForest (randomForest)

    The prototype is new and other algorithms are currently tested. The details will follow in a subsequent post.


    The OSX-ORE PMML/JPMML-based prototype for real-time scoring was developed togehther with Mauricio Arango | A-Team Cloud Solutions Architects. The work was presented at BIWA 2016.

    Friday Apr 17, 2015

    The Intersection of “Data Capital” and Advanced Analytics

    We’ve heard about the Three Laws of Data Capital from Paul Sonderegger at Oracle: data comes from activity, data tends to make more data, and platforms tend to win. Advanced analytics enables enterprises to take full advantage of the data their activity produces, ranging from IoT sensors and PoS transactions to social media and image/video. Traditional BI tools produce summary data from data – producing more data, but traditional BI tools provide a view of the past – what did happen. Advanced analytics also produces more data from data, but this data is transformative, generating previously unknown insights and providing a view of future behavior or outcomes – what will likely happen. Oracle provides a platform for advanced analytics today through Oracle Advanced Analytics on Oracle Database, and Oracle R Advanced Analytics for Hadoop on Big Data Appliance to support investing data.

    Enterprises need to put their data to work to realize a return on their investment in data capture, cleansing, and maintenance. Investing data through advanced analytics algorithms has shown repeatedly to dramatically increase ROI. For examples, see customer quotes and videos from StubHub, dunnhumby, CERN, among others. Too often, data centers are perceived as imposing a “tax” instead of yielding a “dividend.” If you cannot extract new insights from your data and use data to perform such revenue enhancing actions such as predicting customer behavior, understanding root causes, and reducing fraud, the costs to maintain large volumes of historical data may feel like a tax. How do enterprises convert data centers to dividend-yielding assets?

    One approach is to reduce “transaction costs.” Typically, these transaction costs involve the cost for moving data into environments where predictive models can be produced or sampling data to be small enough to fit existing hardware and software architectures. Then, there is the cost for putting those models into production. Transaction costs result in multi-step efforts that are labor intensive and make enterprises postpone investing their data and deriving value. Oracle has long recognized the origins of these high transaction costs and produced tools and a platform to eliminate or dramatically lower these costs.

    Further, consider the data scientist or analyst as the “data capital manager,” the person or persons striving to extract the maximum yield from data assets. To achieve high dividends with low transaction costs, the data capital manager needs to be supported with tools and a platform that automates activities – making them more productive – and ultimately more heroic within the enterprise – doing more with less because it’s faster and easier. Oracle removes a lot of the grunt work from the advanced analytics process: data is readily accessible, data manipulation and model building / data scoring is scalable, and deployment is immediate. To learn more about how to increase dividends from your data capital, see Oracle Advanced Analytics and Oracle R Advanced Analytics for Hadoop.

    Wednesday Apr 16, 2014

    Oracle's Strategy for Advanced Analytics

    At Oracle our goal is to enable you to get timely insight from all of your data. We continuously enhance Oracle Database to allow workloads that have traditionally required extracting data from the database to run in-place. We do this to narrow the gap that exists between insights that can be obtained and available data - because any data movement introduces latencies, complexity due to more moving parts, the ensuing need for data reconciliation and governance, as well as increased cost. The Oracle tool set considers the needs of all types of enterprise users - users preferring GUI based access to analytics with smart defaults and heuristics out of the box, users choosing to work interactively and quantitatively with data using R, and users preferring SQL and focusing on operationalization of models.

    Oracle recognized the need to support data analysts, statisticians, and data scientists with a widely used and rapidly growing statistical programming language. Oracle chose R - recognizing it as the new de facto standard for computational statistics and advanced analytics. Oracle supports R in at least 3 ways:

    • R as the language of interaction with the database

    • R as the language in which analytics can be written and executed in the database as a high performance computing platform

    • R as the language in which several native high performance analytics have been written that execute in database

    Additionally, of course, you may chose to leverage any of the CRAN algorithms to execute R scripts at the database server leveraging several forms of data parallelism.

    Providing the first and only supported commercial distribution of R from an established company, Oracle released Oracle R Distribution. In 2012 Oracle embarked on the Hadoop journey acknowledging alternative data management options emerging in the open source for management of unstructured or not-yet-structured data. In keeping with our strategy of delivering analytics close to where data is stored, Oracle extended Advanced Analytics capabilities to execute on HDFS resident data in Hadoop environments. R has been integrated into Hadoop in exactly the same manner as it has been with the database.

    Realizing that data is stored in both database and non-database environment, Oracle provides users options for storing their data (in Oracle Database, HDFS, and Spark RDD), where to perform computations (in-database or the Hadoop cluster), and where results should be stored (Oracle Database or HDFS). Users can write R scripts that can be leveraged across database and Hadoop environments. Oracle Database, as a preferred location for storing R scripts, data, and result objects, provides a real-time scoring and deployment platform. It is also easy to create a model factory environment with authorization, roles, and privileges, combined with auditing, backup, recovery, and security.

    Oracle provides a common infrastructure that supports both in-database and custom R algorithms. Oracle also provides an integrated GUI for business users. Oracle provides both R-based access and GUI-based access to in-database analytics. A major part of Oracle's strategy is to maintain agility in our portfolio of supported techniques - being responsive to customer needs.

    Friday Jul 19, 2013

    Oracle R Connector for Hadoop 2.2.0 released

    Oracle R Connector for Hadoop 2.2.0 is now available for download. The Oracle R Connector for Hadoop 2.x series has introduced numerous enhancements, which are highlighted in this article and summarized as follows:

     ORCH 2.0.0
     ORCH 2.1.0
     ORCH 2.2.0

     Analytic Functions

    • orch.lm
    • orch.lmf
    • orch.neural
    • orch.nmf

    Oracle Loader for Hadoop (OLH) support

    CDH 4.2.0

    ORCHhive transparency layer







    Analytic Functions
    • orch.cor
    • orch.cov
    • orch.kmeans
    • orch.princomp
    • orch.sample - by percent

    Configurable delimiters in text input data files

    Map-only and reduce-only jobs

    Keyless map/reduce output

    "Pristine" data mode for high performance data access

    HDFS cache of metadata

    Hadoop Abstraction Layer (HAL)


    Analytic Functions
    • orch.sample - by number of rows

    CDH 4.3.0

    Full online documentation

    Support integer and matrix data types in hdfs.attach with detection of "pristine" data

    Out-of-the-box support for "pristine" mode for high I/O performance

    HDFS cache to improve interactive performance when navigating HDFS directories and file lists

    HDFS multi-file upload and download performance enhancements

    HAL for Hortonworks Data Platform 1.2 and Apache Hadoop 1.0

    ORCH 2.0.0

    In ORCH 2.0.0, we introduced four Hadoop-enabled analytic functions supporting linear  regression, low rank matrix factorization, neural network, and non-negative matrix factorization. These enable R users to immediately begin using advanced analytics functions on HDFS data using the MapReduce paradigm on a Hadoop cluster without having to design and implement such algorithms themselves.

    While ORCH 1.x supported moving data between the database and HDFS using sqoop, ORCH 2.0.0 supports the use of Oracle Loader for Hadoop (OLH) to move very large data volumes from HDFS to Oracle Database in a efficient and high performance manner.

    ORCH 2.0.0 supported Cloudera Distribution for Hadoop (CDH) version 4.2.0 and introduced the ORCHhive transparency layer, which leverages the Oracle R Enterprise transparency layer for SQL, but instead maps to HiveQL, a SQL-like language for manipulating HDFS data via Hive tables.

    ORCH 2.1.0

    In ORCH 2.1.0, we added several more analytic functions, including correlation and covariance, clustering via K-Means, principle component analysis (PCA), and sampling by specifying the percent of records to return.

    ORCH 2.1.0 also brought a variety of features, including: configurable delimiters (beyond comma delimited text files, using any ASCII delimiter), the ability to specify mapper-only and reduce-only jobs, and the output of NULL keys in mapper and reducer functions.

    To speed the loading of data into Hadoop jobs, ORCH introduced “pristine” mode where the user guarantees that the data meets certain requirements so that ORCH skips a time-consuming data validation step. “Pristine” data requires that numeric columns contain only numeric data, that missing values are either R’s NA or the null string, and that all rows have the same number of columns. This improves performance of hdfs.get on a 1GB file by a factor of 10.

    ORCH 2.1.0 introduced the caching of ORCH metadata to improve response time of ORCH functions, such as, hdfs.describe, and hdfs.mget between 5x and 70x faster.

    The Hadoop Abstraction Layer, or HAL, enables ORCH to work on top of various Hadoop versions or variants, including Apache/Hortonworks, Cloudera Hadoop distributions: CDH3, and CDH 4.x with MR1 and MR2.

    ORCH 2.2.0

    In the latest release, ORCH 2.2.0, we’ve augmented orch.sample to allow specifying the number of rows in addition to percentage of rows. CDH 4.3 is now supported, and ORCH functions provide full online documentation via R's help function or ?. The function hdfs.attach now support integer and matrix data types and the ability to detect pristine data automatically. HDFS bulk directory upload and download performance speeds were also improved. Through the caching and automatic synchronization of ORCH metadata and file lists, the responsiveness of metadata HDFS-related functions has improved by 3x over ORCH 2.1.0, which also improves performance of and hadoop.exec functions. These improvements in turn bring a more interactive user experience for the R user when working with HDFS.

    Starting in ORCH 2.2.0, we introduced out-of-the-box tuning optimizations for high performance and expanded HDFS caching to include the caching of file lists, which further improves performance of HDFS-related functions.

    The function hdfs.upload now supports the option to upload multi-file directories in a single invocation, which optimizes the process. When downloading an HDFS directory, is optimized to issue a single HDFS command to download files into one local temporary directory before combining the separate parts into a single file.

    The Hadoop Abstraction Layer (HAL) was extended to support Hortonworks Data Platform 1.2 and Apache Hadoop 1.0. In addition, ORCH now allows the user to override the Hadoop Abstraction Layer version for use with unofficially supported distributions of Hadoop using system environment variables. This enables testing and certification of ORCH by other Hadoop distribution vendors.

    Certification of ORCH on non-officially supported platforms can be done using a separate test kit (available for download upon request: that includes an extensive set of tests for core ORCH functionality and that can be run using the ORCH built-in testing framework. Running the tests pinpoints the failures and ensures that ORCH is compatible with the target platform.

    See the ORCH 2.2.0 Change List and Release Notes for additional details. ORCH 2.2.0 can be downloaded here.

    Thursday Jul 18, 2013

    Simple and Advanced Time series with Oracle R Enterprise

    This guest post from Marcos Arancibia describes how to use Oracle R Enterprise to analyze Time Series data.

    In this article, we give an overview of how to use Time Series Analysis against data stored in Oracle Database, using the Embedded R Execution capability to send time series computations to the Oracle Database server instead processing at the client. We will also learn how to retrieve the final series or forecasts and retrieve them to the client for plotting, forecasting, and diagnosing.

    One key thing to keep in mind when using Time Series techniques with data that is stored in Oracle Database is the order of the rows, or records. Because of the parallel capabilities of Oracle Database, when queried for records, one might end up receiving records out of order if an option for order is not specified.

    Simple Example using Stock Data

    Let’s start with a simple Time Series example. First we will need to connect to our Oracle Database using ORE. Then, using the package TTR, we will access Oracle Stock data from YahooData service, from January 1, 2008 to January 1, 2013 and push it to the database.

    # Load the ORE library and connect to Oracle Database




    # Get data in XTS format

    xts.orcl <- getYahooData("ORCL", 20080101, 20130101)

    # Convert it to a data frame and gets the date

    # Makes the date the Index

    df.orcl <- data.frame(xts.orcl)

    df.orcl$date <- (data.frame(date=index(xts.orcl))$date)

    # Create/overwrite data in Oracle Database

    # to a Table called ORCLSTOCK




    # Ensure indexing is kept by date

    rownames(ORCLSTOCK) <- ORCLSTOCK$date

    # Ensure the data is in the DB

    # Review column names, data statistics and

    # print a sample of the data



    [1] "Open" "High" "Low" "Close" "Volume"

    [6] "Unadj.Close" "Div" "Split" "Adj.Div" "date"



    Min. 1st Qu. Median Mean 3rd Qu. Max.

    13.36 20.53 24.22 24.79 29.70 35.73



    Open High Low Close Volume

    2008-01-02 01:00:00 21.74414 22.00449 21.58022 21.68629 44360179

    2008-01-03 01:00:00 21.62843 22.28413 21.62843 22.28413 43600532

    2008-01-04 01:00:00 21.95628 22.06235 21.01130 21.24272 46391263

    2008-01-07 01:00:00 21.17523 21.67664 21.01130 21.45486 41527032

    2008-01-08 01:00:00 21.44522 21.52236 20.38453 20.39417 45155398

    2008-01-09 01:00:00 20.57738 20.91487 20.39417 20.83773 49750304

    Unadj.Close Div Split Adj.Div date

    2008-01-02 01:00:00 22.49 NA NA NA 2008-01-02

    2008-01-03 01:00:00 23.11 NA NA NA 2008-01-03

    2008-01-04 01:00:00 22.03 NA NA NA 2008-01-04

    2008-01-07 01:00:00 22.25 NA NA NA 2008-01-07

    2008-01-08 01:00:00 21.15 NA NA NA 2008-01-08

    2008-01-09 01:00:00 21.61 NA NA NA 2008-01-09

    Pull data from the database for a simple plot

    # Pull data from Oracle Database (only the necessary columns)

    orcl <- ore.pull(ORCLSTOCK[,c("date","Close","Open","Low","High")])

    # Simple plot with base libraries - Closing


    main="Base plot:Daily ORACLE Stock Closing points")

    # Simple plot with base libraries - Other Series


    main="Base plot:Daily ORACLE Stock: Open/High/Low points")



    legend("topleft", c("Opening","High","Low"),

    col=c("blue","green","orange"),lwd=2,title = "Series",bty="n")

    A different plot option, using the package xts


    # Pull data from Oracle Database (only the necessary columns)

    orcl <- ore.pull(ORCLSTOCK[,c("date","Close","Open","Low","High")])

    # Convert data to Time Series format

    orcl.xts <- as.xts(orcl,$date,dateFormat="POSIXct")

    # Plot original series


    main="Time Series plot:Daily ORACLE Stock Closing points",col="red")

    Simple Time Series: Moving Average Smoothing

    We might be tempted to call functions like the Smoothing Moving Average from open-source CRAN packages against Oracle Database Tables, but those packages do not know what to do with an “ore.frame”. For that process to work correctly, we can either load the data locally or send the process for remote execution on the Database Server by using Embedded R Execution.

    We will also explore the built-in Moving Average process from ore.rollmean() as a third alternative.

    ALTERNATIVE 1 - The first example is pulling the data from Oracle Database into a ts (time series) object first, for a Client-side smoothing Process.


    # Pull part of the database table into a local data.frame

    sm.orcl <- ore.pull(ORCLSTOCK[,c("date","Close")])

    # Convert "Close" attribute into a Time Series (ts)

    ts.orcl <- ts(sm.orcl$Close)

    # Use SMA - Smoothing Moving Average algorithm from package TTR <-ts(SMA(ts.orcl,n=30),frequency=365, start=c(2008,1) )

    # Plot both Series together


    main="ORCL Stock Close CLIENT-side Smoothed Series n=30 days")


    legend("topleft", c("Closing","MA(30) of Closing"),

    col=c("red","blue"),lwd=2,title = "Series",bty="n")

    ALTERNATIVE 2 – In this alternative, we will use a Server-side example for running the Smoothing via Moving Average, without bringing all data to the client. Only the result is brought locally for plotting. Remember that the TTR package has to be installed on the Server in order to be called.

    # Server execution call using ore.tableApply

    # Result is an ore.list that remains in the database until needed

    sv.orcl.ma30 <-

    ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect = TRUE,

    function(dat) {


    ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

    list(res1 <- ts(ordered$Close,frequency=365, start=c(2008,1)),

    res2 <- ts(SMA(res1,n=30),frequency=365, start=c(2008,1)),

    res3 <- ordered$date)



    # Bring the results locally for plotting

    local.orcl.ma30 <- ore.pull(sv.orcl.ma30)

    # Plot two series side by side

    # (the third element of the list is the date)



    main="ORCL Stock Close SERVER-side Smoothed Series n=30 days")

    # Add smoothed series



    # Add legend

    legend("topleft", c("Closing","Server MA(30) of Closing"),

    col=c("red","blue"), lwd=2,title = "Series", bty="n")

    ALTERNATIVE 3 – In this alternative we will use a Server-side example with the computation of Moving Averages using the native ORE in-Database functions without bringing data to the client. Only the result is brought locally for plotting.

    Just one line of code is needed to generate an in-Database Computation of Moving averages and the creation of a new VIRTUAL column in the Oracle Database. We will call this new column rollmean30.

    We will use the function ore.rollmean(). The option align="right" makes the MA look at only the past k days (30 in this case), or less, depending on the point in time. This creates a small difference between this method and the previous methods in the beginning of the series, since ore.rollmean() can actually calculate the first sets of days using smaller sets of data available, while other methods discard this data.

    # Moving Average done directly in Oracle Database

    ORCLSTOCK$rollmean30 <- ore.rollmean(ORCLSTOCK$Close, k = 30, align="right")

    # Check that new variable is in the database



    Open High Low Close Volume

    2008-01-02 01:00:00 21.74414 22.00449 21.58022 21.68629 44360179

    2008-01-03 01:00:00 21.62843 22.28413 21.62843 22.28413 43600532

    2008-01-04 01:00:00 21.95628 22.06235 21.01130 21.24272 46391263

    2008-01-07 01:00:00 21.17523 21.67664 21.01130 21.45486 41527032

    2008-01-08 01:00:00 21.44522 21.52236 20.38453 20.39417 45155398

    2008-01-09 01:00:00 20.57738 20.91487 20.39417 20.83773 49750304

    Unadj.Close Div Split Adj.Div date rollmean30

    2008-01-02 01:00:00 22.49 NA NA NA 2008-01-02 21.68629

    2008-01-03 01:00:00 23.11 NA NA NA 2008-01-03 21.98521

    2008-01-04 01:00:00 22.03 NA NA NA 2008-01-04 21.73771

    2008-01-07 01:00:00 22.25 NA NA NA 2008-01-07 21.66700

    2008-01-08 01:00:00 21.15 NA NA NA 2008-01-08 21.41243

    2008-01-09 01:00:00 21.61 NA NA NA 2008-01-09 21.31665

    # Get results locally for plotting

    local.orcl <- ore.pull(ORCLSTOCK[,c("date","Close", "rollmean30")])

    sub.orcl <- subset(local.orcl,local.orcl$date> as.Date("2011-12-16"))

    # Plot the two series side by side

    # First plot original series

    plot(local.orcl$date, local.orcl$Close,type="l", col="red",xlab="Date",ylab="US$",

    main="ORCL Stock Close ORE Computation of Smoothed Series n=30 days")

    # Add smoothed series


    # Add legend

    legend("topleft", c("Closing","ORE MA(30) of Closing"),

    col=c("red","blue"),lwd=2,title = "Series",bty="n")

    Seasonal Decomposition for Time Series Diagnostics

    Now that we have learned how to execute these processes using Embedded R, we can start using other methodologies required for Time Series using the same Server-side computation and local plotting.

    It is typical for an analyst to try to understand a Time Series better by looking at some of the basic diagnostics like the Seasonal Decomposition of Time Series by Loess. These can be achieved by using the stl() command in the following process:

    # Server execution

    sv.orcl.dcom <-

    ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect = TRUE,

    function(dat) {

    ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

    ts.orcl <- ts(ordered$Close,frequency=365, start=c(2008,1))

    res <- stl(ts.orcl,s.window="periodic")



    # Get result for plotting

    local.orcl.dcom <- ore.pull(sv.orcl.dcom)

    plot(local.orcl.dcom, main="Server-side Decomposition of ORCL Time-Series",col="blue")

    Another typical set of diagnostic charts includes Autocorrelation and Partial Autocorrelation function plots. These can be achieved by using the acf() command with the proper options in Embedded R Execution, so computations happen at the Oracle Database server:

    # Server-side ACF and PACF computation

    # Use function acf() and save result as a list

    sv.orcl.acf <-



    ts.orcl <- ts(dat$Close,frequency=365, start=c(2008,1))

    list(res1 <- acf(ts.orcl,lag.max=120,type="correlation"),res2 <- acf(ts.orcl,lag.max=30, type="partial"))



    # Get results for plotting

    # ACF and PACF as members of the list pulled

    local.orcl.acf <- ore.pull(sv.orcl.acf)

    plot(local.orcl.acf[[1]],main="Server-side ACF Analysis for Series ORCL",col="blue",lwd=2)

    plot(local.orcl.acf[[2]],main="Server-side PACF Analysis for Series ORCL",col="blue",lwd=5)

    Simple Exponential Smoothing

    Using the popular package “forecast”, we will use the ses() function to calculate a 90 days horizon (h=90) into the future, using the option criterion=MSE for the model. The package forecast needs to be installed on the Oracle Database server R engine.

    Then, we will bring the resulting model locally for plotting. Remember to load the library “forecast” locally as well, to be able to interpret the meaning of the ses() output when it’s brought locally.

    # Execute ses() call in the server <-

    ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

    function(dat) {


    ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

    ts.orcl <- ts(ordered$Close,frequency=365, start=c(2008,1) )

    res <- ses(ts.orcl, h=90, alpha=0.1, initial="simple")



    # Get SES result locally for plotting

    # Since remote object contains a SES model from package forecast,

    # load package locally as well

    library(forecast) <- ore.pull(


    main="ORCL with Server-side SES - Simple Exponential Smoothing Forecast")

    Holt Exponential Smoothing

    Using the popular package “forecast”, we will use the holt() function to calculate a 90 days horizon (h=90) into the future, requesting the Intervals of confidence of 80 and 95%. Again. the package “forecast” needs to be installed on the Oracle Database server R engine.

    Then, we will bring the resulting model locally for plotting. Remember to load the library forecast locally as well, to be able to interpret the meaning of the holt() output when it’s brought locally.

    # Execute holt() call in the server

    sv.orcl.ets <-

    ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

    function(dat) {


    ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

    ts.orcl <- ts(ordered$Close,frequency=365, start=c(2008,1))

    res <- holt(ts.orcl, h=90, level=c(80,95), initial="optimal")



    # Get resulting model from the server

    # Since remote object contains a Holt Exponential Smoothing

    # model from package forecast, load package locally as well


    local.orcl.ets <- ore.pull(sv.orcl.ets)


    main="ORCL Original Series Stock Close with Server-side Holt Forecast")

    ARIMA – Auto-Regressive Interactive Moving Average

    There are at least two options for fitting an ARIMA model into a Time Series. One option is to use the package “forecast”, that allows for an automatic arima fitting (auto.arima) to find the best parameters possible based on the series.

    For more advanced users, the arima() function in the “stats” package itself allows for choosing the model parameters.

    # ARIMA models on the server using auto.arima() from package forecast

    arimaModel <-

    ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

    FUN = function(dat){

    # load forecast library to use auto.arima


    # sort the table into a temp file by date

    ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

    # convert column into a Time Series

    # format ts(...) and request creation of an automatic

    # ARIMA model auto.arima(...)

    res <- auto.arima(ts(ordered$Close,frequency=365, start=c(2008,1)),

    stepwise=TRUE, seasonal=TRUE)


    # Alternative using the arima() from package “stats”.

    arimaModel <-


    FUN = function(dat){

    # sort table into a temp file by date

    ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

    # convert column into a Time Series

    # format ts(...) and request creation of a specific

    # ARIMA model using arima(), for example an ARIMA(2,1,2)

    res <- arima(ts(ordered$Close,frequency=365, start=c(2008,1)),

    order = c(2,1,2))


    # Load forecast package locally to use the model

    # for plotting and producing forecasts


    # Show remote resulting Time Series model


    Series: ts(ordered$Close, frequency = 365, start = c(2008, 1))



    ar1 ar2

    -0.0935 -0.0192

    s.e. 0.0282 0.0282

    sigma^2 estimated as 0.2323: log likelihood=-866.77

    AIC=1739.55 AICc=1739.57 BIC=1754.96

    # Get remote model using ore.pull for local prediction and plotting

    local.arimaModel <- ore.pull(arimaModel)

    # Generate forecasts for the next 15 days

    fore.arimaModel <- forecast(local.arimaModel, h=15)

    # Use the following option if you need to remove scientific notation of

    # numbers that are too large in charts


    # Generate the plot of forecasts, including interval of confidence

    # Main title is generated automatically indicating the type of model

    # chosen by the Auto ARIMA process

    plot(fore.arimaModel,type="l", col="blue", xlab="Date",

    ylab="Closing value (US$)", cex.axis=0.75, font.lab="serif EUC",

    sub="Auto-generated ARIMA for ORCL Stock Closing"


    # Generate and print forecasted data points plus standard errors

    # of the next 15 days

    forecasts <- predict(local.arimaModel, n.ahead = 15)



    Time Series:

    Start = c(2011, 165)

    End = c(2011, 179)

    Frequency = 365

    [1] 33.29677 33.29317 33.29395 33.29395 33.29393 33.29393 33.29393 33.29393 33.29393 33.29393 33.29393

    [12] 33.29393 33.29393 33.29393 33.29393


    Time Series:

    Start = c(2011, 165)

    End = c(2011, 179)

    Frequency = 365

    [1] 0.4819417 0.6504925 0.7807798 0.8928901 0.9924032 1.0827998 1.1662115 1.2440430 1.3172839 1.3866617

    [11] 1.4527300 1.5159216 1.5765824 1.6349941 1.6913898

    Monday Jun 17, 2013

    Oracle R Connector for Hadoop 2.1.0 released

    Oracle R Connector for Hadoop (ORCH), a collection of R packages that enables Big Data analytics using HDFS, Hive, and Oracle Database from a local R environment, continues to make advancements. ORCH 2.1.0 is now available, providing a flexible framework while remarkably improving performance and adding new analytics based on the ORCH framework.

    Previous releases enabled users to write MapReduce tasks in the R language and run them in HDFS. The API was then expanded to include support for Hive data sources, providing easy access to Hive data from R, leveraging the same transparency interface as found in Oracle R Enterprise. ORCH HAL was included to enable portability and compatibility of ORCH with any Cloudera's Hadoop distribution starting from version 3.x up to 4.3.

    In this release, new analytic functions that work in parallel, distributed mode and execute on the Hadoop cluster, include:

    • Covariance and Correlation matrix computation
    • Principal Component Analysis
    • K-means clustering
    • Linear regression
    • Single layer feed forward neural networks for linear regression
    • Matrix completion using low rank matrix factorization
    • Non negative matrix factorization
    • Sampling
    • Predict methods

    ORCH 2.1.0 also adds support for keyless mapReduce output and many other improvements that contribute to overall performance enhancements.

    You can find an ORCH technical reference here, and download Oracle R Connector for Hadoop here.

    Wednesday Jun 12, 2013

    R to Oracle Database Connectivity: Use ROracle for both Performance and Scalability

    R users have a few choices of how to connect to their Oracle Database. The most commonly seen include: RODBC, RJDBC, and ROracle. However, these three packages have significantly different performance and scalability characteristics which can greatly impact your application development. In this blog, we’ll discuss these options and highlight performance benchmark results on a wide range of data sets.

    If you use ROracle, we'd like to hear about your experience. Please take this brief survey.

    By way of introduction, RODBC is an R package that implements ODBC database connectivity. There are two groups of functions: the largely internal odbc* functions implement low-level access to the corresponding ODBC functions having a similar name, and the higher level sql* functions that support read, save, copy, and manipulation of data between R data.frame objects and database tables. Here is an example using RODBC:


    con <- odbcConnect("DD1", uid="rquser", pwd="rquser", rows_at_time = 500)

    sqlSave(con, test_table, "TEST_TABLE")

    sqlQuery(con, "select count(*) from TEST_TABLE")

    d <- sqlQuery(con, "select * from TEST_TABLE")


    The R package RJDBC is an implementation of the R DBI package – database interface – that uses JDBC as the back-end connection to the database. Any database that supports a JDBC driver can be used in connection with RJDBC. Here is an example using RJDBC:


    drv <- JDBC("oracle.jdbc.OracleDriver",

    classPath="…tklocal/instantclient_11_2/ojdbc5.jar", " ")
    con <- dbConnect(drv, "
    jdbc:oracle:thin:@myHost:1521:db", "rquser", "rquser")

    dbWriteTable(con, "TEST_TABLE", test_table)

    dbGetQuery(con, "select count(*) from TEST_TABLE")

    d <- dbReadTable(con, "TEST_TABLE")

    The ROracle package is an implementation of the R DBI package that uses Oracle OCI for high performance and scalability with Oracle Databases. It requires Oracle Instant Client or Oracle Database Client to be installed on the client machine. Here is an example using ROracle:


    drv <- dbDriver("Oracle")

    con <- dbConnect(drv, "rquser", "rquser")

    dbWriteTable(con,”TEST_TABLE”, test_table)

    dbGetQuery(con, "select count(*) from TEST_TABLE")

    d <- dbReadTable(con, "TEST_TABLE")


    Notice that since both RJDBC and ROracle implement the DBI interface, their code is the same except for the driver and connection details.

    To compare these interfaces, we prepared tests along several dimensions:

    • Number of rows – 1K, 10K, 100K, and 1M
    • Number of columns – 10, 100, 1000
    • Data types – NUMBER, BINARY_DOUBLE, TIMESTAMP, and VARCHAR; Numeric data is randomly generated, all character data is 10 characters long.
    • Interface: RODBC 1.3-6 (with Data Direct 7.0 driver), RJDBC 0.2-1 (with rJava 0.9-4 with increased memory limit in,
      and ROracle 1.1-10 (with Oracle Database Client
    • Types of operations: select *, create table, connect

    Loading database data to an R data.frame

    Where an in-database capability as provided by Oracle R Enterprise is not available, typical usage is to pull data to the R client for subsequent processing. In Figure 1, we compare the execution time to pull 10, 100, and 1000 columns of data from 1K, 10, 100K, and 1M rows for BINARY_DOUBLE data on a log-log scale. Notice that RJDBC does not scale to 100 columns x 1M rows, or above 1000 cols x 100K records. While RODBC and ROracle both scale to these volumes, ROracle is consistently faster than RODBC: up to 2.5X faster. For RJDBC, ROracle is up to 79X faster.

    Figure 1: Comparison of RJDBC, RODBC, and ROracle for BINARY_DOUBLE for Select *

    In Figure 2, we provide the range of results for RODBC, ROracle, and RJDBC across all data types. Notice that only ROracle provides the full range of scalability while providing superior performance in general.

    ROracle is virtually always faster than RODBC: NUMBER data up to 2.5X faster, VARCHAR2 data up to 142X faster, and time stamp data up to 214X faster. RODBC fails to process 1000 columns at 1M rows.

    For RJDBC, ROracle is up to 13X faster on NUMBER data, 79X faster on binary double data, 3X for VARCHAR2 data (excluding the 25X over the smallest data set). Note that RODBC and RJDBC have a limit of 255 characters on the length the VARCHAR2 columns. TIMESTAMP data is the one area where RJDBC initially shines, but then fails to scale to larger data sets.

    Figure 2: Comparing the three interfaces for select * from <table>

    Data set sizes represented in megabytes are captured in Table 1 for all data types. With only minor variation, the data sizes are the same across data types.

    Table 1: Dataset sizes in megabytes

    Creating database tables from an R data.frame

    Data or results created in R may need to be written to a database table. In Figure 3, we compare the execution time to create tables with 10, 100, and 1000 columns of data with 1K, 10, 100K, and 1M rows for BINARY_DOUBLE. Notice that in all three cases, RJDBC is slowest and does not scale. RJDBC does not support the NUMBER or BINARY_DOUBLE data types, but uses FLOAT(126) instead. ROracle scaled across the remaining data types, while RODBC and RJDBC were not tested.

    ROracle is 61faster than RODBC for 10 columns x 10K rows, with a median of 5X faster across all data sets. ROracle is 630X faster on 10 columns x 10K rows, with a median of 135X faster across all data sets. RJDBC did not scale to the 1M row data sets.

    Figure 3: Comparison of RJDBC, RODBC, and ROracle for BINARY_DOUBLE create table

    Connecting to Oracle Database

    Depending on the application any sub-second response time may be sufficient. However, as depicted in Figure 4, ROracle introduces minimal time to establish a database connection. ROracle is nearly 10X faster than RJDBC and 1.6X faster than RODBC.

    Figure 4: Database connection times for ROracle, RODBC, and RJDBC

    In summary, for maximal performance and scalability, ROracle can support a wide range of application needs. RJDBC has significant limitations in both performance and scalability. RODBC can be more difficult to configure on various platforms and while it largely scales to the datasets tested here, its performance lags behind ROracle.

    If you use ROracle, we'd like to hear about your experience. Please take this brief survey.

    All tests were performed on a 16 processor machine with 4 core Intel Xeon E5540 CPUs @ 2.53 GHz and 74 GB RAM. Oracle Database was version For JDBC, the following was modified before installing rJava.

    rJava_0.9-4.tar.gz\rJava_0.9-4.tar\rJava\jri\bootstrap\ was modified to use 2GB :

    try {

    System.out.println(jl.toString()+" -cp "+System.getProperty("java.class.path")+" -Xmx2g -Dstage=2 Boot");

    Process p = Runtime.getRuntime().exec(new String[] {

    jl.toString(), "-cp", System.getProperty("java.class.path"),"-Xmx2g", "-Dstage=2", "Boot" });

    System.out.println("Started stage 2 ("+p+"), waiting for it to finish...");


    } catch (Exception re) {}

    Monday Jun 10, 2013

    Bringing R to the Enterprise - new white paper available

    Check out this new white paper entitled "Bringing R to the Enterprise -  A Familiar R Environment with Enterprise-Caliber Performance, Scalability, and Security."

    In this white paper, we begin with "Beyond the Laptop" exploring the ability to run R code in the database, working with CRAN packages at the database server, operationalizing R analytics, and leveraging Hadoop from the comfort of the R language and environment.

    Excerpt: "Oracle Advanced Analytics and Oracle R Connector for Hadoop combine the advantages of R with the power and scalability of Oracle Database and Hadoop. R programs and libraries can be used in conjunction with these database assets to process large amounts of data in a secure environment. Customers can build statistical models and execute them against local data stores as well as run R commands and scripts against data stored in a secure corporate database."

    The white paper continues with three use cases involving Oracle Database and Hadoop: analyzing credit risk, detecting fraud, and preventing customer churn.  The conclusion: providing analytics for the enterprise based on the R environment is here!

    Wednesday Jun 05, 2013

    Oracle R Distribution for R 2.15.3 is released

    We are pleased to announce that Oracle R Distribution (ORD) for R 2.15.3 is available for download today. This update consists of mostly minor bug fixes, and is the final release of the R 2.x series.

    Oracle recommends using yum to install ORD from our public yum serverTo install ORD 2.15.3, first remove the previously installed ORD.  For example, if ORD 2.15.2 is installed, remove it along with it's dependencies:

    1. Become root

    sudo su -

    2. Uninstall ORD 2.15.2 RPMs, in this order:

    rpm -e R-2.15.2-1.el5.x86_64
    rpm -e R-devel
    rpm -e R-core
    rpm -e libRmath-devel
    rpm -e libRmath

    3. Go to and follow these steps to install ORD 2.15.3:

    Install the yum repository as follows

    cd /etc/yum.repos.d

    4a. Use this command to download the Oracle Linux 5 yum repository (el5.repo):


    Open a text editor on the file just downloaded

    vi public-yum-el5.repo


    "enabled=1" for [el5_addons]
    "enabled=1" for [el5_latest]

    4b. Use this command to download the Oracle Linux 6 yum repository (ol6.repo):


    Open a text editor on the file just downloaded

    vi public-yum-ol6.repo


    "enabled=1" for [ol6_addons]
    "enabled=1" for [ol6_latest]

    5. To install ORD, use the command

    yum install R.x86_64
    You can check that ORD was installed by starting it from the command line. You will see this startup 
    6. Older open source R packages may need to be re-installed after an ORD upgrade, which is 
    accomplished by running:

    R> update.packages(checkBuilt=TRUE)

    This command upgrades open source packages if a more recent version exists on CRAN or if the
    installed package was build with an older version of R.

    Oracle R Distribution for R 2.15.3 is certified with Oracle R Enterprise 1.3.1, available for download on Oracle Technology Network. Oracle offers support for users of Oracle R Distribution on Linux, AIX and Solaris 64 bit platforms.

    Tuesday May 28, 2013

    Converting Existing R Scripts to ORE - Getting Started

    Oracle R Enterprise provides a comprehensive, database-centric environment for end-to-end analytical processes in R, with immediate deployment to production environments. This message really resonates with our customers who are interested in executing R functions on database-resident data while seamlessly leveraging Oracle Database as a high-performance computing (HPC) environment. The ability to develop and operationalize R scripts for analytical applications in one step is quite appealing.

    One frequently asked question is how to convert existing R code that access data in flat files or the database to use Oracle R Enterprise. In this blog post, we talk about a few scenarios and how to begin a conversion from existing R code to using Oracle R Enterprise.

    Consider the following scenarios:

    Scenario 1: A stand-alone R script that generates its own data and simply returns a result. Data is not obtained from the file system or database. This may result from performing simulations where dadta is dynamically generated, or perhaps access from a URL on the internet.

    Scenario 2: An R script that loads data from a flat file such as a CSV file, performs some computations in R, and then writes the result back to a file.

    Scenario 3: An R script that loads data from a database table, via one of the database connector packages like RODBC, RJDBC, or ROracle, and writes a result back to the database –using SQL statements or package functions.

    Scenario 1

    A stand-alone R script might normally be run on a user’s desktop, invoked as a cron job, or even via Java to spawn an R engine and retrieve the result, but we’d like to operationalize its execution as part of a database application, invoked from SQL. Here’s a simple script to illustrate the concept of converting such a script to be executed at the database server using ORE’s embedded R execution. The script generates a data.frame with some random columns, performs summary on that data and returns the summary statistics, which are represented as an R table.

    # generate data


    n <- 1000

    df <- 3

    x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

    # perform some analysis

    res <- summary(x)

    #return the result


    To convert this to use ORE, create a function with appropriate arguments and body, for example:

    myFunction1 <- function (n = 1000, df = 3,seed=1) {


    x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

    res <- summary(x)



    Next, load the ORE packages and connect to Oracle Database using the ore.connect function. Using the all argument set to TRUE loads metadata for all the tables and views in that schema. We then store the function in the R script repository, invoking it via ore.doEval.

    # load ORE packages and connect to Oracle Database


    ore.connect("schema","sid","hostname","password",port=1521, all=TRUE)

    # load function into R script repository


    ore.scriptCreate("myFunction-1", myFunction1)

    # invoke using embedded R execution at the database server


    > ore.doEval(FUN.NAME="myFunction-1")
           a                b                  c           
     Min.   :   1.0   Min.   :-3.00805   Min.   : 0.03449  
     1st Qu.: 250.8   1st Qu.:-0.69737   1st Qu.: 1.27386  
     Median : 500.5   Median :-0.03532   Median : 2.36454  
     Mean   : 500.5   Mean   :-0.01165   Mean   : 3.07924  
     3rd Qu.: 750.2   3rd Qu.: 0.68843   3rd Qu.: 4.25994  
     Max.   :1000.0   Max.   : 3.81028   Max.   :17.56720  

    Of course, we’re using default values here. To provide different arguments, change the invocation with arguments as follows:

    ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2)

    > ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2)
           a               b                  c          
     Min.   :  1.0   Min.   :-2.72182   Min.   : 0.1621  
     1st Qu.:125.8   1st Qu.:-0.65346   1st Qu.: 2.6144  
     Median :250.5   Median : 0.04392   Median : 4.4592  
     Mean   :250.5   Mean   : 0.06169   Mean   : 5.0386  
     3rd Qu.:375.2   3rd Qu.: 0.79096   3rd Qu.: 6.8467  
     Max.   :500.0   Max.   : 2.88842   Max.   :17.0367  

    Having successfully invoked this from the R client (my laptop), we can now invoke it from SQL. Here, we retrieve the summary result, which is an R table, as an XML string.

    select *

    from table(rqEval( NULL,'XML','myFunction-1'));

    The result can be viewed from SQL Developer.

    The following shows the XML output in a more structured manner.

    What if we wanted to get the result to appear as a SQL table? Since the current result is an R table (an R object), we need to convert this to a data.frame to return it. We’ll make a few modifications to “myFunction-1” above. Most notably is the need to convert the table object in res to a data.frame. There are a variety of ways to do this.

    myFunction2 <- function (n = 1000, df = 3,seed=1) {

    # generate data


    x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

    # perform some analysis

    res <- summary(x)

    # convert the table result to a data.frame

    res.df <- as.matrix(res)

    res.sum <-,9,20)),6,3))

    names(res.sum) <- c('a','b','c')

    res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

    res.sum <- res.sum[,c(4,1:3)]



    # load function into R script repository


    ore.scriptCreate("myFunction-2", myFunction2)

    We’ll now modify the SQL statement to specify the format of the result.

    select *

    from table(rqEval( NULL,'select cast(''a'' as VARCHAR2(12)) as "statname",

    1 "a", 1 "b", 1 "c" from dual ','myFunction-2'));

    Here’s the result as viewed from SQL Developer.

    This type of result could be incorporated into any SQL application accepting table or view input from a SQL query. That is particular useful in combination with OBIEE dashboards via an RPD.

    Scenario 2

    If you’ve been loading data from a flat file, perhaps a CSV file, your R code may look like the following, where it specifies to builds a model and write hat model to a file for future use, perhaps in scoring. It also generates a graph of the clusters highlighting the individual points, colored by their cluster id, with the centroids indicated with a star.

    # read data


    dat <- read.csv("myDataFile.csv")

    # build a clustering model

    cl <- kmeans(x, 2)

    # write model to file

    save(cl, file="myClusterModel.dat")

    # create a graph and write it to a file


    plot(x, col = cl$cluster)

    points(cl$centers, col = 1:2, pch = 8, cex=2)

    The resulting PDF file contains the following image.

    To convert this script for use in ORE, there are several options. We’ll explore two: the first involving minimal change to use embedded R execution, and the second leveraging in-database techniques. First, we’ll want the data we used above in variable dat to be loaded into the database.

    # create a row id to enable ordered results (if a key doesn’t already exist)

    dat$ID <- 1:nrow(dat)

    # remove the table if it exists


    # create the table using the R data.frame, resulting in an ore.frame named MY_DATA


    # assign the ID column as the row.names of the ore.frame

    row.names(MY_DATA) <- MY_DATA$ID

    In the first example, we’ll use embedded R execution and pass the data to the function via ore.tableApply. We’ll generate the graph, but simply display it within the function to allow embedded R execution to return the graph as a result. (Note we could also write the graph to a file in any directory accessible to the database server.) Instead of writing the model to a file, which requires keeping track of its location, as well as worring about backup and recovery, we store the model in the database R datastore using All this requires minimal change. As above, we could store the function in the R script repository and invoke it by name – both from R and SQL. In this example, we simply provide the function itself as argument.

    myClusterFunction1 <- function(x) {

    cl <- kmeans(x, 2), name="myClusterModel",overwrite=TRUE)

    plot(x, col = cl$cluster)

    points(cl$centers, col = 1:2, pch = 8, cex=2)



    ore.tableApply(MY_DATA[,c('x','y')], myClusterFunction1,


    The ore.tableApply function projects the x and y columns of MY_DATA as input and also specifies ore.connect as TRUE since we are using the R datastore, which requires a database connection. Optionally, we can specify control arguments to the PNG output. In this example, these are the height and width of the image.

    For the second example, we convert this to leverage the ORE Transparency Layer. We’ll use the in-database K-Means algorithm and save the model in a datastore named “myClusteringModel”, as we did above. Since ore.odmKMeans doesn’t automatically assign cluster ids (since the data may be very large or are not required), the scoring is done separately. Note, however, that the prediction results also exist in the database as an ore.frame. To ensure ordering, we also assign row.names to the ore.frame pred. Lastly, we create the plot. Coloring the nodes requires pulling the cluster assignments; however, the points themselves can be accessed from the ore.frame. The centroids points are obtained from cl$centers2 of the cluster model.

    # build a clustering model in-database

    cl <- ore.odmKMeans(~., MY_DATA, 2,

    # save model in database R datastore,name="myClusterModel",overwrite=TRUE)

    # generate predictions to assign each row a cluster id, supplement with original data

    pred <- predict(cl,MY_DATA,supp=c('x','y','ID'),type="class")

    # assign row names to ensure ordering of results

    row.names(pred) <- pred$ID

    # create the graph

    plot(pred[,c('x','y')], col = ore.pull(pred$CLUSTER_ID))

    points(cl$centers2[,c('x','y')], col = c(2,3), pch = 8, cex=2)

    We can also combine using the transparency layer within an embedded R function. But we’ll leave that as an exercise to the reader.

    Scenario 3

    In this last scenario, the data already exists in the database and one of the database interface packages, such as RODBC, RJDBC, and ROracle is be used to retrieve data from and write data to the database. We’ll illustrate this with ROracle, but the same holds for the other two packages.

    # connect to the database

    drv <- dbDriver("Oracle")

    con <- dbConnect(drv, "mySchema", "myPassword")

    # retrieve the data specifying a SQL query

    dat <- dbGetQuery(con, 'select * from MY_RANDOM_DATA where "a" > 100')

    # perform some analysis

    res <- summary(dat)

    # convert the table result to a data.frame for output as table

    res.df <- as.matrix(res)

    res.sum <-,9,20)),6,3))

    names(res.sum) <- c('a','b','c')

    res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

    res.sum <- res.sum[,c(4,1:3)]


    dbWriteTable(con, "SUMMARY_STATS", res.sum)

    Converting this to ORE is straightforward. We’re already connected to the database using ore.connect from previous scenarios, so the existing table MY_RANDOM_DATA was already loaded in the environment as an ore.frame. Executing lists this table is the result, so we can just start using it.


    [1] "MY_RANDOM_DATA"

    # no need to retrieve the data, use the transparency layer to compute summary

    res <- with(MY_RANDOM_DATA , summary(MY_RANDOM_DATA[a > 100,]))

    # convert the table result to a data.frame for output as table

    res.df <- as.matrix(res)

    res.sum <-,9,20)),6,3))

    names(res.sum) <- c('a','b','c')

    res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")

    res.sum <- res.sum[,c(4,1:3)]

    # create the database table

    ore.create(res.sum, "SUMMARY_STATS")


    As we did in previous scenarios, this script can also be wrapped in a function and used in embedded R execution. This too is left as an exercise to the reader.


    As you can see from the three scenarios discussed here, converting a script that accesses no external data, accesses and manipulates file data, or accesses and manipulates database data can be accomplished with a few strategic modifications. More involved scripts, of course, may require additional manipulation. For example, if the SQL query performs complex joins and filtering, along with derived column creation, the user may want to convert this SQL to the corresponding ORE Transparency Layer code, thereby eliminating reliance on SQL. But that’s a topic for another post.

    Wednesday May 22, 2013

    Big Data Analytics in R – the tORCH has been lit!

    This guest post from Anand Srinivasan compares performance of the Oracle R Connector for Hadoop with the R {parallel} package for covariance matrix computation, sampling, and parallel linear model fitting. 

    Oracle R Connector for Hadoop (ORCH) is a collection of R packages that enables Big Data analytics from the R environment. It enables a Data Scientist /Analyst to work on data straddling multiple data platforms (HDFS, Hive, Oracle Database, local files) from the comfort of the R environment and benefit from the R ecosystem.

    ORCH provides:

    1) Out of the box predictive analytic techniques for linear regression, neural networks for prediction, matrix completion using low rank matrix factorization, non-negative matrix factorization, kmeans clustering, principal components analysis and multivariate analysis. While all these techniques have R interfaces, they are implemented either in Java or in R as distributed parallel implementations leveraging all nodes of your Hadoop cluster

    2) A general framework, where a user can use the R language to write custom logic executable in a distributed parallel manner using available compute and storage resources.

    The main idea behind the ORCH architecture and its approach to Big Data analytics is to leverage the Hadoop infrastructure and thereby inherit all its advantages.

    The crux of ORCH is read parallelization and robust methods over parallelized data. Efficient parallelization of reads is the single most important step necessary for Big Data Analytics because it is either expensive or impractical to load all available data in a single thread.

    ORCH is often compared/contrasted with the other options available in R, in particular the popular open source R package called parallel. The parallel package provides a low-level infrastructure for “coarse-grained” distributed and parallel computation. While it is fairly general, it tends to encourage an approach that is based on using the aggregate RAM in the cluster as opposed to using the file system. Specifically, it lacks a data management component, a task management component and an administrative interface for monitoring. Programming, however, follows the broad Map Reduce paradigm.

     In the rest of this article, we assume that the reader has basic familiarity with the parallel package and proceed to compare ORCH and its approach with the parallel package. The goal of this comparison is to explain what it takes for a user to build a solution for their requirement using each of these technologies and also to understand the performance characteristics of these solutions.

    We do this comparison using three concrete use cases – covariance matrix computation, sampling and partitioned linear model fitting. The exercise is designed to be repeatable, so you, the reader, can try this “at home”. We will demonstrate that ORCH is functionally and performance-wise superior to the available alternative of using R’s parallel package.

    A six node Oracle Big Data Appliance v2.1.1 cluster is used in the experiments. Each node in this test environment has 48GB RAM and 24 CPU cores.

    Covariance Matrix Computation

    Computing covariance matrices is one of the most fundamental of statistical techniques.

    In this use case, we have a single input file, “allnumeric_200col_10GB” (see appendix on how to generate this data set), that is about 10GB in size and has a data matrix with about 3 million rows and 200 columns. The requirement is to compute the covariance matrix of this input matrix.

    Since a single node in the test environment has 48GB RAM and the input file is only 10GB, we start with the approach of loading the entire file into memory and then computing the covariance matrix using R’s cov function.

    > system.time(m <- matrix(scan(file="/tmp/allnumeric_200col_10GB",what=0.0, sep=","), ncol=200, byrow=TRUE))

    Read 611200000 items

    user system elapsed

    683.159 17.023 712.527

    > system.time(res <- cov(m))

    user system elapsed

    561.627 0.009 563.044

    We observe that the loading of data takes 712 seconds (vs. 563 seconds for the actual covariane computation) and dominates the cost. It would be even more pronounced (relative to the total elapsed time) if the cov(m) computation were parallelized using mclapply from the parallel package.

    Based on this, we see that for an efficient parallel solution, the main requirement is to parallelize the data loading. This requires that the single input file be split into multiple smaller-sized files. The parallel package does not offer any data management facilities; hence this step has to be performed manually using a Linux command like split. Since there are 24 CPU cores, we split the input file into 24 smaller files.

    time(split -l 127334 /tmp/allnumeric_200col_10GB)

    real 0m54.343s

    user 0m3.598s

    sys 0m24.233s

    Now, we can run the R script:


    # Read the data

    readInput <- function(id) {

    infile <- file.path("/home/oracle/anasrini/cov",paste("p",id,sep=""))


    m <- matrix(scan(file=infile, what=0.0, sep=","), ncol=200, byrow=TRUE)



    # Main MAPPER function

    compCov <- function(id) {

    m <- readInput(id)  # read the input

    cs <- colSums(m)    # compute col sums, num rows

    # compute main cov portion

    nr <- nrow(m)      

    mtm <- crossprod(m)

    list(mat=mtm, colsum=cs, nrow=nr)


    numfiles <- 24

    numCores <- 24

    # Map step

    system.time(mapres <- mclapply(seq_len(numfiles), compCov, mc.cores=numCores))

    # Reduce step

    system.time(xy <- Reduce("+", lapply(mapres, function(x) x$mat)))

    system.time(csf <- Reduce("+", lapply(mapres, function(x) x$colsum)))

    system.time(nrf <- Reduce("+", lapply(mapres, function(x) x$nrow)))

    sts <- csf %*% t(csf)

    m1 <- xy / (nrf -1)

    m2 <- sts / (nrf * (nrf-1))

    m3 <- 2 * sts / (nrf * (nrf-1))

    covmat <- m1 + m2 - m3

    user system elapsed

    1661.196 21.209 77.781

    We observe that the elapsed time (excluding time to split the files) has now come down to 77 seconds. However, it took 54 seconds for splitting the input file into smaller files, making it a significant portion of the total elapsed time of 77+54 = 131 seconds.

    Besides impacting performance, there are a number of more serious problems with having to deal with data management manually. We list a few of them here:

    1) In other scenarios, with larger files or larger number of chunks, placement of chunks also becomes a factor that influences I/O parallelism. Optimal placement of chunks of data over the available set of disks is a non-trivial problem

    2) Requirement of root access – Optimal placement of file chunks on different disks often requires root access. For example, only root has permissions to create files on disks corresponding to the File Systems mounted on /u03, /u04 etc on an Oracle Big Data Appliance node

    3) When multiple nodes are involved in the computation, moving fragments of the original data into different nodes manually can drain productivity

    4) This form of split can only work in a static environment – in a real-world dynamic environment, information about other workloads and their resource utilization cannot be factored in a practical manner by a human

    5) Requires admin to provide user access to all nodes of the cluster in order to allow the user to move data to different nodes

    ORCH-based solution

    On the other hand, using ORCH, we can directly use the out of the box support for multivariate analysis. Further, no manual steps related to data management (like splitting files and addressing chunk placement issues) are required since Hadoop (specifically HDFS) handles all those requirements seamlessly.

    >x <- hdfs.attach("allnumeric_200col_10GB")

    > system.time(res <- orch.cov(x))

    user system elapsed

    18.179 3.991 85.640

    Forty-two concurrent map tasks were involved in the computation above as determined by Hadoop.

    To conclude, we can see the following advantages of the ORCH based approach in this scenario :

    1) No manual steps. Data Management completely handled transparently by HDFS

    2) Out of the box support for cov. The distributed parallel algorithm is available out of the box and the user does not have to work it out from scratch

    3) Using ORCH we get comparable performance to that obtained through manual coding without any of the manual overheads


    We use the same single input file, “allnumeric_200col_10GB” in this case as well. The requirement is to obtain a uniform random sample from the input data set. The size of the sample required is specified as a percentage of the input data set size.

    Once again for the solution using the parallel package, the input file has to be split into smaller sized files for better read parallelism.


    # Read the data

    readInput <- function(id) {

    infile <- file.path("/home/oracle/anasrini/cov", paste("p",id,sep=""))


    system.time(m <- matrix(scan(file=infile, what=0.0, sep=","),

    ncol=200, byrow=TRUE))



    # Main MAPPER function

    samplemap <- function(id, percent) {

    m <- readInput(id)    # read the input

    v <- runif(nrow(m))   # Generate runif

    # Pick only those rows where random < percent*0.01

    keep <- which(v < percent*0.01)

    m1 <- m[keep,,drop=FALSE]



    numfiles <- 24

    numCores <- 24

    # Map step

    percent <- 0.001

    system.time(mapres <- mclapply(seq_len(numfiles), samplemap, percent,


    user system elapsed

    1112.998 23.196 49.561

    ORCH based solution

    >x <- hdfs.attach("allnumeric_200col_10GB_single")

    >system.time(res <- orch.sample(x, percent=0.001))

    user system elapsed

    8.173 0.704 33.590

    The ORCH based solution out-performs the solution based on the parallel package. This is because orch.sample is implemented in Java and the read rates obtained by a Java implementation are superior to what can be achieved in R.

    Partitioned Linear Model Fitting

    Partitioned Linear Model Fitting is a very popular use case. The requirement here is to fit separate linear models, one for each partition of the data. The data itself is partitioned based on a user-specified partitioning key.

    For example, using the ONTIME data set, the user could specify destination city as the partitioning key indicating the requirement for separate linear models (with, for example, ArrDelay as target), 1 per destination city.

    ORCH based solution

    dfs_res <-

    data = input,

    mapper = function(k, v) { orch.keyvals(v$Dest, v) },

    reducer = function(k, v) {

    lm_x <- lm(ArrDelay ~ DepDelay + Distance, v)

    orch.keyval(k, orch.pack(model=lm_x, count = nrow(v)))


    config = new("mapred.config", = "ORCH Partitioned lm by Destination City",

    map.output = mapOut,

    mapred.pristine = TRUE,

    reduce.output = data.frame(key="", model="packed"),



    Notice that the Map Reduce framework is performing the partitioning. The mapper just picks out the partitioning key and the Map Reduce framework handles the rest. The linear model for each partition is then fitted in the reducer.

    parallel based solution

    As in the previous use cases, for good read parallelism, the single input file needs to be split into smaller files. However, unlike the previous use cases, there is a twist here.

    We noted that with the ORCH based solution it is the Map Reduce framework that does the actual partitioning. There is no such out of the box feature available with a parallel package-based solution. There are two options:

    1) Break up the file arbitrarily into smaller pieces for better read parallelism. Implement your own partitioning logic mimicking what the Map Reduce framework provides. Then fit linear models on each of these partitions in parallel.


    2) Break the file into smaller pieces such that each piece is a separate partition. Fit linear models on each of these partitions in parallel 

    Both of these options are not easy and require a lot of user effort. The custom coding required for achieving parallel reads is significant.


    ORCH provides a holistic approach to Big Data Analytics in the R environment. By leveraging the Hadoop infrastructure, ORCH inherits several key components that are all required to address real world analytics requirements.

    The rich set of out-of-the-box predictive analytic techniques along with the possibility of authoring custom parallel distributed analytics using the framework (as demonstrated in the partitioned linear model fitting case) helps simplify the user’s task while meeting the performance and scalability requirements. 

    Appendix – Data Generation

    We show the steps required to generate the single input file “allnumeric_200col_10GB”.

    Run the following in R:

    x <- orch.datagen(datasize=10*1024*1024*1024, numeric.col.count=200,, "allnumeric_200col_10GB")

    Then, from the Linux shell:

    hdfs dfs –rm –r –skipTrash /user/oracle/allnumeric_200col_10GB/__ORCHMETA__

    hdfs dfs –getmerge /user/oracle/allnumeric_200col_10GB /tmp/allnumeric_200col_10GB

    Monday May 06, 2013

    Oracle R Distribution for R 2.15.2 available on public-yum

    Oracle R Distribution (ORD) for R 2.15.2 on Linux is now available for download from Oracle's public-yum repository.  R 2.15.2 is a maintenance update that includes improved performance and reduced memory usage for some commonly-used functions, increased memory available for data on 64-bit systems, enhanced localization for Polish language users, and a number of bug fixes.  Detailed updates can be found in the NEWS file - see the section 'CHANGES IN R VERSION 2.15.2'.

    The most recent update to Oracle R Enterprise, version 1.3.1, is certified with both R 2.15.1 and R 2.15.2. Installing ORD from public-yum will pull the most recently posted version, R 2.15.2.  For example, on Oracle Linux 5, as root, cd to yum.repos.d, download the public yum repository configuration file, and enable the required repositories:

        cd /etc/yum.repos.d
        Edit file public-yum-el5.repo and set
            "enabled=1" for [el5_addons]
            "enabled=1" for [el5_latest]

    Next, install ORD:

    Start R.  Oracle R Distribution for R 2.15.2 is installed.

    To install an older version of ORD such as R 2.15.1, simply specify the R version at the install step:

        yum install R-2.15.1

    Detailed instructions for installing Oracle R Distribution are in the Oracle R Enterprise Installation and Administration Guide.  Oracle R Distribution for R 2.15.2 on AIX, Solaris X86 and Solaris SPARC will be available on Oracle's Free and Open Source Software portal in the coming weeks.

    Wednesday Apr 17, 2013

    Mind Reading... What are our customers thinking?

    Overhauling analytics processes is becoming a recurring theme among customers. A major telecommunication provider recently embarked on overhauling their analytics process for customer surveys. They had three broad technical goals:

    • Provide an agile environment that empowers business analysts to test hypotheses based on survey results
    • Allow dynamic customer segmentation based on survey responses and even specific survey questions to drive hypothesis testing
    • Make results of new surveys readily available for research

    The ultimate goal is to derive greater value from survey research that drives measurable improvements in survey service delivery, and as a result, overall customer satisfaction.

    This provider chose Oracle Advanced Analytics (OAA) to power their survey research. Survey results and analytics are maintained in Oracle Database and delivered via a parameterized BI dashboard. Both the database and BI infrastructure are standard components in their architecture.

    A parameterized BI dashboard enables analysts to create samples for hypothesis testing by filtering respondents to a survey question based on a variety of filtering criteria. This provider required the ability to deploy a range of statistical techniques depending on the survey variables, level of measurement of each variable, and the needs of survey research analysts.

    Oracle Advanced Analytics offers a range of in-database statistical techniques complemented by a unique architecture supporting deployment of open source R packages in-database to optimize data transport to and from database-side R engines. Additionally, depending on the nature of functionality in such R packages, it is possible to leverage data-parallelism constructs available as part of in-database R integration. Finally, all OAA functionality is exposed through SQL, the ubiquitous language of the IT environment. This enables OAA-based solutions to be readily integrated with BI and other IT technologies.

    The survey application noted above has been in production for 3 months. It supports a team of 20 business analysts and has already begun to demonstrate measurable improvements in customer satisfaction.

    In the rest of this blog, we explore the range of statistical techniques deployed as part of this application.

    At the heart of survey research is hypothesis testing. A completed customer satisfaction survey contains data used to draw conclusions about the state of the world. In the survey domain, hypothesis testing is comparing the significance of answers to specific survey questions across two distinct groups of customers - such groups are identified based on knowledge of the business and technically specified through filtering predicates.

    Hypothesis testing sets up the world as consisting of 2 mutually exclusive hypotheses:

    a) Null hypothesis - states that there is no difference in satisfaction levels between the 2 groups of customers

    b) Alternate hypothesis states that there is a significant difference in satisfaction levels between the 2 groups of customers

    Obviously only one of these can be true and the true-ness is determined by the strength, probability, or likelihood of the null hypothesis over the alternate hypothesis. Simplistically, the degree of difference between, e.g., the average score from a specific survey question across two customer groups could provide the necessary evidence in helping decide which hypothesis is true.

    In practice the process of providing evidence to make a decision involves having access to a range of test statistics – a number calculated from each group that helps determine the choice of null or alternate hypothesis. A great deal of theory, experience, and business knowledge goes into selecting the right statistic based on the problem at hand.

    The t-statistic (available in-database) is a fundamental function used in hypothesis testing that helps understand the differences in means across two groups. When the t-values across 2 groups of customers for a specific survey question are extreme then the alternative hypothesis is likely to be true. It is common to set a critical value that the observed t-value should exceed to conclude that the satisfaction survey results across the two groups are significantly different. Other similar statistics available in-database include F-test, cross tabulation (frequencies of various response combinations captured as a table), related hypothesis testing functions such as chi-square functions, Fisher's exact test, Kendall's coefficients, correlation coefficients and a range of lambda functions.

    If an analyst desires to compare across more than 2 groups then analysis of variance (ANOVA) is a collection of techniques that is commonly used. This is an area where the R package ecosystem is rich with several proven implementations. The R stats package has implementations of several test statistics and function glm allows analysis of count data common in survey results including building Poisson and log linear models. R's MASS package implements a popular survey analysis technique called iterative proportional fitting. R's survey package has a rich collection of features (

    The provider was specifically interested in one function in the survey package - raking (also known as sample balancing) - a process that assigns a weight to each customer that responded to a survey such that the weighted distribution of the sample is in very close agreement with other customer attributes, such as the type of cellular plan, demographics, or average bill amount. Raking is an iterative process that uses the sample design weight as the starting weight and terminates when a convergence is achieved.

    For this survey application, R scripts that expose a wide variety of statistical techniques - some in-database accessible through the transparency layer in Oracle R Enterprise and some in CRAN packages - were built and stored in the Oracle R Enterprise in-database R script repository. These parameterized scripts accept various arguments that identify samples of customers to work with as well as specific constraints for the various hypothesis test functions. The net result is greater agility since the business analyst determines both the set of samples to analyze as well as the application of the appropriate technique to the sample based on the hypothesis being pursued.

    For more information see these links for Oracle's R Technologies software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop

    Monday Apr 15, 2013

    Is the size of your lm model causing you headaches?

    If you build an R lm model with a relatively large number of rows, you may be surprised by just how large that lm model is and what impact it has on your environment and application.

    Why might you care about size? The most obvious is that the size of R objects impacts the amount of RAM available for further R processing or loading of more data. However, it also has implications for how much space is required to save that model or the time required to move it around the network. For example, you may want to move the model from the database server R engine to the client R engine when using Oracle R Enterprise Embedded R Execution. If the model is too large, you may encounter latency when trying to retrieve the model or even receive the following error:

    Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch,  :
      ORA-20000: RQuery error
    Error : serialization is too large to store in a raw vector

    If you get this error, there are at least a few options:

    • Perform summary component access, like coefficients, inside the embedded R function and return only what is needed
    • Save the model in a database R datastore and manipulate that model at the database server to avoid pulling it to the client
    • Reduce the size of the model by eliminating large and unneeded components

    In this blog post, we focus on the third approach and look at the size of lm model components, what you can do to control lm model size, and the implications for doing so. With vanilla R, objects are the "memory" serving as the repository for repeatability. As a result, models tend to be populated with the data used to build them to ensure model build repeatability.

    When working with database tables, this "memory" is not needed because governance mechanisms are already in place to ensure either data does not change or logs are available to know what changes took place. Hence it is unnecessary to store the data used to build the model into the model object.

    An lm model consists of several components, for example:

    coefficients, residuals, effects, fitted.values, rank, qr, df.residual, call, terms, xlevels, model, na.action

    Some of these components may appear deceptively small using R’s object.size function. The following script builds an lm model to help reveal what R reports for the size of various components. The examples use a sample of the ONTIME airline arrival and departure delays data set for domestic flights. The ONTIME_S data set is an ore.frame proxy object for data stored in an Oracle database and consists of 219932 rows and 26 columns. The R data.frame ontime_s is this same data pulled to the client R engine using ore.pull and is ~39.4MB.

    Note: The results reported below use R 2.15.2 on Windows. Serialization of some components in the lm model has been improved in R 3.0.0, but the implications are the same.

    f.lm.1 <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat) <- f.lm.1(ontime_s)


    54807720 bytes

    Using the object.size function on the resulting model, the size is about 55MB. If only scoring data with this model, it seems like a lot of bloat for the few coefficients assumed needed for scoring. Also, to move this object over a network will not be instantaneous. But is this the true size of the model?

    A better way to determine just how big an object is, and what space is actually required to store the model or time to move it across a network, is the R serialize function.


    [1] 65826324

    Notice that the size reported by object.size is different from that of serialize – a difference of 11MB or ~20% greater.

    What is taking up so much space? Let’s invoke object.size on each component of this lm model:

    lapply(, object.size)

    424 bytes


    13769600 bytes


    3442760 bytes


    48 bytes


    13769600 bytes


    56 bytes


    17213536 bytes


    48 bytes


    287504 bytes


    192 bytes


    1008 bytes


    4432 bytes


    6317192 bytes

    The components residuals, fitted.values, qr, model, and even na.action are large. Do we need all these components?

    The lm function provides arguments to control some aspects of model size. This can be done, for example, by specifying model=FALSE and qr=FALSE. However, as we saw above, there are other components that contribute heavily to model size.

    f.lm.2 <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY,
                               data = dat, model=FALSE, qr=FALSE) <- f.lm.2(ontime_s)


    [1] 51650410


    31277216 bytes

    The resulting serialized model size is down to about ~52MB, which is not significantly smaller than the full model.The difference with the result reported by object.size is now ~20MB, or 39% smaller.

    Does removing these components have any effect on the usefulness of an lm model? We’ll explore this using four commonly used functions: coef, summary, anova, and predict. If we try to invoke summary on, the following error results:


    Error in qr.lm(object) : lm object does not have a proper 'qr' component.

    Rank zero or should not have used lm(.., qr=FALSE).

    The same error results when we try to run anova. Unfortunately, the predict function also fails with the error above. The qr component is necessary for these functions. Function coef returns without error.



    0.225378249 -0.001217511 0.962528054

    If only coefficients are required, these settings may be acceptable. However, as we’ve seen, removing the model and qr components, while each is large, still leaves a large model. The really large components appear to be the effects, residuals, and fitted.values. We can explicitly nullify them to remove them from the model.

    f.lm.3 <- function(dat) {
    data = dat, model=FALSE, qr=FALSE)
    mod$effects <- mod$residuals <- mod$fitted.values <- NULL
    } <- f.lm.3(ontime_s)


    [1] 24089000


    294968 bytes

    Thinking the model size should be small, we might be surprised to see the results above. The function object.size reports ~295KB, but serializing the model shows 24MB, a difference of 23.8MB or 98.8%. What happened? We’ll get to that in a moment. First, let’s explore what effect nullifying these additional components has on the model.

    To answer this, we’ll turn on model and qr, and focus on effects, residuals, and fitted.values. If we nullify effects, the anova results are invalid, but the other results are fine. If we nullify residuals, summary cannot produce residual and coefficient statistics, but it also produces an odd F-statistic with a warning:

    Warning message:

    In : applied to non-(list or vector) of type 'NULL'

    The function anova produces invalid F values and residual statistics, clarifying with a warning:

    Warning message:

    In anova.lm(mod) :

    ANOVA F-tests on an essentially perfect fit are unreliable

    Otherwise, both predict and coef work fine.

    If we nullify fitted.values, summary produces an invalid F-statistics issuing the warning:

    Warning message:

    In mean.default(f) : argument is not numeric or logical: returning NA

    However, there are no adverse effects on results on the other three functions.

    Depending on what we need from our model, some of these components could be eliminated. But let’s continue looking at each remaining component, not with object.size, but serialize. Below, we use lapply to compute the serialized length of each model component. This reveals that the terms component is actually the largest component, despite object.size reporting only 4432 bytes above.

    as.matrix(lapply(, function(x) length(serialize(x,NULL))))


    coefficients 130

    rank 26

    assign 34

    df.residual 26

    na.action 84056

    xlevels 55

    call 275

    terms 24004509

    If we nullify the terms component, the model becomes quite compact. (By the way, if we simply nullify terms, summary, anova, and predict all fail.) Why is the terms component so large? It turns out it has an environment object as an attribute. The environment contains the variable dat, which contains the original data with 219932 rows and 26 columns. R’s serialize function includes this object and hence the reason the model is so large. The function object.size ignores these objects.

    attr($terms, ".Environment")  
    <environment: 0x1d6778f8>
    ls(envir = attr($terms, ".Environment"))        
    [1] "dat"
    d <- get("dat",envir=envir)
    [1] 219932 26
    length(serialize(attr($terms, ".Environment"), NULL))
    [1] 38959319
    object.size(attr($terms, ".Environment"))
    56 bytes

    If we remove this object from the environment, the serialized object size also becomes small.

    rm(list=ls(envir = attr($terms, ".Environment")),
    envir = attr($terms, ".Environment"))  
    ls(envir = attr($terms, ".Environment"))
    length(serialize(, NULL))
    [1] 85500

    lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE,
        qr = FALSE)

    (Intercept)     DISTANCE     DEPDELAY 
       0.225378    -0.001218     0.962528 

    Is the associated environment essential to the model? If not, we could empty it to significantly reduce model size. We'll rebuild the model using the function f.lm.full

    f.lm.full <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat) <- f.lm.full(ontime_s)
    ls(envir=attr($terms, ".Environment"))
    [1] "dat"
    [1] 65826324

    We'll create the model removing some components as defined in function:

    line-height: 115%; font-family: "Courier New";">f.lm.small <- function(dat) {
      f.lm <- function(dat) {
      mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model=FALSE)   
      mod$fitted.values <- NULL
      mod <- f.lm(dat)
      # empty the env associated with local function
      e <- attr(mod$terms, ".Environment")
      # set parent env to .GlobalEnv so serialization doesn’t include contents
    parent.env(e) <- .GlobalEnv    
      rm(list=ls(envir=e), envir=e) # remove all objects from this environment
    } <- f.lm.small(ontime_s)
    ls(envir=attr($terms, ".Environment")) 
    length(serialize(, NULL))
    [1] 16219251

    We can use the same function with embedded R execution. <- ore.pull(ore.tableApply(ONTIME_S, f.lm.small))
    ls(envir=attr($terms, ".Environment"))
    length(serialize(, NULL))
    [1] 16219251
    as.matrix(lapply(, function(x) length(serialize(x,NULL))))    
    coefficients  130   
    residuals     4624354
    effects       3442434
    rank          26    
    fitted.values 4624354
    assign        34    
    qr            8067072
    df.residual   26    
    na.action     84056 
    xlevels       55    
    call          245   
    terms         938   

    Making this change does not affect the workings of the model for coef, summary, anova, or predict. For example, summary produces expected results:


    lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE)

         Min       1Q   Median       3Q      Max
    -1462.45    -6.97    -1.36     5.07   925.08

                  Estimate Std. Error t value Pr(>|t|)   
    (Intercept)  2.254e-01  5.197e-02   4.336 1.45e-05 ***
    DISTANCE    -1.218e-03  5.803e-05 -20.979  < 2e-16 ***
    DEPDELAY     9.625e-01  1.151e-03 836.289  < 2e-16 ***
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Residual standard error: 14.73 on 215144 degrees of freedom
      (4785 observations deleted due to missingness)
    Multiple R-squared: 0.7647,     Adjusted R-squared: 0.7647
    F-statistic: 3.497e+05 on 2 and 215144 DF,  p-value: < 2.2e-16

    Using the model for prediction also produces expected results.

    lm.pred <- function(dat, mod) {
    prd <- predict(mod, newdata=dat)
    prd[as.integer(rownames(prd))] <- prd
    cbind(dat, PRED = prd)

    dat.test <- with(ontime_s, ontime_s[YEAR == 2003 & MONTH == 5,
    163267        0      748       -2 -2.61037575
    163268       -8      361        0 -0.21414306
    163269       -5      484        0 -0.36389686
    163270       -3      299        3  2.74892676
    163271        6      857       -6 -6.59319662
    163272      -21      659       -8 -8.27718564
    163273       -2     1448        0 -1.53757703
    163274        5      238        9  8.59836323
    163275       -5      744        0 -0.68044960
    163276       -3      199        0 -0.01690635

    As shown above, an lm model can become quite large. At least for some applications, several of these components may be unnecessary, allowing the user to significantly reduce the size of the model and space required for saving or time for transporting the model. Relying on Oracle Database to store the data instead of the R model object further allows for significant reduction in model size.

    Friday Mar 22, 2013

    Are you a Type I or Type II Data Scientist?

    The role of Data Scientist has been getting a lot of attention lately. Brendan Tierney's blog post titled Type I and Type II Data Scientists adds an interesting perspective by defining and characterizing two key types of Data Scientist, both of which are needed in an organization.

    Tierney writes about Type I Data Scientists, "These are people who know a lot about and are really good at a technique or technology that is associated with Data Science. ...have a deep knowledge of their topic and can tell/show you lots of detail about how best to to explore data in their given field." Whereas Type II Data Scientists "concentrate on the business goals and business problems that the organisation are facing. Based on these they will identify what the data scientist project will focus on, ensuring that there is a measurable outcome and business goal." Type IIs are also characterized as good communicators and story tellers. 

    As the role of Data Scientist continues to evolve, continuing to add structure and definition to the role, or roles, will help organizations make the most of advanced analytics and data science projects.

    Monday Feb 18, 2013

    Saving R Objects in Oracle Database using Oracle R Enterprise 1.3 Datastore

    R allows users to save R objects to disk. The whole workspace of R objects can be saved to a file, and reloaded across R sessions, which allows users to return to their previous R environment even after quitting R or to avoid recreating objects needed in the future. One such type of object includes predictive models, which can be built in one session, and saved for scoring in another R session, or even multiple, possibly parallel R sessions.

    R provides the save and load functions where objects in memory are serialized and unserialized, respectively. Figure 1 depicts an example where two R objects, a linear model and data.frame are saved to a file, and then reloaded. When objects are restored, they have the same names as when they were saved.

    Figure 1: Using R save() and load() functions

    Oracle R Enterprise (ORE) 1.3 supports object persistence using an R datastore in the database. Now, ORE proxy objects, as well as any R objects, can be saved and restored across R sessions as a named entity in Oracle Database. Serializing ORE objects, such as ore.frames, and saving them doesn’t work across sessions, since any referenced temporary tables or other database objects are not saved across R sessions. If these ore.frame proxy object references are not maintained, restoring them makes them incomplete and inoperative.

    Figure 2 has an example similar to the previous example. The main difference is that we are using and ore.load, and providing the name of the datastore from which we want to retrieve ORE objects.

    Figure 2: Using ORE datastore functions and ore.load

    Each schema has its own datastore table where R objects are saved. By being managed in Oracle Database, ORE provides referential integrity of saved objects such that when otherwise temporary database objects are no longer referenced, they are auto-deleted at the end of the R session. This applies, for example, to tables created via ore.push or Oracle Data Mining models produced using the OREdm package.

    Here’s a simple example:

    DAT1 <- ore.push(iris)

    ore.lm.mod <- ore.lm(Sepal.Length~., DAT1 )

    lm.mod <- lm(mpg ~ cyl + disp + hp + wt + gear, mtcars)

    nb.mod <- ore.odmNB(YEAR ~ ARRDELAY + DEPDELAY + log(DISTANCE), ONTIME_S), lm.mod, nb.mod, name = "myModels")

    We’re creating four objects: an ore.frame “DAT1” consisting of the iris data set, an ore.lm model that uses DAT1,a standard R lm model using the mtcars dataset, and an ODM naïve Bayes model using ONTIME_S. We then invoke for the three models and use the datastore name “myModels”.

    The R objects are saved in the datastore, and any referenced data tables or ODM models are kept in the database, otherwise these are treated as temporary database objects and dropped when the ORE session ends.

    To load these objects, invoke:

    ore.load(name = "myModels")

     To see the content of a datastore, you can invoke ore.datastoreSummary with the name of the datastore.

    The datastore also makes it easy to access R and ORE objects within ORE embedded R execution functions. Simply, the name of the datastore is passed as an embedded R function argument. Within the function, ore.load is invoked with the name of that datastore. We'll see an example of this using the SQL API below. In addition, by maintaining persisted R objects in the database, ORE facilitates application deployment while leveraging existing database backup, recovery, and security mechanisms. 

    Consider the following ORE embedded R execution example using the SQL API.  We build a simple linear model and store it in a datastore with the name "myDatastore". This R script will be stored in the database R script repository with the name "BuildModel-1". To invoke this R script, we use the rqTableEval function, providing the input data from ONTIME_S, passing the parameters that include the name of the datastore, requesting XML output for the result, and specifying the R script by name. 

     'function(dat,datastore_name) {
        mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat),name=datastore_name, overwrite=TRUE)

    select *

    from table(rqTableEval(
      cursor(select ARRDELAY,DISTANCE,DEPDELAY from ONTIME_S),
      cursor(select 1 as "ore.connect", 'myDatastore' as "datastore_name"
             from dual),

    To score using this model, we create an R script named "ScoreData-1" that loads the model from the named datastore, invokes predict, and binds the predictions with the predictors to be returned as the result.To invoke this script, we again use the rqTableEval function that takes as parameters a cursor specifying the data to be scored, parameters, a description of the result as a SQL query, and the R script name.

     'function(dat, datastore_name) {
         pred <- predict(mod, newdata=dat)
         pred[as.integer(rownames(pred))] <- pred
         cbind(dat, PRED = pred)

    select *
    from table(rqTableEval(
        cursor(select ARRDELAY, DISTANCE, DEPDELAY from ONTIME_S
               where YEAR = 2003 and MONTH = 5 and DAYOFMONTH = 2),
        cursor(select 1 as "ore.connect",
                     'myDatastore' as "datastore_name" from dual),
        'select ARRDELAY, DISTANCE, DEPDELAY, 1 PRED from ontime_s',
    order by 1, 2, 3;

    Datastore functionality supports the following interface:, ore.load, ore.datastore, ore.datastoreSummary, and ore.delete. See the online documentation for details. A presentation on the ORE transparency layer that provides additional examples is available here.

    Wednesday Feb 06, 2013

    Oracle R Enterprise 1.3 gives predictive analytics an in-database performance boost

    Recently released Oracle R Enterprise 1.3 adds packages to R that enable even more in-database analytics. These packages provide horizontal, commonly used techniques that are blazingly fast in-database for large data. With Oracle R Enterprise 1.3, Oracle makes R even better and usable in enterprise settings. (You can download ORE 1.3 here and documentation here.)

    When it comes to predictive analytics, scoring (predicting outcomes using a data mining model) is often a time critical operation. Scoring can be done online (real-time), e.g., while a customer is browsing a webpage or using a mobile app, where on-the-spot recommendations can be made based on current actions. Scoring can also be done offline (batch), e.g., predict which of your 100 million customers will respond to each of a dozen offers, e.g., where applications leverage results to identify which customers should be targeted with a particular ad campaign or special offer.

    In this blog post, we explore where using Oracle R Enterprise pays huge dividends. When working with small data, R can be sufficient, even when pulling data from a database. However, depending on the algorithm, benefits of in-database computation can be seen in a few thousand rows. The time difference with 10s of thousands of rows makes an interactive session more interactive, whereas 100s of thousands of rows becomes a real productivity gain, and on millions (or billions) of rows, becomes a competitive advantage! In addition to performance benefits, ORE integrates R into the database enabling you to leave data in place.

    We’ll look at a few proof points across Oracle R Enterprise features, including:

    • OREdm – a new package that provides R access to several in-database Oracle Data Mining algorithms (Attribute Importance, Decision Tree, Generalized Linear Models, K-Means, Naïve Bayes, Support Vector Machine).
    • OREpredict – a new package that enables scoring models built using select standard R algorithms in the database (glm, negbin, hclust, kmeans, lm, multinom, nnet, rpart).
    • Embedded R Execution – an ORE feature that allows running R under database control and boosts real performance of CRAN predictive analytics packages by providing faster access to data than occurs between the database and client, as well as leveraging a more powerful database machine with greater RAM and CPU resources.


    Pulling data out of a database for any analytical tool impedes interactive data analysis due to access latency, either directly when pulling data out of the database or indirectly via an IT process that involves requesting data to be staged in flat files. Such latencies can quickly become intolerable. On the R front, you’ll also need to consider whether the data will fit in memory. If flat files are involved, consideration needs to be given to how files will be stored, backed up, and secured.

    Of course, model building and data scoring execution time is only part of the story. Consider a scenario A, the “build combined script,” where data is extracted from the database, and an R model built and persisted for later use. In the corresponding scenario B, the “score combined script”, data is pulled from the database, a previously built model loaded, data scored, and the scores written to the database. This is a typical scenario for use in, e.g., enterprise dashboards or within an application supporting campaign management or next-best-offer generation. In-database execution provides significant performance benefits, even for relatively small data sets as included below. Readers should be able to reproduce such results at these scales. We’ve also included a Big Data example by replicating the 123.5 million row ONTIME data set to 1 billion rows. Consider the following examples:

    Linear Models: We compared R lm and ORE ore.lm in-database algorithm on the combined scripts. On datasets ranging from 500K to 1.5M rows with 3-predictors, in-database analytics showed an average 2x-3x performance improvement for build, and nearly 4x performance improvement for scoring. Notice in Figure 1 that the trend is significantly less for ore.lm than lm, indicating greater scalability for ore.lm.

    Figure 1. Overall lm and ore.lm execution time for model building (A) and data scoring (B)

    Figure 2 provides a more detailed view comparing data pull and model build time for build detail, followed by data pull, data scoring, and score writing for score detail. For model building, notice that while data pull is a significant part of lm’s total build time, the actual build time is still greater than ore.lm. A similar statement can be made in the case of scoring.

    Figure 2. Execution time components for lm and ore.lm (excluding model write and load)

    Naïve Bayes from the e1071 package: On 20-predictor datasets ranging from 50k to 150k rows, in-database ore.odmNB improved data scoring performance by a factor of 118x to 418x, while the full scenario B execution time yielded a 13x performance improvement, as depicted in Figure 3B. Using a non-parallel execution of ore.odmNB, we see the cross-over point where ore.odmNB overtakes R, but more importantly, the slope of the trend points to the greater scalability of ORE, as depicted in Figure 3A for the full scenario A execution time.

    Figure 3. Overall naiveBayes and ore.odmNB execution time for model building (A) and data scoring (B)

    K-Means clustering: Using 6 numeric columns from the ONTIME airline data set ranging from 1 million to 1 billion rows, we compare in-database ore.odmKMeans with R kmeans through embedded R execution with ore.tableApply. At 100 million rows, ore.odmKMeans demonstrates better performance than kmeans , and scalability at 1 billion rows. The performance results depicted in Figure 4 uses a log-log plot. The legend shows the function invoked and corresponding parameters, using subset of ONTIME data set d. While ore.odmKMeans scales linearly with number of rows, R kmeans does not. Further, R kmeans did not complete at 1 billion rows.

    Figure 4: K-Means clustering model building on Big Data


    With OREpredict, R users can also benefit from in-database scoring of R models. This becomes evident not only when considering the full “round trip” of pulling data from the database, scoring in R, and writing data back to the database, but also for the scoring itself.

    Consider an lm model built using a dataset with 4-predictors and 1 million to 5 million rows. Pulling data from the database, scoring, and writing the results back to the database shows a pure R-based approach taking 4x - 9x longer than in-database scoring using ore.predict with that same R model. Notice in Figure 5 that the slope of the trend is dramatically less for ore.predict than predict, indicating greater scalability. When considering the scoring time only, ore.predict was 20x faster than predict in R for 5M rows. In ORE 1.3, ore.predict is recommended and will provide speedup over R for numeric predictors.

    Figure 5. Overall lm execution time using R predict vs. ORE ore.predict

    For rpart, we see a similar result. On a 20-predictor, 1 million to 5 million row data set, ore.predict resulted in a 6x – 7x faster execution. In Figure 5, we again see that the slope of the trend is dramatically less for ore.predict than predict, indicating greater scalability. When considering the scoring time only, ore.predict was 123x faster than predict in R for 5 million rows.

    Figure 6. Overall rpart Execution Time using R predict vs. ORE ore.predict

    This scenario is summarized in Figure 7. In the client R engine, we have the ORE packages installed. There, we invoke the pure R-based script, which requires pulling data from the database. We also invoke the ORE-based script that keeps the data in the database.

    Figure 7. Summary of OREpredict performance gains

    To use a real world data set, we again consider the ONTIME airline data set with 123.5 million rows. We will build lm models with varying number of coefficients derived by converting categorical data to multiple columns. The variable p corresponds to the number of coefficients resulting from the transformed formula and is dependent on the number of distinct values in the column. For example, DAYOFWEEK has 7 values, so with DEPDELAY, p=9. In Figure 8, you see that using an lm model with embedded R for a single row (e.g., one-off or real-time scoring), has much more overhead (as expected given that an R engine is being started) compared to ore.predict, which shows subsecond response time through 40 coefficients at 0.54 seconds, and the 106 coefficients at 1.1 seconds. Here are the formulas describing the columns included in the analysis:

    • ARRDELY ~ DEPDELAY (p=2)

    Figure 8. Comparing performance of ore.predict with Embedded R Execution for lm

    Compare this with scoring the entire ONTIME table of 123.5 million rows. We see that ore.predict outperforms embedded R until about 80 coefficients, when embedded R becomes the preferred choice. 

    Data Movement between R and Database: Embedded R Execution

    One advantage of R is its community and CRAN packages. The goal for Oracle R Enterprise with CRAN packages is to enable reuse of these packages while:

    • Leveraging the parallelization and efficient data processing capabilities of Oracle Database
    • Minimizing data transfer and communication overhead between R and the database
    • Leveraging R as a programming language for writing custom analytics

    There are three ways in which we’ll explore the performance of pulling data.

    1) Using ore.pull at a separate client R engine to pull data from the database

    2) Using Embedded R Execution and ore.pull within an embedded R script from a database-spawned R engine

    3) Using Embedded R Execution functions for data-parallelism and task-parallelism to pass database data to the embedded R script via function parameter

    With ORE Embedded R Execution (ERE), the database delivers data-parallelism and task-parallelism, and reduces data access latency due to optimized data transfers into R. Essentially, R runs under the control of the database. As illustrated in Figure 9, loading data at the database server is 12x faster than loading data from the database to a separate R client. Embedded R Execution also provides a 13x advantage when using ore.pull invoked at the database server within an R closure (function) compared with a separate R client. The data load from database to R client is depicted as 1x – the baseline for comparison with embedded R execution data loading.

    Figure 9. Summary of Embedded R Execution data load performance gains

    Data transfer rates are displayed in Figure 10, for a table with 11 columns and 5 million to 15 million rows of data. Loading data via ORE embedded R execution using server-side ore.pull or through the framework with, e.g., ore.tableApply (one of the embedded R execution functions) is dramatically faster than a non-local client load via ore.pull. The numbers shown reflect MB/sec data transfer rates, so a bigger bar is better!

    Figure 10. Data load and write execution time with 11 columns

    While this is impressive, let’s expand our data up to 1 billion rows. To create our 1 billion row data set (1.112 billion rows), we duplicated the 123.5 million row ONTIME dataset 9 times, replacing rows with year 1987 with years 2010 through 2033, and selecting 6 integer columns (YEAR, MONTH, DAYOFMONTH, ARRDELAY, DEPDELAY, DISTANCE) with bitmap index of columns (YEAR, MONTH, DAYOFMONTH). The full data set weighs in at ~53 GB.

    In Figure 11, we see linear scalability for loading data into the client R engine. Times range from 2.8 seconds for 1 million rows, to 2700 seconds for 1 billion rows. While your typical user may not need to load 1 billion rows into R memory, this graph demonstrates the feasibility to do so.

    Figure 11. Client Load of Data via ore.pull for Big Data

    In Figure12, we look at how degree of parallelism (DOP) affects data load times involving ore.rowApply. This test addresses the question of how fast ORE can load 1 billion, e.g., when scoring data. The degree of parallelism corresponds to the number of R engines that are spawned for concurrent execution at the database server. The number of chunks the data is divided into is 1 for a single degree of parallelism, and 10 times the DOP for the remaining tests. For DOP of 160, the data was divided into 1600 chunks, i.e., 160 R engines were spawned, each processing 10 chunks. The graph on the left depicts that execution times improve for the 1 billion row data set through DOP of 160. As expected, at some point, the overhead of spawning additional R engines and partitioning the data outweighs the benefit. At its best time, processing 1 billion rows took 43 seconds.

    Figure 12. Client Load of Data via ore.pull for Big Data

    In the second graph of Figure 12, we contrast execution time for the “sweet spot” identified in the previous graph with varying number of rows. Using this DOP of 160, with 1600 chunks of data, we see that through 100 million rows, there is very little increase in execution time (between 6.4 and 8.5 seconds in actual time). While 1 billion rows took significantly more, it took only 43 seconds.

    We can also consider data write at this scale. In Figure 13, we also depict linear scalability from 1 million through 1 billion rows using the ore.create function to creating database tables from R data. Actual times ranged from 2.6 seconds to roughly 2600 seconds.

    Figure 13. Data Write using ore.create for Big Data

    ORE supports data-parallelism to enable, e.g., building predictive models in parallel on partitions of the data. Consider a marketing firm that micro-segments customers and builds predictive models on each segment. ORE embedded R execution automatically partitions the data, spawns R engines according to the degree of parallelism specified, and executes the specified user R function. To address how efficiently ore.groupApply can process data, Figure 14 shows the total execution time to process the 123.5M rows from the ONTIME data with varying number of columns. The figure shows that ore.groupApply scales linearly as the number of columns increases. Three columns were selected based on their number of distinct values: TAILNUM 12861, DEST 352, and UNIQUECARRIER 29. For UNIQUECARRIER, all columns (total of 29 columns) could not be completed since 29 categories resulted in data too large for a single R engine.

    Figure 14. Processing time for 123.5M rows via ore.groupApply

    ORE also supports row-parallelism, where the same embedded R function can be invoked on chunks of rows. As with ore.groupApply, depending on the specified degree of parallelism, a different chunk of rows will be submitted to a dynamically spawned database server-side R engine. Figure 15 depicts a near linear execution time to process the 123.5M rows from ONTIME with varying number of columns. The chunk size can be specified, however, testing 3 chunk sizes (10k, 50k, and 100k rows) showed no significant difference in overall execution time, hence a single line is graphed.

    Figure 15. Processing time for 123.5M rows via ore.rowApply for chunk sizes 10k-100k

    All tests were performed on an Exadata X3-8. Except as noted, the client R session and database were actually on the same machine, so network latency for data read and write were minimum. Over a LAN or WAN, the benefits of in-database execution and ORE will be even more dramatic.

    Friday Jan 18, 2013

    Oracle R Distribution Performance Benchmark

    Oracle R Distribution Performance Benchmarks

    Oracle R Distribution provides dramatic performance gains with MKL

    Using the recognized R benchmark R-benchmark-25.R test script, we compared the performance of Oracle R Distribution with and without the dynamically loaded high performance Math Kernel Library (MKL) from Intel. The benchmark results show Oracle R Distribution is significantly faster with the dynamically loaded high performance library. R users can immediately gain performance enhancements over open source R, analyzing data on 64-bit architectures and leveraging parallel processing within specific R functions that invoke computations performed by these high performance libraries.

    The Community-developed test consists of matrix calculations and functions, program control, matrix multiplication, Cholesky Factorization, Singular Value Decomposition (SVD), Principal Component Analysis (PCA), and Linear Discriminant Analysis. Such computations form a core component of many real-world problems, often taking the majority of compute time. The ability to speed up these computations means faster results for faster decision making.

    While the benchmark results reported were conducted using Intel MKL, Oracle R Distribution also supports AMD Core Math Library (ACML) and Solaris Sun Performance Library.

    Oracle R Distribution 2.15.1 x64 Benchmark Results (time in seconds)

     ORD with internal BLAS/LAPACK
    1 thread
     ORD + MKL
    1 thread
     ORD + MKL
    2 threads
     ORD + MKL
    4 threads
     ORD + MKL
    8 threads
     Performance gain ORD + MKL
    4 threads
     Performance gain ORD + MKL
    8 threads
     Matrix Calculations
     11.2  1.9  1.3  1.1  0.9  9.2x  11.4x
     Matrix Functions
     7.2  1.1 0.6
     0.4  0.4  17.0x  17.0x
     Program Control
     1.4  1.3  1.5  1.4  0.8  0.0x  0.8x
     Matrix Multiply
     517.6  21.2  10.9  5.8  3.1  88.2x  166.0x
     Cholesky Factorization
     25  3.9  2.1  1.3  0.8  18.2x  29.4x
     Singular Value Decomposition
     103.5  15.1  7.8  4.9  3.4  20.1x  40.9x
     Principal Component Analysis
     490.1  42.7  24.9  15.9  11.7  29.8x  40.9x
     Linear Discriminant Analysis
     419.8  120.9  110.8  94.1  88.0  3.5x  3.8x

    This benchmark was executed on a 3-node cluster, with 24 cores at 3.07GHz per CPU and 47 GB RAM, using Linux 5.5.

    In the first graph, we see significant performance improvements. For example, SVD with ORD plus MKL executes 20 times faster using 4 threads, and 29 times faster using 8 threads. For Cholesky Factorization, ORD plus MKL is 18 and 30 times faster for 4 and 8 threads, respectively.

    In the second graph,we focus on the three longer running tests. Matrix multiplication is 88 and 166 times faster for 4 and 8 threads, respectively. PCA is 30 and 50 times faster, and LDA is over 3 times faster.

    This level of performance improvement can significantly reduce application execution time and make interactive, dynamically generated results readily achievable. Note that ORD plus MKL not only impacts performance on the client side, but also when used in combination with R scripts executed using Oracle R Enterprise Embedded R Execution. Such R scripts, executing at the database server machine, reap these performance gains as well. 

    Wednesday Dec 26, 2012

    Oracle R Enterprise 1.3 released

    We're pleased to announce the latest release of Oracle R Enterprise, now available for download. Oracle R Enterprise 1.3 features new predictive analytics interfaces for in-database model building and scoring, support for in-database sampling and partitioning techniques, and transparent support for Oracle DATE and TIMESTAMP data types to facilitate data preparation for time series analysis and forecasting. Oracle R Enterprise further enables transparent access to Oracle Database tables from R by enabling integer indexing and ensuring consistent ordering between data in R data frames and Oracle Database tables. The latest release also includes improved programming efficiencies and performance improvements.

    The key additions in version 1.3 include:

    Enhanced Model Scoring: The new package OREpredict enables in-database scoring of R-generated models. Supported models include linear regression (lm) and generalized linear models (glm), hierarchical clustering (hclust), k-means clustering (kmeans), multinomial log-linear models (multinom), neural networks (nnet), and recursive partitioning and regression trees (rpart).

    Oracle Data Mining Support: The new package OREdm provides an R interface for in-database Oracle Data Mining predictive analytics and data mining algorithms. Supported models include attribute importance, decision trees, generalized linear models, k-means clustering, naive bayes and support vector machines.

    Neural Network Modeling: A new feed-forward neural network algorithm with in-database execution.

    Date and Time Support: Support for Oracle DATE and TIMESTAMP data types and analytic capabilities that allow date arithmetic, aggregations, percentile calculations and moving window calculations for in-database execution.

    Sampling Methods: Enables in-database sampling and partitioning techniques for use against database-resident data. Techniques include simple random sampling, systematic sampling, stratified sampling, cluster sampling, quota sampling and accidental sampling.

    Object Persistence: New capabilities for saving and restoring R objects in an Oracle Database “datastore”, which supports not only in-database persistence of R objects, but the ability to easily pass any type of R objects to embedded R execution functions.

    Database Auto-Connection:  New functionality for automatically establishing database connectivity using contextual credentials inside embedded R scripts, allowing convenient and secure connections to Oracle Database.

    When used in conjunction with Oracle Exadata Database Machine and Oracle Big Data Appliance, Oracle R Enterprise and Oracle R Connector for Hadoop provide a full set of engineered systems to access and analyze big data. With Oracle R Enterprise, IT organizations can rapidly deploy advanced analytical solutions, while providing the knowledge to act on critical decisions.

    Stay tuned for blogs about the new ORE 1.3 features in upcoming posts. You can find more details about the features in Oracle R Enterprise 1.3 in our New Features Guide and Reference Manual.

    Monday Nov 19, 2012

    Join us at BIWA Summit 2013!

    Registration is now open for BIWA Summit 2013.  This event, focused on Business Intelligence, Data Warehousing and Analytics, is hosted by the BIWA SIG of the IOUG on January 9 and 10 at the Hotel Sofitel, near Oracle headquarters in Redwood City, California.

    Be sure to check out our featured speakers, including Oracle executives Balaji Yelamanchili, Vaishnavi Sashikanth, and Tom Kyte, and Ari Kaplan, sports analyst, as well as the many other internationally recognized speakers.  Hands-on labs will give you the opportunity to try out much of the Oracle software for yourself (including Oracle R Enterprise)--be sure to bring a laptop capable of running Windows Remote Desktop.  There will be over 35 sessions on a wide range of BIWA-related topics.  See the BIWA Summit 2013 web site for details and be sure to register soon, while early bird rates still apply.


    The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.


    « June 2016