Thursday Sep 27, 2012

ROracle support for TimesTen In-Memory Database

Today's guest post comes from Jason Feldhaus, a Consulting Member of Technical Staff in the TimesTen Database organization at Oracle.  He shares with us a sample session using ROracle with the TimesTen In-Memory database. 

Beginning in version 1.1-4, ROracle includes support for the Oracle Times Ten In-Memory Database, version 11.2.2. TimesTen is a relational database providing very fast and high throughput through its memory-centric architecture.  TimesTen is designed for low latency, high-volume data, and event and transaction management. A TimesTen database resides entirely in memory, so no disk I/O is required for transactions and query operations. TimesTen is used in applications requiring very fast and predictable response time, such as real-time financial services trading applications and large web applications. TimesTen can be used as the database of record or as a relational cache database to Oracle Database.

ROracle provides an interface between R and the database, providing the rich functionality of the R statistical programming environment using the SQL query language. ROracle uses the OCI libraries to handle database connections, providing much better performance than standard ODBC.

The latest ROracle enhancements include:

  • Support for Oracle TimesTen In-Memory Database
  • Support for Date-Time using R's POSIXct/POSIXlt data types
  • RAW, BLOB and BFILE data type support
  • Option to specify number of rows per fetch operation
  • Option to prefetch LOB data
  • Break support using Ctrl-C
  • Statement caching support

Times Ten 11.2.2 contains enhanced support for analytics workloads and complex queries:

  • Analytic functions: AVG, SUM, COUNT, MAX, MIN, DENSE_RANK, RANK, ROW_NUMBER, FIRST_VALUE and LAST_VALUE
  • Analytic clauses: OVER PARTITION BY and OVER ORDER BY
  • Multidimensional grouping operators:
  • Grouping clauses: GROUP BY CUBE, GROUP BY ROLLUP, GROUP BY GROUPING SETS
  • Grouping functions: GROUP, GROUPING_ID, GROUP_ID
  • WITH clause, which allows repeated references to a named subquery block
  • Aggregate expressions over DISTINCT expressions
  • General expressions that return a character string in the source or a pattern within the LIKE predicate
  • Ability to order nulls first or last in a sort result (NULLS FIRST or NULLS LAST in the ORDER BY clause)

Note: Some functionality is only available with Oracle Exalytics, refer to the TimesTen product licensing document for details.

Connecting to TimesTen is easy with ROracle.  Simply install and load the ROracle package and load 
the driver.
  
   > install.packages("ROracle")
   > library(ROracle)
    Loading required package: DBI
   > drv <- dbDriver("Oracle")

Once the ROracle package is installed, create a database connection object and connect to a 
TimesTen direct driver DSN as the OS user.

   > conn <- dbConnect(drv, username ="", password="",
                   dbname = "localhost/SampleDb_1122:timesten_direct")

You have the option to report the server type - Oracle or TimesTen?

   > print (paste ("Server type =", dbGetInfo (conn)$serverType))
   [1] "Server type = TimesTen IMDB"

To create tables in the database using R data frame objects, use the function dbWriteTable.  
In the following example we write the built-in iris data frame to TimesTen. The iris data set is 
a small example data set containing 150 rows and 5 columns. We include it here not to highlight 
performance, but so users can easily run this example in their R session.

   > dbWriteTable (conn, "IRIS", iris, overwrite=TRUE, ora.number=FALSE)
   [1] TRUE

Verify that the newly created IRIS table is available in the database. To list the available tables and 
table columns in the database, use dbListTables and dbListFields, respectively.

   > dbListTables (conn)
   [1] "IRIS"
   > dbListFields (conn, "IRIS")
   [1] "SEPAL.LENGTH" "SEPAL.WIDTH"  "PETAL.LENGTH" "PETAL.WIDTH"  "SPECIES"

To retrieve a summary of the data from the database we need to save the results to a local object. The 
following call saves the results of the query as a local R object, iris.summary. The ROracle function 
dbGetQuery is used to execute an arbitrary SQL statement against the database. When connected to 
TimesTen, the SQL statement is processed completely within main memory for the fastest response 
time.
   
   > iris.summary <- dbGetQuery(conn, 'SELECT
                                       SPECIES,
                                        AVG ("SEPAL.LENGTH") AS AVG_SLENGTH,
                                        AVG ("SEPAL.WIDTH") AS AVG_SWIDTH,
                                        AVG ("PETAL.LENGTH") AS AVG_PLENGTH,
                                        AVG ("PETAL.WIDTH") AS AVG_PWIDTH
                                       FROM IRIS
                                       GROUP BY ROLLUP (SPECIES)')

   > iris.summary
        SPECIES AVG_SLENGTH AVG_SWIDTH AVG_PLENGTH AVG_PWIDTH
   1     setosa    5.006000   3.428000       1.462   0.246000
   2 versicolor    5.936000   2.770000       4.260   1.326000
   3  virginica    6.588000   2.974000       5.552   2.026000
   4       <NA>    5.843333   3.057333       3.758   1.199333


Finally, disconnect from the TimesTen Database.

   > dbCommit (conn)
   [1] TRUE
   > dbDisconnect (conn)
   [1] TRUE

We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for our software: Times Ten In-Memory DatabaseROracle.  As always, we welcome comments and questions on the TimesTen and  Oracle R technical forums.







Sunday Sep 23, 2012

Video: Analyzing Big Data using Oracle R Enterprise

Learn how Oracle R Enterprise is used to generate new insight and new value to business, answering not only what happened, but why it happened. View this YouTube Oracle Channel video overview describing how analyzing big data using Oracle R Enterprise is different from other analytics tools at Oracle.

Oracle R Enterprise (ORE),  a component of the Oracle Advanced Analytics Option, couples the wealth of analytics packages in R with the performance, scalability, and security of Oracle Database. ORE executes base R functions transparently on database data without having to pull data from Oracle Database. As an embedded component of the database, Oracle R Enterprise can run your R script and open source packages via embedded R where the database manages the data served to the R engine and user-controlled data parallelism. The result is faster and more secure access to data. ORE also works with the full suite of in-database analytics, providing integrated results to the analyst.



Monday Sep 17, 2012

Podcast interview with Michael Kane

In this podcast interview with Michael Kane, Data Scientist and Associate Researcher at Yale University, Michael discusses the R statistical programming language, computational challenges associated with big data, and two projects involving data analysis he conducted on the stock market "flash crash" of May 6, 2010, and the tracking of transportation routes bird flu H5N1. Michael also worked with Oracle on Oracle R Enterprise, a component of the Advanced Analytics option to Oracle Database Enterprise Edition. In the closing segment of the interview, Michael comments on the relationship between the data analyst and the database administrator and how Oracle R Enterprise provides secure data management, transparent access to data, and improved performance to facilitate this relationship.

Listen now...

Friday Aug 17, 2012

Experience with Oracle R Enterprise in the Oracle micro-processor tools environment

ore.stepwise for high-performance stepwise regression

This guest post from Alexandre Ardelea describes the Oracle R Enterprise deployment in the Oracle micro-processor tools environment.

The deployment of ORE within the Oracle micro-processor tools environment introduced a technology that significantly expands our data analysis capabilities and opens the door to new applications.

Oracle R Enterprise (ORE) has been recently deployed in the Oracle micro-processor tools environment, replacing a popular commercial tool as a production engine for data analysis.  Fit/response models are important components of the simulation flows in the Oracle microprocessor tools environment; such models are used for a variety of purposes ranging from library generation to design yield prediction and optimization. Several tools were targeted for the migration to ORE; these tools are constructed around an iterative loop processing hundreds of data structures. At each iteration, a simulator engine generates data for multiple figures of metrics (targets), and a fit engine is called to construct response models for each target. The fit models are assembled into libraries for subsequent simulation within other flows or used on the fly.

A common characteristic of these models is that they need to express strong nonlinear relations between the targets and large sets of explanatory variables. Multiple interactions and many non-linear dependencies are considered as candidate effects for the model construction; they often result from an automatic generation procedure attempting to cover a large variety of anticipated relations between the dependent and independent variables. For example, for a case with O[10^2] main explanatory variables, the total number of candidate regressors, nregs_tot, could quickly rise to O[10^3-10^4]. Linear regression models with such a high number of terms are not only too expensive to use, as they have, potentially, a large number of terms with negligible coefficients, but are also likely to lead to instability and inaccuracy problems. For example, overfitting is an important concern with models expressing fluctuations in the data rather than capturing the trend. Moreover, if strong quasi-linear interactions occur between large numbers of regressors, the variance of the calculated (model) coefficients can be massively inflated.

In order to reduce the size of the fit models while retaining significant dependencies, we use stepwise regression. Stepwise regression is an iterative fitting method which builds the model incrementally by adding and eliminating regressors (from a specified candidate set) using a pre-defined selection mechanism/test of statistical significance - the model converges when all regressors retained in the model satisfy the significance test criteria. Commercial and open source regression packages offer various flavors of stepwise regression which differ in multiple ways through the choice of the selection mechanism (F-test, Information Criteria: AIC, BIC, etc), choice of the 'direction' (forward, backward, or both), flexibility for specifying the model scope, the significance threshold(s), the handling of interaction terms, etc.

ORE has developed a proprietary in-database algorithm for stepwise regression, ore.stepwise, which complements the functionality of R's step and, especially, offers a very significant performance improvement through faster, scalable algorithms and in-database execution. The basic syntax is the following:

ore.stepwise(start_model, DB_data, model_scope, direction, add.p, drop.p)

ore.stepwise allows the user to specify a model scope and stepwise direction, and uses the F-test for regressor selection with the add.p and drop.p significance levels for adding and removing regressors while the model is iteratively constructed.

A typical usage for ore.stepwise is illustrated below:

  list_modnames <- c("mdlA","mdlB",...)
  Effects_mdlA <- ...
  Effects_mdlB <- ...
  ...
for(modname in list_modnames){ 
  Upper_model <- as.formula(sprintf("%s ~ %s", modname,...)
    get(sprintf("Effects_%s", modname)))
  Lower_model <- as.formula(sprintf("%s ~ 1", modname))
  Scope_model <- list(lower = Lower_model, upper = Upper_model)
  ...
  ...
  assign(sprintf("%s_model", modname), 
     ore.stepwise(Lower_model, data = dataDB, scope = Scope_model, 
        direction="both", add.p=..., drop.p=...))
  ...
 
The in-database ore frame dataDB contains observations for several target models mdlA,mdlB,..and 
a list with all target names (list_modnames) is assembled for iterative processing. For each target, a 
model scope is specified within lower and upper bounds. In the example above the lower bound is the intercept 
but the upper bound is customized so that each target model can be constructed from its own collection 
of regressors. The results shown in Figure1 illustrate the performance difference between ore.stepwise 
in using ORE and base R’s step function for both a bi-linear and a fully quadratic model constructed from 
34 independent variables and 10k data observations.


Bilinear model

method

R^2

Number of Regressors

mean(rel_error)

Elapsed Time (seconds)

step

0.9658

86

3.52e-02

2110.0

ore.stepwise

0.9966

124

3.50e-02

32.1

performance difference




ore.stepwise is approx. 65X faster than step at similar R^2 and relative error as stepwise.

Quadratic model

method

R^2

Number of Regressors

mean(rel_error)

Elapsed Time (seconds)

step

0.9962

154

1.05e-02

12600.0

ore.stepwise

0.9963

210

1.04e-02

69.5

performance difference




ore.stepwise is approx. 180X faster than step at similar R^2 relative error.

Figure 1: Comparison of results for R's step function and ORE's ore.stepwise function for both bi-linear and quadratic models


ore.stepwise is faster than R's step by a factor of 66-180X. The larger the data set and the number of regressors, we observed greater performance with ore.stepwise compared to R's step.  The models produced by R’s step and ore.stepwise have a different number of regressors because both the selection mechanisms and interaction terms are handled differently.  step favors the main terms - x1:x2 will be added only if x1 and x2 were previously added, and, reversibly, x1:x2 will be eliminated before x1 and x2 are eliminated, whereas ore.stepwise does not differentiate between main terms and interactions. With respect to collinearity, ore.stepwise detects strict linear dependencies and eliminates from start the regressors involved in multi-collinear relations.

In summary, the ORE capabilities for stepwise regression far surpass similar functionality in tools we considered as alternatives to ORE. The deployment of ORE within the Oracle micro-processor tools environment introduced a technology which significantly expands the data analysis capabilities through the R ecosystem combined with in-database high performance algorithms and opens the door to new applications. This technology leverages the flexibility and extensibility of the R environment and allows massive and complex data analysis sustained by the scalability and performance of the Oracle database for Big Data.

Alexandre Ardelea is a principal hardware engineer at Oracle Corporation. Alex has a PhD and MS in Physics from Ecole polytechnique fédérale de Lausanne and post-doctoral research in non-linear physics, CFD and parallel methods. Alex's specialities include response surface modeling, optimization strategies for multi-parametric/objective/constraint problems, statistical process characterization, circuit analysis and RF algorithms. 

_______________________________________________________________________________________________________________

Oracle R Enterprise (ORE) implements a transparency layer on top of the R engine that allows R computations specified in the R environment to be pushed for execution in Oracle Database. A mapping is established between a special R object called an ore frame and a corresponding database table or view, allowing analytical functions such as ore.stepwise to be executed on these mapped objects in Oracle Database. The overloaded functions in the ORE packages generate SQL statements in the background, export the expensive computations to the database for execution, and return results to the R environment.  Here is a simple example using ore.stepwise with the longley data, which is shipped in the datasets package with R:


# load the Oracle R Enterprise library and connect to Oracle Database
R> library(ORE)
R> ore.connect(user = "rquser", host = "localhost", sid = "oracle", 
     password = "rquser")

R> # push the longley data to a database table:
R> LONGLEY <- ore.push(longley)

R> class(LONGLEY)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"

# Fit full model
R> oreFit <- ore.lm(Employed ~ ., data = LONGLEY)
R> summary(oreFit)

Call:
ore.lm(formula = Employed ~ ., data = LONGLEY)

Residuals:
     Min       1Q   Median       3Q      Max
-0.41011 -0.15980 -0.02816  0.15681  0.45539

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)  -3.482e+03  8.904e+02  -3.911 0.003560 **
GNP.deflator  1.506e-02  8.492e-02   0.177 0.863141
GNP          -3.582e-02  3.349e-02  -1.070 0.312681
Unemployed   -2.020e-02  4.884e-03  -4.136 0.002535 **
Armed.Forces -1.033e-02  2.143e-03  -4.822 0.000944 ***
Population   -5.110e-02  2.261e-01  -0.226 0.826212
Year          1.829e+00  4.555e-01   4.016 0.003037 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared: 0.9955,     Adjusted R-squared: 0.9925
F-statistic: 330.3 on 6 and 9 DF,  p-value: 4.984e-10


# perform stepwise variable selection
R> oreStep <- ore.stepwise(Employed ~ .^2, data = LONGLEY, add.p =
0.1, drop.p = 0.1)

# View a summary of ore.stepwise object
R> summary(oreStep)

Call:
ore.stepwise(formula = Employed ~ .^2, data = LONGLEY, add.p = 0.1,
    drop.p = 0.1)

Residuals:
   Min     1Q Median     3Q    Max
 18.42  22.08  24.08  29.40  33.19

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)             -3.539e-01  8.455e-03 -41.852  < 2e-16 ***
Year                     3.589e-05  1.821e-05   1.971   0.0675 .
GNP.deflator:GNP        -2.978e-03  3.039e-04  -9.800 6.51e-08 ***
GNP.deflator:Unemployed  2.326e-04  5.720e-06  40.656  < 2e-16 ***
GNP.deflator:Population  2.303e-05  2.293e-06  10.044 4.72e-08 ***
GNP:Armed.Forces         6.875e-06  8.415e-07   8.170 6.64e-07 ***
GNP:Year                 2.007e-04  1.890e-05  10.618 2.26e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9829 on 15 degrees of freedom
Multiple R-squared: 0.9972,     Adjusted R-squared: 0.9961
F-statistic:  3242 on 6 and 15 DF,  p-value: < 2.2e-16

For more information on how Oracle R Enterprise leverages Oracle Database as computational engine, see the Oracle R Training Series. Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition. We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.

Thursday Aug 02, 2012

Data Parallelism Using Oracle R Enterprise

Modern computer processors are adequately optimized for many statistical calculations, but large data operations may require hours or days to return a result.  Oracle R Enterprise (ORE), a set of R packages designed to process large data computations in Oracle Database, can run many R operations in parallel, significantly reducing processing time. ORE supports parallelism through the transparency layer, where the database is used as a computational engine, and embedded R execution, where R scripts can be executed in a data parallel manner.

The backbone of parallel computing is breaking down a resource intensive computation into chunks that can be performed independently, while maintaining a framework that allows for the results of those independent computations to be combined.  Writing parallel code is typically trickier than writing serial code, but this is simplified using ORE, as there is no need for the user to create worker instances or combine results. Using the transparency layer, users simply execute their ORE code and the database implicitly manages the entire process, returning results for further processing.

With ORE, each R function invocation that operates on an ORE object, such as ore.frame, is translated to a SQL statement behind the scenes. This SQL, which may be stacked after several function invocations, undergoes optimization and parallelization when parsed and executed. This technique enables deferred evaluation, but that's a topic for another blog. Depending on the resource requirements of the statement, the database decides if it should leverage parallel execution. 

For embedded R execution, database degree of parallelism settings help determine the number of parallel R engines to start.  When data parallel functions execute in parallel, each unit of work is sent to a different R external process, or extproc, at the database server. The results are automatically collated and returned as R-proxy objects, e.g., ore.frame objects, in the R interface and SQL objects in the SQL interface, which can be processed further in R or by SQL functions. The SQL functions enable the operationalizion or productization of R scripts as part of a database-based application, in what we refer to as "lights out" mode.

In the ORE Transparency Layer, where the database executes SQL generated from overloaded R functions, parallelism is automatic, assuming the database or table is configured for parallelism. Parallel computations in the transparency layer are ideal for bigger data where functionality exists in the database.

Using Embedded R Script Execution, parallelism is enabled for row, group and index operations if specified using a function parameter or parallel cursor hint:

  • ore.groupApply and rqGroupEval* split the data into grouped partitions and invoke the R function on each partition in a separate engine at the database server.
  • ore.rowApply and rqRowEval split the data into row chunks and invoke the R function on each chunk in a separate engine at the database server.
  • ore.indexApply runs an R function x times, with each iteration of the function invoked in separate engine at the database server.

With embedded R execution, the expectation is that the database server machine has greater RAM and CPU capacity than the user's client machine.  So executing R scipts at the server will inherently allow larger data sets to be processed by an individual R engine.

In addition, users can include contributed R packages in their embedded R scripts. Consider an example using a sample of the airline on-time performance data from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs. The data sample consists of 220K records of U.S. domestic commercial flights between 1987 and 2008. 

We use the R interface to embedded R to partition the airline data table (ONTIME_S) by the DAYOFWEEK variable, fit a linear model using the biglm package, and then combine the results. Note: To run this example, the biglm package must be installed on both the database server and client machine.

res <- ore.groupApply(ONTIME_S,
               INDEX = ONTIME_S$DAYOFWEEK,
               parallel = TRUE,
               function(dat) {
                library(biglm)
                library(ORE)
                ore.connect("rquser", "orcl", "localhost", "rquser")
                biglm(ARRDELAY ~ DEPDELAY + DISTANCE, dat)
               })
R> summary(res$Monday)
Large data regression model: biglm(ARRDELAY ~ DEPDELAY + DISTANCE, dat)
Sample size =  31649
               Coef    (95%     CI)     SE     p
(Intercept)  0.5177  0.2295  0.8058 0.1441 3e-04
DEPDELAY     0.9242  0.9178  0.9305 0.0032 0e+00
DISTANCE    -0.0014 -0.0017 -0.0011 0.0002 0e+00

The call to ore.groupApply uses Oracle Database to partition the ONTIME_S table by the categories in the DAYOFWEEK variable.  Each category is sent to an R engine at the database server machine to apply the R function in parallel.  The individual category results are combined in the returned result.  Using embedded R alleviates the typical memory problems associated with running R serially because we are fitting only a single partition, or day of the week, in memory of an R engine. Using a Linux server with 8 GB RAM and 4 CPUs, fitting the model in parallel by setting parallel = TRUE in the call to ore.groupApply, reduces the processing time from approximately 30 seconds to 10 seconds.

If the goal is to integrate the model results as an operationalized process, we can use rqGroupEval, the SQL interface equivalent to ore.groupApply.  We create a script to set up the structure of the input and grouping column and then run the script in SQL. The nature of pipelined table functions requires that we explicitly represent the type of the result, captured in the package, and create a function that includes the column used for partitioning explicitly.

# setup  

SQL> CREATE PACKAGE airlinePkg AS
  2  TYPE cur IS REF CURSOR RETURN ONTIME_S%ROWTYPE;
  3  END airlinePkg;
  4  /

Package created.

SQL> CREATE FUNCTION ontimeGroupEval(
  inp_cur  airlinePkg.cur,
  par_cur  SYS_REFCURSOR, 
  out_qry  VARCHAR2,
  grp_col  VARCHAR2,
  exp_txt  CLOB)
RETURN SYS.AnyDataSet
PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH (DAYOFWEEK))
CLUSTER inp_cur BY (DAYOFWEEK)
USING rqGroupEvalImpl;
/

# model build

alter table ONTIME_S parallel;

SQL> begin
  sys.rqScriptCreate('GroupingExample',
 'function(dat) {
      library(biglm)
library(ORE)
      ore.connect("rquser", "orcl", "localhost", "rquser")
      result <- biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
      result}');
end;
/ 
create table ONTIME_LM as
select *
  from table(ontimeGroupEval(
         cursor(select /*+ parallel(ONTIME_S)*/
         from ONTIME_S),
         NULL, NULL, 'DAYOFWEEK', 'GroupingExample'));

begin
  sys.rqScriptDrop('GroupingExample');
end;
/


We use a parallel hint on the cursor that is the input to our rqGroupEval function to enable Oracle Database to use parallel R engines.  In this case, using the same Linux server, the processing time is reduced from approximately 25 seconds to 7 seconds as we used 7 parallel R engines (one for each day of the week) across a single server.  Of course, a real-world scenario may utilize hundreds of parallel engines across many servers, returning results on large amounts of data in short period of time.

Additional details on how parallel execution works in Oracle database can be found here. We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.

*To enable execution of an R script in the SQL interface, ORE provides variants of ore.doEval, ore.groupApply and ore.indexApply in SQL. These functions are rqEval, rqTableEval, rqRowEval and rqGroupEval. The ore.groupApply feature does not have a direct parallel in the SQL interface. We refer to rqGroupApply as a concept, however, there is specific code required to enable this feature. This is highlighted in the second example.

Wednesday Jun 27, 2012

Solving Big Problems with Oracle R Enterprise, Part II

Part II – Solving Big Problems with Oracle R Enterprise

In the first post in this series (see https://blogs.oracle.com/R/entry/solving_big_problems_with_oracle), we showed how you can use R to perform historical rate of return calculations against investment data sourced from a spreadsheet.  We demonstrated the calculations against sample data for a small set of accounts.  While this worked fine, in the real-world the problem is much bigger because the amount of data is much bigger.  So much bigger that our approach in the previous post won’t scale to meet the real-world needs.


From our previous post, here are the challenges we need to conquer:

  • The actual data that needs to be used lives in a database, not in a spreadsheet
  • The actual data is much, much bigger- too big to fit into the normal R memory space and too big to want to move across the network
  • The overall process needs to run fast- much faster than a single processor
  • The actual data needs to be kept secured- another reason to not want to move it from the database and across the network
  • And the process of calculating the IRR needs to be integrated together with other database ETL activities, so that IRR’s can be calculated as part of the data warehouse refresh processes

In this post, we will show how we moved from sample data environment to working with full-scale data.  This post is based on actual work we did for a financial services customer during a recent proof-of-concept.

...

  1. On average, we performed 8,200 executions of our R function per second (110s/911k accounts)
  2. On average, we did 41,000 single time period rate of return calculations per second (each of the 8,200 executions of our R function did rate of return calculations for 5 time periods)
  3. On average, we processed over 900,000 rows of database data in R per second (103m detail rows/110s)

...

R + Oracle R Enterprise: Best of R + Best of Oracle Database

This blog post series started by describing a real customer problem: how to perform a lot of calculations on a lot of data in a short period of time.  While standard R proved to be a very good fit for writing the necessary calculations, the challenge of working with a lot of data in a short period of time remained.


This blog post series showed how Oracle R Enterprise enables R to be used in conjunction with the Oracle Database to overcome the data volume and performance issues (as well as simplifying the operations and security issues).  It also showed that we could calculate 5 time periods of rate of returns for almost a million individual accounts in less than 2 minutes.



[Read More]

Thursday Jun 21, 2012

Solving Big Problems with Oracle R Enterprise, Part I

Abstract:

This blog post will show how we used Oracle R Enterprise to tackle a customer’s big calculation problem across a big data set.

Overview:

Databases are great for managing large amounts of data in a central place with rigorous enterprise-level controls.  R is great for doing advanced computations.  Sometimes you need to do advanced computations on large amounts of data, subject to rigorous enterprise-level concerns.  This blog post shows how Oracle R Enterprise enables R plus the Oracle Database enabled us to do some pretty sophisticated calculations across 1 million accounts (each with many detailed records) in minutes.

The problem:

A financial services customer of mine has a need to calculate the historical internal rate of return (IRR) for its customers’ portfolios.  This information is needed for customer statements and the online web application.  In the past, they had solved this with a home-grown application that pulled trade and account data out of their data warehouse and ran the calculations.  But this home-grown application was not able to do this fast enough, plus it was a challenge for them to write and maintain the code that did the IRR calculation.

...


[Read More]

Monday Apr 23, 2012

Introduction to Oracle R Connector for Hadoop

MapReduce, the heart of Hadoop, is a programming framework that enables massive scalability across servers using data stored in the Hadoop Distributed File System (HDFS). The Oracle R Connector for Hadoop (ORCH) provides access to a Hadoop cluster from R, enabling manipulation of HDFS-resident data and the execution of MapReduce jobs.

Conceptutally, MapReduce is similar to combination of apply operations in R or GROUP BY in Oracle Database: transform elements of a list or table, compute an index, and apply a function to the specified groups. The value of MapReduce in ORCH is the extension beyond a single-process to parallel processing using modern architectures: multiple cores, processes, machines, clusters, data appliance, or clouds.

ORCH can be used on the Oracle Big Data Appliance or on non-Oracle Hadoop clusters. R users write mapper and reducer functions in R and execute MapReduce jobs from the R environment using a high level interface. As such, R users are not required to learn a new language, e.g., Java, or environment, e.g., cluster software and hardware, to work with Hadoop. Moreover, functionality from R open source packages can be used in the writing of mapper and reducer functions. ORCH also gives R users the ability to test their MapReduce programs locally, using the same function call, before deploying on the Hadoop cluster.


In the following example, we use the ONTIME_S data set typically installed in Oracle Database when Oracle R Enterprise is installed. ONTIME_S is a subset of the airline on-time performance data (from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs. 
We're providing a relatively large sample data set (220K rows), but this example could be performed in ORCH on the full data set, which contains 123 millions rows and requires 12 GB disk space . This data set is significantly larger than R can process on it's own using a typical laptop with 8 GB RAM.

ONTIME_S is a database-resident table with metadata on the R side, represented by an ore.frame object.

> class(ONTIME_S)
[1] "ore.frame" attr(,"package")
[1] "OREbase"

ORCH includes functions for manipulating HDFS data. Users can move data between HDFS and the file system, R data frames, and Oracle Database tables and views. This next example shows one such function, hdfs.push, which accepts an ore.frame object as its first argument, followed by the name of the key column, and then the name of the file to be used within HDFS.

ontime.dfs_DB <- hdfs.push(ONTIME_S,
                       key='DEST',
                       dfs.name='ontime_DB')

The following R script example illustrates how users can attach to an existing HDFS file object, essentially getting a handle to the HDFS file.
Then, using the hadoop.run function in ORCH, we specify the HDFS file handle, followed by the mapper and reducer functions. The mapper function takes the key and value as arguments, which correspond to one row of data at a time from the HDFS block assigned to the mapper. The function keyval in the mapper returns data to Hadoop for further processing by the reducer.

The reducer function receives all the values associated with one key (resulting from the “shuffle and sort” of Hadoop processing). The result of the reducer is also returned to Hadoop using the keyval function. The results of the reducers are consolidated in an HDFS file, which can be obtained using the hdfs.get function.

The following example computes the average arrival delay for flights where the destination is San Francisco Airport (SFO).  It selects the SFO airport in the mapper and the mean of arrival delay in the reducer.

dfs <- hdfs.attach("ontime_DB")

res <- hadoop.run(
        dfs,
        mapper = function(key, value) {
          if (key == 'SFO' & !is.na(x$ARRDELAY)) {
            keyval(key, value)
          }
          else {
            NULL
          }
            },
         reducer = function(key, values) {
            for (x in values) {
                sumAD <- sumAD + x$ARRDELAY
                count <- count + 1
                  }
                  res <- sumAD / count
                  keyval(key, res)
            })

> hdfs.get(res)
   key     val1
1  SFO   17.44828

Oracle R Connector for Hadoop is part of the Oracle Big Data Connectors software suite and is supported for Oracle Big Data Appliance and Oracle R Enterprise customers.  We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  We welcome comments and questions on the Oracle R Forum.





Friday Apr 13, 2012

Oracle R Enterprise 1.1 Download Available

Oracle just released the latest update to Oracle R Enterprise, version 1.1. This release includes the Oracle R Distribution (based on open source R, version 2.13.2), an improved server installation, and much more.  The key new features include:

  • Extended Server Support: New support for Windows 32 and 64-bit server components, as well as continuing support for Linux 64-bit server components
  • Improved Installation: Linux 64-bit server installation now provides robust status updates and prerequisite checks
  • Performance Improvements: Improved performance for embedded R script execution calculations

In addition, the updated ROracle package, which is used with Oracle R Enterprise, now reads date data by conversion to character strings.

We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.



Oracle R Distribution 2-13.2 Update Available

Oracle has released an update to the Oracle R Distribution, an Oracle-supported distribution of open source R. Oracle R Distribution 2-13.2 now contains the ability to dynamically link the following libraries on both Windows and Linux:

  • The Intel Math Kernel Library (MKL) on Intel chips
  • The AMD Core Math Library (ACML) on AMD chips

To take advantage of the performance enhancements provided by Intel MKL or AMD ACML in Oracle R Distribution, simply add the MKL or ACML shared library directory to the LD_LIBRARY_PATH system environment variable. This automatically enables MKL or ACML to make use of all available processors, vastly speeding up linear algebra computations and eliminating the need to recompile R.  Even on a single core, the optimized algorithms in the Intel MKL libraries are faster than using R's standard BLAS library.

Open-source R is linked to NetLib's BLAS libraries, but they are not multi-threaded and only use one core. While R's internal BLAS are efficient for most computations, it's possible to recompile R to link to a different, multi-threaded BLAS library to improve performance on eligible calculations. Compiling and linking to R yourself can be involved, but for many, the significantly improved calculation speed justifies the effort. Oracle R Distribution notably simplifies the process of using external math libraries by enabling R to auto-load MKL or ACML. For R commands that don't link to BLAS code, taking advantage of database parallelism using embedded R execution in Oracle R Enterprise is the route to improved performance.

For more information about rebuilding R with different BLAS libraries, see the linear algebra section in the R Installation and Administration manual. As always, the Oracle R Distribution is available as a free download to anyone. Questions and comments are welcome on the Oracle R Forum.




Wednesday Apr 04, 2012

New Release of ROracle posted to CRAN

Oracle recently updated ROracle to version 1.1-2 on CRAN with enhancements and bug fixes. The major enhancements include the introduction of support for Oracle Wallet Manager and datetime and interval types. 

Oracle Wallet support in ROracle allows users to manage public key security from the client R session. Oracle Wallet allows passwords to be stored and read by Oracle Database, allowing safe storage of database login credentials. In addition, we added support for datetime and interval types when selecting data, which expands ROracle's support for date data. 

See the ROracle NEWS for the complete list of updates.

We encourage ROracle users to post questions and provide feedback on the Oracle R Forum.

In addition to being a high performance database interface to Oracle Database from R for general use, ROracle supports database access for Oracle R Enterprise.

Monday Apr 02, 2012

Introduction to ORE Embedded R Script Execution

This Oracle R Enterprise (ORE) tutorial, on embedded R execution, is the third in a series to help users get started using ORE. See these links for the first tutorial on the transparency layer and second tutorial on the statistics engine. Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

Embedded R Execution refers to the ability to execute an R script at the database server, which provides several benefits: spawning multiple R engines in parallel for data-parallel operations, more efficient data transfer between Oracle Database and the R engine, leverage a likely more powerful server with more CPUs and greater RAM, schedule automated jobs, and take advantage of open source R packages at the database server. Data aggregates are computed in parallel, significantly reducing computation time, without requiring sophisticated configuration steps.

ORE provides two interfaces for embedded R execution: one for R and one for SQL. The R interface enables interactive execution at the database server from the client R engine, e.g., your laptop. It also has transparency aspects for passing R objects and returning R objects. In the R interface, the ore.doEval schedules execution of the R code with the database-embedded R engine and returns the results back to the desktop for continued analysis. User-defined R functions can run in parallel, either on each row, sets of rows, or on each group of rows given a grouping column. The first two cases are covered by ore.rowApply, the second by the ore.groupApply function. ore.indexApply provides parallel simulations capability by invoking the script the number of times specified by the user.  The R interface returns results to the client as R objects that can be passed as arguments to R functions. 

The SQL interface enables interactive execution from any SQL interface, like SQL*Plus or SQL Developer, but it also enables R scripts to be included in production database-based systems. To enable execution of an R script in the SQL interface, ORE provides variants of ore.doEval, ore.groupApply and ore.indexApply in SQL.  These functions are rqEval, rqTableEval, rqRowEval and rqGroupEval. The SQL interface allows for storing results directly in the database.

 R Interface Function (ore.*)
 SQL Interface Function (rq*)
 Purpose
 ore.doEval  rqEval  Invoke stand-alone R script
 ore.tableApply  rqTableEval  Invoke R script with full table input
 ore.rowApply  rqRowEval  Invoke R script one row at a time, or multiple rows in "chunks"
 ore.groupApply  rqGroupEval  Invoke R script on data indexed by grouping column
 ore.indexApply
N/A
 Invoke R script N times

In addition, the SQL interface enables R results to be stored in a database table for subsequent use in another invocation (think data mining model building and scoring). It enables returning structured R results in a table. Results can also be returned as XML. The XML interface enables both structured data, such as data frames, R objects, and graphs to be returned.  The XML capability allows R graphs and structured results to be displayed in Oracle BI Publisher documents and OBIEE dashboards.

Embedded R Execution: R Interface

The following example uses the function ore.groupApply, one of several embedded R execution functions, to illustrate how R users can achieve data parallelism through the database. This example also illustrates that embedded R execution enables the use of open source packages. Here we see the use of the R package biglm.

We specify a column on which to partition the data. Each partition of the data is provided to the function through the first argument, in this case the function variable dat. There is no need to send data from the database to R - the R function is sent to the database, which processes them in parallel. Output results may be stored directly in the database, or may be downloaded to R. Only when we want to see the results of these models do we need to retrieve them into R memory and perform, for example, the summary function.

modList <- ore.groupApply(
   X=ONTIME,
   INDEX=ONTIME$DEST,
   function(dat) {
     library(biglm)
     biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
   });

modList_local <- ore.pull(modList)

> summary(modList_local$BOS)
Large data regression model: biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
Sample size =  3928 
               Coef    (95%     CI)     SE      p
(Intercept)  0.0638 -0.7418  0.8693 0.4028 0.8742
DISTANCE    -0.0014 -0.0021 -0.0006 0.0004 0.0002
DEPDELAY     1.0552  1.0373  1.0731 0.0090 0.0000

Embedded R Execution: SQL Interface 

Whereas the previous example showed how to use embedded R execution from the R environment, we can also invoke R scripts from SQL. This next example illustrates returning a data frame from results computed in Oracle Database. We first create an R script in the database R script repository. The script is defined as a function that creates a vector of 10 elements, and returns a data frame with those elements in one column and those elements divided by 100 in a second column.

Once the script is created, we can invoke it through SQL. One of the SQL embedded R executions table functions available is rqEval. The first argument is NULL since we have no parameters to pass to the function. The second argument describes the structure of the result. Any valid SQL query that captures the name and type of resulting columns will suffice. The third argument is the name of the script to execute.

begin
  sys.rqScriptCreate('Example1',
'function() {
   ID <- 1:10
   res <- data.frame(ID = ID, RES = ID / 100)
   res}');
end;
/
select *
  from table(rqEval(NULL,
                    'select 1 id, 1 res from dual',
                    'Example1'));

The result is a data frame:



Embedded R scripts may generate any valid R object, including graphs. In addition, embedded R execution enables returning results from an R script as an XML
string. Consider the following example that creates a vector from the integers 1 to 10, plots 100 random normal points in a graph, and then returns the vector. After creating the script in the database R script repository, we invoke the script using rqEval, but instead of specifying the form of the result in a SQL query, we specify XML.

begin
  sys.rqScriptCreate('Example6',
 'function(){
            res <- 1:10
            plot( 1:100, rnorm(100), pch = 21,
                  bg = "red", cex = 2 )
            res
            }');
end;
/
select value
from   table(rqEval( NULL,'XML','Example6'));

While the actual graph looks like the following, the output from this query will be an XML string.



In the execution results shown below, the VALUE column returned is a string that contains first the structured data in XML format. Notice the numbers 1 through
10 set off by the <value> tags. This is followed by the image in PNG base 64 representation. This type of output can be consumed by Oracle Business Intelligence Publisher (BIP) to produce documents with R-generated graphs and structured content.  Oracle BIP templates can also be used to expose R-generated content in Oracle Business Intelligence Enterprise Edition (OBIEE) web browser-based dashboards.

You can see additional examples using embedded R execution in action in the Oracle Enterprise Training, session 4, Embedded R Script Execution. These example will run as written in R 2.13.2 after installing Oracle R Enterprise. We'll be posting more examples using embedded R script execution in the coming months. In the meantime, questions are always welcome on the Oracle R Forum.


Tuesday Mar 13, 2012

Oracle R Distribution and Open Source R

Oracle provides the Oracle R Distribution, an Oracle-supported distribution of open source R. Support for Oracle R Distribution is provided to customers of the Oracle Advanced Analytics option and the Oracle Big Data Appliance. The Oracle R Distribution facilitates enterprise acceptance of R, since the lack of a major corporate sponsor has made some companies concerned about fully adopting R. With the Oracle R Distribution, Oracle plans to contribute bug fixes and relevant enhancements to open source R. Oracle R Distribution also employs Intel's Math Kernel Library (MKL) to enable optimized, multi-threaded math routines, providing relevant R functions maximum performance on Intel hardware.

Oracle plans to actively maintain interoperability with current R versions by developing Oracle R Enterprise on the last stable point release of open source R. With the Oracle R Distribution, Oracle plans to expand support to the open source community by contributing bug fixes and relevant enhancements to open source R.  All of these improvements will be made publicly available to the R community. Oracle has already released to the open source community an enhanced version of ROracle, which it now maintains. Support for Oracle R Distribution, Oracle R Enterprise, and Oracle R Connector for Hadoop is provided through standard Oracle Support channels for licensed customers.

Wednesday Feb 29, 2012

ROracle 1.1-1 Delivers Performance Improvements

The Oracle R Advanced Analytics team is happy to announce the release of the ROracle 1.1-1 package on the Comprehensive R Archive Network (CRAN).  We’ve rebuilt ROracle from the ground up, working hard to fix bugs and add optimizations. The new version introduces key improvements for interfacing with Oracle Database from open-source R.

Specific improvements in ROracle 1.1-1 include:

  • Lossless database reads: consistent accuracy for database read results
  • Performance improvements: faster database read and write operations
  • Simplified installation:  Oracle Instant Client now bundled on Windows, Instant or full Oracle Client on Linux

ROracle uses the Oracle Call Interface (OCI) libraries to handle the database connections, providing a high-performance, native C-language interface to the Oracle Database.  ROracle 1.1-1 is supported for R versions 2.12 through 2.13, and with Oracle Instant Client and Oracle Database Client versions 10.2 through 11.2, both 32 and 64-bit running Linux and Windows.

We think ROracle 1.1-1 is a great step forward, allowing users to build high performance and efficient R applications using Oracle Database. Whether you are upgrading your existing interface or using it for the first time,  ROracle 1.1-1 is ready for download..  If you have any questions or comments please post on the Oracle R discussion forum. We'd love to hear from you!

Thursday Feb 23, 2012

Introduction to the ORE Statistics Engine

This Oracle R Enterprise (ORE) statistics engine tutorial is the second in a series to help users get started using ORE. (See the first tutorial on the ORE transparency layer here). Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

The ORE statistics engine is a database library consisting of native database statistical functions and new functionality added specifically for ORE. ORE intercepts R functions and pushes their execution to Oracle Database to perform computationally expensive transformations and computations. This allows R users to use the R client directly against data stored in Oracle Database, greatly increasing R's scalability and performance.

A variety of useful and common routines are available in the ORE statistics engine:

Significance Tests

Chi-square, McNemar, Bowker
Simple and weighted kappas
Cochran-Mantel-Haenzel correlation
Cramer's V
Binomial, KS, t, F, Wilcox

Distribution Functions
Beta distribution
Binomial distribution
Cauchy distribution
Chi-square distribution
Exponential distribution
F-distribution
Gamma distribution
Geometric distribution
Log Normal distribution
Logistic distribution
Negative Binomial distribution
Normal distribution
Poisson distribution
Sign Rank distribution
Student t distribution
Uniform distribution
Weibull distribution
Density Function
Probability Function
Quantile
distribution

Other Functions

Gamma function
Natural logarithm of the Gamma function
Digamma function
Trigamma function
Error function
Complementary error function

Base SAS Equivalents
Freq, Summary, Sort
Rank, Corr, Univariate

These R functions are overridden, such that when presented with ore.frame data, the function generates a SQL query that is transparently submitted to Oracle Database. This is the case for much of the base R and stats functionality, providing the benefit of employing the database as the computational engine and seamlessly breaking through R's memory barrier.

In this post, we introduce a simple yet typical data analysis using functions from the ORE transparency layer. We begin configuring the local R environment by executing a few simple commands.

Load the ORE library and connect to Oracle Database:

> library(ORE)
>
ore.connect("USER", "SID", "HOST", "PASSWORD")

Invoking ore.sync synchronizes the the metadata in the database schema with the R environment:

> ore.sync("SCHEMA")
>
ore.ls()


Attaching the database schema provides access to views and tables so they can be manipulated from a local R session:

>
ore.attach("SCHEMA")

We use the ONTIME_S data set typically installed in Oracle Database when ORE is installed. ONTIME_S is a subset of the airline on-time performance data (from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs.  We're providing a relatively large sample data set (220K rows), but these examples could be performed in ORE on the full data set, which contains 123 millions rows and requires 12 GB disk space . This data set is significantly larger than R can process on it's own.

ONTIME_S is a database-resident table with metadata on the R side, represented by an ore.frame object.

> class(ONTIME_S)
[1] "ore.frame" attr(,"package")
[1] "OREbase"

We focus on two columns of data: one numeric column, DEPDELAY (actual departure delay in minutes), and a categorical column, ORIGIN (airport of origin).

A typical place to begin, is, of course, looking at the structure of selected variables. We call the function summary to obtain summary statistics on the variable measuring departure delay.

> with(ONTIME_S, summary(DEPDELAY))
      Min.   1st Qu  Median  Mean  3rd Qu. Max.     NA's 
   -75.000  -2.000  0.000   8.042   6.00  1438.00  3860.00 

To investigate the skewness of the departure delay variable, we use the quantile function to calculate quantiles at 10, 20, 30,....90%.

> with(ONTIME_S, quantile(DEPDELAY, seq(0.1, 0.9, by = 0.1), na.rm = TRUE)) 10% 20% 30% 40% 50% 60% 70% 80% 90% -4  -2  -1  0  0   1  4  10  26

This shows us that the 10% quantile (-4) is 4 units away from the median, while the 90% quantile (26) is 26 units from the median.  For a symmetric distribution, the two quantiles would be about the same distance from the median.  A measure of the data spread is the interquartile range, the difference between the 25% and 75% quantile. To allow computations on the data, we remove missing values by setting the na.rm parameter to TRUE, as we did above for the quantile function.
> with(ONTIME_S, IQR(DEPDELAY, na.rm = TRUE))
[1] 8

Other measures of data spread available are the variance and standard deviation.

> with(ONTIME_S, var(DEPDELAY, na.rm = TRUE)) [1] 771.7207 > with(ONTIME_S, sd(DEPDELAY, na.rm = TRUE)) [1] 27.77986

Using in-database aggregation summaries, we can investigate the relationship between departure delay and origin a bit further. We use aggregate to calculate the mean departure delay for each airport of origin.  Results for the first five airports are displayed using the function head.

>
ONTIME.agg <- aggregate(ONTIME_S$DEPDELAY,
                     by = list(ONTIME_S$ORIGIN),
                     FUN = mean)
>
head(ONTIME.agg, 5)
  Group.1    x
1     ABE   216
2     ABI    29
3     ABQ  1392
4     ABY   9
5     ACK   2


Now that we have gained a basic impression and some insights into the ONTIME_S data, we may choose to view the data graphically.  For example, we may want to get a visual impression of the distribution of departure delay.  We use the hist function, which displays a histogram skewed on positive side, presumably because flights rarely leave early.

> with(ONTIME_S, hist(DEPDELAY, breaks = 100, col = rainbow(7), cex.lab = 0.8,
      main = "Distribution of Departure Delay", xlab = "Departure Delay (in minutes)"))

After analyzing the data through exploratory methods in ORE, we proceed to a possible next step: confirmatory statistics. Let's compute a Student's t-test using  the origin and departure delay variables we examined earlier.  The goal is to decide whether average departure delay of one airport is different from the average delay of another.

> with(ONTIME_S, t.test(DEPDELAY[ORIGIN == "SEA"], DEPDELAY[ORIGIN == "SFO"], conf.level = 0.95))

    Welch Two Sample t-test

data:  DEPDELAY[ORIGIN == "SEA"] and DEPDELAY[ORIGIN == "SFO"]
t = -1.8406, df = 7571.893, p-value = 0.06572
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.21038010  0.06960489
sample estimates:
mean of x  mean of y
 8.475415  9.545802


The output tells us that we calculated the unpaired t-test and gives us the value of the t-distributed statistic.

We may also use a Kolmogorov–Smirnov test to determine if the departure delay variable is from the same distribution as the cumulative distribution.

> with(ONTIME_S, ks.test(DEPDELAY, "pnorm", mean(DEPDELAY), sd(DEPDELAY)))

    One-sample Kolmogorov-Smirnov test

data:  DEPDELAY
D = 0.3497, p-value < 2.2e-16
alternative hypothesis: two-sided


At this point we could continue our data exploration by performing additional distribution tests, or proceed with the rich set of modeling and functionality ORE offers.  

ORE enables R users transparent access to data stored in Oracle Database while leveraging Oracle Database as a compute engine for scalability and high performance. We've only scratched the surface on ORE's statistical features - stay tuned for posts highlighting more advanced features of the statistical engine.

The Oracle R Enterprise User's Guide contains a number of examples demonstrating the functionality available in ORE. To view the documentation and training materials, visit our product pagePlease feel free to visit our discussion forum  and ask questions or provide comments about how we can help you and your team!






 

Monday Feb 20, 2012

Announcing Oracle R Distribution

Oracle has released the Oracle R Distribution, an Oracle-supported distribution of open source R. This is provided as a free download from Oracle. Support for Oracle R Distribution is provided to customers of the Oracle Advanced Analytics option and Oracle Big Data Appliance. The Oracle R Distribution facilitates enterprise acceptance of R, since the lack of a major corporate sponsor has made some companies concerned about fully adopting R. With the Oracle R Distribution, Oracle plans to contribute bug fixes and relevant enhancements to open source R.

Oracle has already taken responsibility for and contributed modifications to ROracle - an Oracle database interface (DBI) driver for R based on OCI. As ROracle is LGPL and used for Oracle Database connectivity from R, we are committed to ensuring this is the best package for Oracle connectivity.

Friday Feb 17, 2012

Introduction to the ORE Transparency Layer

This Oracle R Enterprise (ORE) transparency layer tutorial is the first in a series to help users get started using ORE. Oracle R Enterprise  is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

Oracle R Enterprise (ORE) implements a transparency layer on top of the R engine that allows R computations to be executed in Oracle Database from the R environment. A mapping is established between a special R object called an ORE frame and a corresponding database table or view, allowing a wide range of R functionality to be executed on these mapped objects.  The overloaded functions in the ORE packages generate SQL statements in the background, export the expensive computations to the database for execution, and return results to the R environment.  

Here's a quick overview and an example using the airline on-time performance data from Research and Innovative Technology Administration (RITA), which  coordinates the U.S. Department of Transportation (DOT) research programs. The data consists of 123 million records of U.S. domestic commercial flights between 1987 and 2008. 

Configuring the R Environment

We begin by configuring the local R environment by executing a few simple commands.  These commands may be saved in the .Rprofile file and executed during the initialization of each R session or typed directly into the R session.

Load the ORE library and connect to Oracle Database:

   >  library(ORE)
   >  ore.connect("USER", "SID", "HOST", "PASSWORD")

Syncing with the database syncs the metadata in the database schema with the R environment:

   >  ore.sync("SCHEMA")
   >  ore.ls()

Attaching the database provides access to views and tables so they can be manipulated from a local R session:

Data Exploration
   >  ore.attach("SCHEMA")


Travelers in the western United States may be interested in the mean arrival delay for the San Fransisco, Los Angeles and Seattle airports.
Executing this R code produces a boxplot representing the delay profile for these selected airports:

   > delay <- ontime$ARRDELAY[ontime$DEST %in% c("SEA", "SFO", "LAX"), ]
   > dest <- ontime$DEST[ontime$DEST %in% c("SEA", "SFO", "LAX"), ]
   > bd <- split(delay, dest)
   > boxplot(bd, notch = TRUE, col = "gold", cex = 0.5,
            outline = FALSE, horizontal = TRUE, yaxt = "n",
            main = "Arrival Delays by Destination",
            ylab = "Destination Airport", xlab = "Delay (minutes)")
   > labels <- levels(dest)
   > text(par("usr")[1] - 3, 1:length(labels), srt = 0, adj = 1, labels = labels, xpd = TRUE, cex = 0.75)


The delay profile shows that, on average, arrival delays are greater in San Fransisco than Seattle and Los



With this information, we proceed with fitting a linear model where arrival delay is modeled as a linear function of departure delay and destination. Fitting models in ORE requires minimal modification to traditional R syntax. Simply replace R's linear modeling function lm with ore.lm, and the remaining syntax is transparent:

   > mod.ore <-ore.lm(ARRDELAY ~ DEST + DEPDELAY, data=ontime)
   > summary(mod.ore)

Many R functions compute and store more information than they report by default. Users can easily save these results to an object and extract the components they need. This allows the output of one function to be used as the input to another - a very powerful feature of the R programming environment.  We will take advantage of this functionality by generating predictions from the linear model built in the database:

   > newdat <- ontime[,c("ARRDELAY","DEST", "DEPDELAY")]
   > newdat.sub <- subset(newdat,!(is.na(ARRDELAY) | is.na(DEST) | is.na(DEPDELAY)))
   > prd.ore <- predict(mod.ore, newdata=newdat)
   > res.ore <- cbind(newdat, PRED = prd.ore)


To complete our session, we may detach the schema if desired, and terminate the database connection:

    > ore.detach ("RQUSER")
    > ore.disconnect()

This a small subset of the functionality available in the ORE transparency layer.  The R functions overloaded in the transparency layer include:

    data type assignment and verification
    distribution testing
    mathematical transformations
    statistical summaries
    arithmetic operators
    comparison operators
    logical operators
    set operations
    string manipulations
    data manipulation
    graphics
    matrix operations
    gamma and bessel functions
    hypothesis testing
    regression 

 Summary

Using ORE objects, users transparently remain in the R language, requiring minimal modifications to their existing R scripts.  Although supported, it's not necessary to pull data from the database into R.  This eliminates the need to manipulate memory-bound R objects on the user's desktop system.  R users may access open-source R packages containing many standard and cutting-edge routines for data analysis.

To learn more about ORE offerings, including statistics and modeling features and and advanced topics like the Oracle R Connector for Hadoop (ORCH), view the documentation and training materials on our our product page.  To encourage collaboration, we provide a discussion forum for ORE topics.






Thursday Feb 16, 2012

R and Database Access

In an enterprise, databases are typically where data reside. So where data analytics are required, it's important for R and the database to work well together. The more seamlessly and naturally R users can access data, the easier it is to produce results. R users may leverage ODBC, JDBC, or similar types of connectivity to access database-resident data. However, this  requires working with SQL to formulate queries to process or filter data in the database, or to pull data into the R environment for further processing using R. If R users, statisticians, or data analysts are unfamiliar with SQL or database tasks, or don't have database access, they often consult IT for data extracts.

Not having direct access to database-resident data introduces delays in obtaining data, and can make near real-time analytics impossible. In some instances, users request data sets much larger than required to avoid multiple requests to IT. Of course, this approach introduces costs of exporting, moving, and storing data, along with the associated backup, recovery, and security risks.

Oracle R Enterprise eliminates the need to know SQL to work with database-resident data. Through the Oracle R Enterprise transparency layer, R users can access data stored in tables and views as virtual data frames. Base R functions performed on these "ore.frames" are overloaded to generate SQL which is transparently sent to Oracle Database for execution - leveraging the database as a high-performance computational engine.

Check out Oracle R Enterprise for examples of the interface, documentation, and a link to download Oracle R Enterprise.

Oracle Announces Availability of Oracle Advanced Analytics for Big Data

Oracle Announces Availability of Oracle Advanced Analytics for Big Data

Oracle Integrates R Statistical Programming Language into Oracle Database 11g

REDWOOD SHORES, Calif. - February 8, 2012

News Facts

Oracle today announced the availability of  Oracle Advanced Analytics, a new option for Oracle Database 11g that bundles Oracle R Enterprise together with Oracle Data Mining.
Oracle R Enterprise delivers enterprise class performance for users of the R statistical programming language, increasing the scale of data that can be analyzed by orders of magnitude using Oracle Database 11g.
R has attracted over two million users since its introduction in 1995, and Oracle R Enterprise dramatically advances capability for R users. Their existing R development skills, tools, and scripts can now also run transparently, and scale against data stored in Oracle Database 11g.
Customer testing of Oracle R Enterprise for Big Data analytics on Oracle Exadata has shown up to 100x increase in performance in comparison to their current environment.
Oracle Data Mining, now part of Oracle Advanced Analytics, helps enable customers to easily build and deploy predictive analytic applications that help deliver new insights into business performance.
Oracle Advanced Analytics, in conjunction with Oracle Big Data Appliance, Oracle Exadata Database Machine and Oracle Exalytics In-Memory Machine, delivers the industry’s most integrated and comprehensive platform for Big Data analytics.

Comprehensive In-Database Platform for Advanced Analytics

Oracle Advanced Analytics brings analytic algorithms to data stored in Oracle Database 11g and Oracle Exadata as opposed to the traditional approach of extracting data to laptops or specialized servers.
With Oracle Advanced Analytics, customers have a comprehensive platform for real-time analytic applications that deliver insight into key business subjects such as churn prediction, product recommendations, and fraud alerting.
By providing direct and controlled access to data stored in Oracle Database 11g, customers can accelerate data analyst productivity while maintaining data security throughout the enterprise.
Powered by decades of Oracle Database innovation, Oracle R Enterprise helps enable analysts to run a variety of sophisticated numerical techniques on billion row data sets in a matter of seconds making iterative, speed of thought, and high-quality numerical analysis on Big Data practical.
Oracle R Enterprise drastically reduces the time to deploy models by eliminating the need to translate the models to other languages before they can be deployed in production.
Oracle R Enterprise integrates the extensive set of Oracle Database data mining algorithms, analytics, and access to Oracle OLAP cubes into the R language for transparent use by R users.
Oracle Data Mining provides an extensive set of in-database data mining algorithms that solve a wide range of business problems. These predictive models can be deployed in Oracle Database 11g and use Oracle Exadata Smart Scan to rapidly score huge volumes of data.
The tight integration between R, Oracle Database 11g, and Hadoop enables R users to write one R script that can run in three different environments: a laptop running open source R, Hadoop running with Oracle Big Data Connectors, and Oracle Database 11g.
Oracle provides single vendor support for the entire Big Data platform spanning the hardware stack, operating system, open source R, Oracle R Enterprise and Oracle Database 11g.
To enable easy enterprise-wide Big Data analysis, results from Oracle Advanced Analytics can be viewed from Oracle Business Intelligence Foundation Suite and Oracle Exalytics In-Memory Machine.

Supporting Quotes

“Oracle is committed to meeting the challenges of Big Data analytics. By building upon the analytical depth of Oracle SQL, Oracle Data Mining and the R environment, Oracle is delivering a scalable and secure Big Data platform to help our customers solve the toughest analytics problems,” said Andrew Mendelsohn, senior vice president, Oracle Server Technologies.
“We work with leading edge customers who rely on us to deliver better BI from their Oracle Databases. The new Oracle R Enterprise functionality allows us to perform deep analytics on Big Data stored in Oracle Databases. By leveraging R and its library of open source contributed CRAN packages combined with the power and scalability of Oracle Database 11g, we can now do that,” said Mark Rittman, co-founder, Rittman Mead.

Supporting Resources

Connect with Oracle Database via Blog, Facebook and Twitter

About Oracle

Oracle engineers hardware and software to work together in the cloud and in your data center. For more information about Oracle (NASDAQ: ORCL), visit http://www.oracle.com.

Trademarks

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

Contact Info

Eloy Ontiveros
Oracle
+1.650.607.6458
eloy.ontiveros@oracle.com

Joan Levy
Blanc & Otus for Oracle
+1.415.856.5110
jlevy@blancandotus.com

Tuesday Feb 14, 2012

Oracle R Enterprise: Getting Started

Following last week's press release, we wanted to post a series of  demonstrations using Oracle R Enterprise.  Stay tuned to learn more about Oracle R Enterprise (ORE), a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

The R programming language and environment was originally designed to hold data in memory, providing fast and efficient calculations by not requiring the user's program to access information stored on the hard drive. Modern data set size has surpassed the rate which RAM has increased. Consequently, R users will often encounter errors similar to the following:

     "cannot allocate vector of length xxx"

This error occurs because R requires the operating system to provide a block of memory large enough to contain the contents of the data file, and the operating system responds that not enough memory is available. The maximum amount of memory that can be accessed by 32-bit R is 3GB. On 64-bit versions of R, larger objects may be created - theoretically up to 8TB.  However, the Operating System imposes limitations on the resources available to a single process, and using such large objects may be unacceptably slow.

R Programmers with big data sets work around memory limitations in a variety of ways. Some opt to analyze data samples, and some divide the data into manageable batches, run jobs sequentially on a single processor, and then combine the results. This is both costly and time-consuming. For R users who like the flexibility of the R language and the support of the R community, the option to analyze and model large data sets in R is an exciting enhancement.

The Oracle R Enterprise framework allows R users to operate on tables and views directly from R in Oracle Database. Instead of loading large data files into memory, the R engine processing is moved to the database, requiring minimal resources on the user's system, regardless of the size of the data.

In this introductory series, we'll cover everything you need to know to get started with Oracle R Enterprise, including:

Part 1: The ORE transparency layer - a collection of R packages with functions to connect to Oracle Database and use R functionality in Oracle Database. This enables R users to work with data too large to fit into the memory of a user's desktop system, and leverage the scalable Oracle Database as a computational engine.

Part 2. The ORE statistics engine - a collection of statistical functions and procedures corresponding to commonly-used statistical libraries. The statistics engine packages also execute in Oracle Database.

PART 3: ORE SQL extensions supporting embedded R execution through the database on the database server. R users can execute R closures (functions) using an R or SQL API, while taking advantage of data parallelism. Using the SQL API for embedded R execution, sophisticated R graphics and results can be exposed in Oracle Business Intelligence EE dashboards and Oracle BI Publisher documents.

PART 4: Oracle R Connector for Hadoop (ORCH) - an R package that interfaces with the Hadoop Distributed File System (HDFS) and enables executing MapReduce jobs. ORCH enables R users to work directly with an Oracle Hadoop cluster, executing computations from the R environment, written in the R language and working on data resident in HDFS, Oracle Database, or local files.

But we won't stop there - expect to see posts discussing many new features in 2012, including expanded platform support and an extended set of analytics routines. Please come back frequently for updates that can help your organization mature in its implementation of in-database analytics.



About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« April 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today