Wednesday Feb 06, 2013

Oracle R Enterprise 1.3 gives predictive analytics an in-database performance boost

Recently released Oracle R Enterprise 1.3 adds packages to R that enable even more in-database analytics. These packages provide horizontal, commonly used techniques that are blazingly fast in-database for large data. With Oracle R Enterprise 1.3, Oracle makes R even better and usable in enterprise settings. (You can download ORE 1.3 here and documentation here.)

When it comes to predictive analytics, scoring (predicting outcomes using a data mining model) is often a time critical operation. Scoring can be done online (real-time), e.g., while a customer is browsing a webpage or using a mobile app, where on-the-spot recommendations can be made based on current actions. Scoring can also be done offline (batch), e.g., predict which of your 100 million customers will respond to each of a dozen offers, e.g., where applications leverage results to identify which customers should be targeted with a particular ad campaign or special offer.

In this blog post, we explore where using Oracle R Enterprise pays huge dividends. When working with small data, R can be sufficient, even when pulling data from a database. However, depending on the algorithm, benefits of in-database computation can be seen in a few thousand rows. The time difference with 10s of thousands of rows makes an interactive session more interactive, whereas 100s of thousands of rows becomes a real productivity gain, and on millions (or billions) of rows, becomes a competitive advantage! In addition to performance benefits, ORE integrates R into the database enabling you to leave data in place.

We’ll look at a few proof points across Oracle R Enterprise features, including:

  • OREdm – a new package that provides R access to several in-database Oracle Data Mining algorithms (Attribute Importance, Decision Tree, Generalized Linear Models, K-Means, Naïve Bayes, Support Vector Machine).
  • OREpredict – a new package that enables scoring models built using select standard R algorithms in the database (glm, negbin, hclust, kmeans, lm, multinom, nnet, rpart).
  • Embedded R Execution – an ORE feature that allows running R under database control and boosts real performance of CRAN predictive analytics packages by providing faster access to data than occurs between the database and client, as well as leveraging a more powerful database machine with greater RAM and CPU resources.

OREdm

Pulling data out of a database for any analytical tool impedes interactive data analysis due to access latency, either directly when pulling data out of the database or indirectly via an IT process that involves requesting data to be staged in flat files. Such latencies can quickly become intolerable. On the R front, you’ll also need to consider whether the data will fit in memory. If flat files are involved, consideration needs to be given to how files will be stored, backed up, and secured.

Of course, model building and data scoring execution time is only part of the story. Consider a scenario A, the “build combined script,” where data is extracted from the database, and an R model built and persisted for later use. In the corresponding scenario B, the “score combined script”, data is pulled from the database, a previously built model loaded, data scored, and the scores written to the database. This is a typical scenario for use in, e.g., enterprise dashboards or within an application supporting campaign management or next-best-offer generation. In-database execution provides significant performance benefits, even for relatively small data sets as included below. Readers should be able to reproduce such results at these scales. We’ve also included a Big Data example by replicating the 123.5 million row ONTIME data set to 1 billion rows. Consider the following examples:

Linear Models: We compared R lm and ORE ore.lm in-database algorithm on the combined scripts. On datasets ranging from 500K to 1.5M rows with 3-predictors, in-database analytics showed an average 2x-3x performance improvement for build, and nearly 4x performance improvement for scoring. Notice in Figure 1 that the trend is significantly less for ore.lm than lm, indicating greater scalability for ore.lm.

Figure 1. Overall lm and ore.lm execution time for model building (A) and data scoring (B)

Figure 2 provides a more detailed view comparing data pull and model build time for build detail, followed by data pull, data scoring, and score writing for score detail. For model building, notice that while data pull is a significant part of lm’s total build time, the actual build time is still greater than ore.lm. A similar statement can be made in the case of scoring.

Figure 2. Execution time components for lm and ore.lm (excluding model write and load)

Naïve Bayes from the e1071 package: On 20-predictor datasets ranging from 50k to 150k rows, in-database ore.odmNB improved data scoring performance by a factor of 118x to 418x, while the full scenario B execution time yielded a 13x performance improvement, as depicted in Figure 3B. Using a non-parallel execution of ore.odmNB, we see the cross-over point where ore.odmNB overtakes R, but more importantly, the slope of the trend points to the greater scalability of ORE, as depicted in Figure 3A for the full scenario A execution time.


Figure 3. Overall naiveBayes and ore.odmNB execution time for model building (A) and data scoring (B)

K-Means clustering: Using 6 numeric columns from the ONTIME airline data set ranging from 1 million to 1 billion rows, we compare in-database ore.odmKMeans with R kmeans through embedded R execution with ore.tableApply. At 100 million rows, ore.odmKMeans demonstrates better performance than kmeans , and scalability at 1 billion rows. The performance results depicted in Figure 4 uses a log-log plot. The legend shows the function invoked and corresponding parameters, using subset of ONTIME data set d. While ore.odmKMeans scales linearly with number of rows, R kmeans does not. Further, R kmeans did not complete at 1 billion rows.

Figure 4: K-Means clustering model building on Big Data

OREpredict

With OREpredict, R users can also benefit from in-database scoring of R models. This becomes evident not only when considering the full “round trip” of pulling data from the database, scoring in R, and writing data back to the database, but also for the scoring itself.

Consider an lm model built using a dataset with 4-predictors and 1 million to 5 million rows. Pulling data from the database, scoring, and writing the results back to the database shows a pure R-based approach taking 4x - 9x longer than in-database scoring using ore.predict with that same R model. Notice in Figure 5 that the slope of the trend is dramatically less for ore.predict than predict, indicating greater scalability. When considering the scoring time only, ore.predict was 20x faster than predict in R for 5M rows. In ORE 1.3, ore.predict is recommended and will provide speedup over R for numeric predictors.

Figure 5. Overall lm execution time using R predict vs. ORE ore.predict

For rpart, we see a similar result. On a 20-predictor, 1 million to 5 million row data set, ore.predict resulted in a 6x – 7x faster execution. In Figure 5, we again see that the slope of the trend is dramatically less for ore.predict than predict, indicating greater scalability. When considering the scoring time only, ore.predict was 123x faster than predict in R for 5 million rows.

Figure 6. Overall rpart Execution Time using R predict vs. ORE ore.predict

This scenario is summarized in Figure 7. In the client R engine, we have the ORE packages installed. There, we invoke the pure R-based script, which requires pulling data from the database. We also invoke the ORE-based script that keeps the data in the database.

Figure 7. Summary of OREpredict performance gains

To use a real world data set, we again consider the ONTIME airline data set with 123.5 million rows. We will build lm models with varying number of coefficients derived by converting categorical data to multiple columns. The variable p corresponds to the number of coefficients resulting from the transformed formula and is dependent on the number of distinct values in the column. For example, DAYOFWEEK has 7 values, so with DEPDELAY, p=9. In Figure 8, you see that using an lm model with embedded R for a single row (e.g., one-off or real-time scoring), has much more overhead (as expected given that an R engine is being started) compared to ore.predict, which shows subsecond response time through 40 coefficients at 0.54 seconds, and the 106 coefficients at 1.1 seconds. Here are the formulas describing the columns included in the analysis:

  • ARRDELY ~ DEPDELAY (p=2)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK (p=8)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH (p=19)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH + YEAR (p=40)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH + YEAR (p=106)


Figure 8. Comparing performance of ore.predict with Embedded R Execution for lm

Compare this with scoring the entire ONTIME table of 123.5 million rows. We see that ore.predict outperforms embedded R until about 80 coefficients, when embedded R becomes the preferred choice. 

Data Movement between R and Database: Embedded R Execution

One advantage of R is its community and CRAN packages. The goal for Oracle R Enterprise with CRAN packages is to enable reuse of these packages while:

  • Leveraging the parallelization and efficient data processing capabilities of Oracle Database
  • Minimizing data transfer and communication overhead between R and the database
  • Leveraging R as a programming language for writing custom analytics

There are three ways in which we’ll explore the performance of pulling data.

1) Using ore.pull at a separate client R engine to pull data from the database

2) Using Embedded R Execution and ore.pull within an embedded R script from a database-spawned R engine

3) Using Embedded R Execution functions for data-parallelism and task-parallelism to pass database data to the embedded R script via function parameter

With ORE Embedded R Execution (ERE), the database delivers data-parallelism and task-parallelism, and reduces data access latency due to optimized data transfers into R. Essentially, R runs under the control of the database. As illustrated in Figure 9, loading data at the database server is 12x faster than loading data from the database to a separate R client. Embedded R Execution also provides a 13x advantage when using ore.pull invoked at the database server within an R closure (function) compared with a separate R client. The data load from database to R client is depicted as 1x – the baseline for comparison with embedded R execution data loading.

Figure 9. Summary of Embedded R Execution data load performance gains

Data transfer rates are displayed in Figure 10, for a table with 11 columns and 5 million to 15 million rows of data. Loading data via ORE embedded R execution using server-side ore.pull or through the framework with, e.g., ore.tableApply (one of the embedded R execution functions) is dramatically faster than a non-local client load via ore.pull. The numbers shown reflect MB/sec data transfer rates, so a bigger bar is better!

Figure 10. Data load and write execution time with 11 columns

While this is impressive, let’s expand our data up to 1 billion rows. To create our 1 billion row data set (1.112 billion rows), we duplicated the 123.5 million row ONTIME dataset 9 times, replacing rows with year 1987 with years 2010 through 2033, and selecting 6 integer columns (YEAR, MONTH, DAYOFMONTH, ARRDELAY, DEPDELAY, DISTANCE) with bitmap index of columns (YEAR, MONTH, DAYOFMONTH). The full data set weighs in at ~53 GB.

In Figure 11, we see linear scalability for loading data into the client R engine. Times range from 2.8 seconds for 1 million rows, to 2700 seconds for 1 billion rows. While your typical user may not need to load 1 billion rows into R memory, this graph demonstrates the feasibility to do so.

Figure 11. Client Load of Data via ore.pull for Big Data

In Figure12, we look at how degree of parallelism (DOP) affects data load times involving ore.rowApply. This test addresses the question of how fast ORE can load 1 billion, e.g., when scoring data. The degree of parallelism corresponds to the number of R engines that are spawned for concurrent execution at the database server. The number of chunks the data is divided into is 1 for a single degree of parallelism, and 10 times the DOP for the remaining tests. For DOP of 160, the data was divided into 1600 chunks, i.e., 160 R engines were spawned, each processing 10 chunks. The graph on the left depicts that execution times improve for the 1 billion row data set through DOP of 160. As expected, at some point, the overhead of spawning additional R engines and partitioning the data outweighs the benefit. At its best time, processing 1 billion rows took 43 seconds.

Figure 12. Client Load of Data via ore.pull for Big Data

In the second graph of Figure 12, we contrast execution time for the “sweet spot” identified in the previous graph with varying number of rows. Using this DOP of 160, with 1600 chunks of data, we see that through 100 million rows, there is very little increase in execution time (between 6.4 and 8.5 seconds in actual time). While 1 billion rows took significantly more, it took only 43 seconds.

We can also consider data write at this scale. In Figure 13, we also depict linear scalability from 1 million through 1 billion rows using the ore.create function to creating database tables from R data. Actual times ranged from 2.6 seconds to roughly 2600 seconds.

Figure 13. Data Write using ore.create for Big Data

ORE supports data-parallelism to enable, e.g., building predictive models in parallel on partitions of the data. Consider a marketing firm that micro-segments customers and builds predictive models on each segment. ORE embedded R execution automatically partitions the data, spawns R engines according to the degree of parallelism specified, and executes the specified user R function. To address how efficiently ore.groupApply can process data, Figure 14 shows the total execution time to process the 123.5M rows from the ONTIME data with varying number of columns. The figure shows that ore.groupApply scales linearly as the number of columns increases. Three columns were selected based on their number of distinct values: TAILNUM 12861, DEST 352, and UNIQUECARRIER 29. For UNIQUECARRIER, all columns (total of 29 columns) could not be completed since 29 categories resulted in data too large for a single R engine.

Figure 14. Processing time for 123.5M rows via ore.groupApply

ORE also supports row-parallelism, where the same embedded R function can be invoked on chunks of rows. As with ore.groupApply, depending on the specified degree of parallelism, a different chunk of rows will be submitted to a dynamically spawned database server-side R engine. Figure 15 depicts a near linear execution time to process the 123.5M rows from ONTIME with varying number of columns. The chunk size can be specified, however, testing 3 chunk sizes (10k, 50k, and 100k rows) showed no significant difference in overall execution time, hence a single line is graphed.

Figure 15. Processing time for 123.5M rows via ore.rowApply for chunk sizes 10k-100k

All tests were performed on an Exadata X3-8. Except as noted, the client R session and database were actually on the same machine, so network latency for data read and write were minimum. Over a LAN or WAN, the benefits of in-database execution and ORE will be even more dramatic.

Friday Jan 18, 2013

Oracle R Distribution Performance Benchmark

Oracle R Distribution Performance Benchmarks

Oracle R Distribution provides dramatic performance gains with MKL

Using the recognized R benchmark R-benchmark-25.R test script, we compared the performance of Oracle R Distribution with and without the dynamically loaded high performance Math Kernel Library (MKL) from Intel. The benchmark results show Oracle R Distribution is significantly faster with the dynamically loaded high performance library. R users can immediately gain performance enhancements over open source R, analyzing data on 64-bit architectures and leveraging parallel processing within specific R functions that invoke computations performed by these high performance libraries.

The Community-developed test consists of matrix calculations and functions, program control, matrix multiplication, Cholesky Factorization, Singular Value Decomposition (SVD), Principal Component Analysis (PCA), and Linear Discriminant Analysis. Such computations form a core component of many real-world problems, often taking the majority of compute time. The ability to speed up these computations means faster results for faster decision making.

While the benchmark results reported were conducted using Intel MKL, Oracle R Distribution also supports AMD Core Math Library (ACML) and Solaris Sun Performance Library.

Oracle R Distribution 2.15.1 x64 Benchmark Results (time in seconds)


 ORD with internal BLAS/LAPACK
1 thread
 ORD + MKL
1 thread
 ORD + MKL
2 threads
 ORD + MKL
4 threads
 ORD + MKL
8 threads
 Performance gain ORD + MKL
4 threads
 Performance gain ORD + MKL
8 threads
 Matrix Calculations
 11.2  1.9  1.3  1.1  0.9  9.2x  11.4x
 Matrix Functions
 7.2  1.1 0.6
 0.4  0.4  17.0x  17.0x
 Program Control
 1.4  1.3  1.5  1.4  0.8  0.0x  0.8x
 Matrix Multiply
 517.6  21.2  10.9  5.8  3.1  88.2x  166.0x
 Cholesky Factorization
 25  3.9  2.1  1.3  0.8  18.2x  29.4x
 Singular Value Decomposition
 103.5  15.1  7.8  4.9  3.4  20.1x  40.9x
 Principal Component Analysis
 490.1  42.7  24.9  15.9  11.7  29.8x  40.9x
 Linear Discriminant Analysis
 419.8  120.9  110.8  94.1  88.0  3.5x  3.8x

This benchmark was executed on a 3-node cluster, with 24 cores at 3.07GHz per CPU and 47 GB RAM, using Linux 5.5.

In the first graph, we see significant performance improvements. For example, SVD with ORD plus MKL executes 20 times faster using 4 threads, and 29 times faster using 8 threads. For Cholesky Factorization, ORD plus MKL is 18 and 30 times faster for 4 and 8 threads, respectively.

In the second graph,we focus on the three longer running tests. Matrix multiplication is 88 and 166 times faster for 4 and 8 threads, respectively. PCA is 30 and 50 times faster, and LDA is over 3 times faster.

This level of performance improvement can significantly reduce application execution time and make interactive, dynamically generated results readily achievable. Note that ORD plus MKL not only impacts performance on the client side, but also when used in combination with R scripts executed using Oracle R Enterprise Embedded R Execution. Such R scripts, executing at the database server machine, reap these performance gains as well. 

Wednesday Dec 26, 2012

Oracle R Enterprise 1.3 released

We're pleased to announce the latest release of Oracle R Enterprise, now available for download. Oracle R Enterprise 1.3 features new predictive analytics interfaces for in-database model building and scoring, support for in-database sampling and partitioning techniques, and transparent support for Oracle DATE and TIMESTAMP data types to facilitate data preparation for time series analysis and forecasting. Oracle R Enterprise further enables transparent access to Oracle Database tables from R by enabling integer indexing and ensuring consistent ordering between data in R data frames and Oracle Database tables. The latest release also includes improved programming efficiencies and performance improvements.

The key additions in version 1.3 include:

Enhanced Model Scoring: The new package OREpredict enables in-database scoring of R-generated models. Supported models include linear regression (lm) and generalized linear models (glm), hierarchical clustering (hclust), k-means clustering (kmeans), multinomial log-linear models (multinom), neural networks (nnet), and recursive partitioning and regression trees (rpart).

Oracle Data Mining Support: The new package OREdm provides an R interface for in-database Oracle Data Mining predictive analytics and data mining algorithms. Supported models include attribute importance, decision trees, generalized linear models, k-means clustering, naive bayes and support vector machines.

Neural Network Modeling: A new feed-forward neural network algorithm with in-database execution.

Date and Time Support: Support for Oracle DATE and TIMESTAMP data types and analytic capabilities that allow date arithmetic, aggregations, percentile calculations and moving window calculations for in-database execution.

Sampling Methods: Enables in-database sampling and partitioning techniques for use against database-resident data. Techniques include simple random sampling, systematic sampling, stratified sampling, cluster sampling, quota sampling and accidental sampling.

Object Persistence: New capabilities for saving and restoring R objects in an Oracle Database “datastore”, which supports not only in-database persistence of R objects, but the ability to easily pass any type of R objects to embedded R execution functions.

Database Auto-Connection:  New functionality for automatically establishing database connectivity using contextual credentials inside embedded R scripts, allowing convenient and secure connections to Oracle Database.

When used in conjunction with Oracle Exadata Database Machine and Oracle Big Data Appliance, Oracle R Enterprise and Oracle R Connector for Hadoop provide a full set of engineered systems to access and analyze big data. With Oracle R Enterprise, IT organizations can rapidly deploy advanced analytical solutions, while providing the knowledge to act on critical decisions.

Stay tuned for blogs about the new ORE 1.3 features in upcoming posts. You can find more details about the features in Oracle R Enterprise 1.3 in our New Features Guide and Reference Manual.

Monday Nov 19, 2012

Join us at BIWA Summit 2013!

Registration is now open for BIWA Summit 2013.  This event, focused on Business Intelligence, Data Warehousing and Analytics, is hosted by the BIWA SIG of the IOUG on January 9 and 10 at the Hotel Sofitel, near Oracle headquarters in Redwood City, California.

Be sure to check out our featured speakers, including Oracle executives Balaji Yelamanchili, Vaishnavi Sashikanth, and Tom Kyte, and Ari Kaplan, sports analyst, as well as the many other internationally recognized speakers.  Hands-on labs will give you the opportunity to try out much of the Oracle software for yourself (including Oracle R Enterprise)--be sure to bring a laptop capable of running Windows Remote Desktop.  There will be over 35 sessions on a wide range of BIWA-related topics.  See the BIWA Summit 2013 web site for details and be sure to register soon, while early bird rates still apply.

Tuesday Oct 02, 2012

Oracle R Enterprise Tutorial Series on Oracle Learning Library

Oracle Server Technologies Curriculum has just released the Oracle R Enterprise Tutorial Series, which is publicly available on Oracle Learning Library (OLL). This 8 part interactive lecture series with review sessions covers Oracle R Enterprise 1.1 and an introduction to Oracle R Connector for Hadoop 1.1:
  • Introducing Oracle R Enterprise
  • Getting Started with ORE
  • R Language Basics
  • Producing Graphs in R
  • The ORE Transparency Layer
  • ORE Embedded R Scripts: R Interface
  • ORE Embedded R Scripts: SQL Interface
  • Using the Oracle R Connector for Hadoop

We encourage you to download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.

Thursday Sep 27, 2012

ROracle support for TimesTen In-Memory Database

Today's guest post comes from Jason Feldhaus, a Consulting Member of Technical Staff in the TimesTen Database organization at Oracle.  He shares with us a sample session using ROracle with the TimesTen In-Memory database. 

Beginning in version 1.1-4, ROracle includes support for the Oracle Times Ten In-Memory Database, version 11.2.2. TimesTen is a relational database providing very fast and high throughput through its memory-centric architecture.  TimesTen is designed for low latency, high-volume data, and event and transaction management. A TimesTen database resides entirely in memory, so no disk I/O is required for transactions and query operations. TimesTen is used in applications requiring very fast and predictable response time, such as real-time financial services trading applications and large web applications. TimesTen can be used as the database of record or as a relational cache database to Oracle Database.

ROracle provides an interface between R and the database, providing the rich functionality of the R statistical programming environment using the SQL query language. ROracle uses the OCI libraries to handle database connections, providing much better performance than standard ODBC.

The latest ROracle enhancements include:

  • Support for Oracle TimesTen In-Memory Database
  • Support for Date-Time using R's POSIXct/POSIXlt data types
  • RAW, BLOB and BFILE data type support
  • Option to specify number of rows per fetch operation
  • Option to prefetch LOB data
  • Break support using Ctrl-C
  • Statement caching support

Times Ten 11.2.2 contains enhanced support for analytics workloads and complex queries:

  • Analytic functions: AVG, SUM, COUNT, MAX, MIN, DENSE_RANK, RANK, ROW_NUMBER, FIRST_VALUE and LAST_VALUE
  • Analytic clauses: OVER PARTITION BY and OVER ORDER BY
  • Multidimensional grouping operators:
  • Grouping clauses: GROUP BY CUBE, GROUP BY ROLLUP, GROUP BY GROUPING SETS
  • Grouping functions: GROUP, GROUPING_ID, GROUP_ID
  • WITH clause, which allows repeated references to a named subquery block
  • Aggregate expressions over DISTINCT expressions
  • General expressions that return a character string in the source or a pattern within the LIKE predicate
  • Ability to order nulls first or last in a sort result (NULLS FIRST or NULLS LAST in the ORDER BY clause)

Note: Some functionality is only available with Oracle Exalytics, refer to the TimesTen product licensing document for details.

Connecting to TimesTen is easy with ROracle.  Simply install and load the ROracle package and load 
the driver.
  
   > install.packages("ROracle")
   > library(ROracle)
    Loading required package: DBI
   > drv <- dbDriver("Oracle")

Once the ROracle package is installed, create a database connection object and connect to a 
TimesTen direct driver DSN as the OS user.

   > conn <- dbConnect(drv, username ="", password="",
                   dbname = "localhost/SampleDb_1122:timesten_direct")

You have the option to report the server type - Oracle or TimesTen?

   > print (paste ("Server type =", dbGetInfo (conn)$serverType))
   [1] "Server type = TimesTen IMDB"

To create tables in the database using R data frame objects, use the function dbWriteTable.  
In the following example we write the built-in iris data frame to TimesTen. The iris data set is 
a small example data set containing 150 rows and 5 columns. We include it here not to highlight 
performance, but so users can easily run this example in their R session.

   > dbWriteTable (conn, "IRIS", iris, overwrite=TRUE, ora.number=FALSE)
   [1] TRUE

Verify that the newly created IRIS table is available in the database. To list the available tables and 
table columns in the database, use dbListTables and dbListFields, respectively.

   > dbListTables (conn)
   [1] "IRIS"
   > dbListFields (conn, "IRIS")
   [1] "SEPAL.LENGTH" "SEPAL.WIDTH"  "PETAL.LENGTH" "PETAL.WIDTH"  "SPECIES"

To retrieve a summary of the data from the database we need to save the results to a local object. The 
following call saves the results of the query as a local R object, iris.summary. The ROracle function 
dbGetQuery is used to execute an arbitrary SQL statement against the database. When connected to 
TimesTen, the SQL statement is processed completely within main memory for the fastest response 
time.
   
   > iris.summary <- dbGetQuery(conn, 'SELECT
                                       SPECIES,
                                        AVG ("SEPAL.LENGTH") AS AVG_SLENGTH,
                                        AVG ("SEPAL.WIDTH") AS AVG_SWIDTH,
                                        AVG ("PETAL.LENGTH") AS AVG_PLENGTH,
                                        AVG ("PETAL.WIDTH") AS AVG_PWIDTH
                                       FROM IRIS
                                       GROUP BY ROLLUP (SPECIES)')

   > iris.summary
        SPECIES AVG_SLENGTH AVG_SWIDTH AVG_PLENGTH AVG_PWIDTH
   1     setosa    5.006000   3.428000       1.462   0.246000
   2 versicolor    5.936000   2.770000       4.260   1.326000
   3  virginica    6.588000   2.974000       5.552   2.026000
   4       <NA>    5.843333   3.057333       3.758   1.199333


Finally, disconnect from the TimesTen Database.

   > dbCommit (conn)
   [1] TRUE
   > dbDisconnect (conn)
   [1] TRUE

We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for our software: Times Ten In-Memory DatabaseROracle.  As always, we welcome comments and questions on the TimesTen and  Oracle R technical forums.







Sunday Sep 23, 2012

Video: Analyzing Big Data using Oracle R Enterprise

Learn how Oracle R Enterprise is used to generate new insight and new value to business, answering not only what happened, but why it happened. View this YouTube Oracle Channel video overview describing how analyzing big data using Oracle R Enterprise is different from other analytics tools at Oracle.

Oracle R Enterprise (ORE),  a component of the Oracle Advanced Analytics Option, couples the wealth of analytics packages in R with the performance, scalability, and security of Oracle Database. ORE executes base R functions transparently on database data without having to pull data from Oracle Database. As an embedded component of the database, Oracle R Enterprise can run your R script and open source packages via embedded R where the database manages the data served to the R engine and user-controlled data parallelism. The result is faster and more secure access to data. ORE also works with the full suite of in-database analytics, providing integrated results to the analyst.



Monday Sep 17, 2012

Podcast interview with Michael Kane

In this podcast interview with Michael Kane, Data Scientist and Associate Researcher at Yale University, Michael discusses the R statistical programming language, computational challenges associated with big data, and two projects involving data analysis he conducted on the stock market "flash crash" of May 6, 2010, and the tracking of transportation routes bird flu H5N1. Michael also worked with Oracle on Oracle R Enterprise, a component of the Advanced Analytics option to Oracle Database Enterprise Edition. In the closing segment of the interview, Michael comments on the relationship between the data analyst and the database administrator and how Oracle R Enterprise provides secure data management, transparent access to data, and improved performance to facilitate this relationship.

Listen now...

Friday Aug 17, 2012

Experience with Oracle R Enterprise in the Oracle micro-processor tools environment

ore.stepwise for high-performance stepwise regression

This guest post from Alexandre Ardelea describes the Oracle R Enterprise deployment in the Oracle micro-processor tools environment.

The deployment of ORE within the Oracle micro-processor tools environment introduced a technology that significantly expands our data analysis capabilities and opens the door to new applications.

Oracle R Enterprise (ORE) has been recently deployed in the Oracle micro-processor tools environment, replacing a popular commercial tool as a production engine for data analysis.  Fit/response models are important components of the simulation flows in the Oracle microprocessor tools environment; such models are used for a variety of purposes ranging from library generation to design yield prediction and optimization. Several tools were targeted for the migration to ORE; these tools are constructed around an iterative loop processing hundreds of data structures. At each iteration, a simulator engine generates data for multiple figures of metrics (targets), and a fit engine is called to construct response models for each target. The fit models are assembled into libraries for subsequent simulation within other flows or used on the fly.

A common characteristic of these models is that they need to express strong nonlinear relations between the targets and large sets of explanatory variables. Multiple interactions and many non-linear dependencies are considered as candidate effects for the model construction; they often result from an automatic generation procedure attempting to cover a large variety of anticipated relations between the dependent and independent variables. For example, for a case with O[10^2] main explanatory variables, the total number of candidate regressors, nregs_tot, could quickly rise to O[10^3-10^4]. Linear regression models with such a high number of terms are not only too expensive to use, as they have, potentially, a large number of terms with negligible coefficients, but are also likely to lead to instability and inaccuracy problems. For example, overfitting is an important concern with models expressing fluctuations in the data rather than capturing the trend. Moreover, if strong quasi-linear interactions occur between large numbers of regressors, the variance of the calculated (model) coefficients can be massively inflated.

In order to reduce the size of the fit models while retaining significant dependencies, we use stepwise regression. Stepwise regression is an iterative fitting method which builds the model incrementally by adding and eliminating regressors (from a specified candidate set) using a pre-defined selection mechanism/test of statistical significance - the model converges when all regressors retained in the model satisfy the significance test criteria. Commercial and open source regression packages offer various flavors of stepwise regression which differ in multiple ways through the choice of the selection mechanism (F-test, Information Criteria: AIC, BIC, etc), choice of the 'direction' (forward, backward, or both), flexibility for specifying the model scope, the significance threshold(s), the handling of interaction terms, etc.

ORE has developed a proprietary in-database algorithm for stepwise regression, ore.stepwise, which complements the functionality of R's step and, especially, offers a very significant performance improvement through faster, scalable algorithms and in-database execution. The basic syntax is the following:

ore.stepwise(start_model, DB_data, model_scope, direction, add.p, drop.p)

ore.stepwise allows the user to specify a model scope and stepwise direction, and uses the F-test for regressor selection with the add.p and drop.p significance levels for adding and removing regressors while the model is iteratively constructed.

A typical usage for ore.stepwise is illustrated below:

  list_modnames <- c("mdlA","mdlB",...)
  Effects_mdlA <- ...
  Effects_mdlB <- ...
  ...
for(modname in list_modnames){ 
  Upper_model <- as.formula(sprintf("%s ~ %s", modname,...)
    get(sprintf("Effects_%s", modname)))
  Lower_model <- as.formula(sprintf("%s ~ 1", modname))
  Scope_model <- list(lower = Lower_model, upper = Upper_model)
  ...
  ...
  assign(sprintf("%s_model", modname), 
     ore.stepwise(Lower_model, data = dataDB, scope = Scope_model, 
        direction="both", add.p=..., drop.p=...))
  ...
 
The in-database ore frame dataDB contains observations for several target models mdlA,mdlB,..and 
a list with all target names (list_modnames) is assembled for iterative processing. For each target, a 
model scope is specified within lower and upper bounds. In the example above the lower bound is the intercept 
but the upper bound is customized so that each target model can be constructed from its own collection 
of regressors. The results shown in Figure1 illustrate the performance difference between ore.stepwise 
in using ORE and base R’s step function for both a bi-linear and a fully quadratic model constructed from 
34 independent variables and 10k data observations.


Bilinear model

method

R^2

Number of Regressors

mean(rel_error)

Elapsed Time (seconds)

step

0.9658

86

3.52e-02

2110.0

ore.stepwise

0.9966

124

3.50e-02

32.1

performance difference




ore.stepwise is approx. 65X faster than step at similar R^2 and relative error as stepwise.

Quadratic model

method

R^2

Number of Regressors

mean(rel_error)

Elapsed Time (seconds)

step

0.9962

154

1.05e-02

12600.0

ore.stepwise

0.9963

210

1.04e-02

69.5

performance difference




ore.stepwise is approx. 180X faster than step at similar R^2 relative error.

Figure 1: Comparison of results for R's step function and ORE's ore.stepwise function for both bi-linear and quadratic models


ore.stepwise is faster than R's step by a factor of 66-180X. The larger the data set and the number of regressors, we observed greater performance with ore.stepwise compared to R's step.  The models produced by R’s step and ore.stepwise have a different number of regressors because both the selection mechanisms and interaction terms are handled differently.  step favors the main terms - x1:x2 will be added only if x1 and x2 were previously added, and, reversibly, x1:x2 will be eliminated before x1 and x2 are eliminated, whereas ore.stepwise does not differentiate between main terms and interactions. With respect to collinearity, ore.stepwise detects strict linear dependencies and eliminates from start the regressors involved in multi-collinear relations.

In summary, the ORE capabilities for stepwise regression far surpass similar functionality in tools we considered as alternatives to ORE. The deployment of ORE within the Oracle micro-processor tools environment introduced a technology which significantly expands the data analysis capabilities through the R ecosystem combined with in-database high performance algorithms and opens the door to new applications. This technology leverages the flexibility and extensibility of the R environment and allows massive and complex data analysis sustained by the scalability and performance of the Oracle database for Big Data.

Alexandre Ardelea is a principal hardware engineer at Oracle Corporation. Alex has a PhD and MS in Physics from Ecole polytechnique fédérale de Lausanne and post-doctoral research in non-linear physics, CFD and parallel methods. Alex's specialities include response surface modeling, optimization strategies for multi-parametric/objective/constraint problems, statistical process characterization, circuit analysis and RF algorithms. 

_______________________________________________________________________________________________________________

Oracle R Enterprise (ORE) implements a transparency layer on top of the R engine that allows R computations specified in the R environment to be pushed for execution in Oracle Database. A mapping is established between a special R object called an ore frame and a corresponding database table or view, allowing analytical functions such as ore.stepwise to be executed on these mapped objects in Oracle Database. The overloaded functions in the ORE packages generate SQL statements in the background, export the expensive computations to the database for execution, and return results to the R environment.  Here is a simple example using ore.stepwise with the longley data, which is shipped in the datasets package with R:


# load the Oracle R Enterprise library and connect to Oracle Database
R> library(ORE)
R> ore.connect(user = "rquser", host = "localhost", sid = "oracle", 
     password = "rquser")

R> # push the longley data to a database table:
R> LONGLEY <- ore.push(longley)

R> class(LONGLEY)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"

# Fit full model
R> oreFit <- ore.lm(Employed ~ ., data = LONGLEY)
R> summary(oreFit)

Call:
ore.lm(formula = Employed ~ ., data = LONGLEY)

Residuals:
     Min       1Q   Median       3Q      Max
-0.41011 -0.15980 -0.02816  0.15681  0.45539

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)  -3.482e+03  8.904e+02  -3.911 0.003560 **
GNP.deflator  1.506e-02  8.492e-02   0.177 0.863141
GNP          -3.582e-02  3.349e-02  -1.070 0.312681
Unemployed   -2.020e-02  4.884e-03  -4.136 0.002535 **
Armed.Forces -1.033e-02  2.143e-03  -4.822 0.000944 ***
Population   -5.110e-02  2.261e-01  -0.226 0.826212
Year          1.829e+00  4.555e-01   4.016 0.003037 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared: 0.9955,     Adjusted R-squared: 0.9925
F-statistic: 330.3 on 6 and 9 DF,  p-value: 4.984e-10


# perform stepwise variable selection
R> oreStep <- ore.stepwise(Employed ~ .^2, data = LONGLEY, add.p =
0.1, drop.p = 0.1)

# View a summary of ore.stepwise object
R> summary(oreStep)

Call:
ore.stepwise(formula = Employed ~ .^2, data = LONGLEY, add.p = 0.1,
    drop.p = 0.1)

Residuals:
   Min     1Q Median     3Q    Max
 18.42  22.08  24.08  29.40  33.19

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)             -3.539e-01  8.455e-03 -41.852  < 2e-16 ***
Year                     3.589e-05  1.821e-05   1.971   0.0675 .
GNP.deflator:GNP        -2.978e-03  3.039e-04  -9.800 6.51e-08 ***
GNP.deflator:Unemployed  2.326e-04  5.720e-06  40.656  < 2e-16 ***
GNP.deflator:Population  2.303e-05  2.293e-06  10.044 4.72e-08 ***
GNP:Armed.Forces         6.875e-06  8.415e-07   8.170 6.64e-07 ***
GNP:Year                 2.007e-04  1.890e-05  10.618 2.26e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9829 on 15 degrees of freedom
Multiple R-squared: 0.9972,     Adjusted R-squared: 0.9961
F-statistic:  3242 on 6 and 15 DF,  p-value: < 2.2e-16

For more information on how Oracle R Enterprise leverages Oracle Database as computational engine, see the Oracle R Training Series. Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition. We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.

Thursday Aug 02, 2012

Data Parallelism Using Oracle R Enterprise

Modern computer processors are adequately optimized for many statistical calculations, but large data operations may require hours or days to return a result.  Oracle R Enterprise (ORE), a set of R packages designed to process large data computations in Oracle Database, can run many R operations in parallel, significantly reducing processing time. ORE supports parallelism through the transparency layer, where the database is used as a computational engine, and embedded R execution, where R scripts can be executed in a data parallel manner.

The backbone of parallel computing is breaking down a resource intensive computation into chunks that can be performed independently, while maintaining a framework that allows for the results of those independent computations to be combined.  Writing parallel code is typically trickier than writing serial code, but this is simplified using ORE, as there is no need for the user to create worker instances or combine results. Using the transparency layer, users simply execute their ORE code and the database implicitly manages the entire process, returning results for further processing.

With ORE, each R function invocation that operates on an ORE object, such as ore.frame, is translated to a SQL statement behind the scenes. This SQL, which may be stacked after several function invocations, undergoes optimization and parallelization when parsed and executed. This technique enables deferred evaluation, but that's a topic for another blog. Depending on the resource requirements of the statement, the database decides if it should leverage parallel execution. 

For embedded R execution, database degree of parallelism settings help determine the number of parallel R engines to start.  When data parallel functions execute in parallel, each unit of work is sent to a different R external process, or extproc, at the database server. The results are automatically collated and returned as R-proxy objects, e.g., ore.frame objects, in the R interface and SQL objects in the SQL interface, which can be processed further in R or by SQL functions. The SQL functions enable the operationalizion or productization of R scripts as part of a database-based application, in what we refer to as "lights out" mode.

In the ORE Transparency Layer, where the database executes SQL generated from overloaded R functions, parallelism is automatic, assuming the database or table is configured for parallelism. Parallel computations in the transparency layer are ideal for bigger data where functionality exists in the database.

Using Embedded R Script Execution, parallelism is enabled for row, group and index operations if specified using a function parameter or parallel cursor hint:

  • ore.groupApply and rqGroupEval* split the data into grouped partitions and invoke the R function on each partition in a separate engine at the database server.
  • ore.rowApply and rqRowEval split the data into row chunks and invoke the R function on each chunk in a separate engine at the database server.
  • ore.indexApply runs an R function x times, with each iteration of the function invoked in separate engine at the database server.

With embedded R execution, the expectation is that the database server machine has greater RAM and CPU capacity than the user's client machine.  So executing R scipts at the server will inherently allow larger data sets to be processed by an individual R engine.

In addition, users can include contributed R packages in their embedded R scripts. Consider an example using a sample of the airline on-time performance data from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs. The data sample consists of 220K records of U.S. domestic commercial flights between 1987 and 2008. 

We use the R interface to embedded R to partition the airline data table (ONTIME_S) by the DAYOFWEEK variable, fit a linear model using the biglm package, and then combine the results. Note: To run this example, the biglm package must be installed on both the database server and client machine.

res <- ore.groupApply(ONTIME_S,
               INDEX = ONTIME_S$DAYOFWEEK,
               parallel = TRUE,
               function(dat) {
                library(biglm)
                library(ORE)
                ore.connect("rquser", "orcl", "localhost", "rquser")
                biglm(ARRDELAY ~ DEPDELAY + DISTANCE, dat)
               })
R> summary(res$Monday)
Large data regression model: biglm(ARRDELAY ~ DEPDELAY + DISTANCE, dat)
Sample size =  31649
               Coef    (95%     CI)     SE     p
(Intercept)  0.5177  0.2295  0.8058 0.1441 3e-04
DEPDELAY     0.9242  0.9178  0.9305 0.0032 0e+00
DISTANCE    -0.0014 -0.0017 -0.0011 0.0002 0e+00

The call to ore.groupApply uses Oracle Database to partition the ONTIME_S table by the categories in the DAYOFWEEK variable.  Each category is sent to an R engine at the database server machine to apply the R function in parallel.  The individual category results are combined in the returned result.  Using embedded R alleviates the typical memory problems associated with running R serially because we are fitting only a single partition, or day of the week, in memory of an R engine. Using a Linux server with 8 GB RAM and 4 CPUs, fitting the model in parallel by setting parallel = TRUE in the call to ore.groupApply, reduces the processing time from approximately 30 seconds to 10 seconds.

If the goal is to integrate the model results as an operationalized process, we can use rqGroupEval, the SQL interface equivalent to ore.groupApply.  We create a script to set up the structure of the input and grouping column and then run the script in SQL. The nature of pipelined table functions requires that we explicitly represent the type of the result, captured in the package, and create a function that includes the column used for partitioning explicitly.

# setup  

SQL> CREATE PACKAGE airlinePkg AS
  2  TYPE cur IS REF CURSOR RETURN ONTIME_S%ROWTYPE;
  3  END airlinePkg;
  4  /

Package created.

SQL> CREATE FUNCTION ontimeGroupEval(
  inp_cur  airlinePkg.cur,
  par_cur  SYS_REFCURSOR, 
  out_qry  VARCHAR2,
  grp_col  VARCHAR2,
  exp_txt  CLOB)
RETURN SYS.AnyDataSet
PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH (DAYOFWEEK))
CLUSTER inp_cur BY (DAYOFWEEK)
USING rqGroupEvalImpl;
/

# model build

alter table ONTIME_S parallel;

SQL> begin
  sys.rqScriptCreate('GroupingExample',
 'function(dat) {
      library(biglm)
library(ORE)
      ore.connect("rquser", "orcl", "localhost", "rquser")
      result <- biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
      result}');
end;
/ 
create table ONTIME_LM as
select *
  from table(ontimeGroupEval(
         cursor(select /*+ parallel(ONTIME_S)*/
         from ONTIME_S),
         NULL, NULL, 'DAYOFWEEK', 'GroupingExample'));

begin
  sys.rqScriptDrop('GroupingExample');
end;
/


We use a parallel hint on the cursor that is the input to our rqGroupEval function to enable Oracle Database to use parallel R engines.  In this case, using the same Linux server, the processing time is reduced from approximately 25 seconds to 7 seconds as we used 7 parallel R engines (one for each day of the week) across a single server.  Of course, a real-world scenario may utilize hundreds of parallel engines across many servers, returning results on large amounts of data in short period of time.

Additional details on how parallel execution works in Oracle database can be found here. We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.

*To enable execution of an R script in the SQL interface, ORE provides variants of ore.doEval, ore.groupApply and ore.indexApply in SQL. These functions are rqEval, rqTableEval, rqRowEval and rqGroupEval. The ore.groupApply feature does not have a direct parallel in the SQL interface. We refer to rqGroupApply as a concept, however, there is specific code required to enable this feature. This is highlighted in the second example.

Wednesday Jun 27, 2012

Solving Big Problems with Oracle R Enterprise, Part II

Part II – Solving Big Problems with Oracle R Enterprise

In the first post in this series (see https://blogs.oracle.com/R/entry/solving_big_problems_with_oracle), we showed how you can use R to perform historical rate of return calculations against investment data sourced from a spreadsheet.  We demonstrated the calculations against sample data for a small set of accounts.  While this worked fine, in the real-world the problem is much bigger because the amount of data is much bigger.  So much bigger that our approach in the previous post won’t scale to meet the real-world needs.


From our previous post, here are the challenges we need to conquer:

  • The actual data that needs to be used lives in a database, not in a spreadsheet
  • The actual data is much, much bigger- too big to fit into the normal R memory space and too big to want to move across the network
  • The overall process needs to run fast- much faster than a single processor
  • The actual data needs to be kept secured- another reason to not want to move it from the database and across the network
  • And the process of calculating the IRR needs to be integrated together with other database ETL activities, so that IRR’s can be calculated as part of the data warehouse refresh processes

In this post, we will show how we moved from sample data environment to working with full-scale data.  This post is based on actual work we did for a financial services customer during a recent proof-of-concept.

...

  1. On average, we performed 8,200 executions of our R function per second (110s/911k accounts)
  2. On average, we did 41,000 single time period rate of return calculations per second (each of the 8,200 executions of our R function did rate of return calculations for 5 time periods)
  3. On average, we processed over 900,000 rows of database data in R per second (103m detail rows/110s)

...

R + Oracle R Enterprise: Best of R + Best of Oracle Database

This blog post series started by describing a real customer problem: how to perform a lot of calculations on a lot of data in a short period of time.  While standard R proved to be a very good fit for writing the necessary calculations, the challenge of working with a lot of data in a short period of time remained.


This blog post series showed how Oracle R Enterprise enables R to be used in conjunction with the Oracle Database to overcome the data volume and performance issues (as well as simplifying the operations and security issues).  It also showed that we could calculate 5 time periods of rate of returns for almost a million individual accounts in less than 2 minutes.



[Read More]

Thursday Jun 21, 2012

Solving Big Problems with Oracle R Enterprise, Part I

Abstract:

This blog post will show how we used Oracle R Enterprise to tackle a customer’s big calculation problem across a big data set.

Overview:

Databases are great for managing large amounts of data in a central place with rigorous enterprise-level controls.  R is great for doing advanced computations.  Sometimes you need to do advanced computations on large amounts of data, subject to rigorous enterprise-level concerns.  This blog post shows how Oracle R Enterprise enables R plus the Oracle Database enabled us to do some pretty sophisticated calculations across 1 million accounts (each with many detailed records) in minutes.

The problem:

A financial services customer of mine has a need to calculate the historical internal rate of return (IRR) for its customers’ portfolios.  This information is needed for customer statements and the online web application.  In the past, they had solved this with a home-grown application that pulled trade and account data out of their data warehouse and ran the calculations.  But this home-grown application was not able to do this fast enough, plus it was a challenge for them to write and maintain the code that did the IRR calculation.

...


[Read More]

Monday Apr 23, 2012

Introduction to Oracle R Connector for Hadoop

MapReduce, the heart of Hadoop, is a programming framework that enables massive scalability across servers using data stored in the Hadoop Distributed File System (HDFS). The Oracle R Connector for Hadoop (ORCH) provides access to a Hadoop cluster from R, enabling manipulation of HDFS-resident data and the execution of MapReduce jobs.

Conceptutally, MapReduce is similar to combination of apply operations in R or GROUP BY in Oracle Database: transform elements of a list or table, compute an index, and apply a function to the specified groups. The value of MapReduce in ORCH is the extension beyond a single-process to parallel processing using modern architectures: multiple cores, processes, machines, clusters, data appliance, or clouds.

ORCH can be used on the Oracle Big Data Appliance or on non-Oracle Hadoop clusters. R users write mapper and reducer functions in R and execute MapReduce jobs from the R environment using a high level interface. As such, R users are not required to learn a new language, e.g., Java, or environment, e.g., cluster software and hardware, to work with Hadoop. Moreover, functionality from R open source packages can be used in the writing of mapper and reducer functions. ORCH also gives R users the ability to test their MapReduce programs locally, using the same function call, before deploying on the Hadoop cluster.


In the following example, we use the ONTIME_S data set typically installed in Oracle Database when Oracle R Enterprise is installed. ONTIME_S is a subset of the airline on-time performance data (from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs. 
We're providing a relatively large sample data set (220K rows), but this example could be performed in ORCH on the full data set, which contains 123 millions rows and requires 12 GB disk space . This data set is significantly larger than R can process on it's own using a typical laptop with 8 GB RAM.

ONTIME_S is a database-resident table with metadata on the R side, represented by an ore.frame object.

> class(ONTIME_S)
[1] "ore.frame" attr(,"package")
[1] "OREbase"

ORCH includes functions for manipulating HDFS data. Users can move data between HDFS and the file system, R data frames, and Oracle Database tables and views. This next example shows one such function, hdfs.push, which accepts an ore.frame object as its first argument, followed by the name of the key column, and then the name of the file to be used within HDFS.

ontime.dfs_DB <- hdfs.push(ONTIME_S,
                       key='DEST',
                       dfs.name='ontime_DB')

The following R script example illustrates how users can attach to an existing HDFS file object, essentially getting a handle to the HDFS file.
Then, using the hadoop.run function in ORCH, we specify the HDFS file handle, followed by the mapper and reducer functions. The mapper function takes the key and value as arguments, which correspond to one row of data at a time from the HDFS block assigned to the mapper. The function keyval in the mapper returns data to Hadoop for further processing by the reducer.

The reducer function receives all the values associated with one key (resulting from the “shuffle and sort” of Hadoop processing). The result of the reducer is also returned to Hadoop using the keyval function. The results of the reducers are consolidated in an HDFS file, which can be obtained using the hdfs.get function.

The following example computes the average arrival delay for flights where the destination is San Francisco Airport (SFO).  It selects the SFO airport in the mapper and the mean of arrival delay in the reducer.

dfs <- hdfs.attach("ontime_DB")

res <- hadoop.run(
        dfs,
        mapper = function(key, value) {
          if (key == 'SFO' & !is.na(x$ARRDELAY)) {
            keyval(key, value)
          }
          else {
            NULL
          }
            },
         reducer = function(key, values) {
            for (x in values) {
                sumAD <- sumAD + x$ARRDELAY
                count <- count + 1
                  }
                  res <- sumAD / count
                  keyval(key, res)
            })

> hdfs.get(res)
   key     val1
1  SFO   17.44828

Oracle R Connector for Hadoop is part of the Oracle Big Data Connectors software suite and is supported for Oracle Big Data Appliance and Oracle R Enterprise customers.  We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  We welcome comments and questions on the Oracle R Forum.





Friday Apr 13, 2012

Oracle R Enterprise 1.1 Download Available

Oracle just released the latest update to Oracle R Enterprise, version 1.1. This release includes the Oracle R Distribution (based on open source R, version 2.13.2), an improved server installation, and much more.  The key new features include:

  • Extended Server Support: New support for Windows 32 and 64-bit server components, as well as continuing support for Linux 64-bit server components
  • Improved Installation: Linux 64-bit server installation now provides robust status updates and prerequisite checks
  • Performance Improvements: Improved performance for embedded R script execution calculations

In addition, the updated ROracle package, which is used with Oracle R Enterprise, now reads date data by conversion to character strings.

We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.



Oracle R Distribution 2-13.2 Update Available

Oracle has released an update to the Oracle R Distribution, an Oracle-supported distribution of open source R. Oracle R Distribution 2-13.2 now contains the ability to dynamically link the following libraries on both Windows and Linux:

  • The Intel Math Kernel Library (MKL) on Intel chips
  • The AMD Core Math Library (ACML) on AMD chips

To take advantage of the performance enhancements provided by Intel MKL or AMD ACML in Oracle R Distribution, simply add the MKL or ACML shared library directory to the LD_LIBRARY_PATH system environment variable. This automatically enables MKL or ACML to make use of all available processors, vastly speeding up linear algebra computations and eliminating the need to recompile R.  Even on a single core, the optimized algorithms in the Intel MKL libraries are faster than using R's standard BLAS library.

Open-source R is linked to NetLib's BLAS libraries, but they are not multi-threaded and only use one core. While R's internal BLAS are efficient for most computations, it's possible to recompile R to link to a different, multi-threaded BLAS library to improve performance on eligible calculations. Compiling and linking to R yourself can be involved, but for many, the significantly improved calculation speed justifies the effort. Oracle R Distribution notably simplifies the process of using external math libraries by enabling R to auto-load MKL or ACML. For R commands that don't link to BLAS code, taking advantage of database parallelism using embedded R execution in Oracle R Enterprise is the route to improved performance.

For more information about rebuilding R with different BLAS libraries, see the linear algebra section in the R Installation and Administration manual. As always, the Oracle R Distribution is available as a free download to anyone. Questions and comments are welcome on the Oracle R Forum.




Wednesday Apr 04, 2012

New Release of ROracle posted to CRAN

Oracle recently updated ROracle to version 1.1-2 on CRAN with enhancements and bug fixes. The major enhancements include the introduction of support for Oracle Wallet Manager and datetime and interval types. 

Oracle Wallet support in ROracle allows users to manage public key security from the client R session. Oracle Wallet allows passwords to be stored and read by Oracle Database, allowing safe storage of database login credentials. In addition, we added support for datetime and interval types when selecting data, which expands ROracle's support for date data. 

See the ROracle NEWS for the complete list of updates.

We encourage ROracle users to post questions and provide feedback on the Oracle R Forum.

In addition to being a high performance database interface to Oracle Database from R for general use, ROracle supports database access for Oracle R Enterprise.

Monday Apr 02, 2012

Introduction to ORE Embedded R Script Execution

This Oracle R Enterprise (ORE) tutorial, on embedded R execution, is the third in a series to help users get started using ORE. See these links for the first tutorial on the transparency layer and second tutorial on the statistics engine. Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

Embedded R Execution refers to the ability to execute an R script at the database server, which provides several benefits: spawning multiple R engines in parallel for data-parallel operations, more efficient data transfer between Oracle Database and the R engine, leverage a likely more powerful server with more CPUs and greater RAM, schedule automated jobs, and take advantage of open source R packages at the database server. Data aggregates are computed in parallel, significantly reducing computation time, without requiring sophisticated configuration steps.

ORE provides two interfaces for embedded R execution: one for R and one for SQL. The R interface enables interactive execution at the database server from the client R engine, e.g., your laptop. It also has transparency aspects for passing R objects and returning R objects. In the R interface, the ore.doEval schedules execution of the R code with the database-embedded R engine and returns the results back to the desktop for continued analysis. User-defined R functions can run in parallel, either on each row, sets of rows, or on each group of rows given a grouping column. The first two cases are covered by ore.rowApply, the second by the ore.groupApply function. ore.indexApply provides parallel simulations capability by invoking the script the number of times specified by the user.  The R interface returns results to the client as R objects that can be passed as arguments to R functions. 

The SQL interface enables interactive execution from any SQL interface, like SQL*Plus or SQL Developer, but it also enables R scripts to be included in production database-based systems. To enable execution of an R script in the SQL interface, ORE provides variants of ore.doEval, ore.groupApply and ore.indexApply in SQL.  These functions are rqEval, rqTableEval, rqRowEval and rqGroupEval. The SQL interface allows for storing results directly in the database.

 R Interface Function (ore.*)
 SQL Interface Function (rq*)
 Purpose
 ore.doEval  rqEval  Invoke stand-alone R script
 ore.tableApply  rqTableEval  Invoke R script with full table input
 ore.rowApply  rqRowEval  Invoke R script one row at a time, or multiple rows in "chunks"
 ore.groupApply  rqGroupEval  Invoke R script on data indexed by grouping column
 ore.indexApply
N/A
 Invoke R script N times

In addition, the SQL interface enables R results to be stored in a database table for subsequent use in another invocation (think data mining model building and scoring). It enables returning structured R results in a table. Results can also be returned as XML. The XML interface enables both structured data, such as data frames, R objects, and graphs to be returned.  The XML capability allows R graphs and structured results to be displayed in Oracle BI Publisher documents and OBIEE dashboards.

Embedded R Execution: R Interface

The following example uses the function ore.groupApply, one of several embedded R execution functions, to illustrate how R users can achieve data parallelism through the database. This example also illustrates that embedded R execution enables the use of open source packages. Here we see the use of the R package biglm.

We specify a column on which to partition the data. Each partition of the data is provided to the function through the first argument, in this case the function variable dat. There is no need to send data from the database to R - the R function is sent to the database, which processes them in parallel. Output results may be stored directly in the database, or may be downloaded to R. Only when we want to see the results of these models do we need to retrieve them into R memory and perform, for example, the summary function.

modList <- ore.groupApply(
   X=ONTIME,
   INDEX=ONTIME$DEST,
   function(dat) {
     library(biglm)
     biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
   });

modList_local <- ore.pull(modList)

> summary(modList_local$BOS)
Large data regression model: biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
Sample size =  3928 
               Coef    (95%     CI)     SE      p
(Intercept)  0.0638 -0.7418  0.8693 0.4028 0.8742
DISTANCE    -0.0014 -0.0021 -0.0006 0.0004 0.0002
DEPDELAY     1.0552  1.0373  1.0731 0.0090 0.0000

Embedded R Execution: SQL Interface 

Whereas the previous example showed how to use embedded R execution from the R environment, we can also invoke R scripts from SQL. This next example illustrates returning a data frame from results computed in Oracle Database. We first create an R script in the database R script repository. The script is defined as a function that creates a vector of 10 elements, and returns a data frame with those elements in one column and those elements divided by 100 in a second column.

Once the script is created, we can invoke it through SQL. One of the SQL embedded R executions table functions available is rqEval. The first argument is NULL since we have no parameters to pass to the function. The second argument describes the structure of the result. Any valid SQL query that captures the name and type of resulting columns will suffice. The third argument is the name of the script to execute.

begin
  sys.rqScriptCreate('Example1',
'function() {
   ID <- 1:10
   res <- data.frame(ID = ID, RES = ID / 100)
   res}');
end;
/
select *
  from table(rqEval(NULL,
                    'select 1 id, 1 res from dual',
                    'Example1'));

The result is a data frame:



Embedded R scripts may generate any valid R object, including graphs. In addition, embedded R execution enables returning results from an R script as an XML
string. Consider the following example that creates a vector from the integers 1 to 10, plots 100 random normal points in a graph, and then returns the vector. After creating the script in the database R script repository, we invoke the script using rqEval, but instead of specifying the form of the result in a SQL query, we specify XML.

begin
  sys.rqScriptCreate('Example6',
 'function(){
            res <- 1:10
            plot( 1:100, rnorm(100), pch = 21,
                  bg = "red", cex = 2 )
            res
            }');
end;
/
select value
from   table(rqEval( NULL,'XML','Example6'));

While the actual graph looks like the following, the output from this query will be an XML string.



In the execution results shown below, the VALUE column returned is a string that contains first the structured data in XML format. Notice the numbers 1 through
10 set off by the <value> tags. This is followed by the image in PNG base 64 representation. This type of output can be consumed by Oracle Business Intelligence Publisher (BIP) to produce documents with R-generated graphs and structured content.  Oracle BIP templates can also be used to expose R-generated content in Oracle Business Intelligence Enterprise Edition (OBIEE) web browser-based dashboards.

You can see additional examples using embedded R execution in action in the Oracle Enterprise Training, session 4, Embedded R Script Execution. These example will run as written in R 2.13.2 after installing Oracle R Enterprise. We'll be posting more examples using embedded R script execution in the coming months. In the meantime, questions are always welcome on the Oracle R Forum.


Tuesday Mar 13, 2012

Oracle R Distribution and Open Source R

Oracle provides the Oracle R Distribution, an Oracle-supported distribution of open source R. Support for Oracle R Distribution is provided to customers of the Oracle Advanced Analytics option and the Oracle Big Data Appliance. The Oracle R Distribution facilitates enterprise acceptance of R, since the lack of a major corporate sponsor has made some companies concerned about fully adopting R. With the Oracle R Distribution, Oracle plans to contribute bug fixes and relevant enhancements to open source R. Oracle R Distribution also employs Intel's Math Kernel Library (MKL) to enable optimized, multi-threaded math routines, providing relevant R functions maximum performance on Intel hardware.

Oracle plans to actively maintain interoperability with current R versions by developing Oracle R Enterprise on the last stable point release of open source R. With the Oracle R Distribution, Oracle plans to expand support to the open source community by contributing bug fixes and relevant enhancements to open source R.  All of these improvements will be made publicly available to the R community. Oracle has already released to the open source community an enhanced version of ROracle, which it now maintains. Support for Oracle R Distribution, Oracle R Enterprise, and Oracle R Connector for Hadoop is provided through standard Oracle Support channels for licensed customers.

Wednesday Feb 29, 2012

ROracle 1.1-1 Delivers Performance Improvements

The Oracle R Advanced Analytics team is happy to announce the release of the ROracle 1.1-1 package on the Comprehensive R Archive Network (CRAN).  We’ve rebuilt ROracle from the ground up, working hard to fix bugs and add optimizations. The new version introduces key improvements for interfacing with Oracle Database from open-source R.

Specific improvements in ROracle 1.1-1 include:

  • Lossless database reads: consistent accuracy for database read results
  • Performance improvements: faster database read and write operations
  • Simplified installation:  Oracle Instant Client now bundled on Windows, Instant or full Oracle Client on Linux

ROracle uses the Oracle Call Interface (OCI) libraries to handle the database connections, providing a high-performance, native C-language interface to the Oracle Database.  ROracle 1.1-1 is supported for R versions 2.12 through 2.13, and with Oracle Instant Client and Oracle Database Client versions 10.2 through 11.2, both 32 and 64-bit running Linux and Windows.

We think ROracle 1.1-1 is a great step forward, allowing users to build high performance and efficient R applications using Oracle Database. Whether you are upgrading your existing interface or using it for the first time,  ROracle 1.1-1 is ready for download..  If you have any questions or comments please post on the Oracle R discussion forum. We'd love to hear from you!

Thursday Feb 23, 2012

Introduction to the ORE Statistics Engine

This Oracle R Enterprise (ORE) statistics engine tutorial is the second in a series to help users get started using ORE. (See the first tutorial on the ORE transparency layer here). Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

The ORE statistics engine is a database library consisting of native database statistical functions and new functionality added specifically for ORE. ORE intercepts R functions and pushes their execution to Oracle Database to perform computationally expensive transformations and computations. This allows R users to use the R client directly against data stored in Oracle Database, greatly increasing R's scalability and performance.

A variety of useful and common routines are available in the ORE statistics engine:

Significance Tests

Chi-square, McNemar, Bowker
Simple and weighted kappas
Cochran-Mantel-Haenzel correlation
Cramer's V
Binomial, KS, t, F, Wilcox

Distribution Functions
Beta distribution
Binomial distribution
Cauchy distribution
Chi-square distribution
Exponential distribution
F-distribution
Gamma distribution
Geometric distribution
Log Normal distribution
Logistic distribution
Negative Binomial distribution
Normal distribution
Poisson distribution
Sign Rank distribution
Student t distribution
Uniform distribution
Weibull distribution
Density Function
Probability Function
Quantile
distribution

Other Functions

Gamma function
Natural logarithm of the Gamma function
Digamma function
Trigamma function
Error function
Complementary error function

Base SAS Equivalents
Freq, Summary, Sort
Rank, Corr, Univariate

These R functions are overridden, such that when presented with ore.frame data, the function generates a SQL query that is transparently submitted to Oracle Database. This is the case for much of the base R and stats functionality, providing the benefit of employing the database as the computational engine and seamlessly breaking through R's memory barrier.

In this post, we introduce a simple yet typical data analysis using functions from the ORE transparency layer. We begin configuring the local R environment by executing a few simple commands.

Load the ORE library and connect to Oracle Database:

> library(ORE)
>
ore.connect("USER", "SID", "HOST", "PASSWORD")

Invoking ore.sync synchronizes the the metadata in the database schema with the R environment:

> ore.sync("SCHEMA")
>
ore.ls()


Attaching the database schema provides access to views and tables so they can be manipulated from a local R session:

>
ore.attach("SCHEMA")

We use the ONTIME_S data set typically installed in Oracle Database when ORE is installed. ONTIME_S is a subset of the airline on-time performance data (from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs.  We're providing a relatively large sample data set (220K rows), but these examples could be performed in ORE on the full data set, which contains 123 millions rows and requires 12 GB disk space . This data set is significantly larger than R can process on it's own.

ONTIME_S is a database-resident table with metadata on the R side, represented by an ore.frame object.

> class(ONTIME_S)
[1] "ore.frame" attr(,"package")
[1] "OREbase"

We focus on two columns of data: one numeric column, DEPDELAY (actual departure delay in minutes), and a categorical column, ORIGIN (airport of origin).

A typical place to begin, is, of course, looking at the structure of selected variables. We call the function summary to obtain summary statistics on the variable measuring departure delay.

> with(ONTIME_S, summary(DEPDELAY))
      Min.   1st Qu  Median  Mean  3rd Qu. Max.     NA's 
   -75.000  -2.000  0.000   8.042   6.00  1438.00  3860.00 

To investigate the skewness of the departure delay variable, we use the quantile function to calculate quantiles at 10, 20, 30,....90%.

> with(ONTIME_S, quantile(DEPDELAY, seq(0.1, 0.9, by = 0.1), na.rm = TRUE)) 10% 20% 30% 40% 50% 60% 70% 80% 90% -4  -2  -1  0  0   1  4  10  26

This shows us that the 10% quantile (-4) is 4 units away from the median, while the 90% quantile (26) is 26 units from the median.  For a symmetric distribution, the two quantiles would be about the same distance from the median.  A measure of the data spread is the interquartile range, the difference between the 25% and 75% quantile. To allow computations on the data, we remove missing values by setting the na.rm parameter to TRUE, as we did above for the quantile function.
> with(ONTIME_S, IQR(DEPDELAY, na.rm = TRUE))
[1] 8

Other measures of data spread available are the variance and standard deviation.

> with(ONTIME_S, var(DEPDELAY, na.rm = TRUE)) [1] 771.7207 > with(ONTIME_S, sd(DEPDELAY, na.rm = TRUE)) [1] 27.77986

Using in-database aggregation summaries, we can investigate the relationship between departure delay and origin a bit further. We use aggregate to calculate the mean departure delay for each airport of origin.  Results for the first five airports are displayed using the function head.

>
ONTIME.agg <- aggregate(ONTIME_S$DEPDELAY,
                     by = list(ONTIME_S$ORIGIN),
                     FUN = mean)
>
head(ONTIME.agg, 5)
  Group.1    x
1     ABE   216
2     ABI    29
3     ABQ  1392
4     ABY   9
5     ACK   2


Now that we have gained a basic impression and some insights into the ONTIME_S data, we may choose to view the data graphically.  For example, we may want to get a visual impression of the distribution of departure delay.  We use the hist function, which displays a histogram skewed on positive side, presumably because flights rarely leave early.

> with(ONTIME_S, hist(DEPDELAY, breaks = 100, col = rainbow(7), cex.lab = 0.8,
      main = "Distribution of Departure Delay", xlab = "Departure Delay (in minutes)"))

After analyzing the data through exploratory methods in ORE, we proceed to a possible next step: confirmatory statistics. Let's compute a Student's t-test using  the origin and departure delay variables we examined earlier.  The goal is to decide whether average departure delay of one airport is different from the average delay of another.

> with(ONTIME_S, t.test(DEPDELAY[ORIGIN == "SEA"], DEPDELAY[ORIGIN == "SFO"], conf.level = 0.95))

    Welch Two Sample t-test

data:  DEPDELAY[ORIGIN == "SEA"] and DEPDELAY[ORIGIN == "SFO"]
t = -1.8406, df = 7571.893, p-value = 0.06572
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.21038010  0.06960489
sample estimates:
mean of x  mean of y
 8.475415  9.545802


The output tells us that we calculated the unpaired t-test and gives us the value of the t-distributed statistic.

We may also use a Kolmogorov–Smirnov test to determine if the departure delay variable is from the same distribution as the cumulative distribution.

> with(ONTIME_S, ks.test(DEPDELAY, "pnorm", mean(DEPDELAY), sd(DEPDELAY)))

    One-sample Kolmogorov-Smirnov test

data:  DEPDELAY
D = 0.3497, p-value < 2.2e-16
alternative hypothesis: two-sided


At this point we could continue our data exploration by performing additional distribution tests, or proceed with the rich set of modeling and functionality ORE offers.  

ORE enables R users transparent access to data stored in Oracle Database while leveraging Oracle Database as a compute engine for scalability and high performance. We've only scratched the surface on ORE's statistical features - stay tuned for posts highlighting more advanced features of the statistical engine.

The Oracle R Enterprise User's Guide contains a number of examples demonstrating the functionality available in ORE. To view the documentation and training materials, visit our product pagePlease feel free to visit our discussion forum  and ask questions or provide comments about how we can help you and your team!






 
About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today