Monday Feb 18, 2013

Saving R Objects in Oracle Database using Oracle R Enterprise 1.3 Datastore

R allows users to save R objects to disk. The whole workspace of R objects can be saved to a file, and reloaded across R sessions, which allows users to return to their previous R environment even after quitting R or to avoid recreating objects needed in the future. One such type of object includes predictive models, which can be built in one session, and saved for scoring in another R session, or even multiple, possibly parallel R sessions.

R provides the save and load functions where objects in memory are serialized and unserialized, respectively. Figure 1 depicts an example where two R objects, a linear model and data.frame are saved to a file, and then reloaded. When objects are restored, they have the same names as when they were saved.

Figure 1: Using R save() and load() functions

Oracle R Enterprise (ORE) 1.3 supports object persistence using an R datastore in the database. Now, ORE proxy objects, as well as any R objects, can be saved and restored across R sessions as a named entity in Oracle Database. Serializing ORE objects, such as ore.frames, and saving them doesn’t work across sessions, since any referenced temporary tables or other database objects are not saved across R sessions. If these ore.frame proxy object references are not maintained, restoring them makes them incomplete and inoperative.

Figure 2 has an example similar to the previous example. The main difference is that we are using ore.save and ore.load, and providing the name of the datastore from which we want to retrieve ORE objects.

Figure 2: Using ORE datastore functions ore.save and ore.load

Each schema has its own datastore table where R objects are saved. By being managed in Oracle Database, ORE provides referential integrity of saved objects such that when otherwise temporary database objects are no longer referenced, they are auto-deleted at the end of the R session. This applies, for example, to tables created via ore.push or Oracle Data Mining models produced using the OREdm package.

Here’s a simple example:

DAT1 <- ore.push(iris)

ore.lm.mod <- ore.lm(Sepal.Length~., DAT1 )

lm.mod <- lm(mpg ~ cyl + disp + hp + wt + gear, mtcars)

nb.mod <- ore.odmNB(YEAR ~ ARRDELAY + DEPDELAY + log(DISTANCE), ONTIME_S)

ore.save(ore.lm.mod, lm.mod, nb.mod, name = "myModels")

We’re creating four objects: an ore.frame “DAT1” consisting of the iris data set, an ore.lm model that uses DAT1,a standard R lm model using the mtcars dataset, and an ODM naïve Bayes model using ONTIME_S. We then invoke ore.save for the three models and use the datastore name “myModels”.

The R objects are saved in the datastore, and any referenced data tables or ODM models are kept in the database, otherwise these are treated as temporary database objects and dropped when the ORE session ends.

To load these objects, invoke:

ore.load(name = "myModels")

 To see the content of a datastore, you can invoke ore.datastoreSummary with the name of the datastore.

The datastore also makes it easy to access R and ORE objects within ORE embedded R execution functions. Simply, the name of the datastore is passed as an embedded R function argument. Within the function, ore.load is invoked with the name of that datastore. We'll see an example of this using the SQL API below. In addition, by maintaining persisted R objects in the database, ORE facilitates application deployment while leveraging existing database backup, recovery, and security mechanisms. 

Consider the following ORE embedded R execution example using the SQL API.  We build a simple linear model and store it in a datastore with the name "myDatastore". This R script will be stored in the database R script repository with the name "BuildModel-1". To invoke this R script, we use the rqTableEval function, providing the input data from ONTIME_S, passing the parameters that include the name of the datastore, requesting XML output for the result, and specifying the R script by name. 

begin
  sys.rqScriptCreate('BuildModel-1',
 'function(dat,datastore_name) {
    mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
    ore.save(mod,name=datastore_name, overwrite=TRUE)
    TRUE
}');
end;
/

select *

from table(rqTableEval(
  cursor(select ARRDELAY,DISTANCE,DEPDELAY from ONTIME_S),
  cursor(select 1 as "ore.connect", 'myDatastore' as "datastore_name"
         from dual),
  'XML',
  'BuildModel-1'));

To score using this model, we create an R script named "ScoreData-1" that loads the model from the named datastore, invokes predict, and binds the predictions with the predictors to be returned as the result.To invoke this script, we again use the rqTableEval function that takes as parameters a cursor specifying the data to be scored, parameters, a description of the result as a SQL query, and the R script name.

begin
  sys.rqScriptCreate('ScoreData-1',
 'function(dat, datastore_name) {
     ore.load(datastore_name)
     pred <- predict(mod, newdata=dat)
     pred[as.integer(rownames(pred))] <- pred
     cbind(dat, PRED = pred)
     }');
end;
/

select *
from table(rqTableEval(
    cursor(select ARRDELAY, DISTANCE, DEPDELAY from ONTIME_S
           where YEAR = 2003 and MONTH = 5 and DAYOFMONTH = 2),
    cursor(select 1 as "ore.connect",
                 'myDatastore' as "datastore_name" from dual),
    'select ARRDELAY, DISTANCE, DEPDELAY, 1 PRED from ontime_s',
    'ScoreData-1'))
order by 1, 2, 3;

Datastore functionality supports the following interface: ore.save, ore.load, ore.datastore, ore.datastoreSummary, and ore.delete. See the online documentation for details. A presentation on the ORE transparency layer that provides additional examples is available here.

Wednesday Feb 06, 2013

Oracle R Enterprise 1.3 gives predictive analytics an in-database performance boost

Recently released Oracle R Enterprise 1.3 adds packages to R that enable even more in-database analytics. These packages provide horizontal, commonly used techniques that are blazingly fast in-database for large data. With Oracle R Enterprise 1.3, Oracle makes R even better and usable in enterprise settings. (You can download ORE 1.3 here and documentation here.)

When it comes to predictive analytics, scoring (predicting outcomes using a data mining model) is often a time critical operation. Scoring can be done online (real-time), e.g., while a customer is browsing a webpage or using a mobile app, where on-the-spot recommendations can be made based on current actions. Scoring can also be done offline (batch), e.g., predict which of your 100 million customers will respond to each of a dozen offers, e.g., where applications leverage results to identify which customers should be targeted with a particular ad campaign or special offer.

In this blog post, we explore where using Oracle R Enterprise pays huge dividends. When working with small data, R can be sufficient, even when pulling data from a database. However, depending on the algorithm, benefits of in-database computation can be seen in a few thousand rows. The time difference with 10s of thousands of rows makes an interactive session more interactive, whereas 100s of thousands of rows becomes a real productivity gain, and on millions (or billions) of rows, becomes a competitive advantage! In addition to performance benefits, ORE integrates R into the database enabling you to leave data in place.

We’ll look at a few proof points across Oracle R Enterprise features, including:

  • OREdm – a new package that provides R access to several in-database Oracle Data Mining algorithms (Attribute Importance, Decision Tree, Generalized Linear Models, K-Means, Naïve Bayes, Support Vector Machine).
  • OREpredict – a new package that enables scoring models built using select standard R algorithms in the database (glm, negbin, hclust, kmeans, lm, multinom, nnet, rpart).
  • Embedded R Execution – an ORE feature that allows running R under database control and boosts real performance of CRAN predictive analytics packages by providing faster access to data than occurs between the database and client, as well as leveraging a more powerful database machine with greater RAM and CPU resources.

OREdm

Pulling data out of a database for any analytical tool impedes interactive data analysis due to access latency, either directly when pulling data out of the database or indirectly via an IT process that involves requesting data to be staged in flat files. Such latencies can quickly become intolerable. On the R front, you’ll also need to consider whether the data will fit in memory. If flat files are involved, consideration needs to be given to how files will be stored, backed up, and secured.

Of course, model building and data scoring execution time is only part of the story. Consider a scenario A, the “build combined script,” where data is extracted from the database, and an R model built and persisted for later use. In the corresponding scenario B, the “score combined script”, data is pulled from the database, a previously built model loaded, data scored, and the scores written to the database. This is a typical scenario for use in, e.g., enterprise dashboards or within an application supporting campaign management or next-best-offer generation. In-database execution provides significant performance benefits, even for relatively small data sets as included below. Readers should be able to reproduce such results at these scales. We’ve also included a Big Data example by replicating the 123.5 million row ONTIME data set to 1 billion rows. Consider the following examples:

Linear Models: We compared R lm and ORE ore.lm in-database algorithm on the combined scripts. On datasets ranging from 500K to 1.5M rows with 3-predictors, in-database analytics showed an average 2x-3x performance improvement for build, and nearly 4x performance improvement for scoring. Notice in Figure 1 that the trend is significantly less for ore.lm than lm, indicating greater scalability for ore.lm.

Figure 1. Overall lm and ore.lm execution time for model building (A) and data scoring (B)

Figure 2 provides a more detailed view comparing data pull and model build time for build detail, followed by data pull, data scoring, and score writing for score detail. For model building, notice that while data pull is a significant part of lm’s total build time, the actual build time is still greater than ore.lm. A similar statement can be made in the case of scoring.

Figure 2. Execution time components for lm and ore.lm (excluding model write and load)

Naïve Bayes from the e1071 package: On 20-predictor datasets ranging from 50k to 150k rows, in-database ore.odmNB improved data scoring performance by a factor of 118x to 418x, while the full scenario B execution time yielded a 13x performance improvement, as depicted in Figure 3B. Using a non-parallel execution of ore.odmNB, we see the cross-over point where ore.odmNB overtakes R, but more importantly, the slope of the trend points to the greater scalability of ORE, as depicted in Figure 3A for the full scenario A execution time.


Figure 3. Overall naiveBayes and ore.odmNB execution time for model building (A) and data scoring (B)

K-Means clustering: Using 6 numeric columns from the ONTIME airline data set ranging from 1 million to 1 billion rows, we compare in-database ore.odmKMeans with R kmeans through embedded R execution with ore.tableApply. At 100 million rows, ore.odmKMeans demonstrates better performance than kmeans , and scalability at 1 billion rows. The performance results depicted in Figure 4 uses a log-log plot. The legend shows the function invoked and corresponding parameters, using subset of ONTIME data set d. While ore.odmKMeans scales linearly with number of rows, R kmeans does not. Further, R kmeans did not complete at 1 billion rows.

Figure 4: K-Means clustering model building on Big Data

OREpredict

With OREpredict, R users can also benefit from in-database scoring of R models. This becomes evident not only when considering the full “round trip” of pulling data from the database, scoring in R, and writing data back to the database, but also for the scoring itself.

Consider an lm model built using a dataset with 4-predictors and 1 million to 5 million rows. Pulling data from the database, scoring, and writing the results back to the database shows a pure R-based approach taking 4x - 9x longer than in-database scoring using ore.predict with that same R model. Notice in Figure 5 that the slope of the trend is dramatically less for ore.predict than predict, indicating greater scalability. When considering the scoring time only, ore.predict was 20x faster than predict in R for 5M rows. In ORE 1.3, ore.predict is recommended and will provide speedup over R for numeric predictors.

Figure 5. Overall lm execution time using R predict vs. ORE ore.predict

For rpart, we see a similar result. On a 20-predictor, 1 million to 5 million row data set, ore.predict resulted in a 6x – 7x faster execution. In Figure 5, we again see that the slope of the trend is dramatically less for ore.predict than predict, indicating greater scalability. When considering the scoring time only, ore.predict was 123x faster than predict in R for 5 million rows.

Figure 6. Overall rpart Execution Time using R predict vs. ORE ore.predict

This scenario is summarized in Figure 7. In the client R engine, we have the ORE packages installed. There, we invoke the pure R-based script, which requires pulling data from the database. We also invoke the ORE-based script that keeps the data in the database.

Figure 7. Summary of OREpredict performance gains

To use a real world data set, we again consider the ONTIME airline data set with 123.5 million rows. We will build lm models with varying number of coefficients derived by converting categorical data to multiple columns. The variable p corresponds to the number of coefficients resulting from the transformed formula and is dependent on the number of distinct values in the column. For example, DAYOFWEEK has 7 values, so with DEPDELAY, p=9. In Figure 8, you see that using an lm model with embedded R for a single row (e.g., one-off or real-time scoring), has much more overhead (as expected given that an R engine is being started) compared to ore.predict, which shows subsecond response time through 40 coefficients at 0.54 seconds, and the 106 coefficients at 1.1 seconds. Here are the formulas describing the columns included in the analysis:

  • ARRDELY ~ DEPDELAY (p=2)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK (p=8)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH (p=19)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH + YEAR (p=40)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH + YEAR (p=106)


Figure 8. Comparing performance of ore.predict with Embedded R Execution for lm

Compare this with scoring the entire ONTIME table of 123.5 million rows. We see that ore.predict outperforms embedded R until about 80 coefficients, when embedded R becomes the preferred choice. 

Data Movement between R and Database: Embedded R Execution

One advantage of R is its community and CRAN packages. The goal for Oracle R Enterprise with CRAN packages is to enable reuse of these packages while:

  • Leveraging the parallelization and efficient data processing capabilities of Oracle Database
  • Minimizing data transfer and communication overhead between R and the database
  • Leveraging R as a programming language for writing custom analytics

There are three ways in which we’ll explore the performance of pulling data.

1) Using ore.pull at a separate client R engine to pull data from the database

2) Using Embedded R Execution and ore.pull within an embedded R script from a database-spawned R engine

3) Using Embedded R Execution functions for data-parallelism and task-parallelism to pass database data to the embedded R script via function parameter

With ORE Embedded R Execution (ERE), the database delivers data-parallelism and task-parallelism, and reduces data access latency due to optimized data transfers into R. Essentially, R runs under the control of the database. As illustrated in Figure 9, loading data at the database server is 12x faster than loading data from the database to a separate R client. Embedded R Execution also provides a 13x advantage when using ore.pull invoked at the database server within an R closure (function) compared with a separate R client. The data load from database to R client is depicted as 1x – the baseline for comparison with embedded R execution data loading.

Figure 9. Summary of Embedded R Execution data load performance gains

Data transfer rates are displayed in Figure 10, for a table with 11 columns and 5 million to 15 million rows of data. Loading data via ORE embedded R execution using server-side ore.pull or through the framework with, e.g., ore.tableApply (one of the embedded R execution functions) is dramatically faster than a non-local client load via ore.pull. The numbers shown reflect MB/sec data transfer rates, so a bigger bar is better!

Figure 10. Data load and write execution time with 11 columns

While this is impressive, let’s expand our data up to 1 billion rows. To create our 1 billion row data set (1.112 billion rows), we duplicated the 123.5 million row ONTIME dataset 9 times, replacing rows with year 1987 with years 2010 through 2033, and selecting 6 integer columns (YEAR, MONTH, DAYOFMONTH, ARRDELAY, DEPDELAY, DISTANCE) with bitmap index of columns (YEAR, MONTH, DAYOFMONTH). The full data set weighs in at ~53 GB.

In Figure 11, we see linear scalability for loading data into the client R engine. Times range from 2.8 seconds for 1 million rows, to 2700 seconds for 1 billion rows. While your typical user may not need to load 1 billion rows into R memory, this graph demonstrates the feasibility to do so.

Figure 11. Client Load of Data via ore.pull for Big Data

In Figure12, we look at how degree of parallelism (DOP) affects data load times involving ore.rowApply. This test addresses the question of how fast ORE can load 1 billion, e.g., when scoring data. The degree of parallelism corresponds to the number of R engines that are spawned for concurrent execution at the database server. The number of chunks the data is divided into is 1 for a single degree of parallelism, and 10 times the DOP for the remaining tests. For DOP of 160, the data was divided into 1600 chunks, i.e., 160 R engines were spawned, each processing 10 chunks. The graph on the left depicts that execution times improve for the 1 billion row data set through DOP of 160. As expected, at some point, the overhead of spawning additional R engines and partitioning the data outweighs the benefit. At its best time, processing 1 billion rows took 43 seconds.

Figure 12. Client Load of Data via ore.pull for Big Data

In the second graph of Figure 12, we contrast execution time for the “sweet spot” identified in the previous graph with varying number of rows. Using this DOP of 160, with 1600 chunks of data, we see that through 100 million rows, there is very little increase in execution time (between 6.4 and 8.5 seconds in actual time). While 1 billion rows took significantly more, it took only 43 seconds.

We can also consider data write at this scale. In Figure 13, we also depict linear scalability from 1 million through 1 billion rows using the ore.create function to creating database tables from R data. Actual times ranged from 2.6 seconds to roughly 2600 seconds.

Figure 13. Data Write using ore.create for Big Data

ORE supports data-parallelism to enable, e.g., building predictive models in parallel on partitions of the data. Consider a marketing firm that micro-segments customers and builds predictive models on each segment. ORE embedded R execution automatically partitions the data, spawns R engines according to the degree of parallelism specified, and executes the specified user R function. To address how efficiently ore.groupApply can process data, Figure 14 shows the total execution time to process the 123.5M rows from the ONTIME data with varying number of columns. The figure shows that ore.groupApply scales linearly as the number of columns increases. Three columns were selected based on their number of distinct values: TAILNUM 12861, DEST 352, and UNIQUECARRIER 29. For UNIQUECARRIER, all columns (total of 29 columns) could not be completed since 29 categories resulted in data too large for a single R engine.

Figure 14. Processing time for 123.5M rows via ore.groupApply

ORE also supports row-parallelism, where the same embedded R function can be invoked on chunks of rows. As with ore.groupApply, depending on the specified degree of parallelism, a different chunk of rows will be submitted to a dynamically spawned database server-side R engine. Figure 15 depicts a near linear execution time to process the 123.5M rows from ONTIME with varying number of columns. The chunk size can be specified, however, testing 3 chunk sizes (10k, 50k, and 100k rows) showed no significant difference in overall execution time, hence a single line is graphed.

Figure 15. Processing time for 123.5M rows via ore.rowApply for chunk sizes 10k-100k

All tests were performed on an Exadata X3-8. Except as noted, the client R session and database were actually on the same machine, so network latency for data read and write were minimum. Over a LAN or WAN, the benefits of in-database execution and ORE will be even more dramatic.

Friday Apr 13, 2012

Oracle R Enterprise 1.1 Download Available

Oracle just released the latest update to Oracle R Enterprise, version 1.1. This release includes the Oracle R Distribution (based on open source R, version 2.13.2), an improved server installation, and much more.  The key new features include:

  • Extended Server Support: New support for Windows 32 and 64-bit server components, as well as continuing support for Linux 64-bit server components
  • Improved Installation: Linux 64-bit server installation now provides robust status updates and prerequisite checks
  • Performance Improvements: Improved performance for embedded R script execution calculations

In addition, the updated ROracle package, which is used with Oracle R Enterprise, now reads date data by conversion to character strings.

We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.



Friday Feb 17, 2012

Introduction to the ORE Transparency Layer

This Oracle R Enterprise (ORE) transparency layer tutorial is the first in a series to help users get started using ORE. Oracle R Enterprise  is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

Oracle R Enterprise (ORE) implements a transparency layer on top of the R engine that allows R computations to be executed in Oracle Database from the R environment. A mapping is established between a special R object called an ORE frame and a corresponding database table or view, allowing a wide range of R functionality to be executed on these mapped objects.  The overloaded functions in the ORE packages generate SQL statements in the background, export the expensive computations to the database for execution, and return results to the R environment.  

Here's a quick overview and an example using the airline on-time performance data from Research and Innovative Technology Administration (RITA), which  coordinates the U.S. Department of Transportation (DOT) research programs. The data consists of 123 million records of U.S. domestic commercial flights between 1987 and 2008. 

Configuring the R Environment

We begin by configuring the local R environment by executing a few simple commands.  These commands may be saved in the .Rprofile file and executed during the initialization of each R session or typed directly into the R session.

Load the ORE library and connect to Oracle Database:

   >  library(ORE)
   >  ore.connect("USER", "SID", "HOST", "PASSWORD")

Syncing with the database syncs the metadata in the database schema with the R environment:

   >  ore.sync("SCHEMA")
   >  ore.ls()

Attaching the database provides access to views and tables so they can be manipulated from a local R session:

Data Exploration
   >  ore.attach("SCHEMA")


Travelers in the western United States may be interested in the mean arrival delay for the San Fransisco, Los Angeles and Seattle airports.
Executing this R code produces a boxplot representing the delay profile for these selected airports:

   > delay <- ontime$ARRDELAY[ontime$DEST %in% c("SEA", "SFO", "LAX"), ]
   > dest <- ontime$DEST[ontime$DEST %in% c("SEA", "SFO", "LAX"), ]
   > bd <- split(delay, dest)
   > boxplot(bd, notch = TRUE, col = "gold", cex = 0.5,
            outline = FALSE, horizontal = TRUE, yaxt = "n",
            main = "Arrival Delays by Destination",
            ylab = "Destination Airport", xlab = "Delay (minutes)")
   > labels <- levels(dest)
   > text(par("usr")[1] - 3, 1:length(labels), srt = 0, adj = 1, labels = labels, xpd = TRUE, cex = 0.75)


The delay profile shows that, on average, arrival delays are greater in San Fransisco than Seattle and Los



With this information, we proceed with fitting a linear model where arrival delay is modeled as a linear function of departure delay and destination. Fitting models in ORE requires minimal modification to traditional R syntax. Simply replace R's linear modeling function lm with ore.lm, and the remaining syntax is transparent:

   > mod.ore <-ore.lm(ARRDELAY ~ DEST + DEPDELAY, data=ontime)
   > summary(mod.ore)

Many R functions compute and store more information than they report by default. Users can easily save these results to an object and extract the components they need. This allows the output of one function to be used as the input to another - a very powerful feature of the R programming environment.  We will take advantage of this functionality by generating predictions from the linear model built in the database:

   > newdat <- ontime[,c("ARRDELAY","DEST", "DEPDELAY")]
   > newdat.sub <- subset(newdat,!(is.na(ARRDELAY) | is.na(DEST) | is.na(DEPDELAY)))
   > prd.ore <- predict(mod.ore, newdata=newdat)
   > res.ore <- cbind(newdat, PRED = prd.ore)


To complete our session, we may detach the schema if desired, and terminate the database connection:

    > ore.detach ("RQUSER")
    > ore.disconnect()

This a small subset of the functionality available in the ORE transparency layer.  The R functions overloaded in the transparency layer include:

    data type assignment and verification
    distribution testing
    mathematical transformations
    statistical summaries
    arithmetic operators
    comparison operators
    logical operators
    set operations
    string manipulations
    data manipulation
    graphics
    matrix operations
    gamma and bessel functions
    hypothesis testing
    regression 

 Summary

Using ORE objects, users transparently remain in the R language, requiring minimal modifications to their existing R scripts.  Although supported, it's not necessary to pull data from the database into R.  This eliminates the need to manipulate memory-bound R objects on the user's desktop system.  R users may access open-source R packages containing many standard and cutting-edge routines for data analysis.

To learn more about ORE offerings, including statistics and modeling features and and advanced topics like the Oracle R Connector for Hadoop (ORCH), view the documentation and training materials on our our product page.  To encourage collaboration, we provide a discussion forum for ORE topics.






Thursday Feb 02, 2012

Announcing Oracle R Enterprise 1.0

Analyzing huge data sets presents a challenging opportunity for IT decision makers, driven by the balance between the maintenance and support of existing IT infrastructure with the need to analyze rapidly growing data stores. In many cases, processing this data requires a fresh approach because traditional techniques fail when applied to massive data sets. To extract immediate value from big data, we desire tools that efficiently access, organize, analyze and maintain a variety of data types.

Oracle R Enterprise (ORE), a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition, emerges as the clear solution to these challenges. ORE integrates the popular open-source R statistical programming environment with Oracle Database 11g, Oracle Exadata and the Oracle Big Data Appliance, delivering enterprise-level analytics based on R scripts and parallelized, in-database modeling.

How do R and Oracle R Enterprise work together?

The powerful R programming environment enables the creation of sophisticated graphics, statistical analyses, and simulations. It contains a vast set of built-in functions which may be extended to build custom statistical packages. The R engine is limited by capacity and performance for large data, but with Oracle R Enterprise, R bypasses these constraints by leveraging the database as the analytics engine directly from their R session.

The components that support Oracle R Enterprise include:

1. The Oracle R Enterprise transparency layer - a collection of R packages with functions to connect to Oracle Database and use R functionality in Oracle Database. This enables R users to work with data too large to fit into the memory of a user's desktop system, and leverage the scalable Oracle Database as a computational engine.

2. The Oracle statistics engine - a collection of statistical functions and procedures corresponding to commonly-used statistical libraries. The statistics engine packages also execute in Oracle Database.

3. SQL extensions supporting embedded R execution through the database on the database server. R users can execute R closures (functions) using an R or SQL API, while taking advantage of data parallelism. Using the SQL API for embedded R execution, sophisticated R graphics and results can be exposed in OBIEE dashboards and BI Publisher documents.

4. Oracle R Connector for Hadoop (ORCH) - an R package that interfaces with the Hadoop Distributed File System (HDFS) and enables executing MapReduce jobs. ORCH enables R users to work directly with an Oracle Hadoop cluster, executing computations from the R environment, written in the R language and working on data resident in HDFS, Oracle Database, or local files.

Using a simple R workflow, R users can seamlessly utilize the parallel processing architecture of ORE and ORCH for scalability and better performance. Analytics and reporting tasks are moved to the Oracle Database, eliminating long approval chains for data movement and dramatically increasing processing speed. R users are not required to learn SQL because the R-to-SQL translation is shipped to the database and processed behind the scenes. The significant benefits to IT include improved data security, data maintenance and audit compliance practices.

We’re proud to announce Oracle R Enterprise 1.0 and look forward to your comments. To learn more about ORE, visit our product page.

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today