Friday Jan 03, 2014

Invoking R scripts via Oracle Database: Theme and Variation, Part 2

In part 1 of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution, focusing on the functions ore.doEval and rqEval. In this blog post, we’ll cover the next in our theme and variation series involving ore.tableApply and rqTableEval.

The variation on embedded R execution for ore.tableApply involves passing an ore.frame to the function such that the first parameter of your embedded R function receives a data.frame. The rqTableEval function in SQL allows users to specify a data cursor to be delivered to your embedded R function as a data.frame.

Let’s look at a few examples.


R API

In the following example, we’re using ore.tableApply to build a Naïve Bayes model on the iris data set. Naïve Bayes is found in the e1071 package, which must be installed on both the client and database server machine R engines.

library(e1071)
mod <- ore.tableApply(
ore.push(iris),
function(dat) {
library(e1071)
dat$Species <- as.factor(dat$Species)
naiveBayes(Species ~ ., dat)
})
class(mod)
mod

A few points to highlight:
• To use the CRAN package e1071 on the client, we first load the library.
• The iris data set is pushed to the database to create an ore.frame as the first argument to ore.tableApply. This would normally refer to an ore.frame that refers to a table that exists in Oracle Database. If not obvious, note that we could have previously assigned dat <- ore.push(iris) and passed dat as the argument as well.
• The embedded R function is supplied as the second argument to ore.tableApply as a function object. Recall from Part 1 that we could have alternatively assigned this function to a variable and passed the variable as an argument, or stored the function in the R script repository and passed the argument FUN.NAME with the assigned function name.
• The user-defined embedded R function takes dat as its first argument which will contain a data.frame derived from the ore.frame supplied.
• The model itself is returned from the function.
• The result of the ore.tableApply execution will be an ore.object.

SQL API

We can invoke the function through the SQL API by storing the function in the R script repository. Recall that the call to sys.rqScriptCreate must be wrapped in a BEGIN-END PL/SQL block.


begin
sys.rqScriptCreate('myNaiveBayesModel',
'function(dat) {
library(e1071)
dat$Species <- as.factor(dat$Species)
naiveBayes(Species ~ ., dat)
}');
end;
/

Invoking the function myNaiveBayesModel occurs in a SQL SELECT statement as shown below. The first argument to rqTableEval specifies a cursor that retrieves the IRIS table. Note that the IRIS table could have been created earlier using ore.create(iris,"IRIS"). The second argument, NULL, indicates that no arguments are supplied to the function.

The function returns an R object of type naiveBayes, but as a serialized object that is chunked into a table. This likely is not useful to most users.


select *
from table(rqTableEval(cursor(select * from IRIS), NULL, NULL, 'myNaiveBayesModel'));

If we want to keep the model in a more usable form, we can store it in an ORE datastore in Oracle Database. For this, we require a change to the user-defined R function and the SQL invocation.


begin
sys.rqScriptCreate('myNaiveBayesModel',
'function(dat) {
library(e1071)
dat$Species <- as.factor(dat$Species)
mod <- naiveBayes(Species ~ ., dat)
ore.save(mod, name="myNaiveBayesDatastore")
TRUE

}');
end;
/
select *
from table(rqTableEval(cursor(select * from IRIS), cursor(select 1 as "ore.connect" from dual), 'XML', 'myNaiveBayesModel'));

Highlighted in red, we’ve stored the model in the datastore named ‘myNaiveBayesDatastore’. We’ve also returned TRUE to have a simple value that can show up as the result of the function execution. In the SQL query, we changed the third parameter to ‘XML’ to return an XML string containing “TRUE”. The name of the datastore could be passed as an argument as follows:


begin
sys.rqScriptCreate('myNaiveBayesModel',
'function(dat, datastoreName) {
library(e1071)
dat$Species <- as.factor(dat$Species)
mod <- naiveBayes(Species ~ ., dat)
ore.save(mod, name=datastoreName)
TRUE
}');
end;
/
select *
from table(rqTableEval(
cursor(select * from IRIS),
cursor(select 'myNaiveBayesDatastore' "datastoreName", 1 as "ore.connect" from dual),
'XML',
'myNaiveBayesModel'));

Memory considerations with ore.tableApply and rqTableEval

The input data provided as the first argument to a user-defined R function invoked using ore.tableApply or rqTableEval is physically being moved from Oracle Database to the database server R engine. It’s important to realize that R’s memory limitations still apply. If your database server machine has 32 GB RAM and your data table is 64 GB, ORE will not be able to load the data into the R function’s dat argument.
You may see errors like:


Error : vector memory exhausted (limit reached)

or

ORA-28579: network error during callback from external procedure agent

See the blog post on Managing Memory Limits and Configuring Exadata for Embedded R Execution where we discuss setting memory limits for the database server R engine. This can be necessary to load reasonably sized data tables.

Parallelism

As with ore.doEval / rqEval, user-defined R functions invoked using ore.tableApply / rqTableEval are not executed in parallel, i.e., a single R engine is used to execute the user-defined R function.

Invoking certain ORE advanced analytics functions

In the current ORE release, some advanced analytics functions, like ore.lm or ore.glm, which use the embedded R execution framework, cannot be used within other embedded R calls such as ore.doEval / rqEval and ore.tableApply / rqTableEval.

You can expect to see an error like the following:


ORA-28580: recursive external procedures are not supported

In the next post in this series, I’ll discuss ore.groupApply and the corresponding definitions required for SQL execution, since there is no rqGroupApply function. I’ll also cover the relationship of various “group apply” constructs to map-reduce paradigm.

Monday Jun 10, 2013

Bringing R to the Enterprise - new white paper available

Check out this new white paper entitled "Bringing R to the Enterprise -  A Familiar R Environment with Enterprise-Caliber Performance, Scalability, and Security."

In this white paper, we begin with "Beyond the Laptop" exploring the ability to run R code in the database, working with CRAN packages at the database server, operationalizing R analytics, and leveraging Hadoop from the comfort of the R language and environment.

Excerpt: "Oracle Advanced Analytics and Oracle R Connector for Hadoop combine the advantages of R with the power and scalability of Oracle Database and Hadoop. R programs and libraries can be used in conjunction with these database assets to process large amounts of data in a secure environment. Customers can build statistical models and execute them against local data stores as well as run R commands and scripts against data stored in a secure corporate database."

The white paper continues with three use cases involving Oracle Database and Hadoop: analyzing credit risk, detecting fraud, and preventing customer churn.  The conclusion: providing analytics for the enterprise based on the R environment is here!


Tuesday Oct 02, 2012

Oracle R Enterprise Tutorial Series on Oracle Learning Library

Oracle Server Technologies Curriculum has just released the Oracle R Enterprise Tutorial Series, which is publicly available on Oracle Learning Library (OLL). This 8 part interactive lecture series with review sessions covers Oracle R Enterprise 1.1 and an introduction to Oracle R Connector for Hadoop 1.1:
  • Introducing Oracle R Enterprise
  • Getting Started with ORE
  • R Language Basics
  • Producing Graphs in R
  • The ORE Transparency Layer
  • ORE Embedded R Scripts: R Interface
  • ORE Embedded R Scripts: SQL Interface
  • Using the Oracle R Connector for Hadoop

We encourage you to download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.

Friday Apr 13, 2012

Oracle R Enterprise 1.1 Download Available

Oracle just released the latest update to Oracle R Enterprise, version 1.1. This release includes the Oracle R Distribution (based on open source R, version 2.13.2), an improved server installation, and much more.  The key new features include:

  • Extended Server Support: New support for Windows 32 and 64-bit server components, as well as continuing support for Linux 64-bit server components
  • Improved Installation: Linux 64-bit server installation now provides robust status updates and prerequisite checks
  • Performance Improvements: Improved performance for embedded R script execution calculations

In addition, the updated ROracle package, which is used with Oracle R Enterprise, now reads date data by conversion to character strings.

We encourage you download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.  As always, we welcome comments and questions on the Oracle R Forum.



Wednesday Apr 04, 2012

New Release of ROracle posted to CRAN

Oracle recently updated ROracle to version 1.1-2 on CRAN with enhancements and bug fixes. The major enhancements include the introduction of support for Oracle Wallet Manager and datetime and interval types. 

Oracle Wallet support in ROracle allows users to manage public key security from the client R session. Oracle Wallet allows passwords to be stored and read by Oracle Database, allowing safe storage of database login credentials. In addition, we added support for datetime and interval types when selecting data, which expands ROracle's support for date data. 

See the ROracle NEWS for the complete list of updates.

We encourage ROracle users to post questions and provide feedback on the Oracle R Forum.

In addition to being a high performance database interface to Oracle Database from R for general use, ROracle supports database access for Oracle R Enterprise.

Monday Apr 02, 2012

Introduction to ORE Embedded R Script Execution

This Oracle R Enterprise (ORE) tutorial, on embedded R execution, is the third in a series to help users get started using ORE. See these links for the first tutorial on the transparency layer and second tutorial on the statistics engine. Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

Embedded R Execution refers to the ability to execute an R script at the database server, which provides several benefits: spawning multiple R engines in parallel for data-parallel operations, more efficient data transfer between Oracle Database and the R engine, leverage a likely more powerful server with more CPUs and greater RAM, schedule automated jobs, and take advantage of open source R packages at the database server. Data aggregates are computed in parallel, significantly reducing computation time, without requiring sophisticated configuration steps.

ORE provides two interfaces for embedded R execution: one for R and one for SQL. The R interface enables interactive execution at the database server from the client R engine, e.g., your laptop. It also has transparency aspects for passing R objects and returning R objects. In the R interface, the ore.doEval schedules execution of the R code with the database-embedded R engine and returns the results back to the desktop for continued analysis. User-defined R functions can run in parallel, either on each row, sets of rows, or on each group of rows given a grouping column. The first two cases are covered by ore.rowApply, the second by the ore.groupApply function. ore.indexApply provides parallel simulations capability by invoking the script the number of times specified by the user.  The R interface returns results to the client as R objects that can be passed as arguments to R functions. 

The SQL interface enables interactive execution from any SQL interface, like SQL*Plus or SQL Developer, but it also enables R scripts to be included in production database-based systems. To enable execution of an R script in the SQL interface, ORE provides variants of ore.doEval, ore.groupApply and ore.indexApply in SQL.  These functions are rqEval, rqTableEval, rqRowEval and rqGroupEval. The SQL interface allows for storing results directly in the database.

 R Interface Function (ore.*)
 SQL Interface Function (rq*)
 Purpose
 ore.doEval  rqEval  Invoke stand-alone R script
 ore.tableApply  rqTableEval  Invoke R script with full table input
 ore.rowApply  rqRowEval  Invoke R script one row at a time, or multiple rows in "chunks"
 ore.groupApply  rqGroupEval  Invoke R script on data indexed by grouping column
 ore.indexApply
N/A
 Invoke R script N times

In addition, the SQL interface enables R results to be stored in a database table for subsequent use in another invocation (think data mining model building and scoring). It enables returning structured R results in a table. Results can also be returned as XML. The XML interface enables both structured data, such as data frames, R objects, and graphs to be returned.  The XML capability allows R graphs and structured results to be displayed in Oracle BI Publisher documents and OBIEE dashboards.

Embedded R Execution: R Interface

The following example uses the function ore.groupApply, one of several embedded R execution functions, to illustrate how R users can achieve data parallelism through the database. This example also illustrates that embedded R execution enables the use of open source packages. Here we see the use of the R package biglm.

We specify a column on which to partition the data. Each partition of the data is provided to the function through the first argument, in this case the function variable dat. There is no need to send data from the database to R - the R function is sent to the database, which processes them in parallel. Output results may be stored directly in the database, or may be downloaded to R. Only when we want to see the results of these models do we need to retrieve them into R memory and perform, for example, the summary function.

modList <- ore.groupApply(
   X=ONTIME,
   INDEX=ONTIME$DEST,
   function(dat) {
     library(biglm)
     biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
   });

modList_local <- ore.pull(modList)

> summary(modList_local$BOS)
Large data regression model: biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)
Sample size =  3928 
               Coef    (95%     CI)     SE      p
(Intercept)  0.0638 -0.7418  0.8693 0.4028 0.8742
DISTANCE    -0.0014 -0.0021 -0.0006 0.0004 0.0002
DEPDELAY     1.0552  1.0373  1.0731 0.0090 0.0000

Embedded R Execution: SQL Interface 

Whereas the previous example showed how to use embedded R execution from the R environment, we can also invoke R scripts from SQL. This next example illustrates returning a data frame from results computed in Oracle Database. We first create an R script in the database R script repository. The script is defined as a function that creates a vector of 10 elements, and returns a data frame with those elements in one column and those elements divided by 100 in a second column.

Once the script is created, we can invoke it through SQL. One of the SQL embedded R executions table functions available is rqEval. The first argument is NULL since we have no parameters to pass to the function. The second argument describes the structure of the result. Any valid SQL query that captures the name and type of resulting columns will suffice. The third argument is the name of the script to execute.

begin
  sys.rqScriptCreate('Example1',
'function() {
   ID <- 1:10
   res <- data.frame(ID = ID, RES = ID / 100)
   res}');
end;
/
select *
  from table(rqEval(NULL,
                    'select 1 id, 1 res from dual',
                    'Example1'));

The result is a data frame:



Embedded R scripts may generate any valid R object, including graphs. In addition, embedded R execution enables returning results from an R script as an XML
string. Consider the following example that creates a vector from the integers 1 to 10, plots 100 random normal points in a graph, and then returns the vector. After creating the script in the database R script repository, we invoke the script using rqEval, but instead of specifying the form of the result in a SQL query, we specify XML.

begin
  sys.rqScriptCreate('Example6',
 'function(){
            res <- 1:10
            plot( 1:100, rnorm(100), pch = 21,
                  bg = "red", cex = 2 )
            res
            }');
end;
/
select value
from   table(rqEval( NULL,'XML','Example6'));

While the actual graph looks like the following, the output from this query will be an XML string.



In the execution results shown below, the VALUE column returned is a string that contains first the structured data in XML format. Notice the numbers 1 through
10 set off by the <value> tags. This is followed by the image in PNG base 64 representation. This type of output can be consumed by Oracle Business Intelligence Publisher (BIP) to produce documents with R-generated graphs and structured content.  Oracle BIP templates can also be used to expose R-generated content in Oracle Business Intelligence Enterprise Edition (OBIEE) web browser-based dashboards.

You can see additional examples using embedded R execution in action in the Oracle Enterprise Training, session 4, Embedded R Script Execution. These example will run as written in R 2.13.2 after installing Oracle R Enterprise. We'll be posting more examples using embedded R script execution in the coming months. In the meantime, questions are always welcome on the Oracle R Forum.


Thursday Feb 23, 2012

Introduction to the ORE Statistics Engine

This Oracle R Enterprise (ORE) statistics engine tutorial is the second in a series to help users get started using ORE. (See the first tutorial on the ORE transparency layer here). Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

The ORE statistics engine is a database library consisting of native database statistical functions and new functionality added specifically for ORE. ORE intercepts R functions and pushes their execution to Oracle Database to perform computationally expensive transformations and computations. This allows R users to use the R client directly against data stored in Oracle Database, greatly increasing R's scalability and performance.

A variety of useful and common routines are available in the ORE statistics engine:

Significance Tests

Chi-square, McNemar, Bowker
Simple and weighted kappas
Cochran-Mantel-Haenzel correlation
Cramer's V
Binomial, KS, t, F, Wilcox

Distribution Functions
Beta distribution
Binomial distribution
Cauchy distribution
Chi-square distribution
Exponential distribution
F-distribution
Gamma distribution
Geometric distribution
Log Normal distribution
Logistic distribution
Negative Binomial distribution
Normal distribution
Poisson distribution
Sign Rank distribution
Student t distribution
Uniform distribution
Weibull distribution
Density Function
Probability Function
Quantile
distribution

Other Functions

Gamma function
Natural logarithm of the Gamma function
Digamma function
Trigamma function
Error function
Complementary error function

Base SAS Equivalents
Freq, Summary, Sort
Rank, Corr, Univariate

These R functions are overridden, such that when presented with ore.frame data, the function generates a SQL query that is transparently submitted to Oracle Database. This is the case for much of the base R and stats functionality, providing the benefit of employing the database as the computational engine and seamlessly breaking through R's memory barrier.

In this post, we introduce a simple yet typical data analysis using functions from the ORE transparency layer. We begin configuring the local R environment by executing a few simple commands.

Load the ORE library and connect to Oracle Database:

> library(ORE)
>
ore.connect("USER", "SID", "HOST", "PASSWORD")

Invoking ore.sync synchronizes the the metadata in the database schema with the R environment:

> ore.sync("SCHEMA")
>
ore.ls()


Attaching the database schema provides access to views and tables so they can be manipulated from a local R session:

>
ore.attach("SCHEMA")

We use the ONTIME_S data set typically installed in Oracle Database when ORE is installed. ONTIME_S is a subset of the airline on-time performance data (from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs.  We're providing a relatively large sample data set (220K rows), but these examples could be performed in ORE on the full data set, which contains 123 millions rows and requires 12 GB disk space . This data set is significantly larger than R can process on it's own.

ONTIME_S is a database-resident table with metadata on the R side, represented by an ore.frame object.

> class(ONTIME_S)
[1] "ore.frame" attr(,"package")
[1] "OREbase"

We focus on two columns of data: one numeric column, DEPDELAY (actual departure delay in minutes), and a categorical column, ORIGIN (airport of origin).

A typical place to begin, is, of course, looking at the structure of selected variables. We call the function summary to obtain summary statistics on the variable measuring departure delay.

> with(ONTIME_S, summary(DEPDELAY))
      Min.   1st Qu  Median  Mean  3rd Qu. Max.     NA's 
   -75.000  -2.000  0.000   8.042   6.00  1438.00  3860.00 

To investigate the skewness of the departure delay variable, we use the quantile function to calculate quantiles at 10, 20, 30,....90%.

> with(ONTIME_S, quantile(DEPDELAY, seq(0.1, 0.9, by = 0.1), na.rm = TRUE)) 10% 20% 30% 40% 50% 60% 70% 80% 90% -4  -2  -1  0  0   1  4  10  26

This shows us that the 10% quantile (-4) is 4 units away from the median, while the 90% quantile (26) is 26 units from the median.  For a symmetric distribution, the two quantiles would be about the same distance from the median.  A measure of the data spread is the interquartile range, the difference between the 25% and 75% quantile. To allow computations on the data, we remove missing values by setting the na.rm parameter to TRUE, as we did above for the quantile function.
> with(ONTIME_S, IQR(DEPDELAY, na.rm = TRUE))
[1] 8

Other measures of data spread available are the variance and standard deviation.

> with(ONTIME_S, var(DEPDELAY, na.rm = TRUE)) [1] 771.7207 > with(ONTIME_S, sd(DEPDELAY, na.rm = TRUE)) [1] 27.77986

Using in-database aggregation summaries, we can investigate the relationship between departure delay and origin a bit further. We use aggregate to calculate the mean departure delay for each airport of origin.  Results for the first five airports are displayed using the function head.

>
ONTIME.agg <- aggregate(ONTIME_S$DEPDELAY,
                     by = list(ONTIME_S$ORIGIN),
                     FUN = mean)
>
head(ONTIME.agg, 5)
  Group.1    x
1     ABE   216
2     ABI    29
3     ABQ  1392
4     ABY   9
5     ACK   2


Now that we have gained a basic impression and some insights into the ONTIME_S data, we may choose to view the data graphically.  For example, we may want to get a visual impression of the distribution of departure delay.  We use the hist function, which displays a histogram skewed on positive side, presumably because flights rarely leave early.

> with(ONTIME_S, hist(DEPDELAY, breaks = 100, col = rainbow(7), cex.lab = 0.8,
      main = "Distribution of Departure Delay", xlab = "Departure Delay (in minutes)"))

After analyzing the data through exploratory methods in ORE, we proceed to a possible next step: confirmatory statistics. Let's compute a Student's t-test using  the origin and departure delay variables we examined earlier.  The goal is to decide whether average departure delay of one airport is different from the average delay of another.

> with(ONTIME_S, t.test(DEPDELAY[ORIGIN == "SEA"], DEPDELAY[ORIGIN == "SFO"], conf.level = 0.95))

    Welch Two Sample t-test

data:  DEPDELAY[ORIGIN == "SEA"] and DEPDELAY[ORIGIN == "SFO"]
t = -1.8406, df = 7571.893, p-value = 0.06572
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.21038010  0.06960489
sample estimates:
mean of x  mean of y
 8.475415  9.545802


The output tells us that we calculated the unpaired t-test and gives us the value of the t-distributed statistic.

We may also use a Kolmogorov–Smirnov test to determine if the departure delay variable is from the same distribution as the cumulative distribution.

> with(ONTIME_S, ks.test(DEPDELAY, "pnorm", mean(DEPDELAY), sd(DEPDELAY)))

    One-sample Kolmogorov-Smirnov test

data:  DEPDELAY
D = 0.3497, p-value < 2.2e-16
alternative hypothesis: two-sided


At this point we could continue our data exploration by performing additional distribution tests, or proceed with the rich set of modeling and functionality ORE offers.  

ORE enables R users transparent access to data stored in Oracle Database while leveraging Oracle Database as a compute engine for scalability and high performance. We've only scratched the surface on ORE's statistical features - stay tuned for posts highlighting more advanced features of the statistical engine.

The Oracle R Enterprise User's Guide contains a number of examples demonstrating the functionality available in ORE. To view the documentation and training materials, visit our product pagePlease feel free to visit our discussion forum  and ask questions or provide comments about how we can help you and your team!






 

Friday Feb 17, 2012

Introduction to the ORE Transparency Layer

This Oracle R Enterprise (ORE) transparency layer tutorial is the first in a series to help users get started using ORE. Oracle R Enterprise  is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

Oracle R Enterprise (ORE) implements a transparency layer on top of the R engine that allows R computations to be executed in Oracle Database from the R environment. A mapping is established between a special R object called an ORE frame and a corresponding database table or view, allowing a wide range of R functionality to be executed on these mapped objects.  The overloaded functions in the ORE packages generate SQL statements in the background, export the expensive computations to the database for execution, and return results to the R environment.  

Here's a quick overview and an example using the airline on-time performance data from Research and Innovative Technology Administration (RITA), which  coordinates the U.S. Department of Transportation (DOT) research programs. The data consists of 123 million records of U.S. domestic commercial flights between 1987 and 2008. 

Configuring the R Environment

We begin by configuring the local R environment by executing a few simple commands.  These commands may be saved in the .Rprofile file and executed during the initialization of each R session or typed directly into the R session.

Load the ORE library and connect to Oracle Database:

   >  library(ORE)
   >  ore.connect("USER", "SID", "HOST", "PASSWORD")

Syncing with the database syncs the metadata in the database schema with the R environment:

   >  ore.sync("SCHEMA")
   >  ore.ls()

Attaching the database provides access to views and tables so they can be manipulated from a local R session:

Data Exploration
   >  ore.attach("SCHEMA")


Travelers in the western United States may be interested in the mean arrival delay for the San Fransisco, Los Angeles and Seattle airports.
Executing this R code produces a boxplot representing the delay profile for these selected airports:

   > delay <- ontime$ARRDELAY[ontime$DEST %in% c("SEA", "SFO", "LAX"), ]
   > dest <- ontime$DEST[ontime$DEST %in% c("SEA", "SFO", "LAX"), ]
   > bd <- split(delay, dest)
   > boxplot(bd, notch = TRUE, col = "gold", cex = 0.5,
            outline = FALSE, horizontal = TRUE, yaxt = "n",
            main = "Arrival Delays by Destination",
            ylab = "Destination Airport", xlab = "Delay (minutes)")
   > labels <- levels(dest)
   > text(par("usr")[1] - 3, 1:length(labels), srt = 0, adj = 1, labels = labels, xpd = TRUE, cex = 0.75)


The delay profile shows that, on average, arrival delays are greater in San Fransisco than Seattle and Los



With this information, we proceed with fitting a linear model where arrival delay is modeled as a linear function of departure delay and destination. Fitting models in ORE requires minimal modification to traditional R syntax. Simply replace R's linear modeling function lm with ore.lm, and the remaining syntax is transparent:

   > mod.ore <-ore.lm(ARRDELAY ~ DEST + DEPDELAY, data=ontime)
   > summary(mod.ore)

Many R functions compute and store more information than they report by default. Users can easily save these results to an object and extract the components they need. This allows the output of one function to be used as the input to another - a very powerful feature of the R programming environment.  We will take advantage of this functionality by generating predictions from the linear model built in the database:

   > newdat <- ontime[,c("ARRDELAY","DEST", "DEPDELAY")]
   > newdat.sub <- subset(newdat,!(is.na(ARRDELAY) | is.na(DEST) | is.na(DEPDELAY)))
   > prd.ore <- predict(mod.ore, newdata=newdat)
   > res.ore <- cbind(newdat, PRED = prd.ore)


To complete our session, we may detach the schema if desired, and terminate the database connection:

    > ore.detach ("RQUSER")
    > ore.disconnect()

This a small subset of the functionality available in the ORE transparency layer.  The R functions overloaded in the transparency layer include:

    data type assignment and verification
    distribution testing
    mathematical transformations
    statistical summaries
    arithmetic operators
    comparison operators
    logical operators
    set operations
    string manipulations
    data manipulation
    graphics
    matrix operations
    gamma and bessel functions
    hypothesis testing
    regression 

 Summary

Using ORE objects, users transparently remain in the R language, requiring minimal modifications to their existing R scripts.  Although supported, it's not necessary to pull data from the database into R.  This eliminates the need to manipulate memory-bound R objects on the user's desktop system.  R users may access open-source R packages containing many standard and cutting-edge routines for data analysis.

To learn more about ORE offerings, including statistics and modeling features and and advanced topics like the Oracle R Connector for Hadoop (ORCH), view the documentation and training materials on our our product page.  To encourage collaboration, we provide a discussion forum for ORE topics.






Thursday Feb 02, 2012

Announcing Oracle R Enterprise 1.0

Analyzing huge data sets presents a challenging opportunity for IT decision makers, driven by the balance between the maintenance and support of existing IT infrastructure with the need to analyze rapidly growing data stores. In many cases, processing this data requires a fresh approach because traditional techniques fail when applied to massive data sets. To extract immediate value from big data, we desire tools that efficiently access, organize, analyze and maintain a variety of data types.

Oracle R Enterprise (ORE), a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition, emerges as the clear solution to these challenges. ORE integrates the popular open-source R statistical programming environment with Oracle Database 11g, Oracle Exadata and the Oracle Big Data Appliance, delivering enterprise-level analytics based on R scripts and parallelized, in-database modeling.

How do R and Oracle R Enterprise work together?

The powerful R programming environment enables the creation of sophisticated graphics, statistical analyses, and simulations. It contains a vast set of built-in functions which may be extended to build custom statistical packages. The R engine is limited by capacity and performance for large data, but with Oracle R Enterprise, R bypasses these constraints by leveraging the database as the analytics engine directly from their R session.

The components that support Oracle R Enterprise include:

1. The Oracle R Enterprise transparency layer - a collection of R packages with functions to connect to Oracle Database and use R functionality in Oracle Database. This enables R users to work with data too large to fit into the memory of a user's desktop system, and leverage the scalable Oracle Database as a computational engine.

2. The Oracle statistics engine - a collection of statistical functions and procedures corresponding to commonly-used statistical libraries. The statistics engine packages also execute in Oracle Database.

3. SQL extensions supporting embedded R execution through the database on the database server. R users can execute R closures (functions) using an R or SQL API, while taking advantage of data parallelism. Using the SQL API for embedded R execution, sophisticated R graphics and results can be exposed in OBIEE dashboards and BI Publisher documents.

4. Oracle R Connector for Hadoop (ORCH) - an R package that interfaces with the Hadoop Distributed File System (HDFS) and enables executing MapReduce jobs. ORCH enables R users to work directly with an Oracle Hadoop cluster, executing computations from the R environment, written in the R language and working on data resident in HDFS, Oracle Database, or local files.

Using a simple R workflow, R users can seamlessly utilize the parallel processing architecture of ORE and ORCH for scalability and better performance. Analytics and reporting tasks are moved to the Oracle Database, eliminating long approval chains for data movement and dramatically increasing processing speed. R users are not required to learn SQL because the R-to-SQL translation is shipped to the database and processed behind the scenes. The significant benefits to IT include improved data security, data maintenance and audit compliance practices.

We’re proud to announce Oracle R Enterprise 1.0 and look forward to your comments. To learn more about ORE, visit our product page.

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today