Thursday Aug 21, 2014

Oracle R Distribution 3.1.1 Released

Oracle R Distribution version 3.1.1 has been released to Oracle's public yum today. R-3.1.1 (code name "Sock it to Me") is an update to R-3.1.0 that consists mainly of bug fixes. It also 
includes enhancements related to accessing package help files, improved accuracy when importing data with large integers, and better integration with RStudio graphics. The full list of new features and bug fixes is listed in the NEWS file.

To install Oracle R Distribution using
yum, follow the instructions in the Oracle R Enterprise Installation and Administration Guide.

Installing using
yum will resolve any operating system dependencies automatically. As such, we recommend using yum to install Oracle R Distribution. However, if yum is not available, you can install Oracle R Distribution RPMs directly using RPM commands.

For Oracle Linux 5, the Oracle R Distribution RPMs are available in the Enterprise Linux Add-Ons repository


For Oracle Linux 6, the Oracle R Distribution RPMs are available in the Oracle Linux Add-Ons repository:


For example, this command installs the R 3.1.1 RPM on Oracle Linux x86-64 version 6:

  rpm -i R-3.1.1-1.el6.x86_64.rpm

To complete the Oracle R Distribution 3.1.1 installation, repeat this command for each of the 6 RPMs, resolving dependencies as required. 

Oracle R Distribution 3.1.1 is not yet officially certified with Oracle R Enterprise. Refer to 
Table 1-2 in the Oracle R Enterprise Installation Guide for supported configurations of Oracle R Enterprise components, or check this blog for updates. The Oracle R Distribution 3.1.1 binaries for Windows, AIX, Solaris SPARC and Solaris x86 will be available on OSS, Oracle's Open Source Software portal, in the coming weeks.

Monday Aug 18, 2014

Real-time Big Data Analytics is a reality for StubHub with Oracle Advanced Analytics

What can you use for a comprehensive platform for real-time analytics?
How can you process big data volumes for near-real-time recommendations and dramatically reduce fraud?

Learn in this video what Stubhub achieved with Oracle R Enterprise from the Oracle Advanced Analytics option to Oracle Database, and read more on their story here.

Advanced analytics solutions that impact the bottom line of a business are challenging due to the range of skills and individuals involved in realizing such solutions. While we hear a lot about the role of the data scientist, that role is but one piece of the puzzle. Advanced analytics solutions also have an operationalization aspect that also requires close proximity to where the transactional activity occurs.

The data scientist needs access to the right data with which to model the business problem. This involves IT for data collection, management, and administration, as well as ensuring zero downtime (a website needs to be up 24x7). This also involves working with the data scientist to keep predictive models refreshed with the latest scripts.

Integrating advanced analytics solutions into enterprise apps involves not just generating predictions, but supporting the whole life-cycle from data collection, to model building, model assessment, and then outcome assessment and feedback to the model building process again. Application and web interface designers need to take into account how end users will see and use the advanced analytics results, e.g., supporting operations staff that need to handle the potentially fraudulent transactions.

As just described, advanced analytics projects can be "complicated" from just a human perspective. The extent to which software can simplify the interactions among users and systems will increase the likelihood of project success. The ability to quickly operationalize advanced analytics projects and demonstrate measurable value, means the difference between a successful project and just a nice research report.

By standardizing on Oracle Database and SQL invocation of R, along with in-database modeling as found in Oracle Advanced Analytics, expedient model deployment and zero downtime for refreshing models becomes a reality. Meanwhile, data scientists are also able to explore leading edge techniques available in open source. The Oracle solution propels the entire organization forward to realize the value of advanced analytics.

Thursday Aug 14, 2014

Selecting the most predictive variables – returning Attribute Importance results as a database table

Attribute Importance (AI) is a technique of Oracle Advanced Analytics (OAA) that ranks the relative importance of predictors given a categorical or numeric target for classification or regression models, respectively. OAA AI uses the minimum description length algorithm and produces importance scores such that predictors with positive scores help predict the target, while zero or negative do not, and may even contribute noise to a model, making it less accurate. OAA AI, however, considers predictors only pairwise with the target, so any interactions among predictors are addressed. OAA AI is a good first assessment of which predictors should be included in a classification or regression model, enabling what is sometimes called feature selection or variable selection.

In my series on Oracle R Enterprise Embedded R Execution, I explored how structured table results could be returned from embedded R calls. In a subsequent post, I explored how to return select results from a principal components analysis (PCA) model as a table. In this post, I describe how you can work with results from an Attribute Importance model from ORE embedded R execution via an R function. This R function takes a table name and target variable name as input, places the predictor rankings in an named ORE datastore also specified as input, and returns a data.frame with the predictor variable name, rank, importance value.

The function below implements this functionality. Notice that we dynamically sync the named table and get its ore.frame proxy object. From here, we invoke ore.odmAI using the dynamically generated formula using the targetName argument. We pull out the importance component of the result, explicitly assign the column variable to the row names, and then reorder the columns. Next, we nullify the row names since these are now redundant with column variable.

The next three lines assign the result to a datastore. This is technically not necessary since the result is returned by this function, but if a user wanted to access this result without recomputing it, the user could simply retrieve the datastore object using another embedded R function. This is left as an exercise for the reader to load the named datastore and return the contents as an ore.frame in R or database table in SQL.

Lastly, the resulting data.frame is returned.

rankPredictors <- function(tableName,targetName,dsName) {
  dat <- ore.get(tableName)
  formulaStr <- paste(targetName,".",sep="~")
  res <- ore.odmAI(as.formula(formulaStr),dat)
  res <- res$importance
  res$variable <- rownames(res)
  res <- res[,c("variable","rank","importance")]
  row.names(res) <- NULL
  resName <- paste(tableName,targetName,"AI",sep=".")

To test this funtion, we invoke it explicitly with suitable arguments.

res <- rankPredictors ("IRIS","Species","/DS/Test1")

Here, you see the results.

> res
    variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

The contents of the datastore can be accessed as well.

> ore.datastore(pattern="/DS") object.count size description
1 /DS/Test1 1 355 2014-08-14 16:38:46 <na>
> ore.datastoreSummary(name="/DS/Test1") class size length row.count col.count
1 IRIS.Species.AI data.frame 355 3 4 3
> ore.load("/DS/Test1")
[1] "IRIS.Species.AI"
> IRIS.Species.AI
    variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

With the confidence that our R function is behaving correctly, we load it into the R Script Repository in Oracle Database.


To test that the function behaves properly with embedded R execution, we invoke it first from R using ore.doEval, passing the desired parameters and returning the result as an ore.frame. This last part is enabled through the specification of the FUN.VALUE argument. Since we are using a datastore and the transparency layer, ore.connect is set to TRUE.


Notice we get the same result as above.

    variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

Again, we can view the datastore contents for the execution above. Notice our use of the “/” notation to organize our datastore content. While we can name datastores with any arbitrary string, this approach can help structure the retrieval of datastore contents.


We have a single datastore matching our IRIS data set followed by the summary with the IRIS.Species.AI object, which is an R data.frame with 3 columns and 4 rows.

> ore.datastore(pattern="/AttributeImportance/IRIS") object.count size description
1 /AttributeImportance/IRIS/Species 1 355 2014-08-14 16:55:40
> ore.datastoreSummary(name="/AttributeImportance/IRIS/Species") class size length row.count col.count
1 IRIS.Species.AI data.frame 355 3 4 3

To execute this R script from SQL, use the ORE SQL API.

select * from table(rqEval(
  cursor(select 1 "ore.connect",
      'IRIS' "tableName",
      'Species' "targetName",
      '/AttributeImportance/IRIS/Species' "dsName"
      from dual),
  'select cast(''a'' as varchar2(50)) "variable",
  1 "rank",
  1 "importance"
  from dual',

In summary, we’ve explored how to use ORE embedded R execution to extract model elements from an in-database algorithm and present it as an R data.frame, ore.frame, and SQL table.

The process used above can also serve as a template for working on your own embedded R execution projects:

+ Interactively develop an R script that does what you need and wrap it in a function
+ Validate that the R function behaves as expected
+ Store the function in the R Script Repository
+ Validate that the R interface to embedded R execution produces the desired results
+ Generate SQL query that invokes the R function
+ Validate that the SQL interface to embedded R execution produces the desired resultsv


The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.


« August 2014 »