Monday Mar 30, 2015

Oracle Open World 2015 Call for Proposals!

It's that time of year again...submit your session proposals for Oracle OpenWorld 2015!

Oracle customers and partners are encouraged to submit proposals to present at the Oracle OpenWorld 2015 conference, October 25 - 29, 2015, held at the Moscone Center in San Francisco.

Details and submission guidelines are available on the Oracle OpenWorld Call for Proposals web site. The deadline for submissions is Wednesday, April 29, 11:59 p.m. PDT.

We look forward to checking out your sessions on Oracle Advanced Analytics, including Oracle R Enterprise and Oracle Data Mining, and Oracle R Advanced Analytics for Hadoop. Tell how these tools have enhanced the way you do business!

Monday Mar 23, 2015

Oracle R Distribution 3.1.1 Available for Download on all Platforms

The Oracle R Distribution 3.1.1 binaries for Windows, AIX, Solaris SPARC and Solaris x86 are now available on OSS, Oracle's Open Source Software portal. Oracle R Distribution 3.1.1 is an update to R version 3.1.0, and it includes many improvements, including upgrades to the package help system and improved accuracy importing data with large integers. The complete list of changes is in the NEWS file.

To install Oracle R Distribution,
follow the instructions for your platform in the Oracle R Enterprise Installation and Administration Guide.

Thursday Feb 12, 2015

Pain Point #6: “We need to build 10s of thousands of models fast to meet business objectives”

The last pain point in this series on Addressing Analytic Pain Points, involves one aspect of what I call massive predictive modeling. Increasingly, enterprise customers are building a greater number of models. In past decades, producing a handful of production models per year may have been considered a significant accomplishment. With the advent of powerful computing platforms, parallel and distributed algorithms, as well as the wealth of data – Big Data – we see enterprises building hundreds and thousands of models in targeted ways.

For example, consider the utility sector with data being collected from household smart meters. Whether water, gas, or electricity, utility companies can make more precise demand projections by modeling individual customer consumption behavior. Aggregating this behavior across all households can provide more accurate forecasts, since individual household patterns are considered, not just generalizations about all households, or even different household segments.

The concerns associated with this form of massive predictive modeling include: (i) dealing effectively with Big Data from the hardware, software, network, storage and Cloud, (ii) algorithm and infrastructure scalability and performance, (iii) production deployment, and (iv) model storage, backup, recovery and security. Some of these I’ve explored under previous pain points blog posts.

Oracle Advanced Analytics (OAA) and Oracle R Advanced Analytics for Hadoop (ORAAH) both provide support for massive predictive modeling. From the Oracle R Enterprise component of OAA, users leverage embedded R execution to run user-defined R functions in parallel, both from R and from SQL. OAA provides the infrastructure to allow R users to focus on their core R functionality while allowing Oracle Database to handle spawning of R engines, partitioning data and providing data to their R function across parallel R engines, aggregating results, etc. Data parallelism is enabled using the “groupApply” and “rowApply” functions, while task parallelism is enabled using the “indexApply” function. The Oracle Data Mining component of OAA provides "on-the-fly" models, also called "predictive queries," where the model is automatically built on partitions of the data and scoring using those partitioned models is similarly automated.

ORAAH enables the writing of mapper and reducer functions in R where corresponding ORE functionality can be achieved on the Hadoop cluster. For example, to emulate “groupApply”, users write the mapper to partition the data and the reducer to build a model on the resulting data. To emulate “rowApply”, users can simply use the mapper to perform, e.g., data scoring and passing the model to the environment of the mapper. No reducer is required.

Monday Jan 19, 2015

Pain Point #5: “Our company is concerned about data security, backup and recovery”

So far in this series on Addressing Analytic Pain Points, I’ve focused on the issues of data access, performance, scalability, application complexity, and production deployment. However, there are also fundamental needs for enterprise advanced analytics solutions that revolve around data security, backup, and recovery.

Traditional non-database analytics tools typically rely on flat files. If data originated in an RDBMS, that data must first be extracted. Once extracted, who has access to these flat files? Who is using this data and when? What operations are being performed? Security needs for data may be somewhat obvious, but what about the predictive models themselves? In some sense, these may be more valuable than the raw data since these models contain patterns and insights that help make the enterprise competitive, if not the dominant player. Are these models secure? Do we know who is using them, when, and with what operations? In short, what audit capabilities are available?

While security is a hot topic for most enterprises, it is essential to have a well-defined backup process in place. Enterprises normally have well-established database backup procedures that database administrators (DBAs) rigorously follow. If data and models are stored in flat files, perhaps in a distributed environment, one must ask what procedures exist and with what guarantees. Are the data files taxing file system backup mechanisms already in place – or not being backed up at all?

On the other hand, recovery involves using those backups to restore the database to a consistent state, reapplying any changes since the last backup. Again, enterprises normally have well-established database recovery procedures that are used by DBAs. If separate backup and recovery mechanisms are used for data, models, and scores, it may be difficult, if not impossible, to reconstruct a consistent view of an application or system that uses advanced analytics. If separate mechanisms are in place, they are likely more complex than necessary.

For Oracle Advanced Analytics (OAA), data is secured via Oracle Database, which wins security awards and is highly regarded for its ability to provide secure data for confidentiality, integrity, availability, authentication, authorization, and non-repudiation. Oracle Database logs and monitors user activity. Users can work independently or jointly in a shared environment with data access controlled by standard database privileges. The data itself can be encrypted and data redaction is supported.

OAA models are secured in one of two ways: (i) models produced in the kernel of the database are treated as first-class database objects with corresponding access privileges (create, update, delete, execute), and (ii) models produced through the R interface can be stored in the R datastore, which exists as a database table in the user's schema with its own access privileges. In either case, users must log into their Oracle Database schema/account, which provides the needed degree of confidentiality, integrity, availability, authentication, authorization, and non-repudiation.

Enterprise Oracle DBAs already follow rigorous backup and recovery procedures. The ability to reuse these procedures in conjunction with advanced analytics solutions is a major simplification and helps to ensure the integrity of data, models, and results.

Tuesday Dec 23, 2014

Pain Point #4: “Recoding R (or other) models into SQL, C, or Java takes time and is error prone”

In the previous post in this series Addressing Analytic Pain Points, I focused on some issues surrounding production deployment of advanced analytics solutions. One specific aspect of production deployment involves how to get predictive model results (e.g., scores) from R or leading vendor tools into applications that are based on programming languages such as SQL, C, or Java. In certain environments, one way to integrate predictive models involves recoding them into one of these languages. Recoding involves identifying the minimal information needed for scoring, i.e., making predictions, and implementing that in a language that is compatible with the target environment. For example, consider a linear regression model with coefficients. It can be fairly straightforward to write a SQL statement or a function in C or Java to produce a score using these coefficients. This translated model can then be integrated with production applications or systems.

While recoding has been a technique used for decades, it suffers from several drawbacks: latency, quality, and robustness. Latency refers to the time delay between the data scientist developing the solution and leveraging that solution in production. Customers recount historic horror stories where the process from analyst to software developers to application deployment took months. Quality comes into play on two levels: the coding and testing quality of the software produced, and the freshness of the model itself. In fast changing environments, models may become “stale” within days or weeks. As a result, latency can impact quality. In addition, while a stripped down implementation of the scoring function is possible, it may not account for all cases considered by the original algorithm implementer. As such, robustness, i.e., the ability to handle greater variation in the input data, may suffer.

One way to address this pain point is to make it easy to leverage predictive models immediately (especially open source R and in-database Oracle Advanced Analytics models), thereby eliminating the need to recode models. Since enterprise applications normally know how to interact with databases via SQL, as soon as a model is produced, it can be placed into production via SQL access. In the case of R models, these can be accessed using Oracle R Enterprise embedded R execution in parallel via ore.rowApply and, for select models, the ore.predict capability performs automatic translation of native R models for execution inside the database. In the case of native SQL Oracle Advanced Analytics interface algorithms, as found in Oracle Data Mining and exposed through an R interface in Oracle R Enterprise, users can perform scoring directly in Oracle Database. This capability minimizes or even eliminates latency, dramatically increases quality, and leverages the robustness of the original algorithm implementations.

Sunday Dec 14, 2014

Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex”

Continuing in our series Addressing Analytic Pain Points, another concern for data scientists and analysts, as well as enterprise management, is how to leverage analytic results in production systems. These production systems can include (i) dashboards used by management to make business decisions, (ii) call center applications where representatives see personalized recommendations for the customer they’re speaking to or how likely that customer is to churn, (iii) real-time recommender systems for customer retail web applications, (iv) automated network intrusion detection systems, and (v) semiconductor manufacturing alert systems that monitor product quality and equipment parameters via sensors – to name a few.

When a data scientist or analyst begins examining a data-based business problem, one of the first steps is to acquire the available data relevant to that problem. In many enterprises, this involves having it extracted from a data warehouse and operational systems, or acquiring supplemental data from third parties. They then explore the data, prepare it with various transformations, build models using a variety of algorithms and settings, evaluate the results, and after choosing a “best” approach, produce results such as predictions or insights that can be used by the enterprise.

If the end goal is to produce a slide deck or report, aside from those final documents, the work is done. However, reaping financial benefits from advanced analytics often needs to go beyond PowerPoint! It involves automating the process described above: extract and prepare the data, build and select the “best” model, generate predictions or highlight model details such as descriptive rules, and utilize them in production systems.

One of the biggest challenges enterprises face involves realizing the promised benefits in production that the data scientist achieved in the lab. How do you take that cleverly crafted R script, for example, and put all the necessary “plumbing” around it to enable not only the execution of the R script, but the movement of data and delivery of results where they are needed, parallel and distributed script execution across compute nodes, and execution scheduling.

As a production deployment, care needs to taken to safeguard against potential failures in the process. Further, more “moving parts” result in greater complexity. Since the plumbing is often custom implemented for each deployment, this plumbing needs to be reinvented and thoroughly tested for each project. Unfortunately, code and process reuse is seldom realized across an enterprise even for similar projects, which results in duplication of effort.

Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining) with Oracle Database provides an environment that eliminates the need for a separately managed analytics server, the corresponding movement of data and results between such a server and the database, and the need for custom plumbing. Users can store their R and SQL scripts directly in Oracle Database and invoke them through standard database mechanisms. For example, R scripts can be invoked via SQL, and SQL scripts can be scheduled for execution through Oracle Database’s DMBS_SCHEDULER package. Parallel and distributed execution of R scripts is supported through embedded R execution, while the database kernel supports parallel and distributed execution of SQL statements and in-database data mining algorithms. In addition, using the Oracle Advanced Analytics GUI, Oracle Data Miner, users can convert “drag and drop” analytic workflows to SQL scripts for ease of deployment in Oracle Database.

By making solution deployment a well-defined and routine part of the production process and reducing complexity through fewer moving parts and built-in capabilities, enterprises are able to realize and then extend the value they get from predictive analytics faster and with greater confidence.

Wednesday Nov 19, 2014

Pain Point #2: “I can’t analyze or mine all of my data – it has to be sampled”

Continuing in our series Addressing Analytic Pain Points, another concern for enterprise data scientists and analysts is having to compromise accuracy due to sampling. While sampling is an important technique for data analysis, it’s one thing to sample because you choose to; it’s quite another if you are forced to sample or to use a much smaller sample than is useful. A combination of memory, compute power, and algorithm design normally contributes to this.

In some cases, data simply cannot fit in memory. As a result, users must either process data in batches (adding to code or process complexity), or limit the data they use through sampling. In some environments, sampling itself introduces a catch 22 problem: the data is too big to fit in memory so it needs to be sampled, but to sample it with the current tool, I need to fit the data in memory! As a result, sampling large volume data may require processing it in batches, involving extra coding.

As data volumes increase, computing statistics and predictive analytics models on a data sample can significantly reduce accuracy. For example, to find all the unique values for a given variable, a sample may miss values, especially those that occur infrequently. In addition, for environments like open source R, it is not enough for data to fit in memory; sufficient memory must be left over to perform the computation. This results from R’s call-by-value semantics.

Even when data fits in memory, local machines, such as laptops, may have insufficient CPU power to process larger data sets. Insufficient computing resources means that performance suffers and users must wait for results - perhaps minutes, hours, or longer. This wastes the valuable (and expensive) time of the data scientist or analyst. Having multiple fast cores for parallel computations, as normally present on database server machines, can significantly reduce execution time.

So let’s say we can fit the data in memory with sufficient memory left over, and we have ample compute resources. It may still be the case that performance is slow, or worse, the computation effectively “never” completes. A computation that would take days or weeks to complete on the full data set may be deemed as “never” completing by the user or business, especially where the results are time-sensitive. To address this problem, algorithm design must be addressed. Serial, non-threaded algorithms, especially with quadratic or worse order run time do not readily scale. Algorithms need to be redesigned to work in a parallel and even distributed manner to handle large data volumes.

Oracle Advanced Analytics
provides a range of statistical computations and predictive algorithms implemented in a parallel, distributed manner to enable processing much larger volume data. By virtue of executing in Oracle Database, client-side memory limitations can be eliminated. For example, with Oracle R Enterprise, R users operate on database tables using proxy objects – of type ore.frame, a subclass of data.frame – such that data.frame functions are transparently converted to SQL and executed in Oracle Database. This eliminates data movement from the database to the client machine. Users can also leverage the Oracle Data Miner graphical interface or SQL directly. When high performance hardware, such as Oracle Exadata, is used, there are powerful resources available to execute operations efficiently on big data. On Hadoop, Oracle R Advanced Analytics for Hadoop – a part of the Big Data Connectors often deployed on Oracle Big Data Appliance – also provides a range of pre-package parallel, distributed algorithms for scalability and performance across the Hadoop cluster.

Friday Oct 24, 2014

Pain Point #1: “It takes too long to get my data or to get the ‘right’ data”

This is the first in a series on Addressing Analytic Pain Points: “It takes too long to get my data or to get the ‘right’ data.”

Analytics users can be characterized along multiple dimensions. One such dimension is how they get access to or receive data. For example, some receive data via flat files. Since we’re talking about “enterprise” users, this often means data stored in RDBMSs where users request data extracts from a DBA or more generally the IT department. Turnaround time can be hours to days, or even weeks, depending on the organization. If the data scientist needs more or different data, the cycle repeats – often leading to frustration on both sides and delays in generating results.

Others users are granted access to databases directly using programmatic access tools like ODBC, JDBC, their corresponding R variants, or ROracle. These users may be given read-only access to a range of data tables, possibly in a sandbox schema. Here, analytics users don’t have to go back to their DBA or IT as to obtain extracts, but they still need to pull the data from the database to their client environment, e.g., a laptop, and push results back to the database. If significant volumes of data are involved, the time required for pulling data can hinder productivity. (Of course, this assumes the client has enough RAM to load the needed data sets, but that’s a topic for the next blog post.)

To address the first type of user, since much of the data in question resides in databases, empowering users with a self service model mitigates the vicious cycle described above. When the available data are readily accessible to analytics users, they can see and select what they need at will. An Oracle Database solution addresses this data access pain point by providing schema access, possibly in a sandbox with read-only table access, for the analytics user.

Even so, this approach just turns the first type of user into the second mentioned above. An Oracle Database solution further addresses this pain point by either minimizing or eliminating data movement as much as possible. Most analytics engines bring data to the computation, requiring extracts and in some cases even proprietary formats before being able to perform analytics. This takes time. Often, data movement can dwarf the time required to perform the actual computation. From the perspective of the analytics user, this is wasted time because it is just a perfunctory step on the way to getting the desired results. By bringing computation to the data, using Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining), the time normally required to move data is eliminated. Consider the time savings of being able to prepare data, compute statistics, or build predictive models and score data directly in the database. Using Oracle Advanced Analytics, either from R via Oracle R Enterprise, SQL via Oracle Data Mining, or the graphical interface Oracle Data Miner, users can leverage Oracle Database as a high performance computational engine.

We should also note that Oracle Database has the high performance Oracle Call Interface (OCI) library for programmatic data access. For R users, Oracle provides the package ROracle that is optimized using OCI for fast data access. While ROracle performance may be much faster than other methods (ODBC- and JDBC-based), the time is still greater than zero and there are other problems that I’ll address in the next pain point.

Addressing Analytic Pain Points

If you’re an enterprise data scientist, data analyst, or statistician, and perform analytics using R or another third party analytics engine, you’ve likely encountered one or more of these pain points:

Pain Point #1: “It takes too long to get my data or to get the ‘right’ data”
Pain Point #2: “I can’t analyze or mine all of my data – it has to be sampled”
Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex”
Pain Point #4: “Recoding R (or other) models into SQL, C, or Java takes time and is error prone”
Pain Point #5: “Our company is concerned about data security, backup and recovery”
Pain Point #6: “We need to build 10s of thousands of models fast to meet business objectives”

Some pain points are related to the scale of data, yet others are felt regardless of data size. In this blog series, I’ll explore each of these pain points, how they affect analytics users and their organizations, and how Oracle Advanced Analytics addresses them.

Monday Sep 22, 2014

Oracle R Enterprise 1.4.1 Released

Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, makes the open source R statistical programming language and environment ready for the enterprise and big data. Designed for problems involving large data volumes, Oracle R Enterprise integrates R with Oracle Database.

R users can execute R commands and scripts for statistical and graphical analyses on data stored in Oracle Database. R users can develop, refine, and deploy R scripts that leverage the parallelism and scalability of the database to automate data analysis. Data analysts and data scientists can use open source R packages and develop and operationalize R scripts for analytical applications in one step – from R or SQL.

With the new release of Oracle R Enterprise 1.4.1, Oracle enables support for Multitenant Container Database (CDB) in Oracle Database 12c and pluggable databases (PDB). With support for CDB / PDB, enterprises can take advantage of new ways of organizing their data: easily taking entire databases offline and easily bringing them back online when needed. Enterprises, such as pharmaceutical companies, that collect vast quantities of data across multiple experiments for individual projects immediately benefit from this capability.

This point release also includes the following enhancements:

• Certified for use with R 3.1.1 and Oracle R Distribution 3.1.1.

• Simplified and enhanced script for install, upgrade, uninstall of ORE Server and the creation and configuratioon of ORE users.

• New supporting packages: arules and statmod.

• ore.glm accepts offset terms in model formula and can fit negative binomial and tweedie families of GLM.

• ore.sync argument, query, creates ore.frame object from SELECT statement without creating view. This allows users to effectively access a view of the data without the CREATE VIEW privilege.

• Global option for serialization, ore.envAsEmptyenv, specifies whether referenced environment objects in an R object, e.g., in an lm model, should be replaced with an empty environment during serialization to the ORE R datastore. This is used by (i) ore.push, which for a list object accepts envAsEmptyenv as an optional argument, (ii), which has envAsEmptyenv as a named argument, and (iii) ore.doEval and the other embedded R execution functions, which accept ore.envAsEmptyenv as a control argument.

Oracle R Enterprise 1.4.1
can be downloaded from OTN here.

Wednesday Sep 17, 2014

Seismic Data Repository: on-the-fly data analysis and visualization using Oracle R Enterprise

RN-KrasnoyarskNIPIneft Establishes Seismic Information Repository for One of the World’s Largest Oil and Gas Companies. Read the complete customer story here, excerpts follow.

RN-KrasnoyarskNIPIneft (KrasNIPI) is a research and development subsidiary of Rosneft Oil Companya, top oil and gas company in Russia and worldwide. KrasNIPI provides high-quality information from seismic surveys to Rosneft—delivering key information that oil and gas companies seek to lower costs, environmental impacts, and risks while exploring for resources to satisfy growing energy needs. KrasNIPI’s primary activities include preparing the information base used for the exploration of hydrocarbons, development and construction of oil and gas fields, processing and interpretation of 2-D and 3-D seismic data, and seismic data warehousing.

Part of the solution involved on-the-fly data analysis and visualization for remote users with only a thin client—such as a web browser (without additional plug-ins and extensions). This was made possible by using Oracle R Enterprise (a component of Oracle Advanced Analytics) to support applications requiring extensive analytical processing.

We store vast amounts of seismic data, process this information with sophisticated math algorithms, and deliver it to remote users under tight deadlines. We deployed Oracle Database together with Oracle Spatial and Graph, Oracle Fusion Middleware MapViewer on Oracle WebLogic Server, and Oracle R Enterprise to keep these complex business processes running smoothly. The result exceeded our most optimistic expectations.”
                              – Artem Khodyaev, Chief Engineer
                                                              Corporate Center of Seismic Information Repository

Thursday Aug 21, 2014

Oracle R Distribution 3.1.1 Released

Oracle R Distribution version 3.1.1 has been released to Oracle's public yum today. R-3.1.1 (code name "Sock it to Me") is an update to R-3.1.0 that consists mainly of bug fixes. It also 
includes enhancements related to accessing package help files, improved accuracy when importing data with large integers, and better integration with RStudio graphics. The full list of new features and bug fixes is listed in the NEWS file.

To install Oracle R Distribution using
yum, follow the instructions in the Oracle R Enterprise Installation and Administration Guide.

Installing using
yum will resolve any operating system dependencies automatically. As such, we recommend using yum to install Oracle R Distribution. However, if yum is not available, you can install Oracle R Distribution RPMs directly using RPM commands.

For Oracle Linux 5, the Oracle R Distribution RPMs are available in the Enterprise Linux Add-Ons repository


For Oracle Linux 6, the Oracle R Distribution RPMs are available in the Oracle Linux Add-Ons repository:


For example, this command installs the R 3.1.1 RPM on Oracle Linux x86-64 version 6:

  rpm -i R-3.1.1-1.el6.x86_64.rpm

To complete the Oracle R Distribution 3.1.1 installation, repeat this command for each of the 6 RPMs, resolving dependencies as required. 

Oracle R Distribution 3.1.1 is the certified with Oracle R Enterprise 1.4.x. Refer to 
Table 1-2 in the Oracle R Enterprise Installation Guide for supported configurations of Oracle R Enterprise components, or check this blog for updates. The Oracle R Distribution 3.1.1 binaries for Windows, AIX, Solaris SPARC and Solaris x86 are also available on OSS, Oracle's Open Source Software portal.

Thursday Aug 14, 2014

Selecting the most predictive variables – returning Attribute Importance results as a database table

Attribute Importance (AI) is a technique of Oracle Advanced Analytics (OAA) that ranks the relative importance of predictors given a categorical or numeric target for classification or regression models, respectively. OAA AI uses the minimum description length algorithm and produces importance scores such that predictors with positive scores help predict the target, while zero or negative do not, and may even contribute noise to a model, making it less accurate. OAA AI, however, considers predictors only pairwise with the target, so any interactions among predictors are addressed. OAA AI is a good first assessment of which predictors should be included in a classification or regression model, enabling what is sometimes called feature selection or variable selection.

In my series on Oracle R Enterprise Embedded R Execution, I explored how structured table results could be returned from embedded R calls. In a subsequent post, I explored how to return select results from a principal components analysis (PCA) model as a table. In this post, I describe how you can work with results from an Attribute Importance model from ORE embedded R execution via an R function. This R function takes a table name and target variable name as input, places the predictor rankings in an named ORE datastore also specified as input, and returns a data.frame with the predictor variable name, rank, importance value.

The function below implements this functionality. Notice that we dynamically sync the named table and get its ore.frame proxy object. From here, we invoke ore.odmAI using the dynamically generated formula using the targetName argument. We pull out the importance component of the result, explicitly assign the column variable to the row names, and then reorder the columns. Next, we nullify the row names since these are now redundant with column variable.

The next three lines assign the result to a datastore. This is technically not necessary since the result is returned by this function, but if a user wanted to access this result without recomputing it, the user could simply retrieve the datastore object using another embedded R function. This is left as an exercise for the reader to load the named datastore and return the contents as an ore.frame in R or database table in SQL.

Lastly, the resulting data.frame is returned.

rankPredictors <- function(tableName,targetName,dsName) {
  dat <- ore.get(tableName)
  formulaStr <- paste(targetName,".",sep="~")
  res <- ore.odmAI(as.formula(formulaStr),dat)
  res <- res$importance
  res$variable <- rownames(res)
  res <- res[,c("variable","rank","importance")]
  row.names(res) <- NULL
  resName <- paste(tableName,targetName,"AI",sep=".")

To test this funtion, we invoke it explicitly with suitable arguments.

res <- rankPredictors ("IRIS","Species","/DS/Test1")

Here, you see the results.

> res
    variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

The contents of the datastore can be accessed as well.

> ore.datastore(pattern="/DS") object.count size description
1 /DS/Test1 1 355 2014-08-14 16:38:46 <na>
> ore.datastoreSummary(name="/DS/Test1") class size length row.count col.count
1 IRIS.Species.AI data.frame 355 3 4 3
> ore.load("/DS/Test1")
[1] "IRIS.Species.AI"
> IRIS.Species.AI
    variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

With the confidence that our R function is behaving correctly, we load it into the R Script Repository in Oracle Database.


To test that the function behaves properly with embedded R execution, we invoke it first from R using ore.doEval, passing the desired parameters and returning the result as an ore.frame. This last part is enabled through the specification of the FUN.VALUE argument. Since we are using a datastore and the transparency layer, ore.connect is set to TRUE.


Notice we get the same result as above.

    variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

Again, we can view the datastore contents for the execution above. Notice our use of the “/” notation to organize our datastore content. While we can name datastores with any arbitrary string, this approach can help structure the retrieval of datastore contents.


We have a single datastore matching our IRIS data set followed by the summary with the IRIS.Species.AI object, which is an R data.frame with 3 columns and 4 rows.

> ore.datastore(pattern="/AttributeImportance/IRIS") object.count size description
1 /AttributeImportance/IRIS/Species 1 355 2014-08-14 16:55:40
> ore.datastoreSummary(name="/AttributeImportance/IRIS/Species") class size length row.count col.count
1 IRIS.Species.AI data.frame 355 3 4 3

To execute this R script from SQL, use the ORE SQL API.

select * from table(rqEval(
  cursor(select 1 "ore.connect",
      'IRIS' "tableName",
      'Species' "targetName",
      '/AttributeImportance/IRIS/Species' "dsName"
      from dual),
  'select cast(''a'' as varchar2(50)) "variable",
  1 "rank",
  1 "importance"
  from dual',

In summary, we’ve explored how to use ORE embedded R execution to extract model elements from an in-database algorithm and present it as an R data.frame, ore.frame, and SQL table.

The process used above can also serve as a template for working on your own embedded R execution projects:

+ Interactively develop an R script that does what you need and wrap it in a function
+ Validate that the R function behaves as expected
+ Store the function in the R Script Repository
+ Validate that the R interface to embedded R execution produces the desired results
+ Generate SQL query that invokes the R function
+ Validate that the SQL interface to embedded R execution produces the desired resultsv

Wednesday Jul 30, 2014

For CMOs: Take Your Company’s Data to a New Level for Marketing Insights

This guest post from Phyllis Zimbler Miller, ‎Digital Marketer, comments on uses of predictive analytics for marketing insights that could benefit from in-database scalability and ease of production deployment with Oracle R Enterprise.

Does your company have tons of data, such as for how many seconds people watch each short video on your site before clicking away, and you are not yet leveraging this data to benefit your company’s bottom line?

Missed opportunities can be overcome by utilizing predictive analytics

Predictive analytics uses statistical and machine learning techniques that analyze current and historical facts to make predictions about events. For example, your company could take data you’ve already collected and, utilizing statistical analysis software, gain insights into the behavior of your target audiences.

Previously, running the software to analyze this data could take many hours or even days. Today, with advanced software and hardware options, this analysis can take minutes.

Customer segmentation and customer satisfaction based on data analysis

Using predictive analytics you could, for example, begin to evaluate which prospective customers in which part of the country tend to watch which videos on your site longer than the other videos on your site. This evaluation can then be used by your marketing people to craft regional messages that can better resonate with people in those regions.

In terms of data analysis for customer satisfaction, imagine an online entertainment streaming service using data analysis to determine at what point people stop watching a particular film or TV episode. Presumably this information could then be used, among other things, to improve the individual recommendations for site members.

Or imagine an online game company using data analysis of player actions for customer satisfaction insights. Although certain actions may not be against the rules, these actions might artificially increase a player’s ranking against other players, which would interfere with the game satisfaction of others. The company could use data analysis to look for players “gaming” the system and take appropriate action.

Customer retention opportunities from data analysis

Perhaps one of the most important opportunities for analysis of data your company may already have is for customer retention efforts. Let’s say you have a subscription model business. You perform data analysis and discover that your biggest drop-offs are at the 3-month and 6-month points.

First, your marketing department comes up with incentives offered to customers right before those drop-off points – incentives that require extending the customer’s subscription.

Then you use data analysis to evaluate whether there is a statistical difference in the drop-offs after the incentives have been instituted.

Next you try different incentives for those drop-off points and analyze that data. Which incentives seem to better improve customer retention?

Companies with large volume data

Your company may already be using Oracle Database. If your company’s database has a huge amount of data, Oracle has an enterprise solution to improve the efficiency and scalability of running the R statistical programming language, which can be effectively used in many cases for this type of predictive analytics.

Oracle R Enterprise offers scalability, performance, and ease of production deployment. Using Oracle R Enterprise, your company’s data analysis procedures can overcome R memory constraints and, utilizing parallel distributed algorithms, considerably reduce execution time.

Regardless of the amount of data your company has, you still need to consider how to get your advanced analytics into production quickly and easily. The ability to integrate R scripts with production database applications using SQL eliminates delays in moving from development to production use.

And the quicker and easier you can analyze your data, the sooner you can benefit from valuable insights into customer segmentation, satisfaction, and retention in addition to many other customer/marketing applications.

Friday Jul 25, 2014

Addressing Data Order Between R and Relational Databases

Almost all data in R is a vector or is based upon vectors (vectors themselves, matrices, data frames, lists, and so forth).  The elements of a vector in R have an explicit order, and each element can be individually indexed.  R's in-memory processing relies on this order of elements for many computations, e.g., computing quantiles and summaries for time series objects.

By design, query results in relational algebra are unordered.  Repeating the same query multiple times is not guaranteed to return results in the same order. Similarly, database-backed relational data also do not guarantee row order.  However, an explicit order on database tables and views can be defined by using an ORDER BY clause in the SQL SELECT statement.  Ordering is usually achieved by having a unique identifier, either a single or multi-column key specified in the ORDER BY clause.

To bridge between ordered R data frames and normally unordered data in a relational database such as Oracle Database, Oracle R Enterprise provides the ability to create ordered and unordered ore.frame objects.  Oracle R Enterprise supports ordering an ore.frame by assigning row names using the function row.names.

Ordering Using Row Names

Oracle R Enterprise supports ordering using row names.  For example, suppose that the ore.frame object NARROW, which is a proxy object for the corresponding database table, is not indexed.  The following example illustrates using the row.names function to create a unique identifier for each row. When retrieving row names for unordered ore.frame objects, an error is returned:

R> row.names(NARROW)
Error: ORE object has no unique key

If an ore.frame is unordered, row indexing is not permitted, since there is no unique ordering.  For example, an attempt to retrieve the 5th row from the NARROW data returns an error:

Error: ORE object has no unique key

The R function row.names can also be used to assign row names explicitly and thus create a unique row identifier.  We'll do this using the variable "ID" on the NARROW data:

R> row.names(NARROW) <- NARROW$ID

R> row.names(head(NARROW[ ,1:3]))
[1] 101501 101502 101503 101504 101505 101506 

We can now index to a specific row number using integer indexing:

R> NARROW[5,]     
101505 101505 <NA> 34  NeverM United States of America Masters Sales 5 1

Similarly, to index a range of row numbers, use:

R> NARROW[2:3,]

101502 101502 <NA> 27 NeverM United States of America Bach. Sales 3 0

101503 101503 <NA> 20 NeverM United States of America HS-grad Cleric. 2 0

To index a specific row by row name, use character indexing:

R> NARROW["101502",]


101502 101502 <NA> 27 NeverM United States of America Bach. Sales 3 0

Ordering Using Keys

You can also use the primary key of a database table to order an ore.frame object.  When you execute ore.connect in an R session, Oracle R Enterprise creates a connection to a schema in an Oracle Database instance. To gain access to the data in the database tables in the schema, you can explicitly call the ore.sync function. That function creates an ore.frame object that is a proxy for a table in a schema.  With the schema argument, you can specify the schema for which you want to create an R environment to proxy objects.  With the use.keys argument, you can specify whether you can to use primary keys in the table to order the ore.frame object.

To return the NARROW data to it's unordered state, remove the previously created row names:

R> row.names(NARROW) <- NULL

Using a SQL statement, alter the NARROW table to add a composite primary key:

R> ore.exec("alter table NARROW add constraint NARROW primary key (\"ID\")")

Synchronize the table to obtain the updated key using the
ore.sync command and setting the use.keys argument to TRUE.

R> ore.sync(table = "NARROW", use.keys = TRUE)

The row names of the ordered NARROW data are now the primary key column values:

R> head(NARROW[, 1:3])
1 101501   <NA>  41
2 101502   <NA>  27
3 101503   <NA>  20
4 101504   <NA>  45
5 101505   <NA>  34
6 101506   <NA>  38

If your database table already contains a key, there is no need to create the key again. 
Simply execute ore.sync with use.keys set to TRUE when you want to use the primary key:

R> ore.sync(table = "TABLE_NAME", use.keys = TRUE)

Ordering database tables and views is known to reduce performance because it requires sorting. 
As most operations in R do not require ordering, the performance hit due to sorting is unnecessary, and you should generally set use.keys to FALSE in ore.sync. Only when ordering is necessary, for operations such sampling data or running the diff command to compare objects, should keys be used.

Options for Ordering

Oracle R Enterprise contains options that relate to the ordering of an ore.frame object. The ore.warn.order global option specifies whether you want Oracle R Enterprise to display a warning message if you use an unordered ore.frame object in a function whose results are order dependent. If you know what to expect from operations involving aggregates, group summaries, or embedded R computations,  then you might want to turn the warnings off so they do not appear in the output:

R> options("ore.warn.order")

[1] TRUE

R> options("ore.warn.order" = FALSE)

R> options("ore.warn.order")



Note that in some circumstances unordered data may appear to have a repeatable order, however since it was never guaranteed in the first place it is possible to change in future runs. Additionally, parallel queries can significantly impact the order of the result versus the sequential execution.

Thursday Jul 24, 2014

Are you experiencing analytics pain points?

At the user!2014 conference at UCLA in early July, which was a stimulating and well-attended conference, I spoke about Oracle’s R Technologies during the sponsor talks. One of my slides focused on examples of analytics pain points we often hear from customers and prospects. For example,

“It takes too long to get my data or to get the ‘right’ data”
“I can’t analyze or mine all of my data – it has to be sampled”
“Putting R models and results into production is ad hoc and complex”
“Recoding R models into SQL, C, or Java takes time and is error prone”
“Our company is concerned about data security, backup and recovery”
“We need to build 10s of thousands of models fast to meet business objectives”

After the talk, several people approached me remarking how these are exactly the problems they encounter in their organizations. One person even asked, if I’d interviewed her for my talk since she is experiencing every one of these pain points.

Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, addresses these pain points. Let’s take a look one by one.

If it takes too long to get your data, perhaps because your moving it from the database where it resides to your external analytics server or laptop, the ideal solution is don’t move it. Analyze it where it is. This is exactly what Oracle R Enterprise allows you to do using the transparency layer and in-database predictive analytics capabilities. With Oracle R Enterprise, R functions normally performed on data.frames are translated to SQL for execution in the database, taking advantage of query optimization, indexes, parallel-distributed execution, etc. With the advent of Oracle Data In-Memory option, this has even more advantages, but that’s a topic for another post. The second part of this pain point is getting access to the “right” data. Allowing your data scientist to have a sandbox with access to the range of data necessary to perform his/her work avoids the delay of requesting flat file extracts via the DBA, only to realize that more or different data is required. The cycle time in getting the “right” data impedes progress, not to mention annoying some key individuals in your organization. We’ll come back to the security aspects later.

Increasingly, data scientists want to avoid sampling data when analyzing data or building predictive models. Minimally, they at least want to use much more data than may fit in typical analytics servers. Oracle R Enterprise provides an R interface to powerful in-database analytic functions and data mining algorithms. These algorithms are designed to work in a parallel distributed manner whether the data fits in memory or not. In other cases, sampling is desired, if not required, but this results in the chicken-and-egg problem: The data need to be sampled since they won’t fit in memory, but the data are too big to fit in memory to sample! Users have developed home grown techniques to chunk the data and combine partial samples; however, they shouldn’t have to. When sampling is desired/required, with Oracle R Enterprise, we are able to leverage row indexing and in-database sampling to extract only database table rows that are in the sample, using standard R syntax or Oracle R Enterprise-based sampling functions.

Our next pain point involves production deployment. Many good predictive models have been laid waste for lack of integration with or complexity introduced by production environments. Enterprise applications and dashboards often speak SQL and know how to access data. However, to craft a solution that extracts data, invokes an R script in an external R engine, and places batch results back in the database requires a lot of manual coding, often leveraging ad hoc cron jobs. Oracle R Enterprise enables the execution of R scripts on the database server machine, in local R engines under the control of Oracle Database. This can be done from R and SQL. Using the SQL API, R scripts can be invoked to return results in the form of table data, images, and XML. In addition, data can be moved to these R engines more efficiently, and the powerful database hardware, such as Exadata machines, can be leveraged for data-parallel and task-parallel R script execution.

When users don’t have access to a tight integration between R and SQL as noted above, another pain point involves using R only to build the models and relying on developers to recode the scoring procedures in a programming language that fits with the production environment, e.g., SQL, C, or Java. This has multiple downsides: it takes time to recode, manual recoding is error prone, and the resulting code requires significant testing. When the model is refreshed, the process repeats.

The pain points discussed so far also suffer from concerns about security, backup, and recovery. If data is being moved around in flat files, what security protocols or access controls are placed on those flat files? How can access be audited? Oracle R Enterprise enables analytics users to leverage an Oracle Database secured environment for data access. Moving on, if R scripts, models, and other R objects are stored and managed as flat files, how are these backed up? How are they synced with the deployed application? By storing all these artifacts in Oracle Database via Oracle R Enterprise, backup is a normal part of DBA operation with established protocols. The R Script Repository and Datastore simplify backup. Crafting ad hoc solutions involving third party analytic servers, there is the issue of recovery, or resilience to failures. Fewer moving parts mean lower complexity. Programming for failure contingencies in a distributed application adds significant complexity to an application. Allowing Oracle Database to control the execution of R scripts in database server side R engines reduces complexity and frees application developers and data scientists to focus on the more creative aspects of their work.

Lastly, users of advanced analytics software – data scientists, analysts, statisticians – are increasing pushing the barrier of scalability. Not just in volume of data processed, but in the number and frequency of their computations and analyses, e.g., predictive model building. Where only a few models are involved, it may be tractable to manage a few files to store predictive models on disk (although as noted above, this has its own complications). When you need to build thousands of models or hundreds of thousands of models, managing these models becomes a challenge in its own right.

In summary, customers are facing a wide range of pain points in their analytics activities. Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, addresses these pain points allowing data scientists, analysts, and statisticians, as well as the IT staff who supports them, to be more productive, while promoting and enabling new uses of advanced analytics.

Thursday Jun 05, 2014

Convert ddply {plyr} to Oracle R Enterprise, or use with Embedded R Execution

The plyr package contains a set of tools for partitioning a problem into smaller sub-problems that can be more easily processed. One function within {plyr} is ddply, which allows you to specify subsets of a data.frame and then apply a function to each subset. The result is gathered into a single data.frame. Such a capability is very convenient. The function ddply also has a parallel option that if TRUE, will apply the function in parallel, using the backend provided by foreach.

This type of functionality is available through Oracle R Enterprise using the ore.groupApply function. In this blog post, we show a few examples from Sean Anderson's "A quick introduction to plyr" to illustrate the correpsonding functionality using ore.groupApply.

To get started, we'll create a demo data set and load the plyr package.

d <- data.frame(year = rep(2000:2014, each = 3),
        count = round(runif(45, 0, 20)))

This first example takes the data frame, partitions it by year, and calculates the coefficient of variation of the count, returning a data frame.

# Example 1
res <- ddply(d, "year", function(x) {
  mean.count <- mean(x$count)
  sd.count <- sd(x$count)
  cv <- sd.count/mean.count
  data.frame(cv.count = cv)

To illustrate the equivalent functionality in Oracle R Enterprise, using embedded R execution, we use the ore.groupApply function on the same data, but pushed to the database, creating an ore.frame. The function ore.push creates a temporary table in the database, returning a proxy object, the ore.frame.

D <- ore.push(d)
res <- ore.groupApply (D, D$year, function(x) {
  mean.count <- mean(x$count)
  sd.count <- sd(x$count)
  cv <- sd.count/mean.count
  data.frame(year=x$year[1], cv.count = cv)
  }, FUN.VALUE=data.frame(year=1, cv.count=1))

You'll notice the similarities in the first three arguments. With ore.groupApply, we augment the function to return the specific data.frame we want. We also specify the argument FUN.VALUE, which describes the resulting data.frame. From our previous blog posts, you may recall that by default, ore.groupApply returns an ore.list containing the results of each function invocation. To get a data.frame, we specify the structure of the result.

The results in both cases are the same, however the ore.groupApply result is an ore.frame. In this case the data stays in the database until it's actually required. This can result in significant memory and time savings whe data is large.

R> class(res)
[1] "ore.frame"
[1] "OREbase"
R> head(res)
   year cv.count
1 2000 0.3984848
2 2001 0.6062178
3 2002 0.2309401
4 2003 0.5773503
5 2004 0.3069680
6 2005 0.3431743

To make the ore.groupApply execute in parallel, you can specify the argument parallel with either TRUE, to use default database parallelism, or to a specific number, which serves as a hint to the database as to how many parallel R engines should be used.

The next ddply example uses the summarise function, which creates a new data.frame. In ore.groupApply, the year column is passed in with the data. Since no automatic creation of columns takes place, we explicitly set the year column in the data.frame result to the value of the first row, since all rows received by the function have the same year.

# Example 2
ddply(d, "year", summarise, mean.count = mean(count))

res <- ore.groupApply (D, D$year, function(x) {
  mean.count <- mean(x$count)
  data.frame(year=x$year[1], mean.count = mean.count)
  }, FUN.VALUE=data.frame(year=1, mean.count=1))

R> head(res)
   year mean.count
1 2000 7.666667
2 2001 13.333333
3 2002 15.000000
4 2003 3.000000
5 2004 12.333333
6 2005 14.666667

Example 3 uses the transform function with ddply, which modifies the existing data.frame. With ore.groupApply, we again construct the data.frame explicilty, which is returned as an ore.frame.

# Example 3

ddply(d, "year", transform, total.count = sum(count))

res <- ore.groupApply (D, D$year, function(x) {
  total.count <- sum(x$count)
  data.frame(year=x$year[1], count=x$count, total.count = total.count)
  }, FUN.VALUE=data.frame(year=1, count=1, total.count=1))

> head(res)
   year count total.count
1 2000 5 23
2 2000 7 23
3 2000 11 23
4 2001 18 40
5 2001 4 40
6 2001 18 40

In Example 4, the mutate function with ddply enables you to define new columns that build on columns just defined. Since the construction of the data.frame using ore.groupApply is explicit, you always have complete control over when and how to use columns.

# Example 4

ddply(d, "year", mutate, mu = mean(count), sigma = sd(count),
      cv = sigma/mu)

res <- ore.groupApply (D, D$year, function(x) {
  mu <- mean(x$count)
  sigma <- sd(x$count)
  cv <- sigma/mu
  data.frame(year=x$year[1], count=x$count, mu=mu, sigma=sigma, cv=cv)
  }, FUN.VALUE=data.frame(year=1, count=1, mu=1,sigma=1,cv=1))

R> head(res)
   year count mu sigma cv
1 2000 5 7.666667 3.055050 0.3984848
2 2000 7 7.666667 3.055050 0.3984848
3 2000 11 7.666667 3.055050 0.3984848
4 2001 18 13.333333 8.082904 0.6062178
5 2001 4 13.333333 8.082904 0.6062178
6 2001 18 13.333333 8.082904 0.6062178

In Example 5, ddply is used to partition data on multiple columns before constructing the result. Realizing this with ore.groupApply involves creating an index column out of the concatenation of the columns used for partitioning. This example also allows us to illustrate using the ORE transparency layer to subset the data.

# Example 5

baseball.dat <- subset(baseball, year > 2000) # data from the plyr package
x <- ddply(baseball.dat, c("year", "team"), summarize,
           homeruns = sum(hr))

We first push the data set to the database to get an ore.frame. We then add the composite column and perform the subset, using the transparency layer. Since the results from database execution are unordered, we will explicitly sort these results and view the first 6 rows.

BB.DAT <- ore.push(baseball)
BB.DAT$index <- with(BB.DAT, paste(year, team, sep="+"))
BB.DAT2 <- subset(BB.DAT, year > 2000)
X <- ore.groupApply (BB.DAT2, BB.DAT2$index, function(x) {
  data.frame(year=x$year[1], team=x$team[1], homeruns=sum(x$hr))
  }, FUN.VALUE=data.frame(year=1, team="A", homeruns=1), parallel=FALSE)
res <- ore.sort(X, by=c("year","team"))

R> head(res)
   year team homeruns
1 2001 ANA 4
2 2001 ARI 155
3 2001 ATL 63
4 2001 BAL 58
5 2001 BOS 77
6 2001 CHA 63

Our next example is derived from the ggplot function documentation. This illustrates the use of ddply within using the ggplot2 package. We first create a data.frame with demo data and use ddply to create some statistics for each group (gp). We then use ggplot to produce the graph. We can take this same code, push the data.frame df to the database and invoke this on the database server. The graph will be returned to the client window, as depicted below.

# Example 6 with ggplot2

df <- data.frame(gp = factor(rep(letters[1:3], each = 10)),
                 y = rnorm(30))
# Compute sample mean and standard deviation in each group
ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))

# Set up a skeleton ggplot object and add layers:
ggplot() +
  geom_point(data = df, aes(x = gp, y = y)) +
  geom_point(data = ds, aes(x = gp, y = mean),
             colour = 'red', size = 3) +
  geom_errorbar(data = ds, aes(x = gp, y = mean,
                               ymin = mean - sd, ymax = mean + sd),
             colour = 'red', width = 0.4)

DF <- ore.push(df)
ore.tableApply(DF, function(df) {
  ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))
  ggplot() +
    geom_point(data = df, aes(x = gp, y = y)) +
    geom_point(data = ds, aes(x = gp, y = mean),
               colour = 'red', size = 3) +
    geom_errorbar(data = ds, aes(x = gp, y = mean,
                                 ymin = mean - sd, ymax = mean + sd),
                  colour = 'red', width = 0.4)

But let's take this one step further. Suppose we wanted to produce multiple graphs, partitioned on some index column. We replicate the data three times and add some noise to the y values, just to make the graphs a little different. We also create an index column to form our three partitions. Note that we've also specified that this should be executed in parallel, allowing Oracle Database to control and manage the server-side R engines. The result of ore.groupApply is an ore.list that contains the three graphs. Each graph can be viewed by printing the list element.

df2 <- rbind(df,df,df)
df2$y <- df2$y + rnorm(nrow(df2))
df2$index <- c(rep(1,300), rep(2,300), rep(3,300))
DF2 <- ore.push(df2)
res <- ore.groupApply(DF2, DF2$index, function(df) {
  df <- df[,1:2]
  ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))
  ggplot() +
    geom_point(data = df, aes(x = gp, y = y)) +
    geom_point(data = ds, aes(x = gp, y = mean),
               colour = 'red', size = 3) +
    geom_errorbar(data = ds, aes(x = gp, y = mean,
                                 ymin = mean - sd, ymax = mean + sd),
                  colour = 'red', width = 0.4)
  }, parallel=TRUE)

To recap, we've illustrated how various uses of ddply from the plyr package can be realized in ore.groupApply, which affords the user explicit control over the contents of the data.frame result in a straightforward manner. We've also highlighted how ddply can be used within an ore.groupApply call.

Friday May 30, 2014

Financial institutions build predictive models using Oracle R Enterprise to speed model deployment

See the Oracle press release, Financial Institutions Leverage Metadata Driven Modeling Capability Built on the Oracle R Enterprise Platform to Accelerate Model Deployment and Streamline Governance for a description where a "unified environment for analytics data management and model lifecycle management brings the power and flexibility of the open source R statistical platform, delivered via the in-database Oracle R Enterprise engine to support open standards compliance."

Through its integration with Oracle R Enterprise, Oracle Financial Services Analytical Applications provides "productivity, management, and governance benefits to financial institutions, including the ability to:

  • Centrally manage and control models in a single, enterprise model repository, allowing for consistent management and application of security and IT governance policies across enterprise assets

  • Reuse models and rapidly integrate with applications by exposing models as services

  • Accelerate development with seeded models and common modeling and statistical techniques available out-of-the-box

  • Cut risk and speed model deployment by testing and tuning models with production data while working within a safe sandbox

  • Support compliance with regulatory requirements by carrying out comprehensive stress testing, which captures the effects of adverse risk events that are not estimated by standard statistical and business models. This approach supplements the modeling process and supports compliance with the Pillar I and the Internal Capital Adequacy Assessment Process stress testing requirements of the Basel II Accord

  • Improve performance by deploying and running models co-resident with data. Oracle R Enterprise engines run in database, virtually eliminating the need to move data to and from client machines, thereby reducing latency and improving security"

Thursday May 01, 2014

"Darden uses analytics to understand customer restaurants"

See the InformationWeek article Darden Uses Analytics To Understand Restaurant Customers highlighting Darden's use of Oracle Advanced Analytics:

"Check-Level Analytics is one of the tools Darden plans to use to boost sales and customer loyalty, while pulling together data from across all its operations to show how they can work together better. ... it brought in some new tools, including Oracle Data Miner and Oracle R Enterprise, both included in the Oracle Advanced Analytics option to Oracle Database, to spot correlations and meaningful patterns in the data." 

Darden's data analytics project earned them the 5th spot on this year's InformationWeek Elite 100 ranking.

Sunday Apr 27, 2014

Step-by-step: Returning R statistical results as a Database Table

R provides a rich set of statistical functions that we may want to use directly from SQL. Many of these results can be readily expressed as structured table data for use with other SQL tables, or for use by SQL-enabled applications, e.g., dashboards or other statistical tools.

In this blog post, we illustrate in a sequence of five simple steps  how to go from an R function to a SQL-enabled result. Taken from recent "proof of concept" customer engagement, our example involves using the function princomp, which performs a principal components analysis on a given numeric data matrix and returns the results as an object of class princomp. The customer actively uses this R function to produce loadings used in subsequent computations and analysis. The loadings is a matrix whose columns contain the eigenvectors).

The current process of pulling data from their Oracle Database, starting an R  engine, invoking the R script, and placing the results back in the database was proving non-performant and unnecessarily complex. The goal was to leverage Oracle R Enterprise to streamline this process and allow the results to be immediately accessible
through SQL.

As a best practice, here is a process that can get you from start to finish:

Step 1: Invoke from command line, understand results

If you're using a particular R function, chances are you are familiar with its content. However, you may not be familiar with its structure. We'll use an example from the R princomp documentation that uses the USArrests data set. We see that the class of the result is of type princomp, and the model prints the call and standard deviations of the components. To understand the underlying structure, we invoke the function str and see there are seven elements in the list, one of which is the matrix loadings.

mod <- princomp(USArrests, cor = TRUE)


R> mod <- princomp(USArrests, cor = TRUE)
R> class(mod)
[1] "princomp"
R> mod
princomp(x = USArrests, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

4 variables and 50 observations.

R> str(mod)
List of 7
$ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
.. ..$ : chr [1:4] "Comp.1" "Comap.2" "Comp.3" "Comp.4"
$ center : Named num [1:4] 7.79 170.76 65.54 21.23
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ scale : Named num [1:4] 4.31 82.5 14.33 9.27
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ n.obs : int 50
$ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:50] "1" "2" "3" "4" ...
.. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ call : language princomp(x = dat, cor = TRUE)
- attr(*, "class")= chr "princomp"

Step 2: Wrap script in a function, and invoke from ore.tableApply

Since we want to invoke princomp on database data, we first push the demo data, USArrests, to the database to create an ore.frame. Other data we wish to use will also be in database tables.

We'll use ore.tableApply (for the reasons cited in the previous blog post)  providing the ore.frame as the first argument and simply returning within our function the model produced by princomp. We'll then look at its class, retrieve the result from the database, and check its class and structure once again.

Notice that we are able to obtain the exact same result we received using our local R engine as with the database R engine through embedded R execution.

dat <- ore.push(USArrests)
computePrincomp <- function(dat) princomp(dat, cor=TRUE)
res <- ore.tableApply(dat, computePrincomp)

res.local <- ore.pull(res)


R> dat <- ore.push(USArrests)
R> computePrincomp <- function(dat) princomp(dat, cor=TRUE)
R> res <- ore.tableApply(dat, dat, computePrincomp)
R> class(res)
[1] "ore.object"
[1] "OREembed"
R> res.local <- ore.pull(res)
R> class(res.local)
[1] "princomp"

R> str(res.local)
List of 7
$ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
.. ..$ : chr [1:4] "Comp.1" "Comap.2" "Comp.3" "Comp.4"
$ center : Named num [1:4] 7.79 170.76 65.54 21.23
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ scale : Named num [1:4] 4.31 82.5 14.33 9.27
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ n.obs : int 50
$ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:50] "1" "2" "3" "4" ...
.. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ call : language princomp(x = dat, cor = TRUE)
- attr(*, "class")= chr "princomp"

R> res.local
princomp(x = dat, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

4 variables and 50 observations.
R> res
princomp(x = dat, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

4 variables and 50 observations.

Step 3: Determine what results we really need

Since we are only interested in the loadings and any result we return needs to be a data.frame to turn it into a database row set (table result), we build the model, transform the loadings object into a data.frame, and return the data.frame as the function result. We then view the class of the result and its values.

Since we do this from the R API, we can simply print res to display the returned data.frame, as the print does an implicit ore.pull.

returnLoadings <- function(dat) {
                    mod <- princomp(dat, cor=TRUE)
                    dd <- dim(mod$loadings)
                    ldgs <-$loadings[1:dd[1],1:dd[2]])
                    ldgs$variables <- row.names(ldgs)
res <- ore.tableApply(dat, returnLoadings)



R> res <- ore.tableApply(dat, returnLoadings)
R> class(res)
[1] "ore.object"
[1] "OREembed"
R> res
             Comp.1     Comp.2     Comp.3     Comp.4 variables
Murder   -0.5358995  0.4181809 -0.3412327  0.64922780 Murder
Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748 Assault
UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773 UrbanPop
Rape     -0.5434321 -0.1673186  0.8177779  0.08902432 Rape

Step 4: Load script into the R Script Repository in the database

We're at the point of being able to load the script into the R Script Repository before invoking it from SQL. We can create the function from R or from SQL. In R,

ore.scriptCreate('princomp.loadings', returnLoadings)

or from SQL,

      'function(dat) {
        mod <- princomp(dat, cor=TRUE)
        dd <- dim(mod$loadings)
        ldgs <-$loadings[1:dd[1],1:dd[2]])
        ldgs$variables <- row.names(ldgs)

Step 5: invoke from SQL SELECT statement

Finally, we're able to invoke the function from SQL using the rqTableEval table function. We pass in a cursor with the data from our USARRESTS table. We have no parameters, so the next argument is NULL. To get the results as a table, we specify a SELECT string that defines the structure of the result. Note that the column names must be identical to what is returned in the R data.frame. The last parameter is the name of the function we want to invoke from the R script repository.

Invoking this, we see the result as a table from the SELECT statement.

select *
from table(rqTableEval( cursor(select * from USARRESTS),
                       'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual',


SQL> select *
from table(rqTableEval( cursor(select * from USARRESTS),NULL,
          'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));
2 3
    Comp.1     Comp.2     Comp.3     Comp.4  variables
---------- ---------- ---------- ---------- ------------
-.53589947  .418180865 -.34123273  .649227804 Murder
-.58318363  .187985604 -.26814843 -.74340748  Assault
-.27819087 -.87280619  -.37801579  .133877731 UrbanPop
-.54343209 -.16731864   .817777908 .089024323 Rape

As you see above, we have the loadings result returned as a SQL table.

In this example, we walked through the steps of moving from invoking an R function to obtain a specific result to producing that same result from SQL by invoking an R script at the database server under the control of Oracle Database.


The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.


« March 2015