Tuesday Jun 30, 2015

R Consortium Launched!

The Linux Foundation announces the R Consortium to support R users globally.

The R Consortium works with and provides support to the R Foundation and other organizations developing, maintaining and distributing R software and provides a unifying framework for the R user community.

“Data science is pushing the boundaries of what is possible in business, science, and technology, where the R language and ecosystem is a major enabling force,” said Neil Mendelson, Vice President, Big Data and Advanced Analytics, Oracle “The R Consortium is an important enabling body to support and help grow the R user community, which increasingly includes enterprise data scientists.”

R is a key enabling technology for data science as evidenced by its dramatic rise in adoption over the past several years. We look forward to contributing to R's continued success through the R Consortium.

Monday Mar 30, 2015

Oracle Open World 2015 Call for Proposals!

It's that time of year again...submit your session proposals for Oracle OpenWorld 2015!

Oracle customers and partners are encouraged to submit proposals to present at the Oracle OpenWorld 2015 conference, October 25 - 29, 2015, held at the Moscone Center in San Francisco.

Details and submission guidelines are available on the Oracle OpenWorld Call for Proposals web site. The deadline for submissions is Wednesday, April 29, 11:59 p.m. PDT.

We look forward to checking out your sessions on Oracle Advanced Analytics, including Oracle R Enterprise and Oracle Data Mining, and Oracle R Advanced Analytics for Hadoop. Tell how these tools have enhanced the way you do business!

Thursday Feb 12, 2015

Pain Point #6: “We need to build 10s of thousands of models fast to meet business objectives”

The last pain point in this series on Addressing Analytic Pain Points, involves one aspect of what I call massive predictive modeling. Increasingly, enterprise customers are building a greater number of models. In past decades, producing a handful of production models per year may have been considered a significant accomplishment. With the advent of powerful computing platforms, parallel and distributed algorithms, as well as the wealth of data – Big Data – we see enterprises building hundreds and thousands of models in targeted ways.

For example, consider the utility sector with data being collected from household smart meters. Whether water, gas, or electricity, utility companies can make more precise demand projections by modeling individual customer consumption behavior. Aggregating this behavior across all households can provide more accurate forecasts, since individual household patterns are considered, not just generalizations about all households, or even different household segments.

The concerns associated with this form of massive predictive modeling include: (i) dealing effectively with Big Data from the hardware, software, network, storage and Cloud, (ii) algorithm and infrastructure scalability and performance, (iii) production deployment, and (iv) model storage, backup, recovery and security. Some of these I’ve explored under previous pain points blog posts.

Oracle Advanced Analytics (OAA) and Oracle R Advanced Analytics for Hadoop (ORAAH) both provide support for massive predictive modeling. From the Oracle R Enterprise component of OAA, users leverage embedded R execution to run user-defined R functions in parallel, both from R and from SQL. OAA provides the infrastructure to allow R users to focus on their core R functionality while allowing Oracle Database to handle spawning of R engines, partitioning data and providing data to their R function across parallel R engines, aggregating results, etc. Data parallelism is enabled using the “groupApply” and “rowApply” functions, while task parallelism is enabled using the “indexApply” function. The Oracle Data Mining component of OAA provides "on-the-fly" models, also called "predictive queries," where the model is automatically built on partitions of the data and scoring using those partitioned models is similarly automated.

ORAAH enables the writing of mapper and reducer functions in R where corresponding ORE functionality can be achieved on the Hadoop cluster. For example, to emulate “groupApply”, users write the mapper to partition the data and the reducer to build a model on the resulting data. To emulate “rowApply”, users can simply use the mapper to perform, e.g., data scoring and passing the model to the environment of the mapper. No reducer is required.

Friday Oct 24, 2014

Pain Point #1: “It takes too long to get my data or to get the ‘right’ data”

This is the first in a series on Addressing Analytic Pain Points: “It takes too long to get my data or to get the ‘right’ data.”

Analytics users can be characterized along multiple dimensions. One such dimension is how they get access to or receive data. For example, some receive data via flat files. Since we’re talking about “enterprise” users, this often means data stored in RDBMSs where users request data extracts from a DBA or more generally the IT department. Turnaround time can be hours to days, or even weeks, depending on the organization. If the data scientist needs more or different data, the cycle repeats – often leading to frustration on both sides and delays in generating results.

Others users are granted access to databases directly using programmatic access tools like ODBC, JDBC, their corresponding R variants, or ROracle. These users may be given read-only access to a range of data tables, possibly in a sandbox schema. Here, analytics users don’t have to go back to their DBA or IT as to obtain extracts, but they still need to pull the data from the database to their client environment, e.g., a laptop, and push results back to the database. If significant volumes of data are involved, the time required for pulling data can hinder productivity. (Of course, this assumes the client has enough RAM to load the needed data sets, but that’s a topic for the next blog post.)

To address the first type of user, since much of the data in question resides in databases, empowering users with a self service model mitigates the vicious cycle described above. When the available data are readily accessible to analytics users, they can see and select what they need at will. An Oracle Database solution addresses this data access pain point by providing schema access, possibly in a sandbox with read-only table access, for the analytics user.

Even so, this approach just turns the first type of user into the second mentioned above. An Oracle Database solution further addresses this pain point by either minimizing or eliminating data movement as much as possible. Most analytics engines bring data to the computation, requiring extracts and in some cases even proprietary formats before being able to perform analytics. This takes time. Often, data movement can dwarf the time required to perform the actual computation. From the perspective of the analytics user, this is wasted time because it is just a perfunctory step on the way to getting the desired results. By bringing computation to the data, using Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining), the time normally required to move data is eliminated. Consider the time savings of being able to prepare data, compute statistics, or build predictive models and score data directly in the database. Using Oracle Advanced Analytics, either from R via Oracle R Enterprise, SQL via Oracle Data Mining, or the graphical interface Oracle Data Miner, users can leverage Oracle Database as a high performance computational engine.

We should also note that Oracle Database has the high performance Oracle Call Interface (OCI) library for programmatic data access. For R users, Oracle provides the package ROracle that is optimized using OCI for fast data access. While ROracle performance may be much faster than other methods (ODBC- and JDBC-based), the time is still greater than zero and there are other problems that I’ll address in the next pain point.

Addressing Analytic Pain Points

If you’re an enterprise data scientist, data analyst, or statistician, and perform analytics using R or another third party analytics engine, you’ve likely encountered one or more of these pain points:

Pain Point #1: “It takes too long to get my data or to get the ‘right’ data”
Pain Point #2: “I can’t analyze or mine all of my data – it has to be sampled”
Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex”
Pain Point #4: “Recoding R (or other) models into SQL, C, or Java takes time and is error prone”
Pain Point #5: “Our company is concerned about data security, backup and recovery”
Pain Point #6: “We need to build 10s of thousands of models fast to meet business objectives”

Some pain points are related to the scale of data, yet others are felt regardless of data size. In this blog series, I’ll explore each of these pain points, how they affect analytics users and their organizations, and how Oracle Advanced Analytics addresses them.

Monday Sep 22, 2014

Oracle R Enterprise 1.4.1 Released

Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, makes the open source R statistical programming language and environment ready for the enterprise and big data. Designed for problems involving large data volumes, Oracle R Enterprise integrates R with Oracle Database.

R users can execute R commands and scripts for statistical and graphical analyses on data stored in Oracle Database. R users can develop, refine, and deploy R scripts that leverage the parallelism and scalability of the database to automate data analysis. Data analysts and data scientists can use open source R packages and develop and operationalize R scripts for analytical applications in one step – from R or SQL.

With the new release of Oracle R Enterprise 1.4.1, Oracle enables support for Multitenant Container Database (CDB) in Oracle Database 12c and pluggable databases (PDB). With support for CDB / PDB, enterprises can take advantage of new ways of organizing their data: easily taking entire databases offline and easily bringing them back online when needed. Enterprises, such as pharmaceutical companies, that collect vast quantities of data across multiple experiments for individual projects immediately benefit from this capability.

This point release also includes the following enhancements:

• Certified for use with R 3.1.1 and Oracle R Distribution 3.1.1.

• Simplified and enhanced script for install, upgrade, uninstall of ORE Server and the creation and configuratioon of ORE users.

• New supporting packages: arules and statmod.

• ore.glm accepts offset terms in model formula and can fit negative binomial and tweedie families of GLM.

• ore.sync argument, query, creates ore.frame object from SELECT statement without creating view. This allows users to effectively access a view of the data without the CREATE VIEW privilege.

• Global option for serialization, ore.envAsEmptyenv, specifies whether referenced environment objects in an R object, e.g., in an lm model, should be replaced with an empty environment during serialization to the ORE R datastore. This is used by (i) ore.push, which for a list object accepts envAsEmptyenv as an optional argument, (ii) ore.save, which has envAsEmptyenv as a named argument, and (iii) ore.doEval and the other embedded R execution functions, which accept ore.envAsEmptyenv as a control argument.

Oracle R Enterprise 1.4.1
can be downloaded from OTN here.

Wednesday Sep 17, 2014

Seismic Data Repository: on-the-fly data analysis and visualization using Oracle R Enterprise

RN-KrasnoyarskNIPIneft Establishes Seismic Information Repository for One of the World’s Largest Oil and Gas Companies. Read the complete customer story here, excerpts follow.

RN-KrasnoyarskNIPIneft (KrasNIPI) is a research and development subsidiary of Rosneft Oil Companya, top oil and gas company in Russia and worldwide. KrasNIPI provides high-quality information from seismic surveys to Rosneft—delivering key information that oil and gas companies seek to lower costs, environmental impacts, and risks while exploring for resources to satisfy growing energy needs. KrasNIPI’s primary activities include preparing the information base used for the exploration of hydrocarbons, development and construction of oil and gas fields, processing and interpretation of 2-D and 3-D seismic data, and seismic data warehousing.

Part of the solution involved on-the-fly data analysis and visualization for remote users with only a thin client—such as a web browser (without additional plug-ins and extensions). This was made possible by using Oracle R Enterprise (a component of Oracle Advanced Analytics) to support applications requiring extensive analytical processing.

We store vast amounts of seismic data, process this information with sophisticated math algorithms, and deliver it to remote users under tight deadlines. We deployed Oracle Database together with Oracle Spatial and Graph, Oracle Fusion Middleware MapViewer on Oracle WebLogic Server, and Oracle R Enterprise to keep these complex business processes running smoothly. The result exceeded our most optimistic expectations.”
                              – Artem Khodyaev, Chief Engineer
                                                              Corporate Center of Seismic Information Repository
                                                              RN-KrasnoyarskNIPIneft

Monday Aug 18, 2014

Real-time Big Data Analytics is a reality for StubHub with Oracle Advanced Analytics

What can you use for a comprehensive platform for real-time analytics?
How can you process big data volumes for near-real-time recommendations and dramatically reduce fraud?

Learn in this video what Stubhub achieved with Oracle R Enterprise from the Oracle Advanced Analytics option to Oracle Database, and read more on their story here.

Advanced analytics solutions that impact the bottom line of a business are challenging due to the range of skills and individuals involved in realizing such solutions. While we hear a lot about the role of the data scientist, that role is but one piece of the puzzle. Advanced analytics solutions also have an operationalization aspect that also requires close proximity to where the transactional activity occurs.

The data scientist needs access to the right data with which to model the business problem. This involves IT for data collection, management, and administration, as well as ensuring zero downtime (a website needs to be up 24x7). This also involves working with the data scientist to keep predictive models refreshed with the latest scripts.

Integrating advanced analytics solutions into enterprise apps involves not just generating predictions, but supporting the whole life-cycle from data collection, to model building, model assessment, and then outcome assessment and feedback to the model building process again. Application and web interface designers need to take into account how end users will see and use the advanced analytics results, e.g., supporting operations staff that need to handle the potentially fraudulent transactions.

As just described, advanced analytics projects can be "complicated" from just a human perspective. The extent to which software can simplify the interactions among users and systems will increase the likelihood of project success. The ability to quickly operationalize advanced analytics projects and demonstrate measurable value, means the difference between a successful project and just a nice research report.

By standardizing on Oracle Database and SQL invocation of R, along with in-database modeling as found in Oracle Advanced Analytics, expedient model deployment and zero downtime for refreshing models becomes a reality. Meanwhile, data scientists are also able to explore leading edge techniques available in open source. The Oracle solution propels the entire organization forward to realize the value of advanced analytics.

Thursday Aug 14, 2014

Selecting the most predictive variables – returning Attribute Importance results as a database table

Attribute Importance (AI) is a technique of Oracle Advanced Analytics (OAA) that ranks the relative importance of predictors given a categorical or numeric target for classification or regression models, respectively. OAA AI uses the minimum description length algorithm and produces importance scores such that predictors with positive scores help predict the target, while zero or negative do not, and may even contribute noise to a model, making it less accurate. OAA AI, however, considers predictors only pairwise with the target, so any interactions among predictors are addressed. OAA AI is a good first assessment of which predictors should be included in a classification or regression model, enabling what is sometimes called feature selection or variable selection.

In my series on Oracle R Enterprise Embedded R Execution, I explored how structured table results could be returned from embedded R calls. In a subsequent post, I explored how to return select results from a principal components analysis (PCA) model as a table. In this post, I describe how you can work with results from an Attribute Importance model from ORE embedded R execution via an R function. This R function takes a table name and target variable name as input, places the predictor rankings in an named ORE datastore also specified as input, and returns a data.frame with the predictor variable name, rank, importance value.

The function below implements this functionality. Notice that we dynamically sync the named table and get its ore.frame proxy object. From here, we invoke ore.odmAI using the dynamically generated formula using the targetName argument. We pull out the importance component of the result, explicitly assign the column variable to the row names, and then reorder the columns. Next, we nullify the row names since these are now redundant with column variable.

The next three lines assign the result to a datastore. This is technically not necessary since the result is returned by this function, but if a user wanted to access this result without recomputing it, the user could simply retrieve the datastore object using another embedded R function. This is left as an exercise for the reader to load the named datastore and return the contents as an ore.frame in R or database table in SQL.

Lastly, the resulting data.frame is returned.

rankPredictors <- function(tableName,targetName,dsName) {
  ore.sync(table=tableName)
  ore.attach()
  dat <- ore.get(tableName)
  formulaStr <- paste(targetName,".",sep="~")
  res <- ore.odmAI(as.formula(formulaStr),dat)
  res <- res$importance
  res$variable <- rownames(res)
  res <- res[,c("variable","rank","importance")]
  row.names(res) <- NULL
  resName <- paste(tableName,targetName,"AI",sep=".")
  assign(resName,res)
  ore.save(list=c(resName),name=dsName,overwrite=TRUE)
  res
}

To test this funtion, we invoke it explicitly with suitable arguments.

res <- rankPredictors ("IRIS","Species","/DS/Test1")
res

Here, you see the results.

> res
    variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

The contents of the datastore can be accessed as well.

ore.datastore(pattern="/DS")
ore.datastoreSummary(name="/DS/Test1")
ore.load("/DS/Test1")
IRIS.Species.AI
> ore.datastore(pattern="/DS")
  datastore.name object.count size    creation.date description
1 /DS/Test1 1 355 2014-08-14 16:38:46 <na>
> ore.datastoreSummary(name="/DS/Test1")
object.name class size length row.count col.count
1 IRIS.Species.AI data.frame 355 3 4 3
> ore.load("/DS/Test1")
[1] "IRIS.Species.AI"
> IRIS.Species.AI
    variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

With the confidence that our R function is behaving correctly, we load it into the R Script Repository in Oracle Database.

ore.scriptDrop("rankPredictors")
ore.scriptCreate("rankPredictors",rankPredictors)

To test that the function behaves properly with embedded R execution, we invoke it first from R using ore.doEval, passing the desired parameters and returning the result as an ore.frame. This last part is enabled through the specification of the FUN.VALUE argument. Since we are using a datastore and the transparency layer, ore.connect is set to TRUE.

ore.doEval(
FUN.NAME="rankPredictors",
  tableName="IRIS",
  target="Species",
  dsName="/AttributeImportance/IRIS/Species",
  FUN.VALUE=data.frame(variable=character(0)
      ,rank=numeric(0)
      ,importance=numeric(0)),
  ore.connect=TRUE
)

Notice we get the same result as above.

    variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

Again, we can view the datastore contents for the execution above. Notice our use of the “/” notation to organize our datastore content. While we can name datastores with any arbitrary string, this approach can help structure the retrieval of datastore contents.

ore.datastore(pattern="/AttributeImportance/IRIS")
ore.datastoreSummary(name="/AttributeImportance/IRIS/Species")

We have a single datastore matching our IRIS data set followed by the summary with the IRIS.Species.AI object, which is an R data.frame with 3 columns and 4 rows.

> ore.datastore(pattern="/AttributeImportance/IRIS")
datastore.name object.count size creation.date description
1 /AttributeImportance/IRIS/Species 1 355 2014-08-14 16:55:40
> ore.datastoreSummary(name="/AttributeImportance/IRIS/Species")
object.name class size length row.count col.count
1 IRIS.Species.AI data.frame 355 3 4 3

To execute this R script from SQL, use the ORE SQL API.

select * from table(rqEval(
  cursor(select 1 "ore.connect",
      'IRIS' "tableName",
      'Species' "targetName",
      '/AttributeImportance/IRIS/Species' "dsName"
      from dual),
  'select cast(''a'' as varchar2(50)) "variable",
  1 "rank",
  1 "importance"
  from dual',
  'rankPredictors'));

In summary, we’ve explored how to use ORE embedded R execution to extract model elements from an in-database algorithm and present it as an R data.frame, ore.frame, and SQL table.

The process used above can also serve as a template for working on your own embedded R execution projects:

+ Interactively develop an R script that does what you need and wrap it in a function
+ Validate that the R function behaves as expected
+ Store the function in the R Script Repository
+ Validate that the R interface to embedded R execution produces the desired results
+ Generate SQL query that invokes the R function
+ Validate that the SQL interface to embedded R execution produces the desired resultsv

Thursday Jul 24, 2014

Are you experiencing analytics pain points?

At the user!2014 conference at UCLA in early July, which was a stimulating and well-attended conference, I spoke about Oracle’s R Technologies during the sponsor talks. One of my slides focused on examples of analytics pain points we often hear from customers and prospects. For example,

“It takes too long to get my data or to get the ‘right’ data”
“I can’t analyze or mine all of my data – it has to be sampled”
“Putting R models and results into production is ad hoc and complex”
“Recoding R models into SQL, C, or Java takes time and is error prone”
“Our company is concerned about data security, backup and recovery”
“We need to build 10s of thousands of models fast to meet business objectives”

After the talk, several people approached me remarking how these are exactly the problems they encounter in their organizations. One person even asked, if I’d interviewed her for my talk since she is experiencing every one of these pain points.

Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, addresses these pain points. Let’s take a look one by one.

If it takes too long to get your data, perhaps because your moving it from the database where it resides to your external analytics server or laptop, the ideal solution is don’t move it. Analyze it where it is. This is exactly what Oracle R Enterprise allows you to do using the transparency layer and in-database predictive analytics capabilities. With Oracle R Enterprise, R functions normally performed on data.frames are translated to SQL for execution in the database, taking advantage of query optimization, indexes, parallel-distributed execution, etc. With the advent of Oracle Data In-Memory option, this has even more advantages, but that’s a topic for another post. The second part of this pain point is getting access to the “right” data. Allowing your data scientist to have a sandbox with access to the range of data necessary to perform his/her work avoids the delay of requesting flat file extracts via the DBA, only to realize that more or different data is required. The cycle time in getting the “right” data impedes progress, not to mention annoying some key individuals in your organization. We’ll come back to the security aspects later.

Increasingly, data scientists want to avoid sampling data when analyzing data or building predictive models. Minimally, they at least want to use much more data than may fit in typical analytics servers. Oracle R Enterprise provides an R interface to powerful in-database analytic functions and data mining algorithms. These algorithms are designed to work in a parallel distributed manner whether the data fits in memory or not. In other cases, sampling is desired, if not required, but this results in the chicken-and-egg problem: The data need to be sampled since they won’t fit in memory, but the data are too big to fit in memory to sample! Users have developed home grown techniques to chunk the data and combine partial samples; however, they shouldn’t have to. When sampling is desired/required, with Oracle R Enterprise, we are able to leverage row indexing and in-database sampling to extract only database table rows that are in the sample, using standard R syntax or Oracle R Enterprise-based sampling functions.

Our next pain point involves production deployment. Many good predictive models have been laid waste for lack of integration with or complexity introduced by production environments. Enterprise applications and dashboards often speak SQL and know how to access data. However, to craft a solution that extracts data, invokes an R script in an external R engine, and places batch results back in the database requires a lot of manual coding, often leveraging ad hoc cron jobs. Oracle R Enterprise enables the execution of R scripts on the database server machine, in local R engines under the control of Oracle Database. This can be done from R and SQL. Using the SQL API, R scripts can be invoked to return results in the form of table data, images, and XML. In addition, data can be moved to these R engines more efficiently, and the powerful database hardware, such as Exadata machines, can be leveraged for data-parallel and task-parallel R script execution.

When users don’t have access to a tight integration between R and SQL as noted above, another pain point involves using R only to build the models and relying on developers to recode the scoring procedures in a programming language that fits with the production environment, e.g., SQL, C, or Java. This has multiple downsides: it takes time to recode, manual recoding is error prone, and the resulting code requires significant testing. When the model is refreshed, the process repeats.

The pain points discussed so far also suffer from concerns about security, backup, and recovery. If data is being moved around in flat files, what security protocols or access controls are placed on those flat files? How can access be audited? Oracle R Enterprise enables analytics users to leverage an Oracle Database secured environment for data access. Moving on, if R scripts, models, and other R objects are stored and managed as flat files, how are these backed up? How are they synced with the deployed application? By storing all these artifacts in Oracle Database via Oracle R Enterprise, backup is a normal part of DBA operation with established protocols. The R Script Repository and Datastore simplify backup. Crafting ad hoc solutions involving third party analytic servers, there is the issue of recovery, or resilience to failures. Fewer moving parts mean lower complexity. Programming for failure contingencies in a distributed application adds significant complexity to an application. Allowing Oracle Database to control the execution of R scripts in database server side R engines reduces complexity and frees application developers and data scientists to focus on the more creative aspects of their work.

Lastly, users of advanced analytics software – data scientists, analysts, statisticians – are increasing pushing the barrier of scalability. Not just in volume of data processed, but in the number and frequency of their computations and analyses, e.g., predictive model building. Where only a few models are involved, it may be tractable to manage a few files to store predictive models on disk (although as noted above, this has its own complications). When you need to build thousands of models or hundreds of thousands of models, managing these models becomes a challenge in its own right.

In summary, customers are facing a wide range of pain points in their analytics activities. Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, addresses these pain points allowing data scientists, analysts, and statisticians, as well as the IT staff who supports them, to be more productive, while promoting and enabling new uses of advanced analytics.

Thursday Jun 05, 2014

Convert ddply {plyr} to Oracle R Enterprise, or use with Embedded R Execution

The plyr package contains a set of tools for partitioning a problem into smaller sub-problems that can be more easily processed. One function within {plyr} is ddply, which allows you to specify subsets of a data.frame and then apply a function to each subset. The result is gathered into a single data.frame. Such a capability is very convenient. The function ddply also has a parallel option that if TRUE, will apply the function in parallel, using the backend provided by foreach.

This type of functionality is available through Oracle R Enterprise using the ore.groupApply function. In this blog post, we show a few examples from Sean Anderson's "A quick introduction to plyr" to illustrate the correpsonding functionality using ore.groupApply.

To get started, we'll create a demo data set and load the plyr package.


set.seed(1)
d <- data.frame(year = rep(2000:2014, each = 3),
        count = round(runif(45, 0, 20)))
dim(d)
library(plyr)

This first example takes the data frame, partitions it by year, and calculates the coefficient of variation of the count, returning a data frame.


# Example 1
res <- ddply(d, "year", function(x) {
  mean.count <- mean(x$count)
  sd.count <- sd(x$count)
  cv <- sd.count/mean.count
  data.frame(cv.count = cv)
  })

To illustrate the equivalent functionality in Oracle R Enterprise, using embedded R execution, we use the ore.groupApply function on the same data, but pushed to the database, creating an ore.frame. The function ore.push creates a temporary table in the database, returning a proxy object, the ore.frame.


D <- ore.push(d)
res <- ore.groupApply (D, D$year, function(x) {
  mean.count <- mean(x$count)
  sd.count <- sd(x$count)
  cv <- sd.count/mean.count
  data.frame(year=x$year[1], cv.count = cv)
  }, FUN.VALUE=data.frame(year=1, cv.count=1))

You'll notice the similarities in the first three arguments. With ore.groupApply, we augment the function to return the specific data.frame we want. We also specify the argument FUN.VALUE, which describes the resulting data.frame. From our previous blog posts, you may recall that by default, ore.groupApply returns an ore.list containing the results of each function invocation. To get a data.frame, we specify the structure of the result.

The results in both cases are the same, however the ore.groupApply result is an ore.frame. In this case the data stays in the database until it's actually required. This can result in significant memory and time savings whe data is large.


R> class(res)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> head(res)
   year cv.count
1 2000 0.3984848
2 2001 0.6062178
3 2002 0.2309401
4 2003 0.5773503
5 2004 0.3069680
6 2005 0.3431743

To make the ore.groupApply execute in parallel, you can specify the argument parallel with either TRUE, to use default database parallelism, or to a specific number, which serves as a hint to the database as to how many parallel R engines should be used.

The next ddply example uses the summarise function, which creates a new data.frame. In ore.groupApply, the year column is passed in with the data. Since no automatic creation of columns takes place, we explicitly set the year column in the data.frame result to the value of the first row, since all rows received by the function have the same year.


# Example 2
ddply(d, "year", summarise, mean.count = mean(count))

res <- ore.groupApply (D, D$year, function(x) {
  mean.count <- mean(x$count)
  data.frame(year=x$year[1], mean.count = mean.count)
  }, FUN.VALUE=data.frame(year=1, mean.count=1))

R> head(res)
   year mean.count
1 2000 7.666667
2 2001 13.333333
3 2002 15.000000
4 2003 3.000000
5 2004 12.333333
6 2005 14.666667

Example 3 uses the transform function with ddply, which modifies the existing data.frame. With ore.groupApply, we again construct the data.frame explicilty, which is returned as an ore.frame.


# Example 3

ddply(d, "year", transform, total.count = sum(count))

res <- ore.groupApply (D, D$year, function(x) {
  total.count <- sum(x$count)
  data.frame(year=x$year[1], count=x$count, total.count = total.count)
  }, FUN.VALUE=data.frame(year=1, count=1, total.count=1))

> head(res)
   year count total.count
1 2000 5 23
2 2000 7 23
3 2000 11 23
4 2001 18 40
5 2001 4 40
6 2001 18 40

In Example 4, the mutate function with ddply enables you to define new columns that build on columns just defined. Since the construction of the data.frame using ore.groupApply is explicit, you always have complete control over when and how to use columns.


# Example 4

ddply(d, "year", mutate, mu = mean(count), sigma = sd(count),
      cv = sigma/mu)

res <- ore.groupApply (D, D$year, function(x) {
  mu <- mean(x$count)
  sigma <- sd(x$count)
  cv <- sigma/mu
  data.frame(year=x$year[1], count=x$count, mu=mu, sigma=sigma, cv=cv)
  }, FUN.VALUE=data.frame(year=1, count=1, mu=1,sigma=1,cv=1))

R> head(res)
   year count mu sigma cv
1 2000 5 7.666667 3.055050 0.3984848
2 2000 7 7.666667 3.055050 0.3984848
3 2000 11 7.666667 3.055050 0.3984848
4 2001 18 13.333333 8.082904 0.6062178
5 2001 4 13.333333 8.082904 0.6062178
6 2001 18 13.333333 8.082904 0.6062178

In Example 5, ddply is used to partition data on multiple columns before constructing the result. Realizing this with ore.groupApply involves creating an index column out of the concatenation of the columns used for partitioning. This example also allows us to illustrate using the ORE transparency layer to subset the data.


# Example 5

baseball.dat <- subset(baseball, year > 2000) # data from the plyr package
x <- ddply(baseball.dat, c("year", "team"), summarize,
           homeruns = sum(hr))

We first push the data set to the database to get an ore.frame. We then add the composite column and perform the subset, using the transparency layer. Since the results from database execution are unordered, we will explicitly sort these results and view the first 6 rows.


BB.DAT <- ore.push(baseball)
BB.DAT$index <- with(BB.DAT, paste(year, team, sep="+"))
BB.DAT2 <- subset(BB.DAT, year > 2000)
X <- ore.groupApply (BB.DAT2, BB.DAT2$index, function(x) {
  data.frame(year=x$year[1], team=x$team[1], homeruns=sum(x$hr))
  }, FUN.VALUE=data.frame(year=1, team="A", homeruns=1), parallel=FALSE)
res <- ore.sort(X, by=c("year","team"))

R> head(res)
   year team homeruns
1 2001 ANA 4
2 2001 ARI 155
3 2001 ATL 63
4 2001 BAL 58
5 2001 BOS 77
6 2001 CHA 63

Our next example is derived from the ggplot function documentation. This illustrates the use of ddply within using the ggplot2 package. We first create a data.frame with demo data and use ddply to create some statistics for each group (gp). We then use ggplot to produce the graph. We can take this same code, push the data.frame df to the database and invoke this on the database server. The graph will be returned to the client window, as depicted below.


# Example 6 with ggplot2

library(ggplot2)
df <- data.frame(gp = factor(rep(letters[1:3], each = 10)),
                 y = rnorm(30))
# Compute sample mean and standard deviation in each group
library(plyr)
ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))

# Set up a skeleton ggplot object and add layers:
ggplot() +
  geom_point(data = df, aes(x = gp, y = y)) +
  geom_point(data = ds, aes(x = gp, y = mean),
             colour = 'red', size = 3) +
  geom_errorbar(data = ds, aes(x = gp, y = mean,
                               ymin = mean - sd, ymax = mean + sd),
             colour = 'red', width = 0.4)

DF <- ore.push(df)
ore.tableApply(DF, function(df) {
  library(ggplot2)
  library(plyr)
  ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))
  ggplot() +
    geom_point(data = df, aes(x = gp, y = y)) +
    geom_point(data = ds, aes(x = gp, y = mean),
               colour = 'red', size = 3) +
    geom_errorbar(data = ds, aes(x = gp, y = mean,
                                 ymin = mean - sd, ymax = mean + sd),
                  colour = 'red', width = 0.4)
})

But let's take this one step further. Suppose we wanted to produce multiple graphs, partitioned on some index column. We replicate the data three times and add some noise to the y values, just to make the graphs a little different. We also create an index column to form our three partitions. Note that we've also specified that this should be executed in parallel, allowing Oracle Database to control and manage the server-side R engines. The result of ore.groupApply is an ore.list that contains the three graphs. Each graph can be viewed by printing the list element.

df2 <- rbind(df,df,df)
df2$y <- df2$y + rnorm(nrow(df2))
df2$index <- c(rep(1,300), rep(2,300), rep(3,300))
DF2 <- ore.push(df2)
res <- ore.groupApply(DF2, DF2$index, function(df) {
  df <- df[,1:2]
  library(ggplot2)
  library(plyr)
  ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))
  ggplot() +
    geom_point(data = df, aes(x = gp, y = y)) +
    geom_point(data = ds, aes(x = gp, y = mean),
               colour = 'red', size = 3) +
    geom_errorbar(data = ds, aes(x = gp, y = mean,
                                 ymin = mean - sd, ymax = mean + sd),
                  colour = 'red', width = 0.4)
  }, parallel=TRUE)
res[[1]]
res[[2]]
res[[3]]

To recap, we've illustrated how various uses of ddply from the plyr package can be realized in ore.groupApply, which affords the user explicit control over the contents of the data.frame result in a straightforward manner. We've also highlighted how ddply can be used within an ore.groupApply call.

Friday May 30, 2014

Financial institutions build predictive models using Oracle R Enterprise to speed model deployment

See the Oracle press release, Financial Institutions Leverage Metadata Driven Modeling Capability Built on the Oracle R Enterprise Platform to Accelerate Model Deployment and Streamline Governance for a description where a "unified environment for analytics data management and model lifecycle management brings the power and flexibility of the open source R statistical platform, delivered via the in-database Oracle R Enterprise engine to support open standards compliance."

Through its integration with Oracle R Enterprise, Oracle Financial Services Analytical Applications provides "productivity, management, and governance benefits to financial institutions, including the ability to:


  • Centrally manage and control models in a single, enterprise model repository, allowing for consistent management and application of security and IT governance policies across enterprise assets

  • Reuse models and rapidly integrate with applications by exposing models as services

  • Accelerate development with seeded models and common modeling and statistical techniques available out-of-the-box

  • Cut risk and speed model deployment by testing and tuning models with production data while working within a safe sandbox

  • Support compliance with regulatory requirements by carrying out comprehensive stress testing, which captures the effects of adverse risk events that are not estimated by standard statistical and business models. This approach supplements the modeling process and supports compliance with the Pillar I and the Internal Capital Adequacy Assessment Process stress testing requirements of the Basel II Accord

  • Improve performance by deploying and running models co-resident with data. Oracle R Enterprise engines run in database, virtually eliminating the need to move data to and from client machines, thereby reducing latency and improving security"

Monday May 19, 2014

Model cross-validation with ore.CV()

In this blog post we illustrate how to use Oracle R Enterprise for performing cross-validation of regression and classification models. We describe a new utility R function ore.CV that leverages features of Oracle R Enterprise and is available for download and use.

Predictive models are usually built on given data and verified on held-aside or unseen data. Cross-validation is a model improvement technique that avoids the limitations of a single train-and-test experiment by building and testing multiple models via repeated sampling from the available data. It's purpose is to offer a better insight into how well the model would generalize to new data and avoid over-fitting and deriving wrong conclusions from misleading peculiarities of the seen data.

In a k-fold cross-validation the data is partitioned into k (roughly) equal size subsets. One of the subsets is retained for testing and the remaining k-1 subsets are used for training. The process is repeated k times with each of the k subsets serving exactly once as testing data. Thus, all observations in the original data set are used for both training and testing.

The choice of k depends, in practice on the size n of the data set. For large data, k=3 could be sufficient. For very small data, the extreme case where k=n, leave-one-out cross-validation (LOOCV) would use a single observation from the original sample as testing data and the remaining observations as training data. Common choices are k=10 or k=5.

For a select set of algorithms and cases, the function ore.CV performs cross-validation for models generated by ORE regression and classification functions using in-databse data. ORE embedded R execution is leveraged to support cross-validation also for models built with vanilla R functions.

Usage

ore.CV(funType, function, formula, dataset, nFolds=<nb.folds>, fun.args=NULL, pred.args=NULL, pckg.lst=NULL)
  • funType - "regression" or "classification"
  • function - ORE predictive modeling functions for regression & classification or R function (regression only)
  • formula - object of class "formula"
  • dataset - name of the ore.frame
  • nFolds - number of folds
  • fun.args - list of supplementary arguments for 'function'
  • pred.args - list of supplementary arguments for 'predict'. Must be consistent with the model object/model generator 'function'.
  • pckg.lst - list of packages to be loaded by the DB R engine for embedded execution.
The set of functions supported for ORE include:
  • ore.lm
  • ore.stepwise
  • ore.neural
  • ore.glm
  • ore.odmDT
  • ore.odmSVM
  • ore.odmGLM
  • ore.odmNB
The set of functions supported for R include:
  • lm
  • glm
  • svm
Note: The 'ggplot' and 'reshape' packages are required on the R client side for data post-processing and plotting (classification CV).

Examples

In the following examples, we illustrate various ways to invoke ore.CV using some datasets we have seen in previous posts. The datasets can be created as ore.frame objects using:
 
IRIS <- ore.push(iris)
LONGLEY <- ore.push(longley)
library(rpart)
KYPHOSIS <- ore.push(kyphosis)
library(PASWR)
TITANIC3 <- ore.push(titanic3)
MTCARS <- pore.push(mtcars)
(A) Cross-validation for models generated with ORE functions.
 
# Basic specification
ore.CV("regression","ore.lm",Sepal.Length~.-Species,"IRIS",nFolds=5)
ore.CV("regression","ore.neural",Employed~GNP+Population+Year,
            "LONGLEY",nFolds=5)

#Specification of function arguments
ore.CV("regression","ore.stepwise",Employed~.,"LONGLEY",nFolds=5,
            fun.args= list(add.p=0.15,drop.p=0.15))
ore.CV("regression","ore.odmSVM",Employed~GNP+Population+Year,
             "LONGLEY",nFolds=5, fun.args="regression")

#Specification of function arguments and prediction arguments
ore.CV("classification","ore.glm",Kyphosis~.,"KYPHOSIS",nFolds=5,
             fun.args=list(family=binomial()),pred.args=list(type="response"))
ore.CV("classification","ore.odmGLM",Kyphosis~.,"KYPHOSIS",nFolds=5,
            fun.args= list(type="logistic"),pred.args=list(type="response"))
 
(B) Cross-validation for models generated with R functions via the ORE embedded execution mechanism.

ore.CV("regression","lm",mpg~cyl+disp+hp+drat+wt+qsec,"MTCARS",nFolds=3)
ore.CV("regression","svm",Sepal.Length~.-Species,"IRIS",nFolds=5,
             fun.args=list(type="eps-regression"), pckg.lst=c("e1071")) 


Restrictions

  • The signature of the model generator ‘function’ must be of the following type: function(formula,data,...). For example, functions like, ore.stepwise, ore.odmGLM and lm are supported but the R step(object,scope,...) function for AIC model selection via the stepwise algorithm, does not satisfy this requirement.
  • The model validation process requires the prediction function to return a (1-dimensional) vector with the predicted values. If the (default) returned object is different the requirement must be met by providing an appropriate argument through ‘pred.args’. For example, for classification with ore.glm or ore.odmGLM the user should specify pred.args=list(type="response").
  • Cross-validation of classification models via embedded R execution of vanilla R functions is not supported yet.
  • Remark: Cross-validation is not a technique intended for large data as the cost of multiple model training and testing can become prohibitive. Moreover, with large data sets, it is possible to effectively produce an effective sampled train and test data set. The current ore.CV does not impose any restrictions on the size of the input and the user working with large data should use good judgment when choosing the model generator and the number of folds.

    Output

    The function ore.CV provides output on several levels: datastores to contain model results, plots, and text output.

    Datastores

    The results of each cross-validation run are saved into a datastore named dsCV_funTyp_data_Target_function_nFxx where funTyp, function, nF(=nFolds) have been described above and Target is the left-hand-side of the formula. For example, if one runs the ore.neural, ore.glm, and ore.odmNB-based cross-validation examples from above, the following three datastores are produced:
    
    R> ds <- ore.datastore(pattern="dsCV")
    R> print(ds)
    datastore.name object.count size creation.date description
    1 dsCV_classification_KYPHOSIS_Kyphosis_ore.glm_nF5 104480326 2014-04-30 18:19:55 <NA>
    2 dsCV_classification_TITANIC3_survived_ore.odmNB_nF5 10 592083 2014-04-30 18:21:35 <NA>
    3 dsCV_regression_LONGLEY_Employed_ore.neural_nF5 10 497204 2014-04-30 18:16:35 <NA>
    
    Each datastore contains the models and prediction tables for every fold. Every prediction table has 3 columns: the fold index together with the target variable/class and the predicted values. If we consider the example from above and examine the most recent datastore (the Naive Bayes classification CV), we would see:
    
    R> ds.last <- ds$datastore.name[which.max(as.numeric(ds$creation.date))]
    R> ore.datastoreSummary(name=ds.last)
    object.name class size length row.count col.count
    1 model.fold1 ore.odmNB 66138 9 NA NA
    2 model.fold2 ore.odmNB 88475 9 NA NA
    3 model.fold3 ore.odmNB 110598 9 NA NA
    4 model.fold4 ore.odmNB 133051 9 NA NA
    5 model.fold5 ore.odmNB 155366 9 NA NA
    6 test.fold1 ore.frame 7691 3 261 3
    7 test.fold2 ore.frame 7691 3 262 3
    8 test.fold3 ore.frame 7691 3 262 3
    9 test.fold4 ore.frame 7691 3 262 3
    10 test.fold5 ore.frame 7691 3 262 3
    
    

    Plots

    The following plots are generated automatically by ore.CV and saved in an automatically generated OUTPUT directory:

  • Regression: ore.CV compares predicted vs target values, root mean square error (RMSE) and relative error (RERR) boxplots per fold. The example below is based on 5-fold cross-validation with the ore.lm regression model for Sepal.Length ~.-Species using the ore.frame IRIS dataset.
  • Classification : ore.CV outputs a multi plot figure for classification metrics like Precision, Recall and F-measure. Each metrics is captured per target class (side-by-side barplots) and fold (groups of barplots). The example below is based on the 5-folds CV of the ore.odmSVM classification model for Species ~. using the ore.frame IRIS dataset.
  • Text output
    For classification problems, the confusion tables for each fold are saved in an ouput file residing in the OUTPUT directory together with a summary table displaying the precision, recall and F-measure metrics for every fold and predicted class.
    file.show("OUTDIR/tbl_CV_classification_IRIS_Species_ore.odmSVM_nF5")
    
    Confusion table for fold 1 :           
                 setosa versicolor virginica
      setosa          9          0         0
      versicolor      0         12         1
      virginica       0          1         7
    Confusion table for fold 2 :            
                 setosa versicolor virginica
      setosa          9          0         0
      versicolor      0          8         1
      virginica       0          2        10
    Confusion table for fold 3 :           
                 setosa versicolor virginica
      setosa         11          0         0
      versicolor      0         10         2
      virginica       0          0         7
    Confusion table for fold 4 :            
                 setosa versicolor virginica
      setosa          9          0         0
      versicolor      0         10         0
      virginica       0          2         9
    Confusion table for fold 5 :            
                 setosa versicolor virginica
      setosa         12          0         0
      versicolor      0          5         1
      virginica       0          0        12
    Accuracy, Recall & F-measure table per {class,fold}
       fold      class TP  m  n Precision Recall F_meas
    1     1     setosa  9  9  9     1.000  1.000  1.000
    2     1 versicolor 12 13 13     0.923  0.923  0.923
    3     1  virginica  7  8  8     0.875  0.875  0.875
    4     2     setosa  9  9  9     1.000  1.000  1.000
    5     2 versicolor  8  9 10     0.889  0.800  0.842
    6     2  virginica 10 12 11     0.833  0.909  0.870
    7     3     setosa 11 11 11     1.000  1.000  1.000
    8     3 versicolor 10 12 10     0.833  1.000  0.909
    9     3  virginica  7  7  9     1.000  0.778  0.875
    10    4     setosa  9  9  9     1.000  1.000  1.000
    11    4 versicolor 10 10 12     1.000  0.833  0.909
    12    4  virginica  9 11  9     0.818  1.000  0.900
    13    5     setosa 12 12 12     1.000  1.000  1.000
    14    5 versicolor  5  6  5     0.833  1.000  0.909
    15    5  virginica 12 12 13     1.000  0.923  0.960
    
              
    What's next
    Several extensions of ore.CV are possible involving sampling, parallel model training and testing, support for vanilla R classifiers, post-processing and output. More material for future posts.

    Sunday Apr 27, 2014

    Step-by-step: Returning R statistical results as a Database Table


    R provides a rich set of statistical functions that we may want to use directly from SQL. Many of these results can be readily expressed as structured table data for use with other SQL tables, or for use by SQL-enabled applications, e.g., dashboards or other statistical tools.

    In this blog post, we illustrate in a sequence of five simple steps  how to go from an R function to a SQL-enabled result. Taken from recent "proof of concept" customer engagement, our example involves using the function princomp, which performs a principal components analysis on a given numeric data matrix and returns the results as an object of class princomp. The customer actively uses this R function to produce loadings used in subsequent computations and analysis. The loadings is a matrix whose columns contain the eigenvectors).

    The current process of pulling data from their Oracle Database, starting an R  engine, invoking the R script, and placing the results back in the database was proving non-performant and unnecessarily complex. The goal was to leverage Oracle R Enterprise to streamline this process and allow the results to be immediately accessible
    through SQL.

    As a best practice, here is a process that can get you from start to finish:

    Step 1: Invoke from command line, understand results

    If you're using a particular R function, chances are you are familiar with its content. However, you may not be familiar with its structure. We'll use an example from the R princomp documentation that uses the USArrests data set. We see that the class of the result is of type princomp, and the model prints the call and standard deviations of the components. To understand the underlying structure, we invoke the function str and see there are seven elements in the list, one of which is the matrix loadings.

    mod <- princomp(USArrests, cor = TRUE)
    class(mod)
    mod
    str(mod)


    Results:

    R> mod <- princomp(USArrests, cor = TRUE)
    R> class(mod)
    [1] "princomp"
    R> mod
    Call:
    princomp(x = USArrests, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494

    4 variables and 50 observations.

    R> str(mod)
    List of 7
    $ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
    ..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    .. ..$ : chr [1:4] "Comp.1" "Comap.2" "Comp.3" "Comp.4"
    $ center : Named num [1:4] 7.79 170.76 65.54 21.23
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ scale : Named num [1:4] 4.31 82.5 14.33 9.27
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ n.obs : int 50
    $ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:50] "1" "2" "3" "4" ...
    .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ call : language princomp(x = dat, cor = TRUE)
    - attr(*, "class")= chr "princomp"


    Step 2: Wrap script in a function, and invoke from ore.tableApply

    Since we want to invoke princomp on database data, we first push the demo data, USArrests, to the database to create an ore.frame. Other data we wish to use will also be in database tables.

    We'll use ore.tableApply (for the reasons cited in the previous blog post)  providing the ore.frame as the first argument and simply returning within our function the model produced by princomp. We'll then look at its class, retrieve the result from the database, and check its class and structure once again.

    Notice that we are able to obtain the exact same result we received using our local R engine as with the database R engine through embedded R execution.

    dat <- ore.push(USArrests)
    computePrincomp <- function(dat) princomp(dat, cor=TRUE)
    res <- ore.tableApply(dat, computePrincomp)


    class(res)
    res.local <- ore.pull(res)
    class(res.local)
    str(res.local)
    res.local
    res


    Results:

    R> dat <- ore.push(USArrests)
    R> computePrincomp <- function(dat) princomp(dat, cor=TRUE)
    R> res <- ore.tableApply(dat, dat, computePrincomp)
    R> class(res)
    [1] "ore.object"
    attr(,"package")
    [1] "OREembed"
    R> res.local <- ore.pull(res)
    R> class(res.local)
    [1] "princomp"


    R> str(res.local)
    List of 7
    $ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
    ..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    .. ..$ : chr [1:4] "Comp.1" "Comap.2" "Comp.3" "Comp.4"
    $ center : Named num [1:4] 7.79 170.76 65.54 21.23
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ scale : Named num [1:4] 4.31 82.5 14.33 9.27
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ n.obs : int 50
    $ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:50] "1" "2" "3" "4" ...
    .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ call : language princomp(x = dat, cor = TRUE)
    - attr(*, "class")= chr "princomp"

    R> res.local
    Call:
    princomp(x = dat, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494

    4 variables and 50 observations.
    R> res
    Call:
    princomp(x = dat, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494


    4 variables and 50 observations.


    Step 3: Determine what results we really need

    Since we are only interested in the loadings and any result we return needs to be a data.frame to turn it into a database row set (table result), we build the model, transform the loadings object into a data.frame, and return the data.frame as the function result. We then view the class of the result and its values.

    Since we do this from the R API, we can simply print res to display the returned data.frame, as the print does an implicit ore.pull.

    returnLoadings <- function(dat) {
                        mod <- princomp(dat, cor=TRUE)
                        dd <- dim(mod$loadings)
                        ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
                        ldgs$variables <- row.names(ldgs)
                        ldgs
                      }
    res <- ore.tableApply(dat, returnLoadings)
    class(res)
    res

    ore.create(USArrests,table="USARRESTS")


    Results:

    R> res <- ore.tableApply(dat, returnLoadings)
    R> class(res)
    [1] "ore.object"
    attr(,"package")
    [1] "OREembed"
    R> res
                 Comp.1     Comp.2     Comp.3     Comp.4 variables
    Murder   -0.5358995  0.4181809 -0.3412327  0.64922780 Murder
    Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748 Assault
    UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773 UrbanPop
    Rape     -0.5434321 -0.1673186  0.8177779  0.08902432 Rape


    Step 4: Load script into the R Script Repository in the database

    We're at the point of being able to load the script into the R Script Repository before invoking it from SQL. We can create the function from R or from SQL. In R,

    ore.scriptCreate('princomp.loadings', returnLoadings)

    or from SQL,

    begin
    --sys.rqScriptDrop('princomp.loadings');
    sys.rqScriptCreate('princomp.loadings',
          'function(dat) {
            mod <- princomp(dat, cor=TRUE)
            dd <- dim(mod$loadings)
            ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
            ldgs$variables <- row.names(ldgs)
            ldgs
          }');
    end;
    /


    Step 5: invoke from SQL SELECT statement

    Finally, we're able to invoke the function from SQL using the rqTableEval table function. We pass in a cursor with the data from our USARRESTS table. We have no parameters, so the next argument is NULL. To get the results as a table, we specify a SELECT string that defines the structure of the result. Note that the column names must be identical to what is returned in the R data.frame. The last parameter is the name of the function we want to invoke from the R script repository.

    Invoking this, we see the result as a table from the SELECT statement.

    select *
    from table(rqTableEval( cursor(select * from USARRESTS),
                            NULL,
                           'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual',
                            'princomp.loadings'));


    Results:

    SQL> select *
    from table(rqTableEval( cursor(select * from USARRESTS),NULL,
              'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));
    2 3
        Comp.1     Comp.2     Comp.3     Comp.4  variables
    ---------- ---------- ---------- ---------- ------------
    -.53589947  .418180865 -.34123273  .649227804 Murder
    -.58318363  .187985604 -.26814843 -.74340748  Assault
    -.27819087 -.87280619  -.37801579  .133877731 UrbanPop
    -.54343209 -.16731864   .817777908 .089024323 Rape

    As you see above, we have the loadings result returned as a SQL table.


    In this example, we walked through the steps of moving from invoking an R function to obtain a specific result to producing that same result from SQL by invoking an R script at the database server under the control of Oracle Database.

    Wednesday Apr 16, 2014

    Oracle's Strategy for Advanced Analytics

    At Oracle our goal is to enable you to get timely insight from all of your data. We continuously enhance Oracle Database to allow workloads that have traditionally required extracting data from the database to run in-place. We do this to narrow the gap that exists between insights that can be obtained and available data - because any data movement introduces latencies, complexity due to more moving parts, the ensuing need for data reconciliation and governance, as well as increased cost. The Oracle tool set considers the needs of all types of enterprise users - users preferring GUI based access to analytics with smart defaults and heuristics out of the box, users choosing to work interactively and quantitatively with data using R, and users preferring SQL and focusing on operationalization of models.

    Oracle recognized the need to support data analysts, statisticians, and data scientists with a widely used and rapidly growing statistical programming language. Oracle chose R - recognizing it as the new de facto standard for computational statistics and advanced analytics. Oracle supports R in at least 3 ways:


    • R as the language of interaction with the database

    • R as the language in which analytics can be written and executed in the database as a high performance computing platform

    • R as the language in which several native high performance analytics have been written that execute in database


    Additionally, of course, you may chose to leverage any of the CRAN algorithms to execute R scripts at the database server leveraging several forms of data parallelism.

    Providing the first and only supported commercial distribution of R from an established company, Oracle released Oracle R Distribution. In 2012 Oracle embarked on the Hadoop journey acknowledging alternative data management options emerging in the open source for management of unstructured or not-yet-structured data. In keeping with our strategy of delivering analytics close to where data is stored, Oracle extended Advanced Analytics capabilities to execute on HDFS resident data in Hadoop environments. R has been integrated into Hadoop in exactly the same manner as it has been with the database.

    Realizing that data is stored in both database and non-database environment, Oracle provides users options for storing their data (in Oracle Database, HDFS, and Spark RDD), where to perform computations (in-database or the Hadoop cluster), and where results should be stored (Oracle Database or HDFS). Users can write R scripts that can be leveraged across database and Hadoop environments. Oracle Database, as a preferred location for storing R scripts, data, and result objects, provides a real-time scoring and deployment platform. It is also easy to create a model factory environment with authorization, roles, and privileges, combined with auditing, backup, recovery, and security.

    Oracle provides a common infrastructure that supports both in-database and custom R algorithms. Oracle also provides an integrated GUI for business users. Oracle provides both R-based access and GUI-based access to in-database analytics. A major part of Oracle's strategy is to maintain agility in our portfolio of supported techniques - being responsive to customer needs.

    Thursday Mar 20, 2014

    ROracle 1-1.11 released - binaries for Windows and other platforms available on OTN


    We are pleased to announce the latest update of the open source ROracle package, version 1-1.11, with enhancements and bug fixes. ROracle provides high performance and scalable interaction from R with Oracle Database. In addition to availability on CRAN, ROracle binaries specific to Windows and other platforms can be downloaded from the Oracle Technology Network. Users of ROracle, please take our brief survey. We want to hear from you!

    Latest enhancements in version 1-1.11 of ROracle:

    • Performance enhancements for RAW data types and large result sets
    • Ability to cache the result set in memory to reduce memory consumption on successive reads
    • Added session mode to connect as SYSDBA or using external authentication
    • bug 17383542: Enhanced dbWritetable() & dbRemoveTable() to work on global schema

    Users of ROracle are quite pleased with the performance and functionality:


    "In my position as a quantitative researcher, I regularly analyze database data up to a gigabyte in size on client-side R engines. I switched to ROracle from RJDBC because the performance of ROracle is vastly superior, especially when writing large tables. I've also come to depend on ROracle for transactional support, pulling data to my R client, and general scalability. I have been very satisfied with the support from Oracle -- their response has been prompt, friendly and knowledgeable."

               -- Antonio Daggett, Quantitative Researcher in Finance Industry


    "Having used ROracle for over a year now with our Oracle Database data, I've come to rely on ROracle for high performance read/write of large data sets (greater than 100 GB), and SQL execution with transactional support for building predictive models in R. We tried RODBC but found ROracle to be faster, much more stable, and scalable."

               -- Dr. Robert Musk, Senior Forest Biometrician, Forestry Tasmania


    See the ROracle NEWS for the complete list of updates.

    We encourage ROracle users to post questions and provide feedback on the Oracle R Technology Forum.

    In addition to being a high performance database interface to Oracle Database from R for general use, ROracle supports database access for Oracle R Enterprise.

    Thursday Feb 13, 2014

    Monitoring progress of embedded R functions

    When you run R functions in the database, especially functions involving multiple R engines in parallel, you can monitor their progress using the Oracle R Enterprise datastore as a central location for progress notifications, or any intermediate status or results. In the following example, based on ore.groupApply, we illustrate instrumenting a simple function that builds a linear model to predict flight arrival delay based on a few other variables.

    In the function modelBuildWithStatus, the function verifies that there are rows for building the model after eliminating incomplete cases supplied in argument dat. If not empty, the function builds a model and reports “success”, otherwise, it reports “no data.” It’s likely that the user would like to use this model in some way or save it in a datastore for future use, but for this example, we just build the model and discard it, validating that a model can be built on the data.


    modelBuildWithStatus <-
      function(dat) {
        dat <- dat[complete.cases(dat),]
        if (nrow(dat)>0L) {
          mod <- lm(ARRDELAY ~ DISTANCE + AIRTIME + DEPDELAY, dat);
          "success"
        } else
          "no_data"
        }

    When we invoke this using ore.groupApply, the goal is to build one model per “unique carrier” or airline. Using an ORE 1.4 feature, we specify the degree of parallelism using the parallel argument, setting it to 2.


    res <- ore.groupApply(ONTIME_S[, c("UNIQUECARRIER","DISTANCE", "ARRDELAY", "DEPDELAY", "AIRTIME")],
            ONTIME_S$UNIQUECARRIER,
            modelBuildWithStatus,
            parallel=2L)

    res.local<-ore.pull(res)
    res.local[unlist(res.local)=="no_data"]

    The result tells us about the status of each execution. Below, we print the unique carries that had no data.


    R> res.local<-ore.pull(res)
    R> res.local[unlist(res.local)=="no_data"]
    $EA
    [1] "no_data"

    $`ML(1)`
    [1] "no_data"

    $`PA(1)`
    [1] "no_data"

    $PI
    [1] "no_data"

    $PS
    [1] "no_data"

    To monitor the progress of each execution, we can identify the group of data being processed in each function invocation using the value from the UNIQUECARRIER column. For this particular data set, we use the first two characters of the carrier’s symbol appended to “group.” to form a unique object name for storing in the datastore identified by job.name. (If we don’t do this, the value will form an invalid object name.) Note that since the UNIQUECARRIER column contains uniform data, we need only the first value.

    The general idea for monitoring progress is to save an object in the datastore named for each execution of the function on a group. We can then list the contents of the named datastore and compute a percentage complete, which is discussed later in this post. For the “success” case, we assign the value “SUCCESS” to the variable named by the string in nm that we created earlier. Using ore.save, this uniquely named object is stored in the datastore with the name in job.name. We use the append=TRUE flag to indicate that the various function executions will be sharing the same named datastore.
    If there is no data left in dat, we assign “NO DATA” to the variable named in nm and save that. Notice in both cases, we’re still returning “success” or “no data” so these come back in the list returned by ore.groupApply. However, we can return other values instead, e.g., the model produced.


    modelBuildWithMonitoring <-
      function(dat, job.name) {
      nm <- paste("group.", substr(as.character(dat$UNIQUECARRIER[1L]),1,2), sep="")
      dat <- dat[complete.cases(dat),]
      if (nrow(dat)>0L) {
        mod <- lm(ARRDELAY ~ DISTANCE + AIRTIME + DEPDELAY, dat);
        assign(nm, "SUCCESS")
        ore.save(list=nm, name=job.name, append=TRUE)
        "success"
      } else {
        assign(nm, "NO DATA")
        ore.save(list=nm, name=job.name, append=TRUE)
        "no data"
      }
    }

    When we use this function in ore.groupApply, we provide the job.name and ore.connect arguments as well. The variable ore.connect must be set to TRUE in order to use the datastore. As the ore.groupApply executes, the datastore named by job.name will be increasingly getting objects added with the name of the carrier. First, delete the datastore named “job1”, if it exists.


    ore.delete(name="job1")

    res <- ore.groupApply(ONTIME_S[, c("UNIQUECARRIER","DISTANCE", "ARRDELAY", "DEPDELAY", "AIRTIME")],
            ONTIME_S$UNIQUECARRIER,
            modelBuildWithMonitoring,
            job.name="job1", parallel=2L, ore.connect=TRUE)

    To see the progress during execution, we can use the following function, which takes a job name and the cardinality of the INDEX column to determine the percent complete. This function is invoked in a separate R engine connected to the same schema. If the job name is found, we print the percent complete, otherwise stop with an error message.


    check.progress <- function(job.name, total.groups) {
      if ( job.name %in% ore.datastore()$datastore.name )
        print(sprintf("%.1f%%", nrow(ore.datastoreSummary(name=job.name))/total.groups*100L))
      else
        stop(paste("Job", job.name, " does not exist"))
    }

    To invoke this, compute the total number of groups and provide this and the job name to the function check.progress.
    total.groups <- length(unique(ONTIME_S$UNIQUECARRIER))
    check.progress("job1",total.groups)

    However, we really want a loop to report on the progress automatically. One simple approach is to set up a while loop with a sleep delay. When we reach 100%, stop. To be self-contained, we include a simplification of the function above as a local function.


    check.progress.loop <- function(job.name, total.groups, sleep.time=2) {
      check.progress <- function(job.name, total.groups) {
        if ( job.name %in% ore.datastore()$datastore.name )
          print(sprintf("%.1f%%", nrow(ore.datastoreSummary(name=job.name))/total.groups*100L))
        else
          paste("Job", job.name, " does not exist")
      }
      while(1) {
        try(x <- check.progress(job.name,total.groups))
        Sys.sleep(sleep.time)
        if(x=="100.0%") break
      }
    }

    As before, this function is invoked in a separate R engine connected to the same schema.


    check.progress.loop("job1",total.groups)

    Looking at the results, we can see the progress reported at one second intervals. Since the models build quickly, it doesn’t take long to reach 100%. For functions that take longer to execute or where there are more groups to process, you may choose a longer sleep time. Following this, we look at the datastore “job1” using ore.datastore and its contents using ore.datastoreSummary.


    R> check.progress.loop("job1",total.groups,sleep.time=1)
    [1] "6.9%"
    [1] "96.6%"
    [1] "100.0%"

    R> ore.datastore(name="job1")
      datastore.name object.count size      creation.date description
    1 job1 29 1073 2014-02-13 22:03:20
    R> ore.datastoreSummary(name="job1")
    object.name class size length row.count col.count
    1 group.9E character 37 1 NA NA
    2 group.AA character 37 1 NA NA
    3 group.AQ character 37 1 NA NA
    4 group.AS character 37 1 NA NA
    5 group.B6 character 37 1 NA NA
    6 group.CO character 37 1 NA NA
    7 group.DH character 37 1 NA NA
    8 group.DL character 37 1 NA NA
    9 group.EA character 37 1 NA NA
    10 group.EV character 37 1 NA NA
    11 group.F9 character 37 1 NA NA
    12 group.FL character 37 1 NA NA
    13 group.HA character 37 1 NA NA
    14 group.HP character 37 1 NA NA
    15 group.ML character 37 1 NA NA
    16 group.MQ character 37 1 NA NA
    17 group.NW character 37 1 NA NA
    18 group.OH character 37 1 NA NA
    19 group.OO character 37 1 NA NA
    20 group.PA character 37 1 NA NA
    21 group.PI character 37 1 NA NA
    22 group.PS character 37 1 NA NA
    23 group.TW character 37 1 NA NA
    24 group.TZ character 37 1 NA NA
    25 group.UA character 37 1 NA NA
    26 group.US character 37 1 NA NA
    27 group.WN character 37 1 NA NA
    28 group.XE character 37 1 NA NA
    29 group.YV character 37 1 NA NA

    The same basic technique can be used to note progress in any long running or complex embedded R function, e.g., in ore.tableApply or ore.doEval. At various points in the function, sequence-named objects can be added to a datastore. Moreover, the contents of those objects can contain incremental or partial results, or even debug output.

    While we’ve focused on the R API for embedded R execution, the same functions could be invoked using the SQL API. However, monitoring would still be done from an interactive R engine.

    Tuesday Feb 04, 2014

    Invoking R scripts via Oracle Database: Theme and Variation, Part 6

    How can I use "group apply" to partition data over multiple columns for parallel execution?
    How can I use R for statistical computations and return results as a database table?

    In this blog post of our theme and variation series, we answer these two questions through several examples, highlighting both R and SQL interfaces.

    So far in this blog series on Oracle R Enterprise embedded R execution we've covered:

    Part 1: ore.doEval / rqEval
    Part 2: ore.tableApply / rqTableEval
    Part 3: ore.groupApply / “rqGroupApply”
    Part 4: ore.rowApply / rqRowEval
    Part 5: ore.indexApply

    Using ore.groupApply for partitioning data on multiple columns

    While the “group apply” functionality is quite powerful as it is, users sometimes want to partition data on multiple columns. Since ore.groupApply currently takes only a single column for the INDEX argument, users can create a new column that is the concatenation of the columns of interest, and provide this column to the INDEX argument. We’ll illustrate this first using the R API, and then the SQL API.

    R API

    We adapt an example from Part 3 to illustrate partitioning data on multiple columns. Instead of building a C5.0 model, we’ll use the same CHURN_TRAIN data set, but build an rpart model since it will produce rules on the partitions of data we’ve chosen for the example, namely, voice_mail_plan and international_plan. To understand the number of rows we can expect in each partition, we’ll use the R table function. We then add a new column that pastes together the two columns of interest to create a new column called “vmp_ip”.


    library(C50)
    data(churn)

    ore.create(churnTrain, "CHURN_TRAIN")

    table(CHURN_TRAIN$international_plan, CHURN_TRAIN$voice_mail_plan)
    CT <- CHURN_TRAIN
    CT$vmp_ip <- paste(CT$voice_mail_plan,CT$international_plan,sep="-")
    head(CT)

    Each invocation of the function “my.rpartFunction” will receive data from one of the partitions identified in vmp_ip. Since our source partition columns are constants, we set them to NULL. The character vectors are converted to factors and the model is built to predict churn and saved in an appropriately named datastore. Instead of returning TRUE as done in the previous example, we create a list to return the specific partition column values, the distribution of churn values, and the model itself.


    ore.scriptDrop("my.rpartFunction")
    ore.scriptCreate("my.rpartFunction",
      function(dat,datastorePrefix) {
        library(rpart)
        vmp <- dat[1,"voice_mail_plan"]
        ip <- dat[1,"international_plan"]
        datastoreName <- paste(datastorePrefix,vmp,ip,sep="_")
        dat$voice_mail_plan <- NULL
        dat$international_plan <- NULL
        dat$state <- as.factor(dat$state)
        dat$churn <- as.factor(dat$churn)
        dat$area_code <- as.factor(dat$area_code)
        mod <- rpart(churn ~ ., data = dat)
        ore.save(mod, name=datastoreName, overwrite=TRUE)
        list(voice_mail_plan=vmp,
            international_plan=ip,
            churn.table=table(dat$churn),
            rpart.model = mod)
      })

    After loading the rpart library and setting the datastore prefix, we invoke ore.groupApply using the derived column vmp_ip as the input to argument INDEX. After building the models, we’ll look at the first entry in the list returned. Using ore.load, we can load the model for the case where the customer neither has the voice mail plan, nor the international plan.


    library(rpart)

    datastorePrefix="my.rpartModel"

    res <- ore.groupApply( CT, INDEX=CT$vmp_ip,
          FUN.NAME="my.rpartFunction",
          datastorePrefix=datastorePrefix,
          ore.connect=TRUE)
    res[[1]]
    ore.load(name=paste(datastorePrefix,"no","no",sep="_"))
    mod
    SQL API

    To invoke this from the SQL API, we use the same approach as covered in Part 3. While we could create the table CT from the ore.frame used above, instead the following illustrates creating the derived column in SQL and explicitly defining a VIEW.


    CREATE OR REPLACE VIEW CT AS
      SELECT t.*, "voice_mail_plan" || '-' || "international_plan" as "vmp_ip"
      FROM CHURN_TRAIN t;

    Next, we create a PL/SQL PACKAGE and FUNCTION for the invocation.


    CREATE OR REPLACE PACKAGE churnPkg AS
      TYPE cur IS REF CURSOR RETURN CT%ROWTYPE;
    END churnPkg;
    /
    CREATE OR REPLACE FUNCTION churnGroupEval(
      inp_cur churnPkg.cur,
      par_cur SYS_REFCURSOR,
      out_qry VARCHAR2,
      grp_col VARCHAR2,
      exp_txt CLOB)
    RETURN SYS.AnyDataSet
    PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH ("vmp_ip"))
    CLUSTER inp_cur BY ("vmp_ip")
    USING rqGroupEvalImpl;
    /

    Then, we can invoke the R function by name in the SELECT statement as follows:


    select *
    from table(churnGroupEval(
      cursor(select * from CT),
      cursor(select 1 as "ore.connect",' my.rpartModel2' as "datastorePrefix" from dual),
      'XML', 'state', 'my.rpartFunction'));

    As another variation on this theme, suppose that you didn’t want to include all the columns from the source data set. To achieve this, you could create a view and define the PACKAGE from the view. However, you could also define a record that contains the specific columns of interest. This is a standard PL/SQL specification that can be used in combination with “group apply”.


    CREATE OR REPLACE PACKAGE churnPkg2 AS
      TYPE rec IS RECORD ("vmp_ip" varchar2(8),
        "churn" varchar2(4),
        "state" varchar2(4),
        "account_length" NUMBER(38));
      TYPE cur IS REF CURSOR RETURN rec;
    END churnPkg2;
    /

    If you don’t want to or cannot create a view, this allows you to specify the exact columns required for model building. Reducing the number of columns on input can improve performance, since only required data will be passed to the server-side R engine. Notice that we could have used this above since we remove the columns for the source partition columns.

    How to return results from R statistical functions as database table data

    R provides a wide range of statistical and advanced analytics functions. While Oracle Database contains a wide range of statistical functional in SQL, R further extends this set. In this next topic, we illustrate how to return statistical results as a SQL table for use with other SQL queries or to feed SQL-based applications.

    As our example, we’ll use the R principal components function princomp. Our goal is to return the loadings of the PCA model as a database table. For our data set, we’ll use the USArrests data set provided with R. We can view the results of princomp in the mod variable, which has class “princomp”. We then push this data to Oracle Database, getting an ore.frame object.


    mod <- princomp(USArrests, cor = TRUE)
    class(mod)
    mod
    dat <- ore.push(USArrests)

    R> mod <- princomp(USArrests, cor = TRUE)
    R> class(mod)
    [1] "princomp"
    R> mod
    Call:
    princomp(x = USArrests, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494

    4 variables and 50 observations.
    R> dat <- ore.push(USArrests)

    In the first case considered, we use ore.tableApply to return simply the princomp object. When we do this we’re getting back a serialized object of type ore.object, but the actual princomp object still resides in the database. We can pull this object from the database to get a local princomp object, but this type of result cannot be directly returned as a SQL table because we need an object of class data.frame (which we’ll address later).


    res <- ore.tableApply(dat,
          function(dat) {
            princomp(dat, cor=TRUE)
          })
    class(res)
    res.local <- ore.pull(res)
    class(res.local)
    str(res.local)
    res.local
    res

    In the following output, we see the result is an ore.object that we pull from the database to get a princomp object. We examine the structure of the object and focus on the loadings element. In the example, we print res.local and res. Since res is an ore.object, it automatically gets pulled to the client before printing it.


    R> res <- ore.tableApply(dat,
    + function(dat) {
    + princomp(dat, cor=TRUE)
    + })
    R> class(res)
    [1] "ore.object"
    attr(,"package")
    [1] "OREembed"
    R> res.local <- ore.pull(res)
    R> class(res.local)
    [1] "princomp"
    R> str(res.local)
    List of 7
    $ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
    ..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ center : Named num [1:4] 7.79 170.76 65.54 21.23
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ scale : Named num [1:4] 4.31 82.5 14.33 9.27
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ n.obs : int 50
    $ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:50] "1" "2" "3" "4" ...
    .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ call : language princomp(x = dat, cor = TRUE)
    - attr(*, "class")= chr "princomp"
    R> res.local
    Call:
    princomp(x = dat, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494

    4 variables and 50 observations.
    R> res
    Call:
    princomp(x = dat, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494

    4 variables and 50 observations.

    In this next case, we focus on the loadings component of the princomp object, which contains the matrix of variable loadings, that is a matrix whose columns contain the eigenvectors. This is of class "loadings"…still not a data.frame. To convert the loadings component to a data.frame, we determine the dimensions of the matrix and then construct a data.frame by accessing the cells of the loading object. To get the variables associated with each row, we assign to the column variables the row names of the loadings. Finally, we return the loadings data.frame.


    res <- ore.tableApply(dat,
          function(dat) {
            mod <- princomp(dat, cor=TRUE)
            dd <- dim(mod$loadings)
            ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
            ldgs$variables <- row.names(ldgs)
            ldgs
          })
    class(res)
    res

    In the output below, notice that we still have an ore.object being returned, but it’s in the form of a data.frame.


    R> res <- ore.tableApply(dat,
    + function(dat) {
    + mod <- princomp(dat, cor=TRUE)
    + dd <- dim(mod$loadings)
    + ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
    + ldgs$variables <- row.names(ldgs)
    + ldgs
    + })
    R> class(res)
    [1] "ore.object"
    attr(,"package")
    [1] "OREembed"
    R> res
            Comp.1    Comp.2     Comp.3     Comp.4 variables
    Murder -0.5358995 0.4181809 -0.3412327 0.64922780 Murder
    Assault -0.5831836 0.1879856 -0.2681484 -0.74340748 Assault
    UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773 UrbanPop
    Rape -0.5434321 -0.1673186 0.8177779 0.08902432 Rape

    We can address this last issue by specifying the FUN.VALUE argument to get an ore.frame result (left as an exercise to the reader). But our main goal is to enable returning the loadings from SQL as a database table. For that, we create the function in the R script repository and construct the appropriate SQL query. In preparation for the next example, we’ll create the table USARRESTS using the R data set.


    ore.create(USArrests,table="USARRESTS")

    Now, we’ll switch to SQL. We’re introducing the functions sys.rqScriptDrop and sys.rqScriptCreate, which are used within a BEGIN END PL/SQL block, to store the R function ‘princomp.loadings’.


    begin
    --sys.rqScriptDrop('princomp.loadings');
    sys.rqScriptCreate('princomp.loadings',
          'function(dat) {
            mod <- princomp(dat, cor=TRUE)
            dd <- dim(mod$loadings)
            ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
            ldgs$variables <- row.names(ldgs)
            ldgs
          }');
    end;
    /

    The SELECT statement provides input data by selecting all data from USARRESTS. There are no arguments to pass, so the next parameter is NULL. The SELECT string describes the format of the result. Notice that the column names must match in name (including case) and type. The last parameter is the name of the function stored in the R script repository.


    select *
    from table(rqTableEval( cursor(select * from USARRESTS),NULL,
              'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));

    SQL> select *
    from table(rqTableEval( cursor(select * from USARRESTS),NULL,
              'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));
    2 3
        Comp.1     Comp.2     Comp.3     Comp.4 variables
    ---------- ---------- ---------- ---------- ------------
    -.53589947 .418180865 -.34123273 .649227804 Murder
    -.58318363 .187985604 -.26814843 -.74340748 Assault
    -.27819087 -.87280619 -.37801579 .133877731 UrbanPop
    -.54343209 -.16731864 .817777908 .089024323 Rape

    If you have interesting embedded R scenarios to share with the ORE community, please consider posting a comment.

    Monday Jan 20, 2014

    Invoking R scripts via Oracle Database: Theme and Variation, Part 5


    In the first four parts of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution involving the functions ore.doEval / rqEval, ore.tableApply / rqTableEval, ore.groupApply / “rqGroupApply”, and ore.rowApply / rqRowEval. In this blog post, we cover ore.indexApply. Note that there is no corresponding rqIndexEval – more on that later. The “index apply” function is also one of the parallel-enabled embedded R execution functions. It supports task-parallel execution, where one or more R engines perform the same or different calculations, or task. A number, associated with the index of the execution, is provided to the function. Any required data is expected to be explicitly generated or loaded within the function.

    This functionality is valuable in a variety of settings, e.g., simulations, for taking advantage of high-performance computing hardware like Exadata.

    As for “group apply” and “row apply”, Oracle Database handles the management and control of potentially multiple R engines at the database server machine, with only the index passed to the function as the first argument. Oracle Database ensures that each R function execution completes, otherwise the ORE function returns an error. Output formats as supported by the other embedded R functions are possible for ore.indexApply, for example, returning an ore.list or combining data.frame data into an ore.frame.

    The variation on embedded R execution for ore.indexApply involves passing as an argument the number of times the user-defined R function should be executed.

    Let’s look at a simple example.

    The following code specifies to execute the function five times in parallel.


    res <- ore.indexApply(5,
          function(index) {
            paste("IndexApply:",index)
          },
        parallel=TRUE)
    class(res)
    res

    Notice that the class of the result is an ore.list, and when we print res, we have 5 character vectors, each with the index that was passed to the user-defined function. As with other parallel embedded R functions, the number of concurrently executing R engines can be limited by specifying the degree of parallelism of the database. As we’ll see in ORE 1.4, the parallel argument can specify a preferred number of parallel R engines, as an upper bound.


    > class(res)
    [1] "ore.list"
    attr(,"package")
    [1] "OREbase"
    > res
    $`1`
    [1] "IndexApply: 1"

    $`2`
    [1] "IndexApply: 2"

    $`3`
    [1] "IndexApply: 3"

    $`4`
    [1] "IndexApply: 4"

    $`5`
    [1] "IndexApply: 5"

    Column-parallel use case

    If we wanted to parallelize R’s summary function, we could compute the summary statistics on each column in parallel and combine them into a final result. The following example does exactly that. While we could generalize this example, we focus on the iris data set and computing summary statistics on the first four numeric columns. Since iris comes standard with R, there’s no need to load data from any other source, we simply access it. The first argument to ore.indexApply is 4, the number of columns we wish to summarize in parallel. The function takes one argument, index, which will be a value between 1 and 4, and will be used to select the column to summarize. We massage the result of summary into a data.frame and add the column name to the result. Note that the function returns a single row: the summary statistics for the column.


    res <- NULL
    res <- ore.indexApply(4,
          function(index) {
            ss <- summary(iris[,index])
            attr.names <- attr(ss,"names")
            stats <- data.frame(matrix(ss,1,length(ss)))
            names(stats) <- attr.names
            stats$col <- names(iris)[index]
            stats
          },
          parallel=TRUE)
    res

    The result comes back as an ore.list object:


    > res
    $`1`
    Min. 1st Qu. Median Mean 3rd Qu. Max. col
    1 4.3 5.1 5.8 5.843 6.4 7.9 Sepal.Length

    $`2`
    Min. 1st Qu. Median Mean 3rd Qu. Max. col
    1 2 2.8 3 3.057 3.3 4.4 Sepal.Width

    $`3`
    Min. 1st Qu. Median Mean 3rd Qu. Max. col
    1 1 1.6 4.35 3.758 5.1 6.9 Petal.Length

    $`4`
    Min. 1st Qu. Median Mean 3rd Qu. Max. col
    1 0.1 0.3 1.3 1.199 1.8 2.5 Petal.Width

    This is good, but it would be better if the result was returned as an ore.frame, especially since all the columns are the same. To enable this, we’ll do a slight variation on the result by specifying FUN.VALUE with the structure of the result defined.


    res <- ore.indexApply(4,
          function(index) {
            ss <- summary(iris[,index])
            attr.names <- attr(ss,"names")
            stats <- data.frame(matrix(ss,1,length(ss)))
            names(stats) <- attr.names
            stats$col <- names(iris)[index]
            stats
          },
          FUN.VALUE=data.frame(Min.=numeric(0),
            "1st Qu."=numeric(0),
            Median=numeric(0),
            Mean=numeric(0),
            "3rd Qu."=numeric(0),
            Max.=numeric(0),
            col=character(0)),
          parallel=TRUE)
    res

    Now, the result comes back as an ore.frame.


    > res
      Min. X1st.Qu. Median  Mean X3rd.Qu. Max.      col
    1 0.1 0.3 1.30 1.199 1.8 2.5 Petal.Width
    2 1.0 1.6 4.35 3.758 5.1 6.9 Petal.Length
    3 4.3 5.1 5.80 5.843 6.4 7.9 Sepal.Length
    4 2.0 2.8 3.00 3.057 3.3 4.4 Sepal.Width
    Simulation use case

    The ore.indexApply function can be used in simulations as well. In this next example we take multiple samples from a random normal distribution with the goal to compare the distribution of the summary statistics. For this, we build upon the example above. We provide parameters such as the sample size, mean and standard deviation of the random numbers, and the number of simulations we want to perform. Each one of these simulations will occur in a separate R engine, in parallel, up to the degree of parallelism allowed by the database.

    We specify num.simulations as the first parameter to ore.indexApply. Inside the user-defined function, we pass the index and three arguments to the function. The function then sets the random seed based on the index. This allows each invocation to generate a different set of random numbers. Using rnorm, the function produces sample.size random normal values. We invoke summary on the vector of random numbers, and then prepare a data.frame result to be returned. We’re using the FUN.VALUE to get an ore.frame as the final result.


    res <- NULL
    sample.size = 1000
    mean.val = 100
    std.dev.val = 10
    num.simulations = 1000

    res <- ore.indexApply(num.simulations,
          function(index, sample.size=1000, mean=0, std.dev=1) {
            set.seed(index)
            x <- rnorm(sample.size, mean, std.dev)
            ss <- summary(x)
            attr.names <- attr(ss,"names")
            stats <- data.frame(matrix(ss,1,length(ss)))
            names(stats) <- attr.names
            stats$index <- index
            stats
          },
          FUN.VALUE=data.frame(Min.=numeric(0),
            "1st Qu."=numeric(0),
            Median=numeric(0),
            Mean=numeric(0),
            "3rd Qu."=numeric(0),
            Max.=numeric(0),
            index=numeric(0)),
          parallel=TRUE,
          sample.size=sample.size,
          mean=mean.val, std.dev=std.dev.val)
    res
    boxplot(ore.pull(res[,1:6]),
      main=sprintf("Boxplot of %d rnorm samples size %d, mean=%d, sd=%d",
            num.simulations, sample.size, mean.val, std.dev.val))

    To get the distribution of samples, we invoke boxplot on the data.frame after pulling the result to the client.

    Here are a couple of plots showing results for different parameters:


    In both cases, we run 10,000 samples. The first graph uses a sample size of 10 and the second uses a sample size of 1000. From these results, it is clear that a larger sample size significantly reduces the variance in each of the summary statistics - confirming our Statistics 101 understanding.

    Error reporting

    As introduced above, Oracle Database ensures that each embedded R user-defined function execution completes, otherwise the ORE function returns an error. Of course, any side-effects of the user-defined function need to be manually cleaned up. Operations that produce files, create tables in the database, or result in completed database transactions through ROracle will remain intact. The ORE embedded R infrastructure will report errors as produced by the function as illustrated in the following example.

    The code specifies to invoke 4 parallel R engines. If the index has value 3, attempt to load the non-existant package "abc123" (which produces an error), otherwise return the index value.


    R> ore.indexApply(4,
    + function(index) {
    + if (index==3) {library(abc123)}
    + else {return(index)}
    + }
    + )
    Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch, :
    ORA-12801: error signaled in parallel query server P000
    ORA-20000: RQuery error
    Error in library(abc123) : there is no package called 'abc123'
    ORA-06512: at "RQSYS.RQGROUPEVALIMPL", line 121
    ORA-06512: at "RQSYS.RQGROUPEVALIMPL", line 118

    Notice that the first reported error is an ORE-12801: error signaled in parallel query server. Then the ORA-20000: RQuery error indicates the error as returned by the R engine. Also interesting to note is that the ORA-06512 errors reveal the underlying implementation of ore.indexApply "RQSYS.RQGROUPEVALIMPL". Which leads us to the next topic.

    No rqIndexEval?

    “Index apply” is really a variation of “group apply” where the INDEX column is a numeric vector that is pushed to the database. With n distinct numbers, one number is provided to each function as its index. As a result, there is no corresponding rqIndexEval in the SQL API. The user would have to create a similar package and function as was illustrated in the blog post on “group apply.”

    Thursday Jan 09, 2014

    Invoking R scripts via Oracle Database: Theme and Variation, Part 4

    In the first three parts of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution involving the functions ore.doEval / rqEval, ore.tableApply / rqTableEval, and ore.groupApply / “rqGroupApply”. In this blog post, we’ll cover the next in our theme and variation series involving ore.rowApply and rqRowEval. The “row apply” function is also one of the parallel-enabled embedded R execution functions. It supports data-parallel execution, where one or more R engines perform the same R function, or task, on disjoint chunks of data. This functionality is essential to enable scalable model scoring/predictions on large data sets and for taking advantage of high-performance computing hardware like Exadata.

    As for ore.groupApply, Oracle Database handles the management and control of potentially multiple R engines at the database server machine, automatically chunking and passing data to parallel executing R engines. Oracle Database ensures that R function executions for all chunks of rows complete, or the ORE function returns an error. The result from the execution of each user-defined embedded R function is gathered in an ore.list. This list remains in the database until the user requires the result. However, we’ll also show how data.frame results from each execution can be combined into a single ore.frame. This features works for return values of other embedded R functions as well.

    The variation on embedded R execution for ore.rowApply involves passing not only an ore.frame to the function such that the first parameter of your embedded R function receives a data.frame, but also the number of rows that should be passed to each invocation of the user-defined R function. The last chunk, of course, may have fewer rows than specified.

    Let’s look at an example. We’re going to use the C50 package to score churn data (i.e., predict which customers are likely to churn) using the C5.0 decision tree models we built in the previous blog post with ore.groupApply. (Well, almost. We need to rebuild the models to take into account the full data set levels.) The goal is to score the customers in parallel leveraging the power of a high performance computing platform, such as Exadata.


    library(C50)
    data(churn)

    ore.create(churnTest, "CHURN_TEST")

    myFunction <- function(dat, xlevels, datastorePrefix) {
      library(C50)
      state <- dat[1,"state"]
      datastoreName <- paste(datastorePrefix,state,sep="_")
      dat$state <- NULL
      for (j in names(xlevels))
        dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
      ore.load(name=datastoreName)
      res <- data.frame(pred=predict(mod,dat, type="class"),
            actual=dat$churn,
            state=state)
      res
    }

    xlevels <- ore.getXlevels(~ ., CHURN_TEST[,-1])
    scoreList <- ore.groupApply(
      CHURN_TEST,
      INDEX=CHURN_TEST$state,
      myFunction,
      datastorePrefix="myC5.0model3",xlevels=xlevels, ore.connect=TRUE)
    score.MA <- ore.pull(scoreList$MA)
    table(score.MA$actual, score.MA$pred)

    A few points to highlight:

    • Instead of computing the levels using the as.factor function inside the user-defined function, we’ll use ore.getXlevels, which returns the levels for each factor column. We don’t need this for the state column, so we exclude it (“-1”). In the previous post we noted that factor data is passed as character columns in the data.frame. Computing the levels first can ensure that all possible levels are provided during model building, even if there are no rows with some of the level values.
    • When building models where some levels were missing (due to using as.factor on each partition of data), scoring can fail if the test data has unknown level values. For this reason, the models built in Part 3 need to be rebuilt using the approach above with ore.getXlevels. This is left as an exercise for the reader.
    • Assign the function to the variable “myFunction” to facilitate reuse (see below).
    • We construct the datastore name to be the same as when we were building the models, i.e., appending the state value to the datastore prefix separated by an ‘_’.
    • The for loop iterates over the levels passed in as xlevels, creating a factor using the provided levels and assigning it back to the data.frame.
    • Loading the datastore by name, we have access to the variable mod, which contains the model for the particular state.
    • The result is constructed as a data.frame with the prediction and the actual values.
    • Three arguments are passed: the datastore prefix, the levels that were pre-computed, and that we need to connect to the database because we’re using a datastore.
    • The results are stored as a list of ore.frames. We can pull the scores for MA and compute a confusion matrix using table.

    This is fine. However, we likely don’t want to have a list of separate ore.frames as the result. We’d prefer to have a single ore.frame with all the results. This can be accomplished using the FUN.VALUE argument. Whenever a data.frame is the result of the user-defined R function, and if the structure of that data.frame is the same across all invocations of the group apply or row apply, you can combine them into a single result by defining the structure as follows:

    scores <- ore.groupApply(
      CHURN_TEST,
      INDEX=CHURN_TEST$state,
      myFunction,
      datastorePrefix="myC5.0model3",xlevels=xlevels, ore.connect=TRUE,
      FUN.VALUE=data.frame(pred=character(0),
            actual=character(0),
            state=character(0)));
    head(scores)
    scores.local <- ore.pull(scores)
    table(scores.local[scores.local$state=="MA",c("actual","pred")])

    scores.MA <- scores[scores$state=="MA",c("actual","pred")]
    table(scores.MA$actual, scores.MA$pred)

    A few important points to highlight:

    • FUN.VALUE is set to a data.frame that describes the format of the result. By providing this argument, you will get back a single ore.frame, not an ore.list object.
    • The group apply completes instantaneously because it is only defining the ore.frame, not actually performing the scoring. Not until the values are needed does the result get computed. We invoke head on the ore.frame in scores to highlight this.
    • We can pull the scores to the client to invoke table as before, but subselecting for state MA. However, we can also do this computation in the database using the transparency layer. First, we filter the rows for MA in scores.MA, and then invoke table on the two columns. Note: ORE requires passing the two columns explicitly to the overloaded function table.
    • To do this in parallel, add the argument parallel=TRUE to the ore.groupApply call.

    Wait! What happened to ore.rowApply?

    Above, we showed how to score with multiple models using ore.groupApply. But what if we had customers from a single state that we wanted to score in parallel? We can use ore.rowApply and rqRowEval to invoke a function on chunks of data (rows) at a time, from 1 to the total number of rows. (Note that values closer to the latter will have no benefit from parallelism, obviously.)


    scores <- ore.rowApply(
      CHURN_TEST[CHURN_TEST$state=="MA",],
      myFunction,
      datastorePrefix="myC5.0model3",xlevels=xlevels,
      ore.connect=TRUE, parallel=TRUE,
      FUN.VALUE=data.frame(pred=character(0),
            actual=character(0),
            state=character(0)),
      rows=200)
    scores
    table(scores$actual, scores$pred)

    A few points to highlight:

    • Since we want to perform the scoring in parallel by state, we filter the rows for MA. This will ensure that all rows processed can use the same predictive model.
    • We set the rows argument to 200. CHURN_TEST has 1667 rows, so this will result in nine executions of myFunction. The first eight receiving 200 rows each and the last receiving 67 rows.
    • We also set parallel=TRUE above since we want the scoring performed in parallel.
    • The invocation of ore.rowApply returns immediately. Not until we print scores do we incur the cost of executing the underlying query. However, also note that each time we access scores, for example in the following call to table, we incur the cost of executing the query. If the result will be used many times in subsequent operations, you may want to create a table with the result using ore.create.

    In SQL, we can do the same, but we’ll need to store the function in the R script repository (perhaps called "myScoringFunction") and also store xlevels in a datastore (perhaps called "myXLevels"). While we can pass complex objects in the R interface to embedded R functions, we cannot do that in SQL. Instead, we must pass the name of a datastore. Since the xlevels are in a datastore, the user-defined R function needs to be modified to take this other datastore name and load that datastore to have access to xlevels. This set of changes is left to the reader as an exercise.


    select * from table(rqRowEval(
      cursor(select /*+ parallel(t, 4) */ *
            from CHURN_TEST t
            where "state" = 'MA'),
      cursor(select 1 as "ore.connect",
            'myC5.0model3' as "datastorePrefix",
            'myXLevels' as "xlevelsDatastore"
            from dual),
      'select ''aaa'' "pred",''aaa'' "actual" , ''aa'' "state" from dual',
        200, 'myScoringFunction'));

    A few points to highlight:

    • The input cursor specifies a parallel hint on the input data cursor and filtering data for MA as well.
    • Several arguments are being passed, including the new argument to our function myXLevels.
    • The output form is specified in the SQL string. Care must be taken to ensure that the column names, ordering, and the length of character strings match the returned data.frame.

    Map Reduce

    The “row apply” functionality can be thought of in terms of the map-reduce paradigm where the mapper performs the scoring and outputs a data.frame value (no key required). There is no reducer, or the reducer is simply a pass-through.

    Memory and performance considerations

    Unlike with group apply, the rows argument in row apply ensures an upper bound on the number of rows (and hence memory requirement). The value of rows should be chosen to balance memory and parallel performance. The usual measures can be taken regarding setting memory limits on the R engine – as noted in Part 2.

    There may be instances where setting rows = 1 makes sense. For example, if the computation per row is intensive (i.e., takes a long time), sending one row per R engine may be appropriate. Experiment with a range of values for rows to determine the best value for your particular scenario.

    About

    The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

    Search

    Archives
    « July 2015
    SunMonTueWedThuFriSat
       
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
     
           
    Today