Thursday Aug 14, 2014

Selecting the most predictive variables – returning Attribute Importance results as a database table

Attribute Importance (AI) is a technique of Oracle Advanced Analytics (OAA) that ranks the relative importance of predictors given a categorical or numeric target for classification or regression models, respectively. OAA AI uses the minimum description length algorithm and produces importance scores such that predictors with positive scores help predict the target, while zero or negative do not, and may even contribute noise to a model, making it less accurate. OAA AI, however, considers predictors only pairwise with the target, so any interactions among predictors are addressed. OAA AI is a good first assessment of which predictors should be included in a classification or regression model, enabling what is sometimes called feature selection or variable selection.

In my series on Oracle R Enterprise Embedded R Execution, I explored how structured table results could be returned from embedded R calls. In a subsequent post, I explored how to return select results from a principal components analysis (PCA) model as a table. In this post, I describe how you can work with results from an Attribute Importance model from ORE embedded R execution via an R function. This R function takes a table name and target variable name as input, places the predictor rankings in an named ORE datastore also specified as input, and returns a data.frame with the predictor variable name, rank, importance value.

The function below implements this functionality. Notice that we dynamically sync the named table and get its ore.frame proxy object. From here, we invoke ore.odmAI using the dynamically generated formula using the targetName argument. We pull out the importance component of the result, explicitly assign the column variable to the row names, and then reorder the columns. Next, we nullify the row names since these are now redundant with column variable.

The next three lines assign the result to a datastore. This is technically not necessary since the result is returned by this function, but if a user wanted to access this result without recomputing it, the user could simply retrieve the datastore object using another embedded R function. This is left as an exercise for the reader to load the named datastore and return the contents as an ore.frame in R or database table in SQL.

Lastly, the resulting data.frame is returned.

rankPredictors <- function(tableName,targetName,dsName) {
  ore.sync(table=tableName)
  ore.attach()
  dat <- ore.get(tableName)
  formulaStr <- paste(targetName,".",sep="~")
  res <- ore.odmAI(as.formula(formulaStr),dat)
  res <- res$importance
  res$variable <- rownames(res)
  res <- res[,c("variable","rank","importance")]
  row.names(res) <- NULL
  resName <- paste(tableName,targetName,"AI",sep=".")
  assign(resName,res)
  ore.save(list=c(resName),name=dsName,overwrite=TRUE)
  res
}

To test this funtion, we invoke it explicitly with suitable arguments.

res <- rankPredictors ("IRIS","Species","/DS/Test1")
res

Here, you see the results.

> res
    variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

The contents of the datastore can be accessed as well.

ore.datastore(pattern="/DS")
ore.datastoreSummary(name="/DS/Test1")
ore.load("/DS/Test1")
IRIS.Species.AI
> ore.datastore(pattern="/DS")
  datastore.name object.count size    creation.date description
1 /DS/Test1 1 355 2014-08-14 16:38:46 <na>
> ore.datastoreSummary(name="/DS/Test1")
object.name class size length row.count col.count
1 IRIS.Species.AI data.frame 355 3 4 3
> ore.load("/DS/Test1")
[1] "IRIS.Species.AI"
> IRIS.Species.AI
    variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

With the confidence that our R function is behaving correctly, we load it into the R Script Repository in Oracle Database.

ore.scriptDrop("rankPredictors")
ore.scriptCreate("rankPredictors",rankPredictors)

To test that the function behaves properly with embedded R execution, we invoke it first from R using ore.doEval, passing the desired parameters and returning the result as an ore.frame. This last part is enabled through the specification of the FUN.VALUE argument. Since we are using a datastore and the transparency layer, ore.connect is set to TRUE.

ore.doEval(
FUN.NAME="rankPredictors",
  tableName="IRIS",
  target="Species",
  dsName="/AttributeImportance/IRIS/Species",
  FUN.VALUE=data.frame(variable=character(0)
      ,rank=numeric(0)
      ,importance=numeric(0)),
  ore.connect=TRUE
)

Notice we get the same result as above.

    variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

Again, we can view the datastore contents for the execution above. Notice our use of the “/” notation to organize our datastore content. While we can name datastores with any arbitrary string, this approach can help structure the retrieval of datastore contents.

ore.datastore(pattern="/AttributeImportance/IRIS")
ore.datastoreSummary(name="/AttributeImportance/IRIS/Species")

We have a single datastore matching our IRIS data set followed by the summary with the IRIS.Species.AI object, which is an R data.frame with 3 columns and 4 rows.

> ore.datastore(pattern="/AttributeImportance/IRIS")
datastore.name object.count size creation.date description
1 /AttributeImportance/IRIS/Species 1 355 2014-08-14 16:55:40
> ore.datastoreSummary(name="/AttributeImportance/IRIS/Species")
object.name class size length row.count col.count
1 IRIS.Species.AI data.frame 355 3 4 3

To execute this R script from SQL, use the ORE SQL API.

select * from table(rqEval(
  cursor(select 1 "ore.connect",
      'IRIS' "tableName",
      'Species' "targetName",
      '/AttributeImportance/IRIS/Species' "dsName"
      from dual),
  'select cast(''a'' as varchar2(50)) "variable",
  1 "rank",
  1 "importance"
  from dual',
  'rankPredictors'));

In summary, we’ve explored how to use ORE embedded R execution to extract model elements from an in-database algorithm and present it as an R data.frame, ore.frame, and SQL table.

The process used above can also serve as a template for working on your own embedded R execution projects:

+ Interactively develop an R script that does what you need and wrap it in a function
+ Validate that the R function behaves as expected
+ Store the function in the R Script Repository
+ Validate that the R interface to embedded R execution produces the desired results
+ Generate SQL query that invokes the R function
+ Validate that the SQL interface to embedded R execution produces the desired resultsv

Wednesday Jul 30, 2014

For CMOs: Take Your Company’s Data to a New Level for Marketing Insights

This guest post from Phyllis Zimbler Miller, ‎Digital Marketer, comments on uses of predictive analytics for marketing insights that could benefit from in-database scalability and ease of production deployment with Oracle R Enterprise.

Does your company have tons of data, such as for how many seconds people watch each short video on your site before clicking away, and you are not yet leveraging this data to benefit your company’s bottom line?

Missed opportunities can be overcome by utilizing predictive analytics

Predictive analytics uses statistical and machine learning techniques that analyze current and historical facts to make predictions about events. For example, your company could take data you’ve already collected and, utilizing statistical analysis software, gain insights into the behavior of your target audiences.

Previously, running the software to analyze this data could take many hours or even days. Today, with advanced software and hardware options, this analysis can take minutes.

Customer segmentation and customer satisfaction based on data analysis

Using predictive analytics you could, for example, begin to evaluate which prospective customers in which part of the country tend to watch which videos on your site longer than the other videos on your site. This evaluation can then be used by your marketing people to craft regional messages that can better resonate with people in those regions.

In terms of data analysis for customer satisfaction, imagine an online entertainment streaming service using data analysis to determine at what point people stop watching a particular film or TV episode. Presumably this information could then be used, among other things, to improve the individual recommendations for site members.

Or imagine an online game company using data analysis of player actions for customer satisfaction insights. Although certain actions may not be against the rules, these actions might artificially increase a player’s ranking against other players, which would interfere with the game satisfaction of others. The company could use data analysis to look for players “gaming” the system and take appropriate action.

Customer retention opportunities from data analysis

Perhaps one of the most important opportunities for analysis of data your company may already have is for customer retention efforts. Let’s say you have a subscription model business. You perform data analysis and discover that your biggest drop-offs are at the 3-month and 6-month points.

First, your marketing department comes up with incentives offered to customers right before those drop-off points – incentives that require extending the customer’s subscription.

Then you use data analysis to evaluate whether there is a statistical difference in the drop-offs after the incentives have been instituted.

Next you try different incentives for those drop-off points and analyze that data. Which incentives seem to better improve customer retention?

Companies with large volume data

Your company may already be using Oracle Database. If your company’s database has a huge amount of data, Oracle has an enterprise solution to improve the efficiency and scalability of running the R statistical programming language, which can be effectively used in many cases for this type of predictive analytics.

Oracle R Enterprise offers scalability, performance, and ease of production deployment. Using Oracle R Enterprise, your company’s data analysis procedures can overcome R memory constraints and, utilizing parallel distributed algorithms, considerably reduce execution time.

Regardless of the amount of data your company has, you still need to consider how to get your advanced analytics into production quickly and easily. The ability to integrate R scripts with production database applications using SQL eliminates delays in moving from development to production use.

And the quicker and easier you can analyze your data, the sooner you can benefit from valuable insights into customer segmentation, satisfaction, and retention in addition to many other customer/marketing applications.

Friday Jul 25, 2014

Addressing Data Order Between R and Relational Databases

Almost all data in R is a vector or is based upon vectors (vectors themselves, matrices, data frames, lists, and so forth).  The elements of a vector in R have an explicit order, and each element can be individually indexed.  R's in-memory processing relies on this order of elements for many computations, e.g., computing quantiles and summaries for time series objects.

By design, query results in relational algebra are unordered.  Repeating the same query multiple times is not guaranteed to return results in the same order. Similarly, database-backed relational data also do not guarantee row order.  However, an explicit order on database tables and views can be defined by using an ORDER BY clause in the SQL SELECT statement.  Ordering is usually achieved by having a unique identifier, either a single or multi-column key specified in the ORDER BY clause.

To bridge between ordered R data frames and normally unordered data in a relational database such as Oracle Database, Oracle R Enterprise provides the ability to create ordered and unordered ore.frame objects.  Oracle R Enterprise supports ordering an ore.frame by assigning row names using the function row.names.

Ordering Using Row Names

Oracle R Enterprise supports ordering using row names.  For example, suppose that the ore.frame object NARROW, which is a proxy object for the corresponding database table, is not indexed.  The following example illustrates using the row.names function to create a unique identifier for each row. When retrieving row names for unordered ore.frame objects, an error is returned:

R> row.names(NARROW)
Error: ORE object has no unique key

If an ore.frame is unordered, row indexing is not permitted, since there is no unique ordering.  For example, an attempt to retrieve the 5th row from the NARROW data returns an error:

R> NARROW[5,]
Error: ORE object has no unique key


The R function row.names can also be used to assign row names explicitly and thus create a unique row identifier.  We'll do this using the variable "ID" on the NARROW data:

R> row.names(NARROW) <- NARROW$ID


R> row.names(head(NARROW[ ,1:3]))
[1] 101501 101502 101503 101504 101505 101506 

We can now index to a specific row number using integer indexing:

R> NARROW[5,]     
  ID GENDER AGE MARITAL_STATUS COUNTRY EDUCATION OCCUPATION YRS_RESIDENCE CLASS
101505 101505 <NA> 34  NeverM United States of America Masters Sales 5 1

Similarly, to index a range of row numbers, use:

R> NARROW[2:3,]

ID GENDER AGE MARITAL_STATUS COUNTRY EDUCATION OCCUPATION YRS_RESIDENCE CLASS
101502 101502 <NA> 27 NeverM United States of America Bach. Sales 3 0

101503 101503 <NA> 20 NeverM United States of America HS-grad Cleric. 2 0


To index a specific row by row name, use character indexing:

R> NARROW["101502",]

  ID GENDER AGE MARITAL_STATUS COUNTRY EDUCATION OCCUPATION YRS_RESIDENCE CLASS

101502 101502 <NA> 27 NeverM United States of America Bach. Sales 3 0

Ordering Using Keys

You can also use the primary key of a database table to order an ore.frame object.  When you execute ore.connect in an R session, Oracle R Enterprise creates a connection to a schema in an Oracle Database instance. To gain access to the data in the database tables in the schema, you can explicitly call the ore.sync function. That function creates an ore.frame object that is a proxy for a table in a schema.  With the schema argument, you can specify the schema for which you want to create an R environment to proxy objects.  With the use.keys argument, you can specify whether you can to use primary keys in the table to order the ore.frame object.

To return the NARROW data to it's unordered state, remove the previously created row names:

R> row.names(NARROW) <- NULL


Using a SQL statement, alter the NARROW table to add a composite primary key:


R> ore.exec("alter table NARROW add constraint NARROW primary key (\"ID\")")


Synchronize the table to obtain the updated key using the
ore.sync command and setting the use.keys argument to TRUE.

R> ore.sync(table = "NARROW", use.keys = TRUE)


The row names of the ordered NARROW data are now the primary key column values:


R> head(NARROW[, 1:3])
   ID GENDER AGE
1 101501   <NA>  41
2 101502   <NA>  27
3 101503   <NA>  20
4 101504   <NA>  45
5 101505   <NA>  34
6 101506   <NA>  38

If your database table already contains a key, there is no need to create the key again. 
Simply execute ore.sync with use.keys set to TRUE when you want to use the primary key:


R> ore.sync(table = "TABLE_NAME", use.keys = TRUE)


Ordering database tables and views is known to reduce performance because it requires sorting. 
As most operations in R do not require ordering, the performance hit due to sorting is unnecessary, and you should generally set use.keys to FALSE in ore.sync. Only when ordering is necessary, for operations such sampling data or running the diff command to compare objects, should keys be used.

Options for Ordering

Oracle R Enterprise contains options that relate to the ordering of an ore.frame object. The ore.warn.order global option specifies whether you want Oracle R Enterprise to display a warning message if you use an unordered ore.frame object in a function whose results are order dependent. If you know what to expect from operations involving aggregates, group summaries, or embedded R computations,  then you might want to turn the warnings off so they do not appear in the output:

R> options("ore.warn.order")
$ore.warn.order

[1] TRUE


R> options("ore.warn.order" = FALSE)


R> options("ore.warn.order")

$ore.warn.order

[1] FALSE

Note that in some circumstances unordered data may appear to have a repeatable order, however since it was never guaranteed in the first place it is possible to change in future runs. Additionally, parallel queries can significantly impact the order of the result versus the sequential execution.




























Thursday Jul 24, 2014

Are you experiencing analytics pain points?

At the user!2014 conference at UCLA in early July, which was a stimulating and well-attended conference, I spoke about Oracle’s R Technologies during the sponsor talks. One of my slides focused on examples of analytics pain points we often hear from customers and prospects. For example,

“It takes too long to get my data or to get the ‘right’ data”
“I can’t analyze or mine all of my data – it has to be sampled”
“Putting R models and results into production is ad hoc and complex”
“Recoding R models into SQL, C, or Java takes time and is error prone”
“Our company is concerned about data security, backup and recovery”
“We need to build 10s of thousands of models fast to meet business objectives”

After the talk, several people approached me remarking how these are exactly the problems they encounter in their organizations. One person even asked, if I’d interviewed her for my talk since she is experiencing every one of these pain points.

Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, addresses these pain points. Let’s take a look one by one.

If it takes too long to get your data, perhaps because your moving it from the database where it resides to your external analytics server or laptop, the ideal solution is don’t move it. Analyze it where it is. This is exactly what Oracle R Enterprise allows you to do using the transparency layer and in-database predictive analytics capabilities. With Oracle R Enterprise, R functions normally performed on data.frames are translated to SQL for execution in the database, taking advantage of query optimization, indexes, parallel-distributed execution, etc. With the advent of Oracle Data In-Memory option, this has even more advantages, but that’s a topic for another post. The second part of this pain point is getting access to the “right” data. Allowing your data scientist to have a sandbox with access to the range of data necessary to perform his/her work avoids the delay of requesting flat file extracts via the DBA, only to realize that more or different data is required. The cycle time in getting the “right” data impedes progress, not to mention annoying some key individuals in your organization. We’ll come back to the security aspects later.

Increasingly, data scientists want to avoid sampling data when analyzing data or building predictive models. Minimally, they at least want to use much more data than may fit in typical analytics servers. Oracle R Enterprise provides an R interface to powerful in-database analytic functions and data mining algorithms. These algorithms are designed to work in a parallel distributed manner whether the data fits in memory or not. In other cases, sampling is desired, if not required, but this results in the chicken-and-egg problem: The data need to be sampled since they won’t fit in memory, but the data are too big to fit in memory to sample! Users have developed home grown techniques to chunk the data and combine partial samples; however, they shouldn’t have to. When sampling is desired/required, with Oracle R Enterprise, we are able to leverage row indexing and in-database sampling to extract only database table rows that are in the sample, using standard R syntax or Oracle R Enterprise-based sampling functions.

Our next pain point involves production deployment. Many good predictive models have been laid waste for lack of integration with or complexity introduced by production environments. Enterprise applications and dashboards often speak SQL and know how to access data. However, to craft a solution that extracts data, invokes an R script in an external R engine, and places batch results back in the database requires a lot of manual coding, often leveraging ad hoc cron jobs. Oracle R Enterprise enables the execution of R scripts on the database server machine, in local R engines under the control of Oracle Database. This can be done from R and SQL. Using the SQL API, R scripts can be invoked to return results in the form of table data, images, and XML. In addition, data can be moved to these R engines more efficiently, and the powerful database hardware, such as Exadata machines, can be leveraged for data-parallel and task-parallel R script execution.

When users don’t have access to a tight integration between R and SQL as noted above, another pain point involves using R only to build the models and relying on developers to recode the scoring procedures in a programming language that fits with the production environment, e.g., SQL, C, or Java. This has multiple downsides: it takes time to recode, manual recoding is error prone, and the resulting code requires significant testing. When the model is refreshed, the process repeats.

The pain points discussed so far also suffer from concerns about security, backup, and recovery. If data is being moved around in flat files, what security protocols or access controls are placed on those flat files? How can access be audited? Oracle R Enterprise enables analytics users to leverage an Oracle Database secured environment for data access. Moving on, if R scripts, models, and other R objects are stored and managed as flat files, how are these backed up? How are they synced with the deployed application? By storing all these artifacts in Oracle Database via Oracle R Enterprise, backup is a normal part of DBA operation with established protocols. The R Script Repository and Datastore simplify backup. Crafting ad hoc solutions involving third party analytic servers, there is the issue of recovery, or resilience to failures. Fewer moving parts mean lower complexity. Programming for failure contingencies in a distributed application adds significant complexity to an application. Allowing Oracle Database to control the execution of R scripts in database server side R engines reduces complexity and frees application developers and data scientists to focus on the more creative aspects of their work.

Lastly, users of advanced analytics software – data scientists, analysts, statisticians – are increasing pushing the barrier of scalability. Not just in volume of data processed, but in the number and frequency of their computations and analyses, e.g., predictive model building. Where only a few models are involved, it may be tractable to manage a few files to store predictive models on disk (although as noted above, this has its own complications). When you need to build thousands of models or hundreds of thousands of models, managing these models becomes a challenge in its own right.

In summary, customers are facing a wide range of pain points in their analytics activities. Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, addresses these pain points allowing data scientists, analysts, and statisticians, as well as the IT staff who supports them, to be more productive, while promoting and enabling new uses of advanced analytics.

Tuesday Jul 22, 2014

StubHub Taps into Big Data for Insight into Millions of Customers’ Ticket-Buying Patterns, Fraud Detection, and Optimized Ticket Prices

What can you use for a comprehensive platform for real-time analytics?
How do you drive company growth to leverage actions of millions of customers?
How can you process big data volumes for near-real-time recommendations and dramatically reduce fraud?

These questions, and others, posed challenges set by Stubhub. Read what Stubhub achieved with Oracle R Enterprise from the Oracle Advanced Analytics option to Oracle Database.

Mike Barber, Senior Manager of Data Science at StubHub said:

“Big data is having a tremendous impact on how we run our business. Oracle Database and its various options—including Oracle Advanced Analytics—combine high-performance data-mining functions with the open source R language to enable predictive analytics, data mining, text mining, statistical analysis, advanced numerical computations, and interactive graphics—all inside the database.”

Yadong Chen, Principal Architect, Data Systems at StubHub said:

“We considered solutions from several other vendors, but Oracle Database was a natural choice for us because it enabled us to run analytics at the data source. This capability, together with the integration of open source R with the database, ensured scalability and enabled near-real-time analytics capabilities."

Read the full press release here.

Monday Jul 14, 2014

Using Embedded R Execution: Imputing Missing Data While Preserving Data Structure

This guest post from Matt Fritz, Data Scientist, demonstrates a method for imputing missing values in data using Embedded R Execution with Oracle R Enterprise.

Missing data is a common issue among analyses and is mitigated by imputation. Several techniques handle this process within Oracle R Enterprise; however, some bias the data or generate outputs as data objects that are less accessible than others. This post illustrates ways to effectively impute data while specifying the exact data structure of the output keeping the output’s structure functional in Oracle R Enterprise.

Let’s first create missing data in the WorldPhones data set and create it in Oracle R Enterprise:


  WorldPhones[c(2,6),c(1,2,4)] <- NA
  WorldPhones <- as.data.frame(WorldPhones)
  ore.create(WorldPhones, table = 'PHONES')
  class(PHONES)

  > class(PHONES)
  [1] "ore.frame"

  > PHONES
        N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
  1951  45939  21574 2876   1815    1646     89      555
  1956     NA     NA 4708     NA    2366   1411      733
  1957  64721  32510 5230   2695    2526   1546      773
  1958  68484  35218 6662   2845    2691   1663      836
  1959  71799  37598 6856   3000    2868   1769      911
  1960     NA     NA 8220     NA    3054   1905     1008
  1961  79831  43173 9053   3338    3224   2005     1076

The easiest way to handle missing data is by substituting these values with a constant, such as zero. We are ready to recode the missing values and can use either the Transparency Layer or Embedded R Execution. The Transparency Layer will convert the base R code below into SQL and run the generated SQL inside the database:

  newPHONES <- PHONES
  newPHONES$N.Amer <- ifelse(is.na(newPHONES$N.Amer),0,newPHONES$N.Amer)
  newPHONES$Europe <- ifelse(is.na(newPHONES$Europe),0,newPHONES$Europe)
  newPHONES$S.Amer <- ifelse(is.na(newPHONES$S.Amer),0,newPHONES$S.Amer)
  newPHONES

  > newPHONES
        N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
  1951  45939  21574 2876   1815    1646     89    555
  1956  0      0     4708   0       2366   1411    733
  1957  64721  32510 5230   2695    2526   1546    773
  1958  68484  35218 6662   2845    2691   1663    836
  1959  71799  37598 6856   3000    2868   1769    911
  1960  0      0     8220   0       3054   1905    1008
  1961  79831  43173 9053   3338    3224   2005    1076


This process can also be executed in Embedded R Execution – which spawns an R engine on the database server under the control of Oracle Database – by using a custom R function, such as: 

  function(x) ifelse(is.na(x),0,x) 

One way to call this custom function is with ore.doEval. This method requires code to be written as if it were to be executed on the client; however, the ore.doEval wrapper moves the code to the R Script Repository of Oracle R Enterprise in the database and then leverages the database server’s superior processing capacity: 

  newPHONE <- ore.doEval(
     function() {
        ore.sync(table="PHONES")
        ore.attach()
        data.frame(apply(ore.pull(PHONES)
           ,2
           ,function(x) ifelse(is.na(x),0,x)))} 
           ,ore.connect=TRUE)

Note that we explicitly pull the data from the database using Oracle R Enterprise’s Transparency Layer on the database table PHONES. We must connect to the database to obtain the ore.frame that corresponds to the PHONES table. This is accomplished through the ore.sync function. The ore.attach function allows us to reference the ore.frame by its table name.

The second way is via ore.tableApply, which applies a function on an entire input table within Oracle R Enterprise. The same result is created as with ore.doEval and although both operations are successful, the output’s structure defaults to an ORE object instead of a data frame: 

  newPHONES <- 
ore.tableApply(PHONES
                    ,function(y) {
                       apply(y
                            ,2
                            ,function(x) ifelse(is.na(x),0,x))})
  class(newPHONES)

  > class(newPHONES)
   [1] "ore.object"

Since we cannot work with this object the same way as data frames or matrices, we must pull the ORE object onto the client in order to deserialize the object into an R matrix:

  newphones <- ore.pull(newPHONES)
  class(newphones)

  > class(newphones)
   [1] "matrix"

  > head(newphones)
     N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
  1  45939  21574 2876   1815    1646     89      555
  2  0      0     4708   0       2366   1411      733
  3  64721  32510 5230   2695    2526   1546      773
  4  68484  35218 6662   2845    2691   1663      836
  5  71799  37598 6856   3000    2868   1769      911
  6  0      0     8220   0       3054   1905     100

In this example, it is preferred that the output be structured as a data frame so that we can continue to work within Oracle R Enterprise versus the client. The FUN.VALUE feature within Embedded R provides this flexibility by defining the output data’s structure. For example, the output can be explicitly expressed as a data frame of numeric columns that have identical names to the input. 


  newPHONES <- ore.tableApply(PHONES,
                   function(y) {
                     data.frame(apply(y,
                           2,
                           function(x) ifelse(is.na(x),0,x)))},
                              FUN.VALUE=data.frame(setNames(replicate(7,
                                                          numeric(0),
                                                          simplify=F),
                                                          colnames(PHONES))))


  class(newPHONES)

  > class(newPHONES)
   [1] "ore.frame"

We can now continue to work with the newPHONES output within Oracle R Enterprise just as we would a data frame.

While these methods are technically sufficient, they are not practical for this type of data set. As this is panel data ranging from 1951 to 1961, simply recoding missing values to zero appears to strongly bias the data. Perhaps we prefer to calculate the average of each missing observation’s pre- and post-period values. Embedded R allows for a simple solution by utilizing the open-source zoo package.

  newPHONES <-  ore.tableApply(PHONES,
                    function(y) {
                      library(zoo)
                        data.frame(
                        apply(y, 2, function(x) (na.locf(x) + rev(na.locf(rev(x))))/2))},
                           FUN.VALUE=data.frame(setNames(replicate(7,
                                                      numeric(0),
                                                      simplify=F),
                                                      colnames(PHONES))))
  newPHONES


  > newPHONES
     N.Amer  Europe Asia S.Amer Oceania Africa Mid.Amer
  1  45939 21574.0 2876   1815    1646     89      555
  2  55330 27042.0 4708   2255    2366   1411      733
  3  64721 32510.0 5230   2695    2526   1546      773
  4  68484 35218.0 6662   2845    2691   1663      836
  5  71799 37598.0 6856   3000    2868   1769      911
  6  75815 40385.5 8220   3169    3054   1905     1008
  7  79831 43173.0 9053   3338    3224   2005     1076

These imputed values seem much more reasonable and the output’s structure acts just like a data frame within Oracle R Enterprise.

To recap, handling missing values plays an important role in data analysis and several imputation methods can be leveraged via the Transparency Layer or Embedded R. Further, Embedded R’s FUN.VALUE feature explicitly defines the output’s structure and allows for results to be immediately analyzed within Oracle R Enterprise.

The FUN.VALUE feature requires more tuning when the output comprises both numeric and character columns. Check back for a later post that explains how to define a data frame of ‘mixed class'.

Thursday Jun 05, 2014

Convert ddply {plyr} to Oracle R Enterprise, or use with Embedded R Execution

The plyr package contains a set of tools for partitioning a problem into smaller sub-problems that can be more easily processed. One function within {plyr} is ddply, which allows you to specify subsets of a data.frame and then apply a function to each subset. The result is gathered into a single data.frame. Such a capability is very convenient. The function ddply also has a parallel option that if TRUE, will apply the function in parallel, using the backend provided by foreach.

This type of functionality is available through Oracle R Enterprise using the ore.groupApply function. In this blog post, we show a few examples from Sean Anderson's "A quick introduction to plyr" to illustrate the correpsonding functionality using ore.groupApply.

To get started, we'll create a demo data set and load the plyr package.


set.seed(1)
d <- data.frame(year = rep(2000:2014, each = 3),
        count = round(runif(45, 0, 20)))
dim(d)
library(plyr)

This first example takes the data frame, partitions it by year, and calculates the coefficient of variation of the count, returning a data frame.


# Example 1
res <- ddply(d, "year", function(x) {
  mean.count <- mean(x$count)
  sd.count <- sd(x$count)
  cv <- sd.count/mean.count
  data.frame(cv.count = cv)
  })

To illustrate the equivalent functionality in Oracle R Enterprise, using embedded R execution, we use the ore.groupApply function on the same data, but pushed to the database, creating an ore.frame. The function ore.push creates a temporary table in the database, returning a proxy object, the ore.frame.


D <- ore.push(d)
res <- ore.groupApply (D, D$year, function(x) {
  mean.count <- mean(x$count)
  sd.count <- sd(x$count)
  cv <- sd.count/mean.count
  data.frame(year=x$year[1], cv.count = cv)
  }, FUN.VALUE=data.frame(year=1, cv.count=1))

You'll notice the similarities in the first three arguments. With ore.groupApply, we augment the function to return the specific data.frame we want. We also specify the argument FUN.VALUE, which describes the resulting data.frame. From our previous blog posts, you may recall that by default, ore.groupApply returns an ore.list containing the results of each function invocation. To get a data.frame, we specify the structure of the result.

The results in both cases are the same, however the ore.groupApply result is an ore.frame. In this case the data stays in the database until it's actually required. This can result in significant memory and time savings whe data is large.


R> class(res)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> head(res)
   year cv.count
1 2000 0.3984848
2 2001 0.6062178
3 2002 0.2309401
4 2003 0.5773503
5 2004 0.3069680
6 2005 0.3431743

To make the ore.groupApply execute in parallel, you can specify the argument parallel with either TRUE, to use default database parallelism, or to a specific number, which serves as a hint to the database as to how many parallel R engines should be used.

The next ddply example uses the summarise function, which creates a new data.frame. In ore.groupApply, the year column is passed in with the data. Since no automatic creation of columns takes place, we explicitly set the year column in the data.frame result to the value of the first row, since all rows received by the function have the same year.


# Example 2
ddply(d, "year", summarise, mean.count = mean(count))

res <- ore.groupApply (D, D$year, function(x) {
  mean.count <- mean(x$count)
  data.frame(year=x$year[1], mean.count = mean.count)
  }, FUN.VALUE=data.frame(year=1, mean.count=1))

R> head(res)
   year mean.count
1 2000 7.666667
2 2001 13.333333
3 2002 15.000000
4 2003 3.000000
5 2004 12.333333
6 2005 14.666667

Example 3 uses the transform function with ddply, which modifies the existing data.frame. With ore.groupApply, we again construct the data.frame explicilty, which is returned as an ore.frame.


# Example 3

ddply(d, "year", transform, total.count = sum(count))

res <- ore.groupApply (D, D$year, function(x) {
  total.count <- sum(x$count)
  data.frame(year=x$year[1], count=x$count, total.count = total.count)
  }, FUN.VALUE=data.frame(year=1, count=1, total.count=1))

> head(res)
   year count total.count
1 2000 5 23
2 2000 7 23
3 2000 11 23
4 2001 18 40
5 2001 4 40
6 2001 18 40

In Example 4, the mutate function with ddply enables you to define new columns that build on columns just defined. Since the construction of the data.frame using ore.groupApply is explicit, you always have complete control over when and how to use columns.


# Example 4

ddply(d, "year", mutate, mu = mean(count), sigma = sd(count),
      cv = sigma/mu)

res <- ore.groupApply (D, D$year, function(x) {
  mu <- mean(x$count)
  sigma <- sd(x$count)
  cv <- sigma/mu
  data.frame(year=x$year[1], count=x$count, mu=mu, sigma=sigma, cv=cv)
  }, FUN.VALUE=data.frame(year=1, count=1, mu=1,sigma=1,cv=1))

R> head(res)
   year count mu sigma cv
1 2000 5 7.666667 3.055050 0.3984848
2 2000 7 7.666667 3.055050 0.3984848
3 2000 11 7.666667 3.055050 0.3984848
4 2001 18 13.333333 8.082904 0.6062178
5 2001 4 13.333333 8.082904 0.6062178
6 2001 18 13.333333 8.082904 0.6062178

In Example 5, ddply is used to partition data on multiple columns before constructing the result. Realizing this with ore.groupApply involves creating an index column out of the concatenation of the columns used for partitioning. This example also allows us to illustrate using the ORE transparency layer to subset the data.


# Example 5

baseball.dat <- subset(baseball, year > 2000) # data from the plyr package
x <- ddply(baseball.dat, c("year", "team"), summarize,
           homeruns = sum(hr))

We first push the data set to the database to get an ore.frame. We then add the composite column and perform the subset, using the transparency layer. Since the results from database execution are unordered, we will explicitly sort these results and view the first 6 rows.


BB.DAT <- ore.push(baseball)
BB.DAT$index <- with(BB.DAT, paste(year, team, sep="+"))
BB.DAT2 <- subset(BB.DAT, year > 2000)
X <- ore.groupApply (BB.DAT2, BB.DAT2$index, function(x) {
  data.frame(year=x$year[1], team=x$team[1], homeruns=sum(x$hr))
  }, FUN.VALUE=data.frame(year=1, team="A", homeruns=1), parallel=FALSE)
res <- ore.sort(X, by=c("year","team"))

R> head(res)
   year team homeruns
1 2001 ANA 4
2 2001 ARI 155
3 2001 ATL 63
4 2001 BAL 58
5 2001 BOS 77
6 2001 CHA 63

Our next example is derived from the ggplot function documentation. This illustrates the use of ddply within using the ggplot2 package. We first create a data.frame with demo data and use ddply to create some statistics for each group (gp). We then use ggplot to produce the graph. We can take this same code, push the data.frame df to the database and invoke this on the database server. The graph will be returned to the client window, as depicted below.


# Example 6 with ggplot2

library(ggplot2)
df <- data.frame(gp = factor(rep(letters[1:3], each = 10)),
                 y = rnorm(30))
# Compute sample mean and standard deviation in each group
library(plyr)
ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))

# Set up a skeleton ggplot object and add layers:
ggplot() +
  geom_point(data = df, aes(x = gp, y = y)) +
  geom_point(data = ds, aes(x = gp, y = mean),
             colour = 'red', size = 3) +
  geom_errorbar(data = ds, aes(x = gp, y = mean,
                               ymin = mean - sd, ymax = mean + sd),
             colour = 'red', width = 0.4)

DF <- ore.push(df)
ore.tableApply(DF, function(df) {
  library(ggplot2)
  library(plyr)
  ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))
  ggplot() +
    geom_point(data = df, aes(x = gp, y = y)) +
    geom_point(data = ds, aes(x = gp, y = mean),
               colour = 'red', size = 3) +
    geom_errorbar(data = ds, aes(x = gp, y = mean,
                                 ymin = mean - sd, ymax = mean + sd),
                  colour = 'red', width = 0.4)
})

But let's take this one step further. Suppose we wanted to produce multiple graphs, partitioned on some index column. We replicate the data three times and add some noise to the y values, just to make the graphs a little different. We also create an index column to form our three partitions. Note that we've also specified that this should be executed in parallel, allowing Oracle Database to control and manage the server-side R engines. The result of ore.groupApply is an ore.list that contains the three graphs. Each graph can be viewed by printing the list element.

df2 <- rbind(df,df,df)
df2$y <- df2$y + rnorm(nrow(df2))
df2$index <- c(rep(1,300), rep(2,300), rep(3,300))
DF2 <- ore.push(df2)
res <- ore.groupApply(DF2, DF2$index, function(df) {
  df <- df[,1:2]
  library(ggplot2)
  library(plyr)
  ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))
  ggplot() +
    geom_point(data = df, aes(x = gp, y = y)) +
    geom_point(data = ds, aes(x = gp, y = mean),
               colour = 'red', size = 3) +
    geom_errorbar(data = ds, aes(x = gp, y = mean,
                                 ymin = mean - sd, ymax = mean + sd),
                  colour = 'red', width = 0.4)
  }, parallel=TRUE)
res[[1]]
res[[2]]
res[[3]]

To recap, we've illustrated how various uses of ddply from the plyr package can be realized in ore.groupApply, which affords the user explicit control over the contents of the data.frame result in a straightforward manner. We've also highlighted how ddply can be used within an ore.groupApply call.

Friday May 30, 2014

Financial institutions build predictive models using Oracle R Enterprise to speed model deployment

See the Oracle press release, Financial Institutions Leverage Metadata Driven Modeling Capability Built on the Oracle R Enterprise Platform to Accelerate Model Deployment and Streamline Governance for a description where a "unified environment for analytics data management and model lifecycle management brings the power and flexibility of the open source R statistical platform, delivered via the in-database Oracle R Enterprise engine to support open standards compliance."

Through its integration with Oracle R Enterprise, Oracle Financial Services Analytical Applications provides "productivity, management, and governance benefits to financial institutions, including the ability to:


  • Centrally manage and control models in a single, enterprise model repository, allowing for consistent management and application of security and IT governance policies across enterprise assets

  • Reuse models and rapidly integrate with applications by exposing models as services

  • Accelerate development with seeded models and common modeling and statistical techniques available out-of-the-box

  • Cut risk and speed model deployment by testing and tuning models with production data while working within a safe sandbox

  • Support compliance with regulatory requirements by carrying out comprehensive stress testing, which captures the effects of adverse risk events that are not estimated by standard statistical and business models. This approach supplements the modeling process and supports compliance with the Pillar I and the Internal Capital Adequacy Assessment Process stress testing requirements of the Basel II Accord

  • Improve performance by deploying and running models co-resident with data. Oracle R Enterprise engines run in database, virtually eliminating the need to move data to and from client machines, thereby reducing latency and improving security"

Tuesday May 27, 2014

R Package Installation with Oracle R Enterprise


Programming languages give developers the opportunity to write reusable functions and to bundle those functions into logical deployable entities. In R, these are called packages. R has thousands of such packages provided by an almost equally large group of third-party contributors. To allow others to benefit from these packages, users can share packages on the CRAN system for use by the vast R development community worldwide.

R's package system along with the CRAN framework provides a process for authoring, documenting and distributing packages to millions of users. In this post, we'll illustrate the various ways in which such R packages can be installed for use with R and together with Oracle R Enterprise. In the following, the same instructions apply when using either open source R or Oracle R Distribution.

In this post, we cover the following package installation scenarios for:


R command line
Linux shell command line
Use with Oracle R Enterprise
Installation on Exadata or RAC
Installing all packages in a CRAN Task View
Troubleshooting common errors


1. R Package Installation Basics

R package installation basics are outlined in Chapter 6 of the R Installation and Administration Guide. There are two ways to install packages from the command line: from the R command line and from the shell command line. For this first example on Oracle Linux using Oracle R Distribution, we’ll install the arules package as root so that packages will be installed in the default R system-wide location where all users can access it, /usr/lib64/R/library.

Within R, using the
install.packages function always attempts to install the latest version of the requested package available on CRAN:

R> install.packages("arules")


If the arules package depends upon other packages that are not already installed locally, the R installer automatically downloads and installs those required packages. This is a huge benefit that frees users from the task of identifying and resolving those dependencies.


You can also install R from the shell command line. This is useful for some packages when an internet connection is not available or for installing packages not uploaded to CRAN. To install packages this way, first locate the package on
CRAN and then download the package source to your local machine. For example:

$ wget http://cran.r-project.org/src/contrib/arules_1.1-2.tar.gz

Then, install the package using the command R CMD INSTALL:

$ R CMD INSTALL arules_1.1-2.tar.gz


A major difference between installing R packages using the R package installer at the R command line and shell command line is that package dependencies must be resolved manually at the shell command line.
Package dependencies are listed in the Depends section of the package’s CRAN site. If dependencies are not identified and installed prior to the package’s installation, you will see an error similar to:

ERROR: dependency ‘xxx’ is not available for package ‘yyy’


As a best practice and to save time, always refer to the package’s CRAN site to understand the package dependencies prior to attempting an installation.


If you don’t run R as root, you won’t have permission to write packages into the default system-wide location and you will be prompted to create a personal library accessible by your userid. You can accept the personal library path chosen by R, or specify the library location by passing parameters to the
install.packages function. For example, to create an R package repository in your home directory:

R> install.packages("arules", lib="/home/username/Rpackages")

or


$ R CMD INSTALL arules_1.1-2.tar.gz --library=/home/username/Rpackages


Refer to the
install.packages help file in R or execute R CMD INSTALL --help at
the shell command line for a full list of command line options.


To set the library location and avoid having to specify this at every package install, simply create the R startup environment file .
Renviron in your home area if it does not already exist, and add the following piece of code to it:

R_LIBS_USER = "/home/username/Rpackages"


2. Setting the Repository

Each time you install an R package from the R command line, you are asked which CRAN mirror, or server, R should use. To set the repository and avoid having to specify this during every package installation, create the R startup command file .Rprofile in your home directory and add the following R code to it:

cat("Setting Seattle repository")
r = getOption("repos") 
r["CRAN"] = "http://cran.fhcrc.org/"
options(repos = r)
rm(r)

This code snippet sets the R package repository to the Seattle CRAN mirror at the start of each R session.


3. Installing R Packages for use with Oracle R Enterprise

Embedded R execution with Oracle R Enterprise allows the use of CRAN or other third-party R packages in user-defined R functions executed on the Oracle Database server. The steps for installing and configuring packages for use with Oracle R Enterprise are the same as for open source R. The database-side R engine just needs to know where to find the R packages.

The Oracle R Enterprise installation is performed by user
oracle, which typically does not have write permission to the default site-wide library, /usr/lib64/R/library. On Linux and UNIX platforms, the Oracle R Enterprise Server installation provides the ORE script, which is executed from the operating system shell to install R packages and to start R. The ORE script is a wrapper for the default R script, a shell wrapper for the R executable. It can be used to start R, run batch scripts, and build or install R packages. Unlike the default R script, the ORE script installs packages to a location writable by user oracle and accessible by all ORE users - $ORACLE_HOME/R/library.

To install a package on the database server so that it can be used by any R user and for use in embedded R execution, an Oracle DBA would typically download 
the package source from CRAN using wget. If the package depends on any packages that are not in the R distribution in use, download the sources for those packages, also. 

For a single Oracle Database instance, replace the R script with ORE to install the packages in the same location as the Oracle R Enterprise packages.

$ wget http://cran.r-project.org/src/contrib/arules_1.1-2.tar.gz
$ ORE CMD INSTALL arules_1.1-2.tar.gz


Behind the scenes, the ORE script performs the equivalent of setting R_LIBS_USER to the value of
$ORACLE_HOME/R/library, and all R packages installed with the ORE script are installed to this location. For installing a package on multiple database servers, such as those in an Oracle Real Application Clusters (Oracle RAC) or a multinode Oracle Exadata Database Machine environment, use the ORE script in conjunction with the Exadata Distributed Command Line Interface (DCLI) utility.

$ dcli -g nodes -l oracle ORE CMD INSTALL arules_1.1-1.tar.gz

The DCLI -g flag designates a file containing a list of nodes to install on, and the -l flag specifies the user id to use when executing the commands. For more information on using DCLI with Oracle R Enterprise, see Chapter 5 in the Oracle R Enterprise Installation Guide.

If you are using an Oracle R Enterprise client, install the package the same as any R package, bearing in mind that you must install the same version of the package on both the client and server machines to avoid incompatibilities.


4. CRAN Task Views

CRAN also maintains a set of Task Views that identify packages associated with a particular task or methodology. Task Views are helpful in guiding users through the huge set of available R packages. They are actively maintained by volunteers who include detailed annotations for routines and packages. If you find one of the task views is a perfect match, you can install every package in that view using the ctv package - an R package for automating package installation.

To use the ctv package to install a task view, first, install and load the ctv package.

R> install.packages("ctv")

R> library(ctv)


Then query the names of the available task views and install the view you choose.


R> available.views()
R> install.views("TimeSeries")


5. Using and Managing R packages

To use a package, start up R and load packages one at a time with the library command.

Load the
arules package in your R session.

R> library(arules)

Verify the version of
arules installed.

R> packageVersion("arules")

[1] '1.1.2'


Verify the version of
arules installed on the database server using embedded R execution.


R> ore.doEval(function() packageVersion("arules"))


View the help file for the apropos function in the
arules
package


R> ?apropos


Over time, your package repository will contain more and more packages, especially if you are using the system-wide repository where others are adding additional packages. It’s good to know the entire set of R packages accessible in your environment. To list all available packages in your local R session, use the
installed.packages command:

R> myLocalPackages <- row.names(installed.packages())

R> myLocalPackages


To access the list of available packages on the ORE database server from the ORE client, use the following embedded R syntax:

R> myServerPackages <-
ore.doEval(function() row.names(installed.packages())
R> myServerPackages


6. Troubleshooting Common Problems

Installing Older Versions of R packages

If you immediately upgrade to the latest version of R, you will have no problem installing the most recent versions of R packages. However, if your version of R is older, some of the more recent package releases will not work and
install.packages will generate a message such as:

Warning message:
In install.packages("arules")
: package ‘arules’ is not available


This is when you have to go to the
Old sources link on the CRAN page for the arules
package and determine which version is compatible with your version of R.


Begin by determining what version of R you are using:


$ R --version

Oracle Distribution of R version 3.0.1 (--) -- "Good Sport"
Copyright (C) The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)


Given that R-3.0.1 was
released May 16, 2013, any version of the arules package released after this date may work. Scanning the arules archive, we might try installing version 0.1.1-1, released in January of 2014:

$ wget http://cran.r-project.org/src/contrib/Archive/arules/arules_1.1-1.tar.gz
$ R CMD INSTALL arules_1.1-1.tar.gz

For use with ORE:

$ ORE CMD INSTALL arules_1.1-1.tar.gz

The "package not available" error can also be thrown if the package you’re trying to install lives elsewhere, either another R package site, or it’s been removed from CRAN. A quick Google search usually leads to more information on the package’s location and status.


Oracle R Enterprise is not in the R library path

On Linux hosts, after installing the ORE server components, starting R, and attempting to load the ORE packages, you may receive the error:


R> library(ORE)
Error in library(ORE) : there is no package called ‘ORE’

If you know the ORE packages have been installed and you receive this error, this is the result of not starting R with the ORE script. To resolve this problem, exit R and restart using the ORE script. After restarting R and running the command to load the ORE packages, you should not receive any errors.

$ ORE
R> library(ORE)

On Windows servers, the solution is to make the location of the ORE packages visible to R by adding them to the R library paths. To accomplish this, exit R, then add the following lines to the .Rprofile file. On Windows, the .Rprofile file is located in
R\etc directory C:\Program Files\R\R-<version>\etcAdd the following lines:

.libPaths("<path to $ORACLE_HOME>/R/library")

The above line will tell R to include the R directory in the Oracle home as part of its search path. When you start R, the path above will be included, and future R package installations will also be saved to $ORACLE_HOME/R/library. This path should be writable by the user oracle, or the userid for the DBA tasked with installing R packages.

Binary package compiled with different version of R

By default, R will install pre-compiled versions of packages if they are found. If the version of R under which the package was compiled does not match your installed version of R you will get an error message:

Warning message: package ‘xxx’ was built under R version 3.0.0

The solution is to download the package source and build it for your version of R.

$ wget
http://cran.r-project.org/src/contrib/Archive/arules/arules_1.1-1.tar.gz
$ R CMD INSTALL arules_1.1-1.tar.gz

For use with ORE:

$ ORE CMD INSTALL arules_1.1-1.tar.gz

Unable to execute files in /tmp directory

By default, R uses the /tmp directory to install packages. On security conscious machines, the /tmp directory is often marked as "noexec" in the /etc/fstab file. This means that no file under /tmp can ever be executed, and users who attempt to install R package will receive an error:

ERROR: 'configure' exists but is not executable -- see the 'R Installation and Administration Manual’

The solution is to set the TMP and TMPDIR environment variables to a location which R will use as the compilation directory. For example:

$ mkdir <some path>/tmp
$ export TMPDIR= <some path>/tmp
$ export TMP= <some path>/tmp

This error typically appears on Linux client machines and not database servers, as Oracle Database writes to the value of the 
TMP environment variable for several tasks, including holding temporary files during database installation.


7. Creating your own R package

Creating your own package and submitting to CRAN is for advanced users, but it is not difficult. The procedure to follow, along with details of R's package system, is detailed in the
Writing R Extensions manual.

Monday May 19, 2014

Model cross-validation with ore.CV()

In this blog post we illustrate how to use Oracle R Enterprise for performing cross-validation of regression and classification models. We describe a new utility R function ore.CV that leverages features of Oracle R Enterprise and is available for download and use.

Predictive models are usually built on given data and verified on held-aside or unseen data. Cross-validation is a model improvement technique that avoids the limitations of a single train-and-test experiment by building and testing multiple models via repeated sampling from the available data. It's purpose is to offer a better insight into how well the model would generalize to new data and avoid over-fitting and deriving wrong conclusions from misleading peculiarities of the seen data.

In a k-fold cross-validation the data is partitioned into k (roughly) equal size subsets. One of the subsets is retained for testing and the remaining k-1 subsets are used for training. The process is repeated k times with each of the k subsets serving exactly once as testing data. Thus, all observations in the original data set are used for both training and testing.

The choice of k depends, in practice on the size n of the data set. For large data, k=3 could be sufficient. For very small data, the extreme case where k=n, leave-one-out cross-validation (LOOCV) would use a single observation from the original sample as testing data and the remaining observations as training data. Common choices are k=10 or k=5.

For a select set of algorithms and cases, the function ore.CV performs cross-validation for models generated by ORE regression and classification functions using in-databse data. ORE embedded R execution is leveraged to support cross-validation also for models built with vanilla R functions.

Usage

ore.CV(funType, function, formula, dataset, nFolds=<nb.folds>, fun.args=NULL, pred.args=NULL, pckg.lst=NULL)
  • funType - "regression" or "classification"
  • function - ORE predictive modeling functions for regression & classification or R function (regression only)
  • formula - object of class "formula"
  • dataset - name of the ore.frame
  • nFolds - number of folds
  • fun.args - list of supplementary arguments for 'function'
  • pred.args - list of supplementary arguments for 'predict'. Must be consistent with the model object/model generator 'function'.
  • pckg.lst - list of packages to be loaded by the DB R engine for embedded execution.
The set of functions supported for ORE include:
  • ore.lm
  • ore.stepwise
  • ore.neural
  • ore.glm
  • ore.odmDT
  • ore.odmSVM
  • ore.odmGLM
  • ore.odmNB
The set of functions supported for R include:
  • lm
  • glm
  • svm
Note: The 'ggplot' and 'reshape' packages are required on the R client side for data post-processing and plotting (classification CV).

Examples

In the following examples, we illustrate various ways to invoke ore.CV using some datasets we have seen in previous posts. The datasets can be created as ore.frame objects using:
 
IRIS <- ore.push(iris)
LONGLEY <- ore.push(longley)
library(rpart)
KYPHOSIS <- ore.push(kyphosis)
library(PASWR)
TITANIC3 <- ore.push(titanic3)
MTCARS <- pore.push(mtcars)
(A) Cross-validation for models generated with ORE functions.
 
# Basic specification
ore.CV("regression","ore.lm",Sepal.Length~.-Species,"IRIS",nFolds=5)
ore.CV("regression","ore.neural",Employed~GNP+Population+Year,
            "LONGLEY",nFolds=5)

#Specification of function arguments
ore.CV("regression","ore.stepwise",Employed~.,"LONGLEY",nFolds=5,
            fun.args= list(add.p=0.15,drop.p=0.15))
ore.CV("regression","ore.odmSVM",Employed~GNP+Population+Year,
             "LONGLEY",nFolds=5, fun.args="regression")

#Specification of function arguments and prediction arguments
ore.CV("classification","ore.glm",Kyphosis~.,"KYPHOSIS",nFolds=5,
             fun.args=list(family=binomial()),pred.args=list(type="response"))
ore.CV("classification","ore.odmGLM",Kyphosis~.,"KYPHOSIS",nFolds=5,
            fun.args= list(type="logistic"),pred.args=list(type="response"))
 
(B) Cross-validation for models generated with R functions via the ORE embedded execution mechanism.

ore.CV("regression","lm",mpg~cyl+disp+hp+drat+wt+qsec,"MTCARS",nFolds=3)
ore.CV("regression","svm",Sepal.Length~.-Species,"IRIS",nFolds=5,
             fun.args=list(type="eps-regression"), pckg.lst=c("e1071")) 


Restrictions

  • The signature of the model generator ‘function’ must be of the following type: function(formula,data,...). For example, functions like, ore.stepwise, ore.odmGLM and lm are supported but the R step(object,scope,...) function for AIC model selection via the stepwise algorithm, does not satisfy this requirement.
  • The model validation process requires the prediction function to return a (1-dimensional) vector with the predicted values. If the (default) returned object is different the requirement must be met by providing an appropriate argument through ‘pred.args’. For example, for classification with ore.glm or ore.odmGLM the user should specify pred.args=list(type="response").
  • Cross-validation of classification models via embedded R execution of vanilla R functions is not supported yet.
  • Remark: Cross-validation is not a technique intended for large data as the cost of multiple model training and testing can become prohibitive. Moreover, with large data sets, it is possible to effectively produce an effective sampled train and test data set. The current ore.CV does not impose any restrictions on the size of the input and the user working with large data should use good judgment when choosing the model generator and the number of folds.

    Output

    The function ore.CV provides output on several levels: datastores to contain model results, plots, and text output.

    Datastores

    The results of each cross-validation run are saved into a datastore named dsCV_funTyp_data_Target_function_nFxx where funTyp, function, nF(=nFolds) have been described above and Target is the left-hand-side of the formula. For example, if one runs the ore.neural, ore.glm, and ore.odmNB-based cross-validation examples from above, the following three datastores are produced:
    
    R> ds <- ore.datastore(pattern="dsCV")
    R> print(ds)
    datastore.name object.count size creation.date description
    1 dsCV_classification_KYPHOSIS_Kyphosis_ore.glm_nF5 104480326 2014-04-30 18:19:55 <NA>
    2 dsCV_classification_TITANIC3_survived_ore.odmNB_nF5 10 592083 2014-04-30 18:21:35 <NA>
    3 dsCV_regression_LONGLEY_Employed_ore.neural_nF5 10 497204 2014-04-30 18:16:35 <NA>
    
    Each datastore contains the models and prediction tables for every fold. Every prediction table has 3 columns: the fold index together with the target variable/class and the predicted values. If we consider the example from above and examine the most recent datastore (the Naive Bayes classification CV), we would see:
    
    R> ds.last <- ds$datastore.name[which.max(as.numeric(ds$creation.date))]
    R> ore.datastoreSummary(name=ds.last)
    object.name class size length row.count col.count
    1 model.fold1 ore.odmNB 66138 9 NA NA
    2 model.fold2 ore.odmNB 88475 9 NA NA
    3 model.fold3 ore.odmNB 110598 9 NA NA
    4 model.fold4 ore.odmNB 133051 9 NA NA
    5 model.fold5 ore.odmNB 155366 9 NA NA
    6 test.fold1 ore.frame 7691 3 261 3
    7 test.fold2 ore.frame 7691 3 262 3
    8 test.fold3 ore.frame 7691 3 262 3
    9 test.fold4 ore.frame 7691 3 262 3
    10 test.fold5 ore.frame 7691 3 262 3
    
    

    Plots

    The following plots are generated automatically by ore.CV and saved in an automatically generated OUTPUT directory:

  • Regression: ore.CV compares predicted vs target values, root mean square error (RMSE) and relative error (RERR) boxplots per fold. The example below is based on 5-fold cross-validation with the ore.lm regression model for Sepal.Length ~.-Species using the ore.frame IRIS dataset.
  • Classification : ore.CV outputs a multi plot figure for classification metrics like Precision, Recall and F-measure. Each metrics is captured per target class (side-by-side barplots) and fold (groups of barplots). The example below is based on the 5-folds CV of the ore.odmSVM classification model for Species ~. using the ore.frame IRIS dataset.
  • Text output
    For classification problems, the confusion tables for each fold are saved in an ouput file residing in the OUTPUT directory together with a summary table displaying the precision, recall and F-measure metrics for every fold and predicted class.
    file.show("OUTDIR/tbl_CV_classification_IRIS_Species_ore.odmSVM_nF5")
    
    Confusion table for fold 1 :           
                 setosa versicolor virginica
      setosa          9          0         0
      versicolor      0         12         1
      virginica       0          1         7
    Confusion table for fold 2 :            
                 setosa versicolor virginica
      setosa          9          0         0
      versicolor      0          8         1
      virginica       0          2        10
    Confusion table for fold 3 :           
                 setosa versicolor virginica
      setosa         11          0         0
      versicolor      0         10         2
      virginica       0          0         7
    Confusion table for fold 4 :            
                 setosa versicolor virginica
      setosa          9          0         0
      versicolor      0         10         0
      virginica       0          2         9
    Confusion table for fold 5 :            
                 setosa versicolor virginica
      setosa         12          0         0
      versicolor      0          5         1
      virginica       0          0        12
    Accuracy, Recall & F-measure table per {class,fold}
       fold      class TP  m  n Precision Recall F_meas
    1     1     setosa  9  9  9     1.000  1.000  1.000
    2     1 versicolor 12 13 13     0.923  0.923  0.923
    3     1  virginica  7  8  8     0.875  0.875  0.875
    4     2     setosa  9  9  9     1.000  1.000  1.000
    5     2 versicolor  8  9 10     0.889  0.800  0.842
    6     2  virginica 10 12 11     0.833  0.909  0.870
    7     3     setosa 11 11 11     1.000  1.000  1.000
    8     3 versicolor 10 12 10     0.833  1.000  0.909
    9     3  virginica  7  7  9     1.000  0.778  0.875
    10    4     setosa  9  9  9     1.000  1.000  1.000
    11    4 versicolor 10 10 12     1.000  0.833  0.909
    12    4  virginica  9 11  9     0.818  1.000  0.900
    13    5     setosa 12 12 12     1.000  1.000  1.000
    14    5 versicolor  5  6  5     0.833  1.000  0.909
    15    5  virginica 12 12 13     1.000  0.923  0.960
    
              
    What's next
    Several extensions of ore.CV are possible involving sampling, parallel model training and testing, support for vanilla R classifiers, post-processing and output. More material for future posts.

    Thursday May 01, 2014

    "Darden uses analytics to understand customer restaurants"

    See the InformationWeek article Darden Uses Analytics To Understand Restaurant Customers highlighting Darden's use of Oracle Advanced Analytics:


    "Check-Level Analytics is one of the tools Darden plans to use to boost sales and customer loyalty, while pulling together data from across all its operations to show how they can work together better. ... it brought in some new tools, including Oracle Data Miner and Oracle R Enterprise, both included in the Oracle Advanced Analytics option to Oracle Database, to spot correlations and meaningful patterns in the data." 

    Darden's data analytics project earned them the 5th spot on this year's InformationWeek Elite 100 ranking.

    Sunday Apr 27, 2014

    Step-by-step: Returning R statistical results as a Database Table


    R provides a rich set of statistical functions that we may want to use directly from SQL. Many of these results can be readily expressed as structured table data for use with other SQL tables, or for use by SQL-enabled applications, e.g., dashboards or other statistical tools.

    In this blog post, we illustrate in a sequence of five simple steps  how to go from an R function to a SQL-enabled result. Taken from recent "proof of concept" customer engagement, our example involves using the function princomp, which performs a principal components analysis on a given numeric data matrix and returns the results as an object of class princomp. The customer actively uses this R function to produce loadings used in subsequent computations and analysis. The loadings is a matrix whose columns contain the eigenvectors).

    The current process of pulling data from their Oracle Database, starting an R  engine, invoking the R script, and placing the results back in the database was proving non-performant and unnecessarily complex. The goal was to leverage Oracle R Enterprise to streamline this process and allow the results to be immediately accessible
    through SQL.

    As a best practice, here is a process that can get you from start to finish:

    Step 1: Invoke from command line, understand results

    If you're using a particular R function, chances are you are familiar with its content. However, you may not be familiar with its structure. We'll use an example from the R princomp documentation that uses the USArrests data set. We see that the class of the result is of type princomp, and the model prints the call and standard deviations of the components. To understand the underlying structure, we invoke the function str and see there are seven elements in the list, one of which is the matrix loadings.

    mod <- princomp(USArrests, cor = TRUE)
    class(mod)
    mod
    str(mod)


    Results:

    R> mod <- princomp(USArrests, cor = TRUE)
    R> class(mod)
    [1] "princomp"
    R> mod
    Call:
    princomp(x = USArrests, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494

    4 variables and 50 observations.

    R> str(mod)
    List of 7
    $ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
    ..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    .. ..$ : chr [1:4] "Comp.1" "Comap.2" "Comp.3" "Comp.4"
    $ center : Named num [1:4] 7.79 170.76 65.54 21.23
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ scale : Named num [1:4] 4.31 82.5 14.33 9.27
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ n.obs : int 50
    $ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:50] "1" "2" "3" "4" ...
    .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ call : language princomp(x = dat, cor = TRUE)
    - attr(*, "class")= chr "princomp"


    Step 2: Wrap script in a function, and invoke from ore.tableApply

    Since we want to invoke princomp on database data, we first push the demo data, USArrests, to the database to create an ore.frame. Other data we wish to use will also be in database tables.

    We'll use ore.tableApply (for the reasons cited in the previous blog post)  providing the ore.frame as the first argument and simply returning within our function the model produced by princomp. We'll then look at its class, retrieve the result from the database, and check its class and structure once again.

    Notice that we are able to obtain the exact same result we received using our local R engine as with the database R engine through embedded R execution.

    dat <- ore.push(USArrests)
    computePrincomp <- function(dat) princomp(dat, cor=TRUE)
    res <- ore.tableApply(dat, computePrincomp)


    class(res)
    res.local <- ore.pull(res)
    class(res.local)
    str(res.local)
    res.local
    res


    Results:

    R> dat <- ore.push(USArrests)
    R> computePrincomp <- function(dat) princomp(dat, cor=TRUE)
    R> res <- ore.tableApply(dat, dat, computePrincomp)
    R> class(res)
    [1] "ore.object"
    attr(,"package")
    [1] "OREembed"
    R> res.local <- ore.pull(res)
    R> class(res.local)
    [1] "princomp"


    R> str(res.local)
    List of 7
    $ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
    ..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    .. ..$ : chr [1:4] "Comp.1" "Comap.2" "Comp.3" "Comp.4"
    $ center : Named num [1:4] 7.79 170.76 65.54 21.23
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ scale : Named num [1:4] 4.31 82.5 14.33 9.27
    ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    $ n.obs : int 50
    $ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
    ..- attr(*, "dimnames")=List of 2
    .. ..$ : chr [1:50] "1" "2" "3" "4" ...
    .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
    $ call : language princomp(x = dat, cor = TRUE)
    - attr(*, "class")= chr "princomp"

    R> res.local
    Call:
    princomp(x = dat, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494

    4 variables and 50 observations.
    R> res
    Call:
    princomp(x = dat, cor = TRUE)

    Standard deviations:
       Comp.1    Comp.2    Comp.3    Comp.4
    1.5748783 0.9948694 0.5971291 0.4164494


    4 variables and 50 observations.


    Step 3: Determine what results we really need

    Since we are only interested in the loadings and any result we return needs to be a data.frame to turn it into a database row set (table result), we build the model, transform the loadings object into a data.frame, and return the data.frame as the function result. We then view the class of the result and its values.

    Since we do this from the R API, we can simply print res to display the returned data.frame, as the print does an implicit ore.pull.

    returnLoadings <- function(dat) {
                        mod <- princomp(dat, cor=TRUE)
                        dd <- dim(mod$loadings)
                        ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
                        ldgs$variables <- row.names(ldgs)
                        ldgs
                      }
    res <- ore.tableApply(dat, returnLoadings)
    class(res)
    res

    ore.create(USArrests,table="USARRESTS")


    Results:

    R> res <- ore.tableApply(dat, returnLoadings)
    R> class(res)
    [1] "ore.object"
    attr(,"package")
    [1] "OREembed"
    R> res
                 Comp.1     Comp.2     Comp.3     Comp.4 variables
    Murder   -0.5358995  0.4181809 -0.3412327  0.64922780 Murder
    Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748 Assault
    UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773 UrbanPop
    Rape     -0.5434321 -0.1673186  0.8177779  0.08902432 Rape


    Step 4: Load script into the R Script Repository in the database

    We're at the point of being able to load the script into the R Script Repository before invoking it from SQL. We can create the function from R or from SQL. In R,

    ore.scriptCreate('princomp.loadings', returnLoadings)

    or from SQL,

    begin
    --sys.rqScriptDrop('princomp.loadings');
    sys.rqScriptCreate('princomp.loadings',
          'function(dat) {
            mod <- princomp(dat, cor=TRUE)
            dd <- dim(mod$loadings)
            ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
            ldgs$variables <- row.names(ldgs)
            ldgs
          }');
    end;
    /


    Step 5: invoke from SQL SELECT statement

    Finally, we're able to invoke the function from SQL using the rqTableEval table function. We pass in a cursor with the data from our USARRESTS table. We have no parameters, so the next argument is NULL. To get the results as a table, we specify a SELECT string that defines the structure of the result. Note that the column names must be identical to what is returned in the R data.frame. The last parameter is the name of the function we want to invoke from the R script repository.

    Invoking this, we see the result as a table from the SELECT statement.

    select *
    from table(rqTableEval( cursor(select * from USARRESTS),
                            NULL,
                           'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual',
                            'princomp.loadings'));


    Results:

    SQL> select *
    from table(rqTableEval( cursor(select * from USARRESTS),NULL,
              'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));
    2 3
        Comp.1     Comp.2     Comp.3     Comp.4  variables
    ---------- ---------- ---------- ---------- ------------
    -.53589947  .418180865 -.34123273  .649227804 Murder
    -.58318363  .187985604 -.26814843 -.74340748  Assault
    -.27819087 -.87280619  -.37801579  .133877731 UrbanPop
    -.54343209 -.16731864   .817777908 .089024323 Rape

    As you see above, we have the loadings result returned as a SQL table.


    In this example, we walked through the steps of moving from invoking an R function to obtain a specific result to producing that same result from SQL by invoking an R script at the database server under the control of Oracle Database.

    Friday Apr 25, 2014

    ore.doEval and ore.tableApply: Which one is right for me?


    When beginning to use Oracle R Enterprise, users quickly grasp the techniques and benefits of using embedded R to run scripts in database-side R engines, and gain a solid sense of the available functions for executing R scripts through Oracle Database. However, various embedded R functions are closely related, and a few tips can help in learning which functions are most appropriate for the problem you wish to solve. In this post, we'll demonstrate best practices for two of the non-parallel embedded R functions, ore.doEval and ore.tableApply.

    As with all embedded R functions, both ore.doEval and ore.tableApply invoke R scripts at the database server in an R engine. The difference is that ore.doEval does not take a data table as an input parameter as it's designed
    simply to execute the function provided.  In contrast, ore.tableApply accepts a data table (i.e., ore.frame) as input to be delivered to the embedded R function. The functions ore.doEval and ore.tableApply can be made equivalent simply by passing the name of the database table and pulling the table data within the ore.doEval function.   

    In the following examples, we show embedded R run times for ore.doEval and ore.tableApply using simple functions that build linear models to predict flight arrival delay based on distance traveled and departure delay.

    Model 1: Although ore.doEval does not explicitly accept data from a dedicated input argument, it's possible to retrieve data from the database using ore.sync
    and ore.pull within the function:

    R> system.time(mod1 <- ore.doEval(function(){
               ore.sync(table="ONTIME_S")
               dat <- ore.pull(ore.get("ONTIME_S"))
               lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)},
               ore.connect = TRUE))

        user  system elapsed
        0.008   0.000   4.941

    Model 2: Data can also be passed to a function in the R interface of embedded R exectuion, as shown here with ore.doEval, when connected to the database schema where the data table resides.

    R> system.time(mod2 <- ore.doEval(function(dat){
               lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)},
    dat = ONTIME_S))

       user  system elapsed
       3.196   0.128   9.476


    Model 3: The ore.tableApply function is designed to accept a database table, that is, an ore.frame, as the first input argument.

    R> system.time(mod3 <- ore.tableApply(ONTIME_Sfunction(dat){
                           lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat = dat)}))

       user  system elapsed
       0.001   0.000   3.870

    As the elapsed timings show, ore.tableApply (Model 3) is faster than both ore.doEval implementations (Model 1 and Model 2). Results may vary depending on data size and the operation being performed.  The ONTIME_S airline data used in these examples contains 220,000 rows and 26 columns, and the tests were executed on a Linux 5.8 server with 12 GB RAM and a single processor.

    In summary, ore.doEval takes a function parameter, but can be programmed to source data from Oracle Database or another external source. If your processing is driven by a database table, ore.tableApply is preferable because it's optimized for data transfer from Oracle Database to R. For both approaches, the data must be able to fit in the Database R Engine’s available memory.

    Unlike other embedded R functions, ore.doEval and ore.tableApply run serially by executing a single R process with the entire data in memory.  Other embedded R functions are enabled for parallel execution, and each has a distinct use case: row-wise "chunked" computations can be executed using ore.rowApply. The function ore.groupApply can be applied to grouped data for data sets that have natural partitioning. Lastly, the function ore.indexApply supports task-based execution, where one or more R engines perform the same or different calculations, or tasks, a specified number of times.

    In this post, we haven't yet addressed the functions rqEval and rqTableEval, the SQL equivalents for ore.doEval and ore.tableApply.  One distinction between the R and SQL embedded R execution interfaces is that you can pass a data.frame or ore.frame as an argument in the R interface, as illustrated with ore.doEval above, however, the SQL interface takes only scalar arguments as input, for example, as with the
    rqEval function. The rqTableEval function accepts a full table using a dedicated input argument, and this input may be a cursor or the result of a query.


    Wednesday Apr 16, 2014

    Oracle's Strategy for Advanced Analytics

    At Oracle our goal is to enable you to get timely insight from all of your data. We continuously enhance Oracle Database to allow workloads that have traditionally required extracting data from the database to run in-place. We do this to narrow the gap that exists between insights that can be obtained and available data - because any data movement introduces latencies, complexity due to more moving parts, the ensuing need for data reconciliation and governance, as well as increased cost. The Oracle tool set considers the needs of all types of enterprise users - users preferring GUI based access to analytics with smart defaults and heuristics out of the box, users choosing to work interactively and quantitatively with data using R, and users preferring SQL and focusing on operationalization of models.

    Oracle recognized the need to support data analysts, statisticians, and data scientists with a widely used and rapidly growing statistical programming language. Oracle chose R - recognizing it as the new de facto standard for computational statistics and advanced analytics. Oracle supports R in at least 3 ways:


    • R as the language of interaction with the database

    • R as the language in which analytics can be written and executed in the database as a high performance computing platform

    • R as the language in which several native high performance analytics have been written that execute in database


    Additionally, of course, you may chose to leverage any of the CRAN algorithms to execute R scripts at the database server leveraging several forms of data parallelism.

    Providing the first and only supported commercial distribution of R from an established company, Oracle released Oracle R Distribution. In 2012 Oracle embarked on the Hadoop journey acknowledging alternative data management options emerging in the open source for management of unstructured or not-yet-structured data. In keeping with our strategy of delivering analytics close to where data is stored, Oracle extended Advanced Analytics capabilities to execute on HDFS resident data in Hadoop environments. R has been integrated into Hadoop in exactly the same manner as it has been with the database.

    Realizing that data is stored in both database and non-database environment, Oracle provides users options for storing their data (in Oracle Database, HDFS, and Spark RDD), where to perform computations (in-database or the Hadoop cluster), and where results should be stored (Oracle Database or HDFS). Users can write R scripts that can be leveraged across database and Hadoop environments. Oracle Database, as a preferred location for storing R scripts, data, and result objects, provides a real-time scoring and deployment platform. It is also easy to create a model factory environment with authorization, roles, and privileges, combined with auditing, backup, recovery, and security.

    Oracle provides a common infrastructure that supports both in-database and custom R algorithms. Oracle also provides an integrated GUI for business users. Oracle provides both R-based access and GUI-based access to in-database analytics. A major part of Oracle's strategy is to maintain agility in our portfolio of supported techniques - being responsive to customer needs.

    Thursday Mar 27, 2014

    Why choose Oracle for Advanced Analytics?

    If you're an enterprise company, chances are you have your data in an Oracle database. You chose Oracle for it's global reputation at providing the best software products (and now engineered systems) to support your organization. Oracle database is known for stellar performance and scalability, and Oracle delivers world class support.

    If your data is already in Oracle Database or moving in that direction, leverage the high performance computing environment of the database to analyze your data. Traditionally it was common practice to move data to separate analytic servers for the explicit purpose of model building. This is no longer necessary nor is it scalable as your organization seeks to deliver value from Big Data. Oracle database now has several state of the art algorithms that execute in a parallel and distributed architecture directly in-database and augmented by custom algorithms in the R statistical programming language. Leveraging Oracle database for Advanced Analytics has benefits including:


    • Eliminates data movement to analytic servers

    • Enables analysis of all data not just samples

    • Puts your database infrastructure to even greater use

    • Eliminates impedance mismatch in the form of model translation when operationalizing models

    • All aspects of modeling and deployment are optionally available via SQL making integration into other IT software

    • Leverage CRAN algorithms directly in the database

    Customers such as Stubhub, dunnhumby, CERN OpenLab, Financiera Uno, Turkcell, and others leverage Oracle Advanced Analytics to scale their applications, simplify their analytics architecture, and reduce time to market of predictive models from weeks to hours or even minutes.

    Oracle leverages its own advanced analytics products, for example, by using Oracle Advanced Analytics in a wide range of Oracle Applications and internal deployments, ranging from:


    • Human Capital Management with Predictive Workforce to produce employee turnover, performance prediction, and "what if" analysis

    • Customer Relationship Management with Sales Prediction Engine to predict sales opportunities, what to sell, how much, and when

    • Supply Chain Management with Spend Classification to flag non-compliance or anomalies in expense submissions

    • Retail Analytics with Oracle Retail Customer Analytics to perform shopping cart analysis and next best offers

    • Oracle Financial Services Analytic Applications to enable quantitative analysts in credit risk management divisions to author rules/models directly in R


    Oracle wants you to be successful with advanced analytics. Working closely with customers to integrate Oracle Advanced Analytics as an integral process of their analytics strategy, customers are able to put their advanced analytics into production much faster.

    Thursday Mar 20, 2014

    ROracle 1-1.11 released - binaries for Windows and other platforms available on OTN


    We are pleased to announce the latest update of the open source ROracle package, version 1-1.11, with enhancements and bug fixes. ROracle provides high performance and scalable interaction from R with Oracle Database. In addition to availability on CRAN, ROracle binaries specific to Windows and other platforms can be downloaded from the Oracle Technology Network. Users of ROracle, please take our brief survey. We want to hear from you!

    Latest enhancements in version 1-1.11 of ROracle:

    • Performance enhancements for RAW data types and large result sets
    • Ability to cache the result set in memory to reduce memory consumption on successive reads
    • Added session mode to connect as SYSDBA or using external authentication
    • bug 17383542: Enhanced dbWritetable() & dbRemoveTable() to work on global schema

    Users of ROracle are quite pleased with the performance and functionality:


    "In my position as a quantitative researcher, I regularly analyze database data up to a gigabyte in size on client-side R engines. I switched to ROracle from RJDBC because the performance of ROracle is vastly superior, especially when writing large tables. I've also come to depend on ROracle for transactional support, pulling data to my R client, and general scalability. I have been very satisfied with the support from Oracle -- their response has been prompt, friendly and knowledgeable."

               -- Antonio Daggett, Quantitative Researcher in Finance Industry


    "Having used ROracle for over a year now with our Oracle Database data, I've come to rely on ROracle for high performance read/write of large data sets (greater than 100 GB), and SQL execution with transactional support for building predictive models in R. We tried RODBC but found ROracle to be faster, much more stable, and scalable."

               -- Dr. Robert Musk, Senior Forest Biometrician, Forestry Tasmania


    See the ROracle NEWS for the complete list of updates.

    We encourage ROracle users to post questions and provide feedback on the Oracle R Technology Forum.

    In addition to being a high performance database interface to Oracle Database from R for general use, ROracle supports database access for Oracle R Enterprise.

    Monday Mar 17, 2014

    Oracle R Enterprise Upgrade Steps

    We've recently announced that Oracle R Enterprise 1.4 is available on all platforms. To upgrade Oracle R Enterprise to the latest version:

      1. *Install the version of R that is required for the new version of Oracle R Enterprise.
             See the Oracle R Enterprise supported platforms matrix for the latest requirements.
      2. Update Oracle R Enterprise Server on the database server by running the install.sh script
             and follow the prompts for the upgrade path.
      3. Update the Oracle R Enterprise Supporting packages on the database server.
      4. Update the Oracle R Enterprise Client and Supporting packages on the client.

    For RAC/Exadata installations, upgrade items 1, 2 and 3 must be performed on all compute notes. 

    *If you've changed the R installation directory between releases, manually update the location of the R_HOME directory in the Oracle R Enterprise configuration table.  The sys.rqconfigset procedure edits settings in a configuration table called sys.rq_config. Use of this function requires the sys privilege. You can view the contents of this table to verify various environment settings for Oracle R Enterprise. Among the settings stored in sys.rq_config is the R installation directory, or R_HOME. The following query shows sample values stored in sys.rq_config for a Linux server:

    SQL> select * from sys.rq_config;
    NAME        VALUE
    ------------------------------------------------------------------
    R_HOME       /usr/lib64/R
    MIN_VSIZE    32M
    MAX_VSIZE    4G
    R_LIBS_USER  /u01/app/oracle/product/12.0.1/dbhome_1/R/library
    VERSION      1.4
    MIN_NSIZE    2M
    MAX_NSIZE    20M

    7 rows selected.

    To point to the correct R_HOME:

    SQL > sys.rqconfigset('R_HOME', '<path to current R installation directory>')

    All Oracle R Enterprise downloads are available on the Oracle Technology Network. Refer to the instructions in section 8.3 of the Oracle R Enterprise Installations Guide for detailed steps on upgrading Oracle R Enterprise, and don't hesitate to post questions to the Oracle R forum.



    Saturday Mar 15, 2014

    Oracle R Enterprise 1.4 Released

    We’re pleased to announce that Oracle R Enterprise (ORE) 1.4 is now available for download on all supported platforms. In addition to numerous bug fixes, ORE 1.4 introduces an enhanced high performance computing infrastructure, new and enhanced parallel distributed predictive algorithms for both scalability and performance, added support for production deployment, and compatibility with the latest R versions.  These updates enable IT administrators to easily migrate the ORE database schema to speed production deployment, and statisticians and analysts have access to a larger set of analytics techniques for more powerful predictive models.

    Here are the highlights for the new and upgraded features in ORE 1.4:

    Upgraded R version compatibility


    ORE 1.4 is certified with R-3.0.1 - both open source R and Oracle R Distribution. See the server support matrix for the complete list of supported R versions. R-3.0.1 brings improved performance and big-vector support to R, and compatibility with more than 5000 community-contributed R packages.

    High Performance Computing Enhancements

    Ability to specify degree of parallelism (DOP) for parallel-enabled functions (ore.groupApply, ore.rowApply, and ore.indexApply)
    An additional global option, ore.parallel, to set the number of parallel threads used in embedded R execution

    Data Transformations and Analytics

    ore.neural now provides a highly flexible network architecture with a wide range of activation functions, supporting 1000s of formula-derived columns, in addition to being a parallel and distributed implementation capable of supporting billion row data sets
    ore.glm now also prevents selection of less optimal coefficient methods with parallel distributed in-database execution
    Support for weights in regression models
    New ore.esm enables time series analysis, supporting both simple and double exponential smoothing for scalable in-database execution
    Execute standard R functions for Principal Component Analysis (princomp), ANOVA (anova), and factor analysis (factanal) on database data

    Oracle Data Mining Model Algorithm Functions

    Newly exposed in-database Oracle Data Mining algorithms:

    ore.odmAssocRules function for building Oracle Data Mining association models using the apriori algorithm
    ore.odmNMF function for building Oracle Data Mining feature extraction models using the Non-Negative Matrix Factorization (NMF) algorithm
    ore.odmOC function for building Oracle Data Mining clustering models using the Orthogonal Partitioning Cluster (O-Cluster) algorithm

    Production Deployment

    New migration utility eases production deployment from development environments
    "Snapshotting" of production environments for debugging in test systems

    For a complete list of new features, see the Oracle R Enterprise User's Guide. To learn more about Oracle R Enterprise, check out the white paper entitled, "Bringing R to the Enterprise -  A Familiar R Environment with Enterprise-Caliber Performance, Scalability, and Security.", visit Oracle R Enterprise on Oracle's Technology Network, or review the variety of use cases on the Oracle R blog.

    Monday Mar 10, 2014

    Oracle R Distribution 3.0.1 Benchmarks

    Oracle R Distribution, Oracle's distribution of Open Source R, improves performance by dynamically linking to optimized, multi-threaded BLAS libraries. Unlike open source R, Oracle R Distribution uses all available cores and processors when dynamically linked against optimized BLAS, resulting in increased performance. Thus, the more cores available to Oracle R Distribution, the higher performance for many operations.

    How is this possible?  Standard R's internal BLAS library was created when multi-core machines were not widely used, so it is single-threaded, i.e., operates on a single core. However, the BLAS API in R allows linking to different, multi-threaded BLAS libraries that allow linear algebra computations to use all cores and therefore run much faster. Oracle R Distribution simplifies the linking process by loading the high performance math library after it's added to PATH or LD_LIBRARY_PATH, depending on the Operating System.  Then you are set to use optimized math libraries - the Intel Math Kernel Library (MKL)AMD Core Math Library (ACML), or Solaris Sun Performance Library on Solaris. 

    The benchmarks in this section demonstrate the performance of Oracle R Distribution 3.0.1 with and without dynamically loaded MKL. The R-25 benchmark script developed by the R community consists of fifteen tests. They are split into three groups (matrix calculation, matrix functions and "programmation") with trimmed means for each group, and each test is run three times. For this comparison, we report the mean for the three test runs. The benchmarks show that using Oracle R Distribution with dynamically loaded MKL libraries on a 30-core machine is significantly faster than the single core time.


    Oracle R Distribution 3.0.1 Benchmarks


    This benchmark was executed on a 3-node cluster, with 24 cores at 3.07GHz per CPU and 47 GB RAM, using Linux 5.5.


    In-Database Scalability and Parallelism with Oracle R Enterprise


    Oracle R Enterprise, the set of big data analytics R packages provided by Oracle, provides scalable, parallel in-database data manipulation and algorithms for analyzing very large data sets. Oracle R Enterprise functions implement parallel, out-of-core algorithms that overcome R's limitations of being memory-bound and single-threaded by executing requested R calculations on data in Oracle Database, using the database itself as the computational engine. Oracle R Enterprise allows users to further leverage Oracle's engineered systems, like Exadata, Big Data Appliance, and Exalytics, for enterprise-wide analytics, as well as reporting tools like Oracle Business Intelligence Enterprise Edition dashboards and BI Publisher documents. The combination of Oracle Database and R delivers an enterprise-ready, integrated environment for advanced analytics.

    At the time of this post, Oracle R Distribution 3.0.1 is certified with Oracle R Enterprise 1.4.  See the Install Guide's Oracle R Enterprise support matrix  for a list of current Oracle R Distribution supported configurations and platforms, and this link for instructions on enabling high performance library support for Oracle R Distribution on a Windows or Linux client.

    Tuesday Feb 18, 2014

    Low-Rank Matrix Factorization in Oracle R Advanced Analytics for Hadoop

    This guest post from Arun Kumar, a graduate student in the Department of Computer Sciences at the University of Wisconsin-Madison, describes work done during his internship in the Oracle Advanced Analytics group.

    Oracle R Advanced Analytics For Hadoop (ORAAH), a component of Oracle’s Big Data Connectors software suite is a collection of statistical and predictive techniques implemented on Hadoop infrastructure. In this post, we introduce and explain techniques for a popular machine learning task that has diverse applications ranging from predicting ratings in recommendation systems to feature extraction in text mining namely matrix completion and factorization. Training, scoring, and prediction phases for matrix completion and factorization are available in ORAAH. The models generated can also be transparently loaded into R for ad-hoc inspection. In this blog, post we describe implementation specifics of these two techniques available in ORAAH.

    Motivation

    Consider an e-commerce company that displays products to potential customers on its webpage and collects data about views, purchases, ratings (e.g., 1 to 5 stars), etc. Increasingly, such online retailers are using machine learning techniques to predict in advance which products a customer is likely to rate highly and recommend such products to the customers in the hope that they might purchase them. Users build a statistical model based on the past history of ratings by all customers on all products. One popular model to generate predictions from such a hyper-sparse matrix is the latent factor model, also known as the low-rank matrix factorization model (LMF).

    The setup is the following – we are given a large dataset of past ratings (potentially in the billions), say, with the schema (Customer ID, Product ID, Rating). Here, Customer ID refers to a distinct customer, Product ID refers to a distinct product, and Rating refers to a rating value, e.g., 1 to 5. Conceptually, this dataset represents a large matrix D with m rows (number of customers) and n columns (number of products), where the entries are the available ratings. Notice that this matrix is likely to be extremely sparse, i.e., many ratings could be missing since most customers typically rate only a few products. Thus, the task here is matrix completion – we need to predict the missing ratings so that it can be used for downstream processing such as displaying the top recommendations for each customer.

    The LMF model assumes that the ratings matrix can be approximately generated as a product of two factor matrices, L and R, which are much smaller than D (lower rank). The idea is that the product L * R will approximately reconstruct the existing ratings and also automatically predict the missing ratings in D. More precisely, for each available rating (i,j,v) in D, we have (L x R) [i,j] ≈ v, while for each missing rating (i',j') in D, the predicted rating is (L x R) [i',j']. The model has a parameter r, which dictates the rank of the factor matrices, i.e., L is m x r, while R is r x n.

    Matrix Completion in ORAAH

    LMF can be invoked out-of-the-box using the routine orch.lmf. An execution based on the above example is shown below. The dataset of ratings is in a CSV file on HDFS with the schema above (named “retail_ratings” here).


    input <- hdfs.attach("retail_ratings")
    fit <- orch.lmf(input)

    # Export the model into R memory
    lr <- orch.export.fit(fit)

    # Compute the prediction for the point (100, 50)

    # First column of lr$L contains the userid
    userid <- lr$L[,1] == 100 # find row corresponding to user id 100
    L <- lr$L[, 2:(rank+1)]

    #First column contains the itemid
    itemid <- lr$R[,1] == 50 # find row corresponding to item id 50
    R <- lr$R[, 2:(rank+1)]

    # dot product as sum of terms obtained through component wise multiplication
    pred <- sum(L[userid,] * R[itemid,])

    The factor matrices can be transparently loaded into R for further inspection and for ad-hoc predictions of specific customer ratings using R. The algorithm we use for training the LMF model is called Incremental Gradient Descent (IGD), which has been shown to be one of the fastest algorithms for this task [1, 2].

    The entire set of arguments for the function orch.lmf along with a brief description of each and their default values is given in the table below. The latin parameter configures the degree of parallelism for executing IGD for LMF on Hadoop [2]. ORAAH sets this automatically based on the dimensions of the problem and the memory available to each Mapper. Each Mapper fits its partition of the model in memory, and the multiple partitions run in parallel to learn different parts of the model. The last five parameters configure IGD and need to be tuned by the user to a given dataset since they can impact the quality of the model obtained.

    ORAAH also provides routines for predicting ratings as well as for evaluating the model (computing the error of the model on a given labeled dataset) on a large scale over HDFS-resident datasets. The routine for prediction of ratings is predict, and for evaluating is orch.evaluate. Use help(orch.lmf) for online documentation, and demo(orch_lmf_jellyfish) for a fully working example including model fit, evaluation, and prediction.

    Other Matrix Factorization Tasks

    While LMF is primarily used for matrix completion tasks, it can also be used for other matrix factorization tasks that arise in text mining, computer vision, and bio-informatics, e.g., dimension reduction and feature extraction. In these applications, the input data matrix need not necessarily be sparse. Although many zeros might be present, they are not treated as missing values. The goal here is simply to obtain a low-rank factorization D ≈ L x R as accurately as possible, i.e., the product L x R should recover all entries in D, including the zeros. Typically, such applications use a Non-Negative Matrix Factorization (NMF) approach due to non-negativity constraints on the factor matrix entries. However, many of these applications often do not need non-negativity in the factor matrices. Using NMF algorithms for such applications leads to poorer-quality solutions. Our implementation of matrix factorization for such NMF-style tasks can be invoked out-of-the-box in ORAAH using the routine orch.nmf, which has the same set of arguments as LMF.

    Experimental Results & Comparison with Apache Mahout

    We now present an empirical evaluation of the performance, quality, and scalability of the ORAAH LMF tool based on IGD and compare it to the most widely used off-the-shelf tool for LMF on Hadoop – an implementation of the ALS algorithm from Apache Mahout [3].

    All our experiments are run on an Oracle Big Data Appliance Hadoop cluster with nine nodes, each with Intel Xeon X5675 12-core 3.07GHz processors, 48 GB RAM, and 20 TB disk. We use 256MB HDFS blocks and 10 reducers for MapReduce jobs.

    We use two standard public datasets for recommendation tasks – MovieLens10M (referred to as MLens) and Netflix – for the performance and quality comparisons (insert URL). To study scalability aspects, we use several synthetic datasets of different sizes by changing the number of rows, number of columns, and/or number of ratings. The table below presents the data set statistics.


    Results: Performance and Quality

    We first present end-to-end overview of the performance and quality achieved by our implementation and Mahout on MLens and Netflix. The rank parameter was set at 50 (a typical choice for such tasks) and the other parameters for both tools were chosen using a grid search. The quality of the factor matrices was determined using the standard measure of root mean square error (RMSE) [2]. We use a 70%-15%-15% Wold holdout of the datasets, i.e., 70% for training, 15% for testing, and 15% for validation of generalization error. The training was performed until 0.1% convergence, i.e., until the fractional decrease in the training RMSE after every iteration reached 0.1%. The table below presents the results.

    1. ORAAH LMF has a faster performance than Mahout LMF on the overall training runtime on both datasets – 1.8x faster on MLens and 2.3x faster on Netflix.
    2. The per-iteration runtime of ORAAH LMF is much lower than that of Mahout LMF – between 4.4x and 5.4x.
    3. Although ORAAH LMF runs more iterations than Mahout LMF, the huge difference in the per-iteration runtimes make the overall runtime smaller for ORAAH LMF.
    4. The training quality (training RMSE) achieved is comparable across both tools on both datasets. Similarly, the generalization quality is also comparable. Thus, ORAAH LMF can offer state-of-the-art quality along with faster performance.

    Results: Scalability

    The ability to scale along all possible dimensions of the data is key to big data analytics. Both ORAAH LMF and Mahout LMF are able to scale to billions of ratings by parallelizing and distributing computations on Hadoop. But we now show that unlike Mahout LMF, ORAAH LMF is also able to scale to hundreds of millions of customers (m) and products (n), and also scales well with the rank results along these three dimensions – m, n, and r. parameter (r, which affects the size of the factor matrices). The figure below presents the scalability.

    1. Figures (A) and (B) plot the results for the Syn-row and Syn-col datasets, respectively (r = 2). ORAAH LMF scales linearly with both number of rows (m) and number of columns (n), while Mahout LMF does not show up on either plot because it crashes at all these values of m. In fact, we verified that Mahout LMF does not scale beyond even m = 20 M! The situation is similar with n. This is because Mahout LMF assumes that the factor matrices L and R fit entirely in the memory of each Mapper. In contrast, ORAAH LMF uses a clever partitioning scheme on all matrices ([2]) and can thus scale seamlessly on all dataset dimensions.
    2. Figure (C) shows the impact of the rank parameter r. ORAAH LMF scales linearly with r and the per-iteration runtime roughly doubles between r = 20 and r = 100. However, the per-iteration runtime of Mahout LMF varies quadratically with r, and in fact, increases by a factor of 40x between r = 20 and r = 100! Thus, ORAAH LMF is also able to scale better with r.
    3. Finally, on the tera-scale dataset Syn-tera with 1 billion rows, 10 million columns, and 20 billion ratings, ORAAH LMF (for r = 2) finishes an iteration in just under 2 hours!

    Acknowledgements

    The matrix factorization features in ORAAH were implemented and benchmarked by Arun Kumar during his summer internship at Oracle under the guidance of Vaishnavi Sashikanth. He is pursuing his PhD in computer science from the University of Wisconsin-Madison. This work is the result of a collaboration between Oracle and the research group of Dr. Christopher Ré, who is now at Stanford University. Anand Srinivasan helped integrate these features into ORAAH.

    References

    [1] Towards a Unified Architecture for in-RDBMS Analytics. Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. ACM SIGMOD 2012.

    [2] Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion. Benjamin Recht and Christopher Ré. Mathematical Programming Computation 2013.

    [3] Apache Mahout. http://mahout.apache.org/.

    About

    The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

    Search

    Archives
    « July 2015
    SunMonTueWedThuFriSat
       
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
     
           
    Today