Monday Aug 18, 2014

Real-time Big Data Analytics is a reality for StubHub with Oracle Advanced Analytics

What can you use for a comprehensive platform for real-time analytics?
How can you process big data volumes for near-real-time recommendations and dramatically reduce fraud?

Learn in this video what Stubhub achieved with Oracle R Enterprise from the Oracle Advanced Analytics option to Oracle Database, and read more on their story here.

Advanced analytics solutions that impact the bottom line of a business are challenging due to the range of skills and individuals involved in realizing such solutions. While we hear a lot about the role of the data scientist, that role is but one piece of the puzzle. Advanced analytics solutions also have an operationalization aspect that also requires close proximity to where the transactional activity occurs.

The data scientist needs access to the right data with which to model the business problem. This involves IT for data collection, management, and administration, as well as ensuring zero downtime (a website needs to be up 24x7). This also involves working with the data scientist to keep predictive models refreshed with the latest scripts.

Integrating advanced analytics solutions into enterprise apps involves not just generating predictions, but supporting the whole life-cycle from data collection, to model building, model assessment, and then outcome assessment and feedback to the model building process again. Application and web interface designers need to take into account how end users will see and use the advanced analytics results, e.g., supporting operations staff that need to handle the potentially fraudulent transactions.

As just described, advanced analytics projects can be "complicated" from just a human perspective. The extent to which software can simplify the interactions among users and systems will increase the likelihood of project success. The ability to quickly operationalize advanced analytics projects and demonstrate measurable value, means the difference between a successful project and just a nice research report.

By standardizing on Oracle Database and SQL invocation of R, along with in-database modeling as found in Oracle Advanced Analytics, expedient model deployment and zero downtime for refreshing models becomes a reality. Meanwhile, data scientists are also able to explore leading edge techniques available in open source. The Oracle solution propels the entire organization forward to realize the value of advanced analytics.

Tuesday Jul 22, 2014

StubHub Taps into Big Data for Insight into Millions of Customers’ Ticket-Buying Patterns, Fraud Detection, and Optimized Ticket Prices

What can you use for a comprehensive platform for real-time analytics?
How do you drive company growth to leverage actions of millions of customers?
How can you process big data volumes for near-real-time recommendations and dramatically reduce fraud?

These questions, and others, posed challenges set by Stubhub. Read what Stubhub achieved with Oracle R Enterprise from the Oracle Advanced Analytics option to Oracle Database.

Mike Barber, Senior Manager of Data Science at StubHub said:

“Big data is having a tremendous impact on how we run our business. Oracle Database and its various options—including Oracle Advanced Analytics—combine high-performance data-mining functions with the open source R language to enable predictive analytics, data mining, text mining, statistical analysis, advanced numerical computations, and interactive graphics—all inside the database.”

Yadong Chen, Principal Architect, Data Systems at StubHub said:

“We considered solutions from several other vendors, but Oracle Database was a natural choice for us because it enabled us to run analytics at the data source. This capability, together with the integration of open source R with the database, ensured scalability and enabled near-real-time analytics capabilities."

Read the full press release here.

Monday Jul 14, 2014

Using Embedded R Execution: Imputing Missing Data While Preserving Data Structure

This guest post from Matt Fritz, Data Scientist, demonstrates a method for imputing missing values in data using Embedded R Execution with Oracle R Enterprise.

Missing data is a common issue among analyses and is mitigated by imputation. Several techniques handle this process within Oracle R Enterprise; however, some bias the data or generate outputs as data objects that are less accessible than others. This post illustrates ways to effectively impute data while specifying the exact data structure of the output keeping the output’s structure functional in Oracle R Enterprise.

Let’s first create missing data in the WorldPhones data set and create it in Oracle R Enterprise:

  WorldPhones[c(2,6),c(1,2,4)] <- NA
  WorldPhones <-
  ore.create(WorldPhones, table = 'PHONES')

  > class(PHONES)
  [1] "ore.frame"

        N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
  1951  45939  21574 2876   1815    1646     89      555
  1956     NA     NA 4708     NA    2366   1411      733
  1957  64721  32510 5230   2695    2526   1546      773
  1958  68484  35218 6662   2845    2691   1663      836
  1959  71799  37598 6856   3000    2868   1769      911
  1960     NA     NA 8220     NA    3054   1905     1008
  1961  79831  43173 9053   3338    3224   2005     1076

The easiest way to handle missing data is by substituting these values with a constant, such as zero. We are ready to recode the missing values and can use either the Transparency Layer or Embedded R Execution. The Transparency Layer will convert the base R code below into SQL and run the generated SQL inside the database:

  newPHONES$N.Amer <- ifelse($N.Amer),0,newPHONES$N.Amer)
  newPHONES$Europe <- ifelse($Europe),0,newPHONES$Europe)
  newPHONES$S.Amer <- ifelse($S.Amer),0,newPHONES$S.Amer)

  > newPHONES
        N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
  1951  45939  21574 2876   1815    1646     89    555
  1956  0      0     4708   0       2366   1411    733
  1957  64721  32510 5230   2695    2526   1546    773
  1958  68484  35218 6662   2845    2691   1663    836
  1959  71799  37598 6856   3000    2868   1769    911
  1960  0      0     8220   0       3054   1905    1008
  1961  79831  43173 9053   3338    3224   2005    1076

This process can also be executed in Embedded R Execution – which spawns an R engine on the database server under the control of Oracle Database – by using a custom R function, such as: 

  function(x) ifelse(,0,x) 

One way to call this custom function is with ore.doEval. This method requires code to be written as if it were to be executed on the client; however, the ore.doEval wrapper moves the code to the R Script Repository of Oracle R Enterprise in the database and then leverages the database server’s superior processing capacity: 

  newPHONE <- ore.doEval(
     function() {
           ,function(x) ifelse(,0,x)))} 

Note that we explicitly pull the data from the database using Oracle R Enterprise’s Transparency Layer on the database table PHONES. We must connect to the database to obtain the ore.frame that corresponds to the PHONES table. This is accomplished through the ore.sync function. The ore.attach function allows us to reference the ore.frame by its table name.

The second way is via ore.tableApply, which applies a function on an entire input table within Oracle R Enterprise. The same result is created as with ore.doEval and although both operations are successful, the output’s structure defaults to an ORE object instead of a data frame: 

  newPHONES <- 
                    ,function(y) {
                            ,function(x) ifelse(,0,x))})

  > class(newPHONES)
   [1] "ore.object"

Since we cannot work with this object the same way as data frames or matrices, we must pull the ORE object onto the client in order to deserialize the object into an R matrix:

  newphones <- ore.pull(newPHONES)

  > class(newphones)
   [1] "matrix"

  > head(newphones)
     N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
  1  45939  21574 2876   1815    1646     89      555
  2  0      0     4708   0       2366   1411      733
  3  64721  32510 5230   2695    2526   1546      773
  4  68484  35218 6662   2845    2691   1663      836
  5  71799  37598 6856   3000    2868   1769      911
  6  0      0     8220   0       3054   1905     100

In this example, it is preferred that the output be structured as a data frame so that we can continue to work within Oracle R Enterprise versus the client. The FUN.VALUE feature within Embedded R provides this flexibility by defining the output data’s structure. For example, the output can be explicitly expressed as a data frame of numeric columns that have identical names to the input. 

  newPHONES <- ore.tableApply(PHONES,
                   function(y) {
                           function(x) ifelse(,0,x)))},


  > class(newPHONES)
   [1] "ore.frame"

We can now continue to work with the newPHONES output within Oracle R Enterprise just as we would a data frame.

While these methods are technically sufficient, they are not practical for this type of data set. As this is panel data ranging from 1951 to 1961, simply recoding missing values to zero appears to strongly bias the data. Perhaps we prefer to calculate the average of each missing observation’s pre- and post-period values. Embedded R allows for a simple solution by utilizing the open-source zoo package.

  newPHONES <-  ore.tableApply(PHONES,
                    function(y) {
                        apply(y, 2, function(x) (na.locf(x) + rev(na.locf(rev(x))))/2))},

  > newPHONES
     N.Amer  Europe Asia S.Amer Oceania Africa Mid.Amer
  1  45939 21574.0 2876   1815    1646     89      555
  2  55330 27042.0 4708   2255    2366   1411      733
  3  64721 32510.0 5230   2695    2526   1546      773
  4  68484 35218.0 6662   2845    2691   1663      836
  5  71799 37598.0 6856   3000    2868   1769      911
  6  75815 40385.5 8220   3169    3054   1905     1008
  7  79831 43173.0 9053   3338    3224   2005     1076

These imputed values seem much more reasonable and the output’s structure acts just like a data frame within Oracle R Enterprise.

To recap, handling missing values plays an important role in data analysis and several imputation methods can be leveraged via the Transparency Layer or Embedded R. Further, Embedded R’s FUN.VALUE feature explicitly defines the output’s structure and allows for results to be immediately analyzed within Oracle R Enterprise.

The FUN.VALUE feature requires more tuning when the output comprises both numeric and character columns. Check back for a later post that explains how to define a data frame of ‘mixed class'.


The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.


« November 2015