Monday Jan 04, 2016

ORE Random Forest

Random Forest is a popular ensemble learning technique for classification and regression, developed by Leo Breiman and Adele Cutler. By combining the ideas of “bagging” and random selection of variables, the algorithm produces a collection of decision trees with controlled variance, while avoiding overfitting – a common problem for decision trees. By constructing many trees, classification predictions are made by selecting the mode of classes predicted, while regression predictions are computed using the mean from the individual tree predictions.

Although the Random Forest algorithm provides high accuracy, performance and scalability can be issues for larger data sets. Oracle R Enterprise 1.5 introduces Random Forest for classification with three enhancements:

  •  ore.randomForest uses the ore.frame proxy for database tables so that data remain in the database server
  •  ore.randomForest executes in parallel for model building and scoring while using Oracle R Distribution or R’s randomForest package 4.6-10
  •  randomForest in Oracle R Distribution significantly reduces memory requirements of R’s algorithm, providing only the functionality required for use by ore.randomForest

Performance

Consider the model build performance of randomForest for 500 trees (the default) and three data set sizes (10K, 100K, and 1M rows). The formula is

‘DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH’

using samples of the popular ONTIME domestic flight dataset.

With ORE’s parallel, distributed implementation, ore.randomForest is an order of magnitude faster than the commonly used randomForest package. While the first plot uses the original execution times, the second uses a log scale to facilitate interpreting scalability.

Memory vs. Speed
ore.randomForest
is designed for speed, relying on ORE embedded R execution for parallelism to achieve the order of magnitude speedup. However, the data set is loaded into memory for each parallel R engine, so high degrees of parallelism (DOP) will result in the corresponding use of memory. Since Oracle R Distribution’s randomForest improves memory usage over R's randomForest (approximately 7X less), larger data sets can be accommodated. Users can specify the DOP using the ore.parallel global option.

API

The ore.randomForest API:

ore.randomForest(formula, data, ntree=500, mtry = NULL,
                replace = TRUE, classwt = NULL, cutoff = NULL,
                sampsize = if(replace) nrow(data) else ceiling(0.632*nrow(data)),
                nodesize = 1L, maxnodes = NULL, confusion.matrix = FALSE,
                na.action = na.fail, ...)

To highlight two of the arguments, confusion_matrix is a logical value indicating whether to calculate the confusion matrix. Note that this confusion matrix is not based on OOB (out-of-bag), it is the result of applying the built random forest model to the entire training data.


Argument groups is the number of tree groups that the total number of trees are divided into during model build. The default is equal to the value of the option 'ore.parallel'. If system memory is limited, it is recommended to set this argument to a value large enough so that the number of trees in each group is small to avoid exceeding memory availability.

Scoring with ore.randomForest follows other ORE scoring functions:

predict(object, newdata,
        type = c("response", "prob", "vote", "all"),
        norm.votes = TRUE,
        supplemental.cols = NULL,
        cache.model = TRUE, ...)

The arguments include:

  •  type: scoring output content – 'response', 'prob', 'votes', or 'all'. Corresponding to predicted values, matrix of class probabilities, matrix of vote counts, or both the vote matrix and predicted values, respectively.
  •  norm.votes: a logical value indicating whether the vote counts in the output vote matrix should be normalized. The argument is ignored if 'type' is 'response' or 'prob'.
  •  supplemental.cols: additional columns from the 'newdata' data set to include in the prediction result. This can be particularly useful for including a key column that can be related back to the original data set.
    cache.model: a logical value indicating whether the entire random forest model is cached in memory during prediction. While the default is TRUE, setting it to FALSE may be beneficial if memory is an issue.

Example


options(ore.parallel=8)
df <- ONTIME_S[,c("DAYOFWEEK","DEPDELAY","DISTANCE",
             "UNIQUECARRIER","DAYOFMONTH","MONTH")]
df <- df[complete.cases(df),]
mod <- ore.randomForest(DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH,                 df, ntree=100,groups=20)
ans <- predict(mod, df, type="all", supplemental.cols="DAYOFWEEK")
head(ans)



R> options(ore.parallel=8)
R> df <- ONTIME_S[,c("DAYOFWEEK","DEPDELAY","DISTANCE",
            "UNIQUECARRIER","DAYOFMONTH","MONTH")]
R> df <- dd[complete.cases(dd),]
R> mod <- ore.randomForest(DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH,
+                 df, ntree=100,groups=20)
R> ans <- predict(mod, df, type="all", supplemental.cols="DAYOFWEEK")

R> head(ans)
     1    2    3    4    5    6    7 prediction DAYOFWEEK
1 0.09 0.01 0.06 0.04 0.70 0.05 0.05 5          5
2 0.06 0.01 0.02 0.03 0.01 0.38 0.49 7          6
3 0.11 0.03 0.16 0.02 0.06 0.57 0.05 6          6
4 0.09 0.04 0.15 0.03 0.02 0.62 0.05 6          6
5 0.04 0.04 0.04 0.01 0.06 0.72 0.09 6          6
6 0.35 0.11 0.14 0.27 0.05 0.08 0.00 1          1

Thursday Dec 24, 2015

Oracle R Enterprise 1.5 Released

We’re pleased to announce that Oracle R Enterprise (ORE) 1.5 is now available for download on all supported platforms with Oracle R Distribution 3.2.0 / R-3.2.0. ORE 1.5 introduces parallel distributed implementations of Random Forest, Singular Value Decomposition (SVD), and Principal Component Analysis (PCA) that operate on ore.frame objects. Performance enhancements are included for ore.summary summary statistics.

In addition, ORE 1.5 enhances embedded R execution with CLOB/BLOB data column support to enable larger text and non-character data to be transferred between Oracle Database and R. CLOB/BLOB support is also enabled for functions ore.push and ore.pull. The ORE R Script Repository now supports finer grained R script access control across schemas. Similarly, the ORE Datastore enables finer grained R object access control across schemas. For ore.groupApply in embedded R execution, ORE 1.5 now supports multi-column partitioning of data using the INDEX argument. Multiple bug fixes are also included in this release.

Here are the highlights for the new and upgraded features in ORE 1.5:

Upgraded R version compatibility

ORE 1.5 is certified with R-3.2.0 - both open source R and Oracle R Distribution. See the server support matrix for the complete list of supported R versions. R-3.2.0 brings improved performance and big in-memory data objects, and compatibility with more than 7000 community-contributed R packages.
For supporting packages, ORE 1.5 includes one new package, randomForest, with upgrades to other packages:

arules 1.1-9
cairo 1.5-8
DBI 0.3-1
png 0.1-7
ROracle 1.2-1
statmod 1.4.21
randomForest 4.6-10

Parallel and distributed algorithms

While the Random Forest algorithm provides high accuracy, performance and scalability can be issues for large data sets. ORE 1.5 introduces Random Forest in Oracle R Distribution with two enhancements: first, a revision to reduce memory requirements of the open source randomForest algorithm; and second, the function ore.randomForest that executes in parallel for model building and scoring while using the underlying randomForest function either from Oracle R Distribution or R’s randomForest package 4.6-10. ore.randomForest uses ore.frame objects allowing data to remain in the database server.

The functions svd and prcomp have been overloaded to execute in parallel and accept ore.frame objects. Users now get in-database execution of this functionality to improve scalability and performance – no data movement.

Performance enhancements

ore.summary performance enhancements supports executions that are 30x faster than previous releases.

Capability enhancements

ore.grant and ore.revoke functions enable users to grant other users read access to their R scripts in the R script repository and individual datastores.

The database data types CLOB and BLOB are now supported for embedded R execution invocations input and output, as well as for the functions ore.pull and ore.push.

For embedded R execution ore.groupApply, users can now specify multiple columns for automatically partitioning data via the INDEX argument.

For a complete list of new features, see the Oracle R Enterprise User's Guide. To learn more about Oracle R Enterprise, visit Oracle R Enterprise on Oracle's Technology Network, or review the variety of use cases on the Oracle R Technologies blog.

Thursday Nov 19, 2015

Using RStudio Shiny with ORE for interactive analysis and visualization - Part I

Shiny, by RStudio, is a popular web application framework for R. It can be used, for example, for building flexible interactive analyses and visualization solutions without requiring web development skills and knowledge of Javascript, HTML, CSS, etc. An overview of it's capabilities with numerous examples is available on RStudio's Shiny web site.

In this blog we illustrate a simple Shiny application for processing and visualizing data stored in Oracle Database for the special case where this data is wide and shallow. In this use case, analysts wish to interactively compute and visualize correlations between many variables (up to 40k) for genomics-related data where the number of observations is small, typically around 1000 cases. A similar use case was discussed in a recent blog Consolidating wide and shallow data with ORE Datastore, which addressed the performance aspects of reading and saving wide data from/to an Oracle R Enterprise (ORE) datastore. ORE allows users to store any type of R objects, including wide data.frames, directly in an Oracle Database using ORE datastores.


Our shiny application, invoked at the top level via the shinyApp() command below, has two components: a user-interface (ui) definition myui and a server function myserver.

R> library("shiny")
R> shinyApp(ui = myui, server = myserver)

The functionality is very basic. The user specifies the dataset to be analyzed, the sets of variables to correlate against each other, the correlation method (Pearson, Spearman,etc) and the treatment of missing data as handled by the 'method' and, respectively, the 'use' arguments of R's cor() function. The datasets are all wide (ncols > 1000) and have been already saved in an ORE datastore.

A partial code for the myui user interface is below. The sidebarPanel section handles the input and the correlation values are displayed graphically by plotOutput() within the mainPanel section. The argument 'corrPlot' corresponds to the function invoked by the server() component.


R> myui <- fluidPage(...,   
  sidebarLayout(      
    sidebarPanel(
      selectInput("TblName",
                  label   = "Data Sets",
                  choices = c(...,...,...),                   
                  selected = (...),      
      radioButtons("corrtyp",
                   label = "Correlation Type",
                   choices = list("Pearson"="pearson",
                                  "Spearman"="spearman",
                                  "Kendall"="kendall"),
                   selected = "pearson"),
      selectInput("use",
                  label   = "Use",
                  choices = c("everything","all.obs","complete.obs",
                  
"na.or.complete","pairwise.complete.obs"),                   

                  selected = "everything"),      
      textInput("xvars",label=...,value=...),
      textInput("yvars",label=...,value=...) ,
      submitButton("Submit")
    ),  
    mainPanel( plotOutput("corrPlot") )       
))


The server component consists of two functions :


  • The server function myserver(), passed as argument during the top level invocation of shinyApp(). myserver returns the output$corrPlot object (see the ui component) generated by shiny's rendering function renderPlot(). The plot object p is generated within renderPlot() by calling ore.doEval() for the embedded R execution of gener_corr_plot(). The ui input selections are passed to gener_corr_plot() through ore.doEval().


  • R> myserver <- function(input, output) {   
      output$corrPlot <- renderPlot({
        p<- ore.pull(ore.doEval(
          FUN.NAME="gener_corr_plot",
          TblName=input$TblName,
          corrtyp=input$corrtyp,
          use=input$use,
          xvartxt=input$xvars,
          yvartxt=input$yvars,
          ds.name="...",
          ore.connect=TRUE))
        print(p)
      })    
    }


  • A core function gener_corr_plot(), which combines loading data from the ORE datastore, invoking the R cor() function with all arguments specified in the ui and generating the plot for the resulting correlation values.


  • R> gener_corr_plot <- function(TblName,corrtyp="pearson",use="everything",
                                xvars=...,yvars=...,ds.name=...) {
           library("ggplot2")
           ore.load(name=ds.name,list=c(TblName))
           res <- cor(get(TblName)[,filtering_based_on_xvars],
                      get(TblName)[,filtering_based_on_yvars],
                      method=corrtyp,use=use)
           p <- qplot(y=c(res)....)
    }

    The result of invoking ShinyApp() is illustrated in the figure below. The data picked for this display is DOROTHEA, a drug discovery dataset from the UCI Machine Learning Repository and the plot shows the Pearson correlation between the variable 1 and the first 2000 variables of the dataset. The variables are labeled here generically, by their index number.



    For the type of data considered by the customer (wide tables with 10k-40k columns and ~ 1k rows) a complete iteration (data loading, correlation calculations between one variable and all others & plot rendering) takes from a few seconds up to a few dozen seconds, depending on the correlation method and the handling of missing values. This is considerably faster than the many minutes reported by customers running their code in memory with data loaded from CSV files.


    Several questions may come to mind when thinking about such Shiny-based applications. For example:

  • When and how to decouple the analysis from data reading ?

  • How to build chains of reactive components for more complex applications ?

  • What is an efficient rendering mechanism for multiple plotting methods and increasing data sizes ?

  • What other options for handling data with more than 1000 columns exist ? (nested tables ?)

  • We will address these in future posts. In this Part 1 we illustrated that it is easy to construct a Shiny-based interactive application for wide data by leveraging ORE's datastores capability and support for embedded R execution. Besides improved performance, this solution offers the security, auditing, backup and recovery capabilities of Oracle Database.



    Oracle R Distribution 3.2.0 Benchmarks


    We recently updated the Oracle R Distribution (ORD) benchmarks on ORD version 3.2.0. ORD is Oracle's free distribution of the open source R environment that adds support for dynamically loading the Intel Math Kernel Library (MKL) installed on your system. MKL provides faster performance by taking advantage of hardware-specific math library implementations

    The benchmark results demonstrate the performance of Oracle R Distribution 3.2.0 with and without dynamically loaded MKL. We executed the community-based R-Benchmark-25 script, which consists of a set of tests that benefit from faster matrix computations. The tests were run on a 24 core Linux system with 3.07 GHz per CPU and 47 GB RAM. We previously executed the same scripts against ORD 2.15.1 and ORD 3.0.1 on similar hardware.

      Oracle R Distribution 3.2.0 Benchmarks (Time in seconds)


    Speedup = (Slower time / Faster Time) - 1

    The first graph focuses on shorter running tests.  We see significant performance improvements - SVD with ORD + MKL executes 20 times faster using 4 threads, and 30 times faster using 8 threads. For Cholesky Factorization, ORD + MKL is 15 and 27 times faster for 4 and 8 threads, respectively.



    In the second graph,we focus on the longer running tests. Principal components analysis is 30 and almost 38 times faster for ORD + MKL with 4 and 8 threads, respectively, matrix multiplication is 80 and 139 times faster, and Linear discriminant analysis is almost 4 times faster. 





    By using Oracle R Distribution with MKL, you will see a notable performance boost for many R applications. These improvements happened with the exact same R code, without requiring any linking steps or updating any packages. Oracle R Distribution for R-3.2.0 is available for download on Oracle's Open Source Software portal. Oracle offers support for users of Oracle R Distribution on Windows, Linux, AIX and Solaris 64 bit platforms.




    Friday Oct 23, 2015

    Oracle R Enterprise Performance on Intel® DC P3700 Series SSDs


    Solid-state drives (SSDs) are becoming increasingly popular in enterprise storage systems, providing large caches, permanent storage and low latency. A recent study aimed to characterize the performance of Oracle R Enterprise workloads on the Intel® P3700 SSD versus hard disk drives (HDDs), with IO-WAIT as the key metric of interest. The study showed that Intel® DC P3700 Series SSDs reduced I/O latency for Oracle R Enterprise workloads, most notably when saving objects to Oracle R Enterprise datastore and materializing scoring results.

    The test environment was a 2-node Red Hat Linux 6.5 cluster, each node containing a 2TB
    Intel® DC P3700 Series SSD with attached 2 TB HDD. As the primary objective was to identify applications that could benefit from SSDs as-is, test results showed the performance gain without systems modification or tuning. The tests for the study consisted of a set of single analytic workloads composed of I/O, computational, and memory intensive tests. The tests were run both serially and in parallel up to 100 jobs, which are a good representation of typical workloads for Oracle R Enterprise customers.  

    The figures below show execution time results for datastore save and load, and materializing model scoring results to a database table. The datastore performance is notably improved for the larger test sets (100 million and 1 billion rows). 
    For these workloads, execution time was reduced by an average of 46% with a maximum of 67% compared to the HDD attached to the same cluster. 


    Figure 1: Saving and loading objects to and from the ORE datastore, HDD vs. Intel® P3700 Series SSD



    Figure 2: Model scoring and materializing results, HDD vs. Intel® P3700 Series SSD




    The entire set of test results shows that
    Intel® DC P3700 Series SSDs provides an average reduction of 30% with a maximum of 67% in execution time for I/O heavy Oracle R Enterprise workloads. These results could potentially be improved by working with hardware engineers to tune the host kernel and other settings to optimize SSD performance. Intel® DC P3700 Series SSDs can increase storage I/O by reducing latency and increasing throughput, and it is recommended to explore system tuning options with your engineering team to achieve the best result. 


    Thursday Aug 06, 2015

    Oracle R Advanced Analytics for Hadoop on the Fast Lane: Spark-based Logistic Regression and MLP Neural Networks

    This is the first in a series of blogs that is going to explore the capabilities of the newly released Oracle R Advanced Analytics for Hadoop 2.5.0, part of Oracle Big Data Connectors, which includes two new algorithm implementations that can take advantage of an Apache Spark cluster for a significant performance gains on Model Build and Scoring time. These algorithms are a redesigned version of the Multi-Layer Perceptron Neural Networks (orch.neural) and a brand new implementation of a Logistic Regression model (orch.glm2).

    Through large scale benchmarks we are going to see the improvements in performance that the new custom algorithms bring to enterprise Data Science when running on top of a Hadoop Cluster with an available Apache Spark infrastructure.

    In this first part, we are going to compare only model build performance and feasibility of the new algorithms against the same algorithms running on Map-Reduce, and we are not going to be concerned with model quality or precision. Model scoring, quality and precision are going to be part of a future Blog.

    The Documentation on the new Components can be found on the product itself (with help and sample code), and also on the ORAAH 2.5.0 Documentation Tab on OTN.

    Hardware and Software used for testing


    As a test Bed, we are using an Oracle Big Data Appliance X3-2 cluster with 6-nodes.  Each node consists of two Intel® Xeon® 6-core X5675 (3.07 GHz), for a total of 12 cores (24 threads), and 96 GB of RAM per node is available.

    The BDA nodes run Oracle Enterprise Linux release 6.5, Cloudera Hadoop Distribution 5.3.0 (that includes Apache Spark release 1.2.0). 

    Each node also is running Oracle R Distribution release 3.1.1 and Oracle R Advanced Analytics for Hadoop release 2.5.0.

    Dataset used for Testing


    For the test we are going to use a classic Dataset that consists of Arrival and Departure information of all major Airports in the USA.  The data is available online in different formats.  The most used one contains 123 million records and has been used for many benchmarks, originally cited by the American Statistical Association for their Data Expo 2009.  We have augmented the data available in that file by downloading additional months of data from the official Bureau of Transportation Statistics website.  Our starting point is going to be this new dataset that contains 159 million records and has information up to September 2014.

    For smaller tests, we created a simple subset of this dataset of 1, 10 and 100 million records.  We also created a 1 billion-record dataset by appending the 159 million-record data over and over until we reached 1 billion records.

    Connecting to a Spark Cluster


    In release 2.5.0 we are introducing a new set of R commands that will allow the Data Scientist to request the creation of a Spark Context, in either YARN or Standalone modes.

    For this release, the Spark Context is exclusively used for accelerating the creation of Levels and Factor variables, the Model Matrix, the final solution to the Logistic and Neural Networks models themselves, and Scoring (in the case of GLM).

    The new commands are highlighted below:
    spark.connect(master, name = NULL, memory = NULL,  dfs.namenode = NULL)

    spark.connect() requires loading the ORCH library first to read the configuration of the Hadoop Cluster.

    The “master” variable can be specified as either “yarn-client” to use YARN for Resource Allocation, or the direct reference to a Spark Master service and port, in which case it will use Spark in Standalone Mode. 
    The “name” variable is optional, and it helps centralized logging of the Session on the Spark Master.  By default, the Application name showing on the Spark Master is “ORCH”.
    The “memory” field indicates the amount of memory per Spark Worker to dedicate to this Spark Context.

    Finally, the dfs.namenode points to the Apache HDFS Namenode Server, in order to exchange information with HDFS.

    In summary, to establish a Spark connection, one could do:
    > spark.connect("yarn-client", memory="2g", dfs.namenode=”my.namenode.server.com")
    Conversely, to disconnect the Session after the work is done, you can use spark.disconnect() without options.

    The command spark.connected() checks the status of the current Session and contains the information of the connection to the Server. It is automatically called by the new algorithms to check for a valid connection.

    ORAAH 2.5.0 introduces support for loading data to Spark cache from an HDFS file via the function hdfs.toRDD(). ORAAH dfs.id objects were also extended to support both data residing in HDFS and in Spark memory, and allow the user to cache the HDFS data to an RDD object for use with the new algorithms.

    For all the examples used in this Blog, we used the following command in the R Session:

    Oracle Distribution of R version 3.1.1 (--) -- "Sock it to Me"
    Copyright (C) The R Foundation for Statistical Computing
    Platform: x86_64-unknown-linux-gnu (64-bit)

    R is free software and comes with ABSOLUTELY NO WARRANTY.
    You are welcome to redistribute it under certain conditions.
    Type 'license()' or 'licence()' for distribution details.

    R is a collaborative project with many contributors.
    Type 'contributors()' for more information and
    'citation()' on how to cite R or R packages in publications.

    Type 'demo()' for some demos, 'help()' for on-line help, or
    'help.start()' for an HTML browser interface to help.
    Type 'q()' to quit R.

    You are using Oracle's distribution of R. Please contact
    Oracle Support for any problems you encounter with this
    distribution.

    [Workspace loaded from ~/.RData]

    > library(ORCH)
    Loading required package: OREbase

    Attaching package: ‘OREbase’

    The following objects are masked from ‘package:base’:

    cbind, data.frame, eval, interaction, order, paste, pmax, pmin, rbind, table

    Loading required package: OREstats
    Loading required package: MASS
    Loading required package: ORCHcore
    Loading required package: rJava
    Oracle R Connector for Hadoop 2.5.0 (rev. 307)
    Info: using native C base64 encoding implementation
    Info: Hadoop distribution is Cloudera's CDH v5.3.0
    Info: using auto-detected ORCH HAL v4.2
    Info: HDFS workdir is set to "/user/oracle"
    Warning: mapReduce checks are skipped due to "ORCH_MAPRED_CHECK"=FALSE
    Warning: HDFS checks are skipped due to "ORCH_HDFS_CHECK"=FALSE
    Info: Hadoop 2.5.0-cdh5.3.0 is up
    Info: Sqoop 1.4.5-cdh5.3.0 is up
    Info: OLH 3.3.0 is up
    Info: loaded ORCH core Java library "orch-core-2.5.0-mr2.jar"
    Loading required package: ORCHstats
    >
    > # Spark Context Creation
    > spark.connect(master="spark://my.spark.server:7077", memory="24G",dfs.namenode="my.dfs.namenode.server")
    >


    In this case, we are requesting the usage of 24 Gb of RAM per node in Standalone Mode. Since our BDA has 6 nodes, the total RAM assigned to our Spark Context is 144 GB, which can be verified in the Spark Master screen shown below.


    GLM – Logistic Regression


    In this release, because of a totally overhauled computation engine, we created a new function called orch.glm2() that is going to execute exclusively the Logistic Regression model using Apache Spark as platform.  The input data expected by the algorithm is an ORAAH dfs.id object, which means an HDFS CSV dataset, a HIVE Table that was made compatible by using the hdfs.fromHive() command, or HDFS CSV dataset that has been cached into Apache Spark as an RDD object using the command hdfs.toRDD().

    A simple example of the new algorithm running on the ONTIME dataset with 1 billion records is shown below. The objective of the Test Model is the prediction of Cancelled Flights. The new model requires the indication of a Factor variable as an F() in the formula, and the default (and only family available in this release) is the binomial().

    The R code and the output below assumes that the connection to the Spark Cluster is already done.

    > # Attaches the HDFS file for use within R
    > ont1bi <- hdfs.attach("/user/oracle/ontime_1bi")
    > # Checks the size of the Dataset
    > hdfs.dim(ont1bi)
     [1] 1000000000         30
    > # Testing the GLM Logistic Regression Model on Spark
    > # Formula definition: Cancelled flights (0 or 1) based on other attributes

    > form_oraah_glm2 <- CANCELLED ~ DISTANCE + ORIGIN + DEST + F(YEAR) + F(MONTH) +
    +   F(DAYOFMONTH) + F(DAYOFWEEK)

    > # ORAAH GLM2 Computation from HDFS data (computing factor levels on its own)

    > system.time(m_spark_glm <- orch.glm2(formula=form_oraah_glm2, ont1bi))
     ORCH GLM: processed 6 factor variables, 25.806 sec
     ORCH GLM: created model matrix, 100128 partitions, 32.871 sec
     ORCH GLM: iter  1,  deviance   1.38433414089348300E+09,  elapsed time 9.582 sec
     ORCH GLM: iter  2,  deviance   3.39315388583931150E+08,  elapsed time 9.213 sec
     ORCH GLM: iter  3,  deviance   2.06855738812683250E+08,  elapsed time 9.218 sec
     ORCH GLM: iter  4,  deviance   1.75868100359263200E+08,  elapsed time 9.104 sec
     ORCH GLM: iter  5,  deviance   1.70023181759611580E+08,  elapsed time 9.132 sec
     ORCH GLM: iter  6,  deviance   1.69476890425481350E+08,  elapsed time 9.124 sec
     ORCH GLM: iter  7,  deviance   1.69467586045954760E+08,  elapsed time 9.077 sec
     ORCH GLM: iter  8,  deviance   1.69467574351380850E+08,  elapsed time 9.164 sec
    user  system elapsed
    84.107   5.606 143.591 

    > # Shows the general features of the GLM Model
    > summary(m_spark_glm)
                   Length Class  Mode   
    coefficients   846    -none- numeric
    deviance         1    -none- numeric
    solutionStatus   1    -none- character
    nIterations      1    -none- numeric
    formula          1    -none- character
    factorLevels     6    -none- list  

    A sample benchmark against the same models running on Map-Reduce are illustrated below.  The Map-Reduce models used the call orch.glm(formula, dfs.id, family=(binomial()), and used as.factor() in the formula.


    We can see that the Spark-based GLM2 is capable of a large performance advantage over the model executing in Map-Reduce.

    Later in this Blog we are going to see the performance of the Spark-based GLM Logistic Regression on 1 billion records.

    Linear Model with Neural Networks


    For the MLP Neural Networks model, the same algorithm was adapted to execute using the Spark Caching.  The exact same code and function call will recognize if there is a connection to a Spark Context, and if so, will execute the computations using it.

    In this case, the code for both the Map-Reduce and the Spark-based executions is exactly the same, with the exception of the spark.connect() call that is required for the Spark-based version to kick in.

    The objective of the Test Model is the prediction of Arrival Delays of Flights in minutes, so the model class is a Regression Model. The R code used to run the benchmarks and the output is below, and it assumes that the connection to the Spark Cluster is already done.

    > # Attaches the HDFS file for use within R
    > ont1bi <- hdfs.attach("/user/oracle/ontime_1bi")

    > # Checks the size of the Dataset
    > hdfs.dim(ont1bi)
     [1] 1000000000         30

    > # Testing Neural Model on Spark
    > # Formula definition: Arrival Delay based on other attributes

    > form_oraah_neu <- ARRDELAY ~ DISTANCE + ORIGIN + DEST + as.factor(MONTH) +
    +   as.factor(YEAR) + as.factor(DAYOFMONTH) + as.factor(DAYOFWEEK)

    > # Compute Factor Levels from HDFS data
    > system.time(xlev <- orch.getXlevels(form_oraah_neu, dfs.dat = ont1bi))
        user  system elapsed
    17.717   1.348  50.495

    > # Compute and Cache the Model Matrix from HDFS data, passing factor levels
    > system.time(Mod_Mat <- orch.prepare.model.matrix(form_oraah_neu, dfs.dat = ont1bi,xlev=xlev))
       user  system elapsed
    17.933   1.524  95.624

    > # Compute Neural Model from RDD cached Model Matrix
    > system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat, xlev=xlev, trace=T))
    Unconstrained Nonlinear Optimization
    L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
    Iter           Objective Value   Grad Norm        Step   Evals
      1   5.08900838381104858E+11   2.988E+12   4.186E-16       2
      2   5.08899723803646790E+11   2.987E+12   1.000E-04       3
      3   5.08788839748061768E+11   2.958E+12   1.000E-02       3
      4   5.07751213455999573E+11   2.662E+12   1.000E-01       4
      5   5.05395855303159180E+11   1.820E+12   3.162E-01       1
      6   5.03327619811536194E+11   2.517E+09   1.000E+00       1
      7   5.03327608118144775E+11   2.517E+09   1.000E+00       6
      8   4.98952182330299011E+11   1.270E+12   1.000E+00       1
      9   4.95737805642779968E+11   1.504E+12   1.000E+00       1
     10   4.93293224063758362E+11   8.360E+11   1.000E+00       1
     11   4.92873433106044373E+11   1.989E+11   1.000E+00       1
     12   4.92843500119498352E+11   9.659E+09   1.000E+00       1
     13   4.92843044802041565E+11   6.888E+08   1.000E+00       1
    Solution status             Optimal (objMinProgress)
    Number of L-BFGS iterations 13
    Number of objective evals   27
    Objective value             4.92843e+11
    Gradient norm               6.88777e+08
       user  system elapsed
    43.635   4.186  63.319 

    > # Checks the general information of the Neural Network Model
    > mod_neu
    Number of input units      845
    Number of output units     1
    Number of hidden layers    0
    Objective value            4.928430E+11
    Solution status            Optimal (objMinProgress)
    Output layer               number of neurons 1, activation 'linear'
    Optimization solver        L-BFGS
    Scale Hessian inverse      1
    Number of L-BFGS updates   20
    > mod_neu$nObjEvaluations
    [1] 27
    > mod_neu$nWeights
    [1] 846

    A sample benchmark against the same models running on Map-Reduce are illustrated below.  The Map-Reduce models used the exact same orch.neural() calls as the Spark-based ones, with only the Spark connection as a difference.

    We can clearly see that the larger the dataset, the larger the difference in speeds of the Spark-based computation compared to the Map-Reduce ones, reducing the times from many hours to a few minutes.

    This new performance makes possible to run much larger problems and test several models on 1 billion records, something that took half a day just to run one model.

    Logistic and Deep Neural Networks with 1 billion records


    To prove that it is now feasible not only to run Logistic and Linear Model Neural Networks on large scale datasets, but also complex Multi-Layer Neural Network Models, we decided to test the same 1 billion record dataset against several different architectures.

    These tests were done to check for performance and feasibility of these types of models, and not for comparison of precision or quality, which will be part of a different Blog.

    The default activation function for all Multi-Layer Neural Network models was used, which is the bipolar sigmoid function, and also the default output activation layer was also user, which is the linear function.

    As a reminder, the number of weights we need to compute for a Neural Networks is as follows:

    The generic formulation for the number of weights to be computed is then:
    Total Number of weights = SUM of all Layers from First Hidden to the Output of [(Number of inputs into each Layer + 1) * Number of Neurons)]
    In the simple example, we had [(3 inputs + 1 bias) * 2 neurons] + [(2 neurons + 1 bias) * 1 output ] = 8 + 3 = 11 weights

    In our tests for the Simple Neural Network model (Linear Model), using the same formula, we can see that we were computing 846 weights, because it is using 845 inputs plus the Bias.

    Thus, to calculate the number of weights necessary for the Deep Multi-layer Neural Networks that we are about to Test below, we have the following:

    MLP 3 Layers (50,25,10) => [(845+1)*50]+[(50+1)*25]+[(25+1)*10]+[(10+1)*1] = 43,846 weights

    MLP 4 Layers (100,50,25,10) => [(845+1)*100]+[(100+1)*50]+[(50+1)*25]+[(25+1)*10]+[(10+1)*1] = 91,196 weights

    MLP 5 Layers (200,100,50,25,10) => [(845+1)*200]+[(200+1)*100]+[(100+1)*50]+[(50+1)*25]+ [(25+1)*10]+[(10+1)*1] = 195,896 weights

    The times required to compute the GLM Logistic Regression Model that predicts the Flight Cancellations on 1 billion records is included just as an illustration point of the performance of the new Spark-based algorithms.

    The Neural Network Models are all predicting Arrival Delay of Flights, so they are either Linear Models (the first one, with no Hidden Layers) or Non-linear Models using the bipolar sigmoid activation function (the Multi-Layer ones).


    This demonstrates that the capability of building Very Complex and Deep Networks is available with ORAAH, and it makes possible to build networks with hundreds of thousands or millions of weights for more complex problems.

    Not only that, but a Logistic Model can be computed on 1 billion records in less than 2 and a half minutes, and a Linear Neural Model in almost 3 minutes.

    The R Output Listing of the Logistic Regression computation and of the MLP Neural Networks are below.

    > # Spark Context Creation
    > spark.connect(master="spark://my.spark.server:7077", memory="24G",dfs.namenode="my.dfs.namenode.server")

    > # Attaches the HDFS file for use with ORAAH
    > ont1bi <- hdfs.attach("/user/oracle/ontime_1bi")

    > # Checks the size of the Dataset
    > hdfs.dim(ont1bi)
    [1] 1000000000         30

    GLM - Logistic Regression


    > # Testing GLM Logistic Regression on Spark
    > # Formula definition: Cancellation of Flights in relation to other attributes
    > form_oraah_glm2 <- CANCELLED ~ DISTANCE + ORIGIN + DEST + F(YEAR) + F(MONTH) +
    +   F(DAYOFMONTH) + F(DAYOFWEEK)

    > # ORAAH GLM2 Computation from RDD cached data (computing factor levels on its own)
    > system.time(m_spark_glm <- orch.glm2(formula=form_oraah_glm2, ont1bi))
     ORCH GLM: processed 6 factor variables, 25.806 sec
     ORCH GLM: created model matrix, 100128 partitions, 32.871 sec
     ORCH GLM: iter  1,  deviance   1.38433414089348300E+09,  elapsed time 9.582 sec
     ORCH GLM: iter  2,  deviance   3.39315388583931150E+08,  elapsed time 9.213 sec
     ORCH GLM: iter  3,  deviance   2.06855738812683250E+08,  elapsed time 9.218 sec
     ORCH GLM: iter  4,  deviance   1.75868100359263200E+08,  elapsed time 9.104 sec
     ORCH GLM: iter  5,  deviance   1.70023181759611580E+08,  elapsed time 9.132 sec
     ORCH GLM: iter  6,  deviance   1.69476890425481350E+08,  elapsed time 9.124 sec
     ORCH GLM: iter  7,  deviance   1.69467586045954760E+08,  elapsed time 9.077 sec
     ORCH GLM: iter  8,  deviance   1.69467574351380850E+08,  elapsed time 9.164 sec
    user  system elapsed
    84.107   5.606 143.591

    > # Checks the general information of the GLM Model
    > summary(m_spark_glm)
                   Length Class  Mode   
    coefficients   846    -none- numeric
    deviance         1    -none- numeric
    solutionStatus   1    -none- character
    nIterations      1    -none- numeric
    formula          1    -none- character
    factorLevels     6    -none- list 

    Neural Networks - Initial Steps


    For the Neural Models, we have to add the times for computing the Factor Levels plus the time for creating the Model Matrix to the Total elapsed time of the Model computation itself.

    > # Testing Neural Model on Spark
    > # Formula definition
    > form_oraah_neu <- ARRDELAY ~ DISTANCE + ORIGIN + DEST + as.factor(MONTH) +
    +   as.factor(YEAR) + as.factor(DAYOFMONTH) + as.factor(DAYOFWEEK)
    >
    > # Compute Factor Levels from HDFS data
    > system.time(xlev <- orch.getXlevels(form_oraah_neu, dfs.dat = ont1bi))
      user  system elapsed
    12.598   1.431  48.765

    >
    > # Compute and Cache the Model Matrix from cached RDD data
    > system.time(Mod_Mat <- orch.prepare.model.matrix(form_oraah_neu, dfs.dat = ont1bi,xlev=xlev))
      user  system elapsed
      9.032   0.960  92.953
    <

    Neural Networks Model with 3 Layers of Neurons


    > # Compute DEEP Neural Model from RDD cached Model Matrix (passing xlevels)
    > # Three Layers, with 50, 25 and 10 neurons respectively.

    > system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat,
    +                                    xlev=xlev, hiddenSizes=c(50,25,10),trace=T))
    Unconstrained Nonlinear Optimization
    L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
    Iter           Objective Value   Grad Norm        Step   Evals
      0   5.12100202340115967E+11   5.816E+09   1.000E+00       1
      1   4.94849165811250305E+11   2.730E+08   1.719E-10       1
      2   4.94849149028958862E+11   2.729E+08   1.000E-04       3
      3   4.94848409777413513E+11   2.702E+08   1.000E-02       3
      4   4.94841423640935242E+11   2.437E+08   1.000E-01       4
      5   4.94825372589270386E+11   1.677E+08   3.162E-01       1
      6   4.94810879175052673E+11   1.538E+07   1.000E+00       1
      7   4.94810854064597107E+11   1.431E+07   1.000E+00       1
    Solution status             Optimal (objMinProgress)
    Number of L-BFGS iterations 7
    Number of objective evals   15
    Objective value             4.94811e+11
    Gradient norm               1.43127e+07
        user   system  elapsed
      91.024    8.476 1975.947

    >
    > # Checks the general information of the Neural Network Model
    > mod_neu
    Number of input units      845
    Number of output units     1
    Number of hidden layers    3
    Objective value            4.948109E+11
    Solution status            Optimal (objMinProgress)
    Hidden layer [1]           number of neurons 50, activation 'bSigmoid'
    Hidden layer [2]           number of neurons 25, activation 'bSigmoid'
    Hidden layer [3]           number of neurons 10, activation 'bSigmoid'
    Output layer               number of neurons 1, activation 'linear'
    Optimization solver        L-BFGS
    Scale Hessian inverse      1
    Number of L-BFGS updates   20
    > mod_neu$nObjEvaluations
    [1] 15
    > mod_neu$nWeights
    [1] 43846
    >

    Neural Networks Model with 4 Layers of Neurons


    > # Compute DEEP Neural Model from RDD cached Model Matrix (passing xlevels)
    > # Four Layers, with 100, 50, 25 and 10 neurons respectively.

    > system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat,
    +                                    xlev=xlev, hiddenSizes=c(100,50,25,10),trace=T))
    Unconstrained Nonlinear Optimization
    L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
    Iter           Objective Value   Grad Norm        Step   Evals
       0   5.15274440087001343E+11   7.092E+10   1.000E+00       1
       1   5.10168177067538818E+11   2.939E+10   1.410E-11       1
       2   5.10086354184862549E+11   5.467E+09   1.000E-02       2
       3   5.10063808510261475E+11   5.463E+09   1.000E-01       4
       4   5.07663007172408386E+11   5.014E+09   3.162E-01       1
       5   4.97115989230861267E+11   2.124E+09   1.000E+00       1
       6   4.94859162124700928E+11   3.085E+08   1.000E+00       1
       7   4.94810727630636963E+11   2.117E+07   1.000E+00       1
       8   4.94810490064279846E+11   7.036E+06   1.000E+00       1
    Solution status             Optimal (objMinProgress)
    Number of L-BFGS iterations 8
    Number of objective evals   13
    Objective value             4.9481e+11
    Gradient norm               7.0363e+06
        user   system  elapsed
    166.169   19.697 6467.703
    >
    > # Checks the general information of the Neural Network Model
    > mod_neu
    Number of input units      845
    Number of output units     1
    Number of hidden layers    4
    Objective value            4.948105E+11
    Solution status            Optimal (objMinProgress)
    Hidden layer [1]           number of neurons 100, activation 'bSigmoid'
    Hidden layer [2]           number of neurons 50, activation 'bSigmoid'
    Hidden layer [3]           number of neurons 25, activation 'bSigmoid'
    Hidden layer [4]           number of neurons 10, activation 'bSigmoid'
    Output layer               number of neurons 1, activation 'linear'
    Optimization solver        L-BFGS
    Scale Hessian inverse      1
    Number of L-BFGS updates   20
    > mod_neu$nObjEvaluations
    [1] 13
    > mod_neu$nWeights
    [1] 91196

    Neural Networks Model with 5 Layers of Neurons


    > # Compute DEEP Neural Model from RDD cached Model Matrix (passing xlevels)
    > # Five Layers, with 200, 100, 50, 25 and 10 neurons respectively.

    > system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat,
    +                                    xlev=xlev, hiddenSizes=c(200,100,50,25,10),trace=T))

    Unconstrained Nonlinear Optimization
    L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
    Iter           Objective Value   Grad Norm        Step   Evals
      0   5.14697806831633850E+11   6.238E+09   1.000E+00       1
    ......
    ......

     6   4.94837221890043518E+11   2.293E+08   1.000E+00  1                                                                   
     7   4.94810299190365112E+11   9.268E+06   1.000E+00       1
     8   4.94810277908935242E+11   8.855E+06   1.000E+00       1
    Solution status             Optimal (objMinProgress)
    Number of L-BFGS iterations 8
    Number of objective evals   16
    Objective value             4.9481e+11
    Gradient norm               8.85457e+06
         user    system   elapsed
      498.002    90.940 30473.421
    >
    > # Checks the general information of the Neural Network Model
    > mod_neu
    Number of input units      845
    Number of output units     1
    Number of hidden layers    5
    Objective value            4.948103E+11
    Solution status            Optimal (objMinProgress)
    Hidden layer [1]           number of neurons 200, activation 'bSigmoid'
    Hidden layer [2]           number of neurons 100, activation 'bSigmoid'
    Hidden layer [3]           number of neurons 50, activation 'bSigmoid'
    Hidden layer [4]           number of neurons 25, activation 'bSigmoid'
    Hidden layer [5]           number of neurons 10, activation 'bSigmoid'
    Output layer               number of neurons 1, activation 'linear'
    Optimization solver        L-BFGS
    Scale Hessian inverse      1
    Number of L-BFGS updates   20
    > mod_neu$nObjEvaluations
    [1] 16
    > mod_neu$nWeights
    [1] 195896

    Best Practices on logging level for using Apache Spark with ORAAH


    It is important to note that Apache Spark’s log is by default verbose, so it might be useful after a few tests with different settings to turn down the level of logging, something a System Administrator typically will do by editing the file $SPARK_HOME/etc/log4j.properties (see Best Practices below).

    By default, that file is going to look something like this:

    # cat $SPARK_HOME/etc/log4j.properties # Set everything to be logged to the console
    log4j.rootCategory=INFO, console
    log4j.appender.console=org.apache.log4j.ConsoleAppender
    log4j.appender.console.target=System.err
    log4j.appender.console.layout=org.apache.log4j.PatternLayout
    log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

    # Settings to quiet third party logs that are too verbose
    log4j.logger.org.eclipse.jetty=INFO
    log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
    log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

    A typical full log will provide the below information, but also might provide too much logging when running the Models themselves, so it will be more useful for the first tests and diagnostics.

    > # Creates the Spark Context. Because the Memory setting is not specified,
    > # the defaults of 1 GB of RAM per Spark Worker is used
    > spark.connect("yarn-client", dfs.namenode="my.hdfs.namenode")
    15/02/18 13:05:44 WARN SparkConf:
    SPARK_JAVA_OPTS was detected (set to '-Djava.library.path=/usr/lib64/R/lib'). This is deprecated in Spark 1.0+.
    Please instead use:
    - ./spark-submit with conf/spark-defaults.conf to set defaults for an application
    - ./spark-submit with --driver-java-options to set -X options for a driver
    - spark.executor.extraJavaOptions to set -X options for executors
    -
     SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)
    15/02/18 13:05:44 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' to '- Djava.library.path=/usr/lib64/R/lib' as a work-around.
    15/02/18 13:05:44 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to '- Djava.library.path=/usr/lib64/R/lib' as a work-around
    15/02/18 13:05:44 INFO SecurityManager: Changing view acls to: oracle
    15/02/18 13:05:44 INFO SecurityManager: Changing modify acls to: oracle
    15/02/18 13:05:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(oracle); users with modify permissions: Set(oracle)
    15/02/18 13:05:44 INFO Slf4jLogger: Slf4jLogger started
    15/02/18 13:05:44 INFO Remoting: Starting remoting
    15/02/18 13:05:45 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@my.spark.master:35936]
    15/02/18 13:05:45 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@my.spark.master:35936]
    15/02/18 13:05:46 INFO SparkContext: Added JAR /u01/app/oracle/product/12.2.0/dbhome_1/R/library/ORCHcore/java/orch-core-2.4.1-mr2.jar at http://1.1.1.1:11491/jars/orch-core-2.4.1-mr2.jar with timestamp 1424264746100
    15/02/18 13:05:46 INFO SparkContext: Added JAR /u01/app/oracle/product/12.2.0/dbhome_1/R/library/ORCHcore/java/orch-bdanalytics-core-2.4.1- mr2.jar at http://1.1.1.1:11491/jars/orch-bdanalytics-core-2.4.1-mr2.jar with timestamp 1424264746101
    15/02/18 13:05:46 INFO RMProxy: Connecting to ResourceManager at my.hdfs.namenode /10.153.107.85:8032
    Utils: Successfully started service 'sparkDriver' on port 35936. SparkEnv: Registering MapOutputTracker
    SparkEnv: Registering BlockManagerMaster
    DiskBlockManager: Created local directory at /tmp/spark-local-
    MemoryStore: MemoryStore started with capacity 265.1 MB
    HttpFileServer: HTTP File server directory is /tmp/spark-7c65075f-850c-
    HttpServer: Starting HTTP Server
    Utils: Successfully started service 'HTTP file server' on port 11491. Utils: Successfully started service 'SparkUI' on port 4040.
    SparkUI: Started SparkUI at http://my.hdfs.namenode:4040
    15/02/18 13:05:46 INFO Client: Requesting a new application from cluster with 1 NodeManagers
    15/02/18 13:05:46 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
    15/02/18 13:05:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    15/02/18 13:05:46 INFO Client: Uploading resource file:/opt/cloudera/parcels/CDH-5.3.1- 1.cdh5.3.1.p0.5/lib/spark/lib/spark-assembly.jar -> hdfs://my.hdfs.namenode:8020/user/oracle/.sparkStaging/application_1423701785613_0009/spark- assembly.jar
    15/02/18 13:05:47 INFO Client: Setting up the launch environment for our AM container
    15/02/18 13:05:47 INFO SecurityManager: Changing view acls to: oracle
    15/02/18 13:05:47 INFO SecurityManager: Changing modify acls to: oracle
    15/02/18 13:05:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(oracle); users with modify permissions: Set(oracle) 15/02/18 13:05:47 INFO Client: Submitting application 9 to ResourceManager
    15/02/18 13:05:47 INFO YarnClientImpl: Submitted application application_1423701785613_0009 15/02/18 13:05:48 INFO Client: Application report for application_1423701785613_0009 (state: ACCEPTED)
    13:05:48 INFO Client:
    client token: N/A
    diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.oracle
    start time: 1424264747559 final status: UNDEFINED tracking URL: http:// my.hdfs.namenode
    13:05:46 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB
    13:05:46 INFO Client: Setting up container launch context for our AM
    13:05:46 INFO Client: Preparing resources for our AM container
    my.hdfs.namenode:8088/proxy/application_1423701785613_0009/ user: oracle
    15/02/18 13:05:49 INFO Client: Application report for application_1423701785613_0009 (state: ACCEPTED)
    15/02/18 13:05:50 INFO Client: Application report for application_1423701785613_0009 (state: ACCEPTED)

    Please note that all those warnings are expected, and may vary depending on the release of Spark used.

    With the Console option in the log4j.properties settings are lowered from INFO to WARN, the request for a Spark Context would return the following:

    # cat $SPARK_HOME/etc/log4j.properties

    # Set everything to be logged to the console
    log4j.rootCategory=INFO, console
    log4j.appender.console=org.apache.log4j.ConsoleAppender
    log4j.appender.console.target=System.err
    log4j.appender.console.layout=org.apache.log4j.PatternLayout
    log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

    # Settings to quiet third party logs that are too verbose
    log4j.logger.org.eclipse.jetty=WARN
    log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
    log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

    Now the R Log is going to show only a few details about the Spark Connection.

    > # Creates the Spark Context. Because the Memory setting is not specified,
    > # the defaults of 1 GB of RAM per Spark Worker is used
    > spark.connect(master="yarn-client", dfs.namenode="my.hdfs.server")
    15/04/09 23:32:11 WARN SparkConf:
    SPARK_JAVA_OPTS was detected (set to '-Djava.library.path=/usr/lib64/R/lib').
    This is deprecated in Spark 1.0+.

    Please instead use:
    - ./spark-submit with conf/spark-defaults.conf to set defaults for an application
    - ./spark-submit with --driver-java-options to set -X options for a driver
    - spark.executor.extraJavaOptions to set -X options for executors
    - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)

    15/04/09 23:32:11 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' to '-Djava.library.path=/usr/lib64/R/lib' as a work-around.
    15/04/09 23:32:11 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to '-Djava.library.path=/usr/lib64/R/lib' as a work-around.
    15/04/09 23:32:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

    Finally, with the Console logging option in the log4j.properties file set to ERROR instead of INFO or WARN, the request for a Spark Context would return nothing in case of success:

    # cat $SPARK_HOME/etc/log4j.properties

    # Set everything to be logged to the console
    log4j.rootCategory=INFO, console
    log4j.appender.console=org.apache.log4j.ConsoleAppender
    log4j.appender.console.target=System.err
    log4j.appender.console.layout=org.apache.log4j.PatternLayout
    log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

    # Settings to quiet third party logs that are too verbose
    log4j.logger.org.eclipse.jetty=ERROR
    log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
    log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

    This time there is no message returned back to the R Session, because we requested it to only return feedback in case of an error:

    > # Creates the Spark Context. Because the Memory setting is not specified,
    > # the defaults of 1 GB of RAM per Spark Worker is used
    > spark.connect(master="yarn-client", dfs.namenode="my.hdfs.server")
    >

    In summary, it is practical to start any Project with the full logging, but it would be a good idea to bring the level of logging down to WARN or ERROR after the environment has been tested to be working OK and the settings are stable.

    Wednesday Aug 05, 2015

    ROracle 1.2-1 released

    We are pleased to announce the latest update of the open source ROracle package, version 1.2-1, with enhancements and bug fixes. ROracle provides high performance and scalable interaction between R and Oracle Database. In addition to availability on CRAN, ROracle binaries specific to Windows and other platforms can be downloaded from the Oracle Technology Network. Users of ROracle, please take our brief survey. Your feedback is important and we want to hear from you!

    Latest enhancements in version 1.2-1 include:

    • Support for NATIVE, UTF8 and LATIN1 encoded data in query and results

    • enhancement 20603162 - CLOB/BLOB enhancement, see man page on attributes ore.type, ora.encoding, ora.maxlength, and ora.fractional_seconds_precision.

    • bug 15937661 – mapping of dbWriteTable BLOB, CLOB, NCLOB, NCHAR AND NVARCHAR columns. Data frame mapping to Oracle Database type is provided.

    • bug 16017358 – proper handling of NULL extproc context when passed to in ORE embedded R execution

    • bug 16907374 - ROracle creates time stamp column for R Date with dbWriteTable

    • ROracle now displays NCHAR, NVARCHAR2 and NCLOB data types defined for
    columns in the server using dbColumnInfo and dbGetInfo


    In addition, enhancements in the previous release of ROracle, version 1.1-12, include:

    • Add bulk_write parameter to specify number of rows to bind at a time to improve performance for dbWriteTable and DML operations

    • Date, Timestamp, Timestamp with time zone and Timestamp with local time zone data are maintained in R and Oracle's session time zone. Oracle session time zone environment variable ORA_SDTZ and R's environment variable TZ must be the same for this to work else an error is reported when operating on any of these column types

    • bug 16198839 - Allow selecting data from time stamp with time zone and time stamp with local time zone without reporting error 1805

    • bug 18316008 - increases the bind limit from 4000 to 2GB for inserting data into BLOB, CLOB and 32K VARCHAR and RAW data types. Changes describe lengths to NA for all types except for CHAR, VARCHAR2, and RAW

    • and other performance improvements and bug fixes

    See the ROracle NEWS for the complete list of updates.

    We encourage ROracle users to post questions and provide feedback on the Oracle R Technology Forum and the ROracle survey.

    ROralce is not only a high performance interface to Oracle Database from R for direct use, ROracle supports database access for Oracle R Enterprise from the Oracle Advanced Analytics option to Oracle Database.

    Monday Jul 13, 2015

    BIWASummit 2016 "Call for Speakers" is open!

    Oracle BIWA Summit is an annual conference that provides attendees a concentrated three days of content focused on Big Data and Analytics. Once again, it will be held at the Oracle Headquarters Conference Center in Redwood Shores, CA. As part of the organizing committee, I invite you to submit session proposals, especially those involving Oracle's R technologies.

    BIWA Summit attendees want to hear about your use of Oracle technology. Proposals will be accepted through Monday evening November 2, 2015, at midnight EST.

    To submit your abstract, click here.

    This year's tracks include:


    Oracle BIWA Summit 2016 is organized and managed by the Oracle BIWA SIG, the Oracle Spatial SIG, and the Oracle Northern California User Group. The event attracts top BI, data warehousing, analytics, Spatial, IoT and Big Data experts.

    The three-day event includes keynotes from industry experts, educational sessions, hands-on labs, and networking events.

    Hot topics include:


    • Database, data warehouse and cloud, Big Data architecture

    • Deep dives and hands-on labs on existing Oracle BI, data warehouse, and analytics products

    • Updates on the latest Oracle products and technologies (e.g. Big Data Discovery, Oracle Visual Analyzer, Oracle Big Data SQL)

    • Novel and interesting use cases on everything – Spatial, Graph, Text, Data Mining, IoT, ETL, Security, Cloud

    • Working with Big Data (e.g., Hadoop, "Internet of Things,” SQL, R, Sentiment Analysis)

    • Oracle Business Intelligence (OBIEE), Oracle Big Data Discovery, Oracle Spatial, and Oracle Advanced Analytics—Better Together

    I look forward to seeing you there!

    Friday Jun 12, 2015

    Variable Selection with ORE varclus - Part 2

    In our previous post we talked about variable selection and introduced a technique based on hierarchical divisive clustering and implemented using the Oracle R Enterprise embedded execution capabilities. In this post we illustrate how to visualize the clustering solution, discuss stopping criteria and highlight some performance aspects.

    Plots


    The clustering efficiency can be assessed, from a high level perspective, through a visual representation of metrics related to variability. The plot.clusters() function provided as example in varclus_lib.R takes the datastore name, the iteration number (nclust corresponds to the number of clusters after the final iteration) and an output directory to generate a png output file with two plots.

    R> plot.clusters(dsname="datstr.MYDATA",nclust=6,
                       outdir="out.varclus.MYDATA")

    unix> ls -1 out.varclus.MYDATA
    out.MYDATA.clusters
    out.MYDATA.log
    plot.datstr.MYDATA.ncl6.png

    The upper plot focuses on the last iteration. The x axis represents the cluster id (1 to 6 for six clusters after the 6-th and final iteration). The variation explained and proportion of variation explained (Variation.Explained and Proportion.Explained from 'Cluster Summary') are rendered by the blue curve with units on the left y axis and the red curve with units on the right y axis). Clusters 1,2,3,4,6 are well represented by their first principal component. Cluster 5, contains variation which is not well captured by a single component (only 47.8% is explained, as alraedy mentioned in Part 1). This can be seen also from the r2.own values for the variables of Cluster 5, VAR20, VAR26,...,VAR29 , between 0.24 and 0.62 indicating that their are not well correlated with the 1st principal component score. For this kind of situation, domain expertise will be needed to evaluate the results and decide the course of action : does it make sense to have VAR20, VAR26,...,VAR29 clustered together/keep VAR27 as representative variable or should Cluster 5 be further split by lowering eig2.threshold (below the corresponding Secnd.Eigenval value from the 'Clusters Summary' section) ?

    The bottom plot illustrates the entire clustering sequence (all iterations) The x axis represents the iteration number or the numbers of clusters after that iteration. The total variation explained and proportion of total variation explained (Tot.Var.Explained and Prop.Var.Explained from 'Grand Summary' are rendered by the blue curve with units on the left y axis and the red curve with units on the right y axis). One can see how Prop.Var.Explained tends to flatten below 90% (86.3% for the last iteration).




    For the case above a single cluster was 'weak' and there were no ambiguities about where to start examining the results or search for issues.  Below is the same output for a different problem with 120 variables and 29 final clusters. For this case, the proportion of variation explained by the 1st component (red curve, upper plot) shows several 'weak' clusters : 23, 28, 27, 4, 7, 19.  The Prop.Var.Explained is below 60% for these clusters. Which one should be examined first ? A good choice could be Cluster 7 because it plays a more important role as measured by the absolute value of Variation.Explained. Here again, domain knowledge will be required to examine these clusters and decide if and for how long how one should continue the splitting process.





    Stopping criteria & number of variables


    As illustrated in the previous section, the number of final clusters can be raised or reduced by lowering or increasing the eig2.trshld parameter. For problems with many variables the user may want to stop the iterations at lower values and inspect the clustering results & history before convergence to gain a better understanding of the variable selection process. Early stopping is achieved through the maxclust argument, as discussed in the previous post, and can be used also if the user wants/has to keep the number of selected variables below an upper limit.

    Performance


    The clustering runtime is entirely dominated  by the cost of the PCA analysis. The 1st split is the most expensive as PCA is run for the entire data; the subsequent splits are executed faster and faster as the PCAs handle clusters with less and less variables. For the 39 variables & 55k rows case presented it took ~10s for the entire run (splitting into 6 clusters, post-processing from datastore, generation). The 120 variables & 55k rows case required ~54s. For a larger case with 666 variables & 64k rows the execution completed in 112s and generated 128 clusters. These numbers were obtained on a Intel Xeon 2.9Ghz OL6 machine.The customer ran cases with more than 600 variable & O[1e6] rows in 5-10 mins.

    Monday Apr 06, 2015

    Using rJava in Embedded R Execution

    Integration with high performance programming languages is one way to tackle big data with R. Portions of the R code are moved from R to another language to avoid bottlenecks and perform expensive procedures. The goal is to balance R’s elegant handling of data with the heavy duty computing capabilities of other languages.

    Outsourcing R to another language can easily be hidden in R functions, so proficiency in the target language is not requisite for the users of these functions. The rJava package by Simon Urbanek is one such example - it outsources R to Java very much like R's native .C/.Call interface. rJava allows users to create objects, call methods and access fields of Java objects from R.

    Oracle R Enterprise (ORE) provides an additional boost to rJava when used in embedded R script execution on the database server machine. Embedded R Execution allows R scripts to take advantage of a likely more powerful database server machine - more memory and CPUs, and greater CPU power. Through embedded R, ORE enables R to leverage database support for data parallel and task parallel execution of R scripts and also operationalize R scripts in database applications.  The net result is the ability to analyze larger data sets in parallel from a single R or SQL interface, depending on your preference.

    In this post, we demonstrate a basic example of configuring and deploying rJava in base R and embedded R execution.

    1. Install Java

    To start, you need Java. If you are not using a pre-configured engineered system like Exadata or the Big Data Appliance, you can download the Java Runtime Environment (JRE) and Java Development Kit (JDK) here.

    To verify the JRE is installed on your system, execute the command:

    $ java -version
    java version "1.7.0_67"


    If the JRE is installed on the system, the version number is returned. The equivalent check for JDK is:

    $ javac -version
    javac 1.7.0_67


    A "command not recognized" error indicates either Java is not present or you need to add Java to your PATH and CLASSPATH environment variables.

    2. Configure Java Parameters for R

    R provides the javareconf utility to configure Java support in R.  To prepare the R environment for Java, execute this command:

    $ sudo R CMD javareconf

    or

    $ R CMD javareconf -e

    3.  Install rJava Package

    rJava release versions can be obtained from CRAN.  Assuming an internet connection is available, the install.packages command in an R session will do the trick.

    > install.packages("rJava")
    ..
    ..
    * installing *source* package ‘rJava’ ...
    ** package ‘rJava’ successfully unpacked and MD5 sums checked
    checking for gcc... gcc -m64 -std=gnu99
    ..
    ..
    ** testing if installed package can be loaded
    * DONE (rJava)


    4. Configure the Environment Variable CLASSPATH

    The CLASSPATH environment variable must contain the directories with the jar and class files.  The class files in this example will be created in /home/oracle/tmp.

      export CLASSPATH=$ORACLE_HOME/jlib:/home/oracle/tmp

    Alternatively, use the rJava function .jaddClassPath to define the path to the class files.


    5. Create and Compile Java Program

    For this test, we create a simple, Hello, World! example. Create the file HelloWorld.java in /home/oracle/tmp with the contents:

      public class HelloWorld {
              public String SayHello(String str)
                {
                      String a = "Hello,";
                return a.concat(str);
                }
        }


    Compile the Java code.

    $ javac HelloWorld.java


    6.  Call Java from R


    In R, execute the following commands to call the rJava package and initialize the Java Virtual Machine (JVM).

    R> library(rJava)
    R> .jinit()


    Instantiate the class HelloWorld in R. In other words, tell R to look at the compiled HelloWorld program.

    R> .jnew
    ("HelloWorld")

    Call the function directly.

    R> .jcall(obj, "S", "SayHello", str)
                  VAL
    1 Hello,      World!


    7.  Call Java In Embedded R Execution


    Oracle R Enterprise uses external procedures in Oracle Database to support embedded R execution. The default configuration for external procedures is spawned directly by Oracle Database. The path to the JVM shared library, libjvm.so must be added to the environment variable LD_LIBRARY_PATH so it is found in the shell where Oracle is started.  This is defined in two places: at the OS shell and in the external procedures configuration file, extproc.ora.

    In the OS shell:

    $ locate libjvm.so

    /usr/java/jdk1.7.0_45/jre/lib/amd64/server

    $ export LD_LIBRARY_PATH=/usr/java/jdk1.7.0_45/jre/lib/amd64/server:$LD_LIBRARY_PATH


    In extproc.ora:

    $ cd $ORACLE_HOME/hs/admin/extproc.ora


    Edit the file extproc.ora to add the path to libjvm.so in LD_LIBRARY_PATH:

    SET EXTPROC_DLLS=ANY
    SET LD_LIBRARY_PATH=/usr/java/jdk1.7.0_45/jre/lib/amd64/server
    export LD_LIBRARY_PATH


    You will need to bounce the database instance after updating extproc.ora.

    Now load rJava in embedded R:

    > library(ORE)
    > ore.connect(user     = 'oreuser',
                 password = 'password',
                 sid      = 'sid',
                 host     = 'hostname',
                 all      = TRUE)


    > TEST <- ore.doEval(function(str) {
                           library(rJava)
                           .jinit()
                           obj <- .jnew("HelloWorld")
                           val <- .jcall(obj, "S", "SayHello", str)
                           return(as.data.frame(val))
                         },
                         str = 'World!',
                        FUN.VALUE = data.frame(VAL = character())
      )

    > print(TEST)
                  VAL
    1 Hello,      World!


    If you receive this error, LD_LIBRARY_PATH is not set correctly in extproc.ora:

    Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch,  :
      Error in try({ : ORA-20000: RQuery error
    Error : package or namespace load failed for ‘rJava’
    ORA-06512: at "RQSYS.RQEVALIMPL", line 104
    ORA-06512: at "RQSYS.RQEVALIMPL", line 101


    Once you've mastered this simple example, you can move to your own use case. If you get stuck, the rJava package has very good documentation. Start with the information on the rJava CRAN page. Then, from an R session with the rJava package loaded, execute the command help(package="rJava") lto list  the available functions.

    After that, the source code of R packages which use rJava are a useful source of further inspiration – look at the reverse dependencies list for rJava in CRAN. In particular, the helloJavaWorld package is a tutorial for how to include Java code in an R package.



    Monday Mar 30, 2015

    Oracle Open World 2015 Call for Proposals!

    It's that time of year again...submit your session proposals for Oracle OpenWorld 2015!

    Oracle customers and partners are encouraged to submit proposals to present at the Oracle OpenWorld 2015 conference, October 25 - 29, 2015, held at the Moscone Center in San Francisco.

    Details and submission guidelines are available on the Oracle OpenWorld Call for Proposals web site. The deadline for submissions is Wednesday, April 29, 11:59 p.m. PDT.

    We look forward to checking out your sessions on Oracle Advanced Analytics, including Oracle R Enterprise and Oracle Data Mining, and Oracle R Advanced Analytics for Hadoop. Tell how these tools have enhanced the way you do business!

    Monday Mar 23, 2015

    Oracle R Distribution 3.1.1 Available for Download on all Platforms

    The Oracle R Distribution 3.1.1 binaries for Windows, AIX, Solaris SPARC and Solaris x86 are now available on OSS, Oracle's Open Source Software portal. Oracle R Distribution 3.1.1 is an update to R version 3.1.0, and it includes many improvements, including upgrades to the package help system and improved accuracy importing data with large integers. The complete list of changes is in the NEWS file.

    To install Oracle R Distribution,
    follow the instructions for your platform in the Oracle R Enterprise Installation and Administration Guide.

    Thursday Feb 12, 2015

    Pain Point #6: “We need to build 10s of thousands of models fast to meet business objectives”

    The last pain point in this series on Addressing Analytic Pain Points, involves one aspect of what I call massive predictive modeling. Increasingly, enterprise customers are building a greater number of models. In past decades, producing a handful of production models per year may have been considered a significant accomplishment. With the advent of powerful computing platforms, parallel and distributed algorithms, as well as the wealth of data – Big Data – we see enterprises building hundreds and thousands of models in targeted ways.

    For example, consider the utility sector with data being collected from household smart meters. Whether water, gas, or electricity, utility companies can make more precise demand projections by modeling individual customer consumption behavior. Aggregating this behavior across all households can provide more accurate forecasts, since individual household patterns are considered, not just generalizations about all households, or even different household segments.

    The concerns associated with this form of massive predictive modeling include: (i) dealing effectively with Big Data from the hardware, software, network, storage and Cloud, (ii) algorithm and infrastructure scalability and performance, (iii) production deployment, and (iv) model storage, backup, recovery and security. Some of these I’ve explored under previous pain points blog posts.

    Oracle Advanced Analytics (OAA) and Oracle R Advanced Analytics for Hadoop (ORAAH) both provide support for massive predictive modeling. From the Oracle R Enterprise component of OAA, users leverage embedded R execution to run user-defined R functions in parallel, both from R and from SQL. OAA provides the infrastructure to allow R users to focus on their core R functionality while allowing Oracle Database to handle spawning of R engines, partitioning data and providing data to their R function across parallel R engines, aggregating results, etc. Data parallelism is enabled using the “groupApply” and “rowApply” functions, while task parallelism is enabled using the “indexApply” function. The Oracle Data Mining component of OAA provides "on-the-fly" models, also called "predictive queries," where the model is automatically built on partitions of the data and scoring using those partitioned models is similarly automated.

    ORAAH enables the writing of mapper and reducer functions in R where corresponding ORE functionality can be achieved on the Hadoop cluster. For example, to emulate “groupApply”, users write the mapper to partition the data and the reducer to build a model on the resulting data. To emulate “rowApply”, users can simply use the mapper to perform, e.g., data scoring and passing the model to the environment of the mapper. No reducer is required.

    Monday Jan 19, 2015

    Pain Point #5: “Our company is concerned about data security, backup and recovery”

    So far in this series on Addressing Analytic Pain Points, I’ve focused on the issues of data access, performance, scalability, application complexity, and production deployment. However, there are also fundamental needs for enterprise advanced analytics solutions that revolve around data security, backup, and recovery.

    Traditional non-database analytics tools typically rely on flat files. If data originated in an RDBMS, that data must first be extracted. Once extracted, who has access to these flat files? Who is using this data and when? What operations are being performed? Security needs for data may be somewhat obvious, but what about the predictive models themselves? In some sense, these may be more valuable than the raw data since these models contain patterns and insights that help make the enterprise competitive, if not the dominant player. Are these models secure? Do we know who is using them, when, and with what operations? In short, what audit capabilities are available?

    While security is a hot topic for most enterprises, it is essential to have a well-defined backup process in place. Enterprises normally have well-established database backup procedures that database administrators (DBAs) rigorously follow. If data and models are stored in flat files, perhaps in a distributed environment, one must ask what procedures exist and with what guarantees. Are the data files taxing file system backup mechanisms already in place – or not being backed up at all?

    On the other hand, recovery involves using those backups to restore the database to a consistent state, reapplying any changes since the last backup. Again, enterprises normally have well-established database recovery procedures that are used by DBAs. If separate backup and recovery mechanisms are used for data, models, and scores, it may be difficult, if not impossible, to reconstruct a consistent view of an application or system that uses advanced analytics. If separate mechanisms are in place, they are likely more complex than necessary.

    For Oracle Advanced Analytics (OAA), data is secured via Oracle Database, which wins security awards and is highly regarded for its ability to provide secure data for confidentiality, integrity, availability, authentication, authorization, and non-repudiation. Oracle Database logs and monitors user activity. Users can work independently or jointly in a shared environment with data access controlled by standard database privileges. The data itself can be encrypted and data redaction is supported.

    OAA models are secured in one of two ways: (i) models produced in the kernel of the database are treated as first-class database objects with corresponding access privileges (create, update, delete, execute), and (ii) models produced through the R interface can be stored in the R datastore, which exists as a database table in the user's schema with its own access privileges. In either case, users must log into their Oracle Database schema/account, which provides the needed degree of confidentiality, integrity, availability, authentication, authorization, and non-repudiation.

    Enterprise Oracle DBAs already follow rigorous backup and recovery procedures. The ability to reuse these procedures in conjunction with advanced analytics solutions is a major simplification and helps to ensure the integrity of data, models, and results.

    Tuesday Dec 23, 2014

    Pain Point #4: “Recoding R (or other) models into SQL, C, or Java takes time and is error prone”

    In the previous post in this series Addressing Analytic Pain Points, I focused on some issues surrounding production deployment of advanced analytics solutions. One specific aspect of production deployment involves how to get predictive model results (e.g., scores) from R or leading vendor tools into applications that are based on programming languages such as SQL, C, or Java. In certain environments, one way to integrate predictive models involves recoding them into one of these languages. Recoding involves identifying the minimal information needed for scoring, i.e., making predictions, and implementing that in a language that is compatible with the target environment. For example, consider a linear regression model with coefficients. It can be fairly straightforward to write a SQL statement or a function in C or Java to produce a score using these coefficients. This translated model can then be integrated with production applications or systems.

    While recoding has been a technique used for decades, it suffers from several drawbacks: latency, quality, and robustness. Latency refers to the time delay between the data scientist developing the solution and leveraging that solution in production. Customers recount historic horror stories where the process from analyst to software developers to application deployment took months. Quality comes into play on two levels: the coding and testing quality of the software produced, and the freshness of the model itself. In fast changing environments, models may become “stale” within days or weeks. As a result, latency can impact quality. In addition, while a stripped down implementation of the scoring function is possible, it may not account for all cases considered by the original algorithm implementer. As such, robustness, i.e., the ability to handle greater variation in the input data, may suffer.

    One way to address this pain point is to make it easy to leverage predictive models immediately (especially open source R and in-database Oracle Advanced Analytics models), thereby eliminating the need to recode models. Since enterprise applications normally know how to interact with databases via SQL, as soon as a model is produced, it can be placed into production via SQL access. In the case of R models, these can be accessed using Oracle R Enterprise embedded R execution in parallel via ore.rowApply and, for select models, the ore.predict capability performs automatic translation of native R models for execution inside the database. In the case of native SQL Oracle Advanced Analytics interface algorithms, as found in Oracle Data Mining and exposed through an R interface in Oracle R Enterprise, users can perform scoring directly in Oracle Database. This capability minimizes or even eliminates latency, dramatically increases quality, and leverages the robustness of the original algorithm implementations.

    Sunday Dec 14, 2014

    Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex”

    Continuing in our series Addressing Analytic Pain Points, another concern for data scientists and analysts, as well as enterprise management, is how to leverage analytic results in production systems. These production systems can include (i) dashboards used by management to make business decisions, (ii) call center applications where representatives see personalized recommendations for the customer they’re speaking to or how likely that customer is to churn, (iii) real-time recommender systems for customer retail web applications, (iv) automated network intrusion detection systems, and (v) semiconductor manufacturing alert systems that monitor product quality and equipment parameters via sensors – to name a few.

    When a data scientist or analyst begins examining a data-based business problem, one of the first steps is to acquire the available data relevant to that problem. In many enterprises, this involves having it extracted from a data warehouse and operational systems, or acquiring supplemental data from third parties. They then explore the data, prepare it with various transformations, build models using a variety of algorithms and settings, evaluate the results, and after choosing a “best” approach, produce results such as predictions or insights that can be used by the enterprise.

    If the end goal is to produce a slide deck or report, aside from those final documents, the work is done. However, reaping financial benefits from advanced analytics often needs to go beyond PowerPoint! It involves automating the process described above: extract and prepare the data, build and select the “best” model, generate predictions or highlight model details such as descriptive rules, and utilize them in production systems.

    One of the biggest challenges enterprises face involves realizing the promised benefits in production that the data scientist achieved in the lab. How do you take that cleverly crafted R script, for example, and put all the necessary “plumbing” around it to enable not only the execution of the R script, but the movement of data and delivery of results where they are needed, parallel and distributed script execution across compute nodes, and execution scheduling.

    As a production deployment, care needs to taken to safeguard against potential failures in the process. Further, more “moving parts” result in greater complexity. Since the plumbing is often custom implemented for each deployment, this plumbing needs to be reinvented and thoroughly tested for each project. Unfortunately, code and process reuse is seldom realized across an enterprise even for similar projects, which results in duplication of effort.

    Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining) with Oracle Database provides an environment that eliminates the need for a separately managed analytics server, the corresponding movement of data and results between such a server and the database, and the need for custom plumbing. Users can store their R and SQL scripts directly in Oracle Database and invoke them through standard database mechanisms. For example, R scripts can be invoked via SQL, and SQL scripts can be scheduled for execution through Oracle Database’s DMBS_SCHEDULER package. Parallel and distributed execution of R scripts is supported through embedded R execution, while the database kernel supports parallel and distributed execution of SQL statements and in-database data mining algorithms. In addition, using the Oracle Advanced Analytics GUI, Oracle Data Miner, users can convert “drag and drop” analytic workflows to SQL scripts for ease of deployment in Oracle Database.

    By making solution deployment a well-defined and routine part of the production process and reducing complexity through fewer moving parts and built-in capabilities, enterprises are able to realize and then extend the value they get from predictive analytics faster and with greater confidence.

    Wednesday Nov 19, 2014

    Pain Point #2: “I can’t analyze or mine all of my data – it has to be sampled”

    Continuing in our series Addressing Analytic Pain Points, another concern for enterprise data scientists and analysts is having to compromise accuracy due to sampling. While sampling is an important technique for data analysis, it’s one thing to sample because you choose to; it’s quite another if you are forced to sample or to use a much smaller sample than is useful. A combination of memory, compute power, and algorithm design normally contributes to this.

    In some cases, data simply cannot fit in memory. As a result, users must either process data in batches (adding to code or process complexity), or limit the data they use through sampling. In some environments, sampling itself introduces a catch 22 problem: the data is too big to fit in memory so it needs to be sampled, but to sample it with the current tool, I need to fit the data in memory! As a result, sampling large volume data may require processing it in batches, involving extra coding.

    As data volumes increase, computing statistics and predictive analytics models on a data sample can significantly reduce accuracy. For example, to find all the unique values for a given variable, a sample may miss values, especially those that occur infrequently. In addition, for environments like open source R, it is not enough for data to fit in memory; sufficient memory must be left over to perform the computation. This results from R’s call-by-value semantics.

    Even when data fits in memory, local machines, such as laptops, may have insufficient CPU power to process larger data sets. Insufficient computing resources means that performance suffers and users must wait for results - perhaps minutes, hours, or longer. This wastes the valuable (and expensive) time of the data scientist or analyst. Having multiple fast cores for parallel computations, as normally present on database server machines, can significantly reduce execution time.

    So let’s say we can fit the data in memory with sufficient memory left over, and we have ample compute resources. It may still be the case that performance is slow, or worse, the computation effectively “never” completes. A computation that would take days or weeks to complete on the full data set may be deemed as “never” completing by the user or business, especially where the results are time-sensitive. To address this problem, algorithm design must be addressed. Serial, non-threaded algorithms, especially with quadratic or worse order run time do not readily scale. Algorithms need to be redesigned to work in a parallel and even distributed manner to handle large data volumes.

    Oracle Advanced Analytics
    provides a range of statistical computations and predictive algorithms implemented in a parallel, distributed manner to enable processing much larger volume data. By virtue of executing in Oracle Database, client-side memory limitations can be eliminated. For example, with Oracle R Enterprise, R users operate on database tables using proxy objects – of type ore.frame, a subclass of data.frame – such that data.frame functions are transparently converted to SQL and executed in Oracle Database. This eliminates data movement from the database to the client machine. Users can also leverage the Oracle Data Miner graphical interface or SQL directly. When high performance hardware, such as Oracle Exadata, is used, there are powerful resources available to execute operations efficiently on big data. On Hadoop, Oracle R Advanced Analytics for Hadoop – a part of the Big Data Connectors often deployed on Oracle Big Data Appliance – also provides a range of pre-package parallel, distributed algorithms for scalability and performance across the Hadoop cluster.

    Friday Oct 24, 2014

    Pain Point #1: “It takes too long to get my data or to get the ‘right’ data”

    This is the first in a series on Addressing Analytic Pain Points: “It takes too long to get my data or to get the ‘right’ data.”

    Analytics users can be characterized along multiple dimensions. One such dimension is how they get access to or receive data. For example, some receive data via flat files. Since we’re talking about “enterprise” users, this often means data stored in RDBMSs where users request data extracts from a DBA or more generally the IT department. Turnaround time can be hours to days, or even weeks, depending on the organization. If the data scientist needs more or different data, the cycle repeats – often leading to frustration on both sides and delays in generating results.

    Others users are granted access to databases directly using programmatic access tools like ODBC, JDBC, their corresponding R variants, or ROracle. These users may be given read-only access to a range of data tables, possibly in a sandbox schema. Here, analytics users don’t have to go back to their DBA or IT as to obtain extracts, but they still need to pull the data from the database to their client environment, e.g., a laptop, and push results back to the database. If significant volumes of data are involved, the time required for pulling data can hinder productivity. (Of course, this assumes the client has enough RAM to load the needed data sets, but that’s a topic for the next blog post.)

    To address the first type of user, since much of the data in question resides in databases, empowering users with a self service model mitigates the vicious cycle described above. When the available data are readily accessible to analytics users, they can see and select what they need at will. An Oracle Database solution addresses this data access pain point by providing schema access, possibly in a sandbox with read-only table access, for the analytics user.

    Even so, this approach just turns the first type of user into the second mentioned above. An Oracle Database solution further addresses this pain point by either minimizing or eliminating data movement as much as possible. Most analytics engines bring data to the computation, requiring extracts and in some cases even proprietary formats before being able to perform analytics. This takes time. Often, data movement can dwarf the time required to perform the actual computation. From the perspective of the analytics user, this is wasted time because it is just a perfunctory step on the way to getting the desired results. By bringing computation to the data, using Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining), the time normally required to move data is eliminated. Consider the time savings of being able to prepare data, compute statistics, or build predictive models and score data directly in the database. Using Oracle Advanced Analytics, either from R via Oracle R Enterprise, SQL via Oracle Data Mining, or the graphical interface Oracle Data Miner, users can leverage Oracle Database as a high performance computational engine.

    We should also note that Oracle Database has the high performance Oracle Call Interface (OCI) library for programmatic data access. For R users, Oracle provides the package ROracle that is optimized using OCI for fast data access. While ROracle performance may be much faster than other methods (ODBC- and JDBC-based), the time is still greater than zero and there are other problems that I’ll address in the next pain point.

    Addressing Analytic Pain Points

    If you’re an enterprise data scientist, data analyst, or statistician, and perform analytics using R or another third party analytics engine, you’ve likely encountered one or more of these pain points:

    Pain Point #1: “It takes too long to get my data or to get the ‘right’ data”
    Pain Point #2: “I can’t analyze or mine all of my data – it has to be sampled”
    Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex”
    Pain Point #4: “Recoding R (or other) models into SQL, C, or Java takes time and is error prone”
    Pain Point #5: “Our company is concerned about data security, backup and recovery”
    Pain Point #6: “We need to build 10s of thousands of models fast to meet business objectives”

    Some pain points are related to the scale of data, yet others are felt regardless of data size. In this blog series, I’ll explore each of these pain points, how they affect analytics users and their organizations, and how Oracle Advanced Analytics addresses them.

    Monday Sep 22, 2014

    Oracle R Enterprise 1.4.1 Released

    Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, makes the open source R statistical programming language and environment ready for the enterprise and big data. Designed for problems involving large data volumes, Oracle R Enterprise integrates R with Oracle Database.

    R users can execute R commands and scripts for statistical and graphical analyses on data stored in Oracle Database. R users can develop, refine, and deploy R scripts that leverage the parallelism and scalability of the database to automate data analysis. Data analysts and data scientists can use open source R packages and develop and operationalize R scripts for analytical applications in one step – from R or SQL.

    With the new release of Oracle R Enterprise 1.4.1, Oracle enables support for Multitenant Container Database (CDB) in Oracle Database 12c and pluggable databases (PDB). With support for CDB / PDB, enterprises can take advantage of new ways of organizing their data: easily taking entire databases offline and easily bringing them back online when needed. Enterprises, such as pharmaceutical companies, that collect vast quantities of data across multiple experiments for individual projects immediately benefit from this capability.

    This point release also includes the following enhancements:

    • Certified for use with R 3.1.1 and Oracle R Distribution 3.1.1.

    • Simplified and enhanced script for install, upgrade, uninstall of ORE Server and the creation and configuratioon of ORE users.

    • New supporting packages: arules and statmod.

    • ore.glm accepts offset terms in model formula and can fit negative binomial and tweedie families of GLM.

    • ore.sync argument, query, creates ore.frame object from SELECT statement without creating view. This allows users to effectively access a view of the data without the CREATE VIEW privilege.

    • Global option for serialization, ore.envAsEmptyenv, specifies whether referenced environment objects in an R object, e.g., in an lm model, should be replaced with an empty environment during serialization to the ORE R datastore. This is used by (i) ore.push, which for a list object accepts envAsEmptyenv as an optional argument, (ii) ore.save, which has envAsEmptyenv as a named argument, and (iii) ore.doEval and the other embedded R execution functions, which accept ore.envAsEmptyenv as a control argument.

    Oracle R Enterprise 1.4.1
    can be downloaded from OTN here.

    About

    The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

    Search

    Categories
    Archives
    « May 2016
    SunMonTueWedThuFriSat
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
        
           
    Today