Thursday Aug 06, 2015

Oracle R Advanced Analytics for Hadoop on the Fast Lane: Spark-based Logistic Regression and MLP Neural Networks

This is the first in a series of blogs that is going to explore the capabilities of the newly released Oracle R Advanced Analytics for Hadoop 2.5.0, part of Oracle Big Data Connectors, which includes two new algorithm implementations that can take advantage of an Apache Spark cluster for a significant performance gains on Model Build and Scoring time. These algorithms are a redesigned version of the Multi-Layer Perceptron Neural Networks (orch.neural) and a brand new implementation of a Logistic Regression model (orch.glm2).

Through large scale benchmarks we are going to see the improvements in performance that the new custom algorithms bring to enterprise Data Science when running on top of a Hadoop Cluster with an available Apache Spark infrastructure.

In this first part, we are going to compare only model build performance and feasibility of the new algorithms against the same algorithms running on Map-Reduce, and we are not going to be concerned with model quality or precision. Model scoring, quality and precision are going to be part of a future Blog.

The Documentation on the new Components can be found on the product itself (with help and sample code), and also on the ORAAH 2.5.0 Documentation Tab on OTN.

Hardware and Software used for testing


As a test Bed, we are using an Oracle Big Data Appliance X3-2 cluster with 6-nodes.  Each node consists of two Intel® Xeon® 6-core X5675 (3.07 GHz), for a total of 12 cores (24 threads), and 96 GB of RAM per node is available.

The BDA nodes run Oracle Enterprise Linux release 6.5, Cloudera Hadoop Distribution 5.3.0 (that includes Apache Spark release 1.2.0). 

Each node also is running Oracle R Distribution release 3.1.1 and Oracle R Advanced Analytics for Hadoop release 2.5.0.

Dataset used for Testing


For the test we are going to use a classic Dataset that consists of Arrival and Departure information of all major Airports in the USA.  The data is available online in different formats.  The most used one contains 123 million records and has been used for many benchmarks, originally cited by the American Statistical Association for their Data Expo 2009.  We have augmented the data available in that file by downloading additional months of data from the official Bureau of Transportation Statistics website.  Our starting point is going to be this new dataset that contains 159 million records and has information up to September 2014.

For smaller tests, we created a simple subset of this dataset of 1, 10 and 100 million records.  We also created a 1 billion-record dataset by appending the 159 million-record data over and over until we reached 1 billion records.

Connecting to a Spark Cluster


In release 2.5.0 we are introducing a new set of R commands that will allow the Data Scientist to request the creation of a Spark Context, in either YARN or Standalone modes.

For this release, the Spark Context is exclusively used for accelerating the creation of Levels and Factor variables, the Model Matrix, the final solution to the Logistic and Neural Networks models themselves, and Scoring (in the case of GLM).

The new commands are highlighted below:
spark.connect(master, name = NULL, memory = NULL,  dfs.namenode = NULL)

spark.connect() requires loading the ORCH library first to read the configuration of the Hadoop Cluster.

The “master” variable can be specified as either “yarn-client” to use YARN for Resource Allocation, or the direct reference to a Spark Master service and port, in which case it will use Spark in Standalone Mode. 
The “name” variable is optional, and it helps centralized logging of the Session on the Spark Master.  By default, the Application name showing on the Spark Master is “ORCH”.
The “memory” field indicates the amount of memory per Spark Worker to dedicate to this Spark Context.

Finally, the dfs.namenode points to the Apache HDFS Namenode Server, in order to exchange information with HDFS.

In summary, to establish a Spark connection, one could do:
> spark.connect("yarn-client", memory="2g", dfs.namenode=”my.namenode.server.com")
Conversely, to disconnect the Session after the work is done, you can use spark.disconnect() without options.

The command spark.connected() checks the status of the current Session and contains the information of the connection to the Server. It is automatically called by the new algorithms to check for a valid connection.

ORAAH 2.5.0 introduces support for loading data to Spark cache from an HDFS file via the function hdfs.toRDD(). ORAAH dfs.id objects were also extended to support both data residing in HDFS and in Spark memory, and allow the user to cache the HDFS data to an RDD object for use with the new algorithms.

For all the examples used in this Blog, we used the following command in the R Session:

Oracle Distribution of R version 3.1.1 (--) -- "Sock it to Me"
Copyright (C) The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

You are using Oracle's distribution of R. Please contact
Oracle Support for any problems you encounter with this
distribution.

[Workspace loaded from ~/.RData]

> library(ORCH)
Loading required package: OREbase

Attaching package: ‘OREbase’

The following objects are masked from ‘package:base’:

cbind, data.frame, eval, interaction, order, paste, pmax, pmin, rbind, table

Loading required package: OREstats
Loading required package: MASS
Loading required package: ORCHcore
Loading required package: rJava
Oracle R Connector for Hadoop 2.5.0 (rev. 307)
Info: using native C base64 encoding implementation
Info: Hadoop distribution is Cloudera's CDH v5.3.0
Info: using auto-detected ORCH HAL v4.2
Info: HDFS workdir is set to "/user/oracle"
Warning: mapReduce checks are skipped due to "ORCH_MAPRED_CHECK"=FALSE
Warning: HDFS checks are skipped due to "ORCH_HDFS_CHECK"=FALSE
Info: Hadoop 2.5.0-cdh5.3.0 is up
Info: Sqoop 1.4.5-cdh5.3.0 is up
Info: OLH 3.3.0 is up
Info: loaded ORCH core Java library "orch-core-2.5.0-mr2.jar"
Loading required package: ORCHstats
>
> # Spark Context Creation
> spark.connect(master="spark://my.spark.server:7077", memory="24G",dfs.namenode="my.dfs.namenode.server")
>


In this case, we are requesting the usage of 24 Gb of RAM per node in Standalone Mode. Since our BDA has 6 nodes, the total RAM assigned to our Spark Context is 144 GB, which can be verified in the Spark Master screen shown below.


GLM – Logistic Regression


In this release, because of a totally overhauled computation engine, we created a new function called orch.glm2() that is going to execute exclusively the Logistic Regression model using Apache Spark as platform.  The input data expected by the algorithm is an ORAAH dfs.id object, which means an HDFS CSV dataset, a HIVE Table that was made compatible by using the hdfs.fromHive() command, or HDFS CSV dataset that has been cached into Apache Spark as an RDD object using the command hdfs.toRDD().

A simple example of the new algorithm running on the ONTIME dataset with 1 billion records is shown below. The objective of the Test Model is the prediction of Cancelled Flights. The new model requires the indication of a Factor variable as an F() in the formula, and the default (and only family available in this release) is the binomial().

The R code and the output below assumes that the connection to the Spark Cluster is already done.

> # Attaches the HDFS file for use within R
> ont1bi <- hdfs.attach("/user/oracle/ontime_1bi")
> # Checks the size of the Dataset
> hdfs.dim(ont1bi)
 [1] 1000000000         30
> # Testing the GLM Logistic Regression Model on Spark
> # Formula definition: Cancelled flights (0 or 1) based on other attributes

> form_oraah_glm2 <- CANCELLED ~ DISTANCE + ORIGIN + DEST + F(YEAR) + F(MONTH) +
+   F(DAYOFMONTH) + F(DAYOFWEEK)

> # ORAAH GLM2 Computation from HDFS data (computing factor levels on its own)

> system.time(m_spark_glm <- orch.glm2(formula=form_oraah_glm2, ont1bi))
 ORCH GLM: processed 6 factor variables, 25.806 sec
 ORCH GLM: created model matrix, 100128 partitions, 32.871 sec
 ORCH GLM: iter  1,  deviance   1.38433414089348300E+09,  elapsed time 9.582 sec
 ORCH GLM: iter  2,  deviance   3.39315388583931150E+08,  elapsed time 9.213 sec
 ORCH GLM: iter  3,  deviance   2.06855738812683250E+08,  elapsed time 9.218 sec
 ORCH GLM: iter  4,  deviance   1.75868100359263200E+08,  elapsed time 9.104 sec
 ORCH GLM: iter  5,  deviance   1.70023181759611580E+08,  elapsed time 9.132 sec
 ORCH GLM: iter  6,  deviance   1.69476890425481350E+08,  elapsed time 9.124 sec
 ORCH GLM: iter  7,  deviance   1.69467586045954760E+08,  elapsed time 9.077 sec
 ORCH GLM: iter  8,  deviance   1.69467574351380850E+08,  elapsed time 9.164 sec
user  system elapsed
84.107   5.606 143.591 

> # Shows the general features of the GLM Model
> summary(m_spark_glm)
               Length Class  Mode   
coefficients   846    -none- numeric
deviance         1    -none- numeric
solutionStatus   1    -none- character
nIterations      1    -none- numeric
formula          1    -none- character
factorLevels     6    -none- list  

A sample benchmark against the same models running on Map-Reduce are illustrated below.  The Map-Reduce models used the call orch.glm(formula, dfs.id, family=(binomial()), and used as.factor() in the formula.


We can see that the Spark-based GLM2 is capable of a large performance advantage over the model executing in Map-Reduce.

Later in this Blog we are going to see the performance of the Spark-based GLM Logistic Regression on 1 billion records.

Linear Model with Neural Networks


For the MLP Neural Networks model, the same algorithm was adapted to execute using the Spark Caching.  The exact same code and function call will recognize if there is a connection to a Spark Context, and if so, will execute the computations using it.

In this case, the code for both the Map-Reduce and the Spark-based executions is exactly the same, with the exception of the spark.connect() call that is required for the Spark-based version to kick in.

The objective of the Test Model is the prediction of Arrival Delays of Flights in minutes, so the model class is a Regression Model. The R code used to run the benchmarks and the output is below, and it assumes that the connection to the Spark Cluster is already done.

> # Attaches the HDFS file for use within R
> ont1bi <- hdfs.attach("/user/oracle/ontime_1bi")

> # Checks the size of the Dataset
> hdfs.dim(ont1bi)
 [1] 1000000000         30

> # Testing Neural Model on Spark
> # Formula definition: Arrival Delay based on other attributes

> form_oraah_neu <- ARRDELAY ~ DISTANCE + ORIGIN + DEST + as.factor(MONTH) +
+   as.factor(YEAR) + as.factor(DAYOFMONTH) + as.factor(DAYOFWEEK)

> # Compute Factor Levels from HDFS data
> system.time(xlev <- orch.getXlevels(form_oraah_neu, dfs.dat = ont1bi))
    user  system elapsed
17.717   1.348  50.495

> # Compute and Cache the Model Matrix from HDFS data, passing factor levels
> system.time(Mod_Mat <- orch.prepare.model.matrix(form_oraah_neu, dfs.dat = ont1bi,xlev=xlev))
   user  system elapsed
17.933   1.524  95.624

> # Compute Neural Model from RDD cached Model Matrix
> system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat, xlev=xlev, trace=T))
Unconstrained Nonlinear Optimization
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
Iter           Objective Value   Grad Norm        Step   Evals
  1   5.08900838381104858E+11   2.988E+12   4.186E-16       2
  2   5.08899723803646790E+11   2.987E+12   1.000E-04       3
  3   5.08788839748061768E+11   2.958E+12   1.000E-02       3
  4   5.07751213455999573E+11   2.662E+12   1.000E-01       4
  5   5.05395855303159180E+11   1.820E+12   3.162E-01       1
  6   5.03327619811536194E+11   2.517E+09   1.000E+00       1
  7   5.03327608118144775E+11   2.517E+09   1.000E+00       6
  8   4.98952182330299011E+11   1.270E+12   1.000E+00       1
  9   4.95737805642779968E+11   1.504E+12   1.000E+00       1
 10   4.93293224063758362E+11   8.360E+11   1.000E+00       1
 11   4.92873433106044373E+11   1.989E+11   1.000E+00       1
 12   4.92843500119498352E+11   9.659E+09   1.000E+00       1
 13   4.92843044802041565E+11   6.888E+08   1.000E+00       1
Solution status             Optimal (objMinProgress)
Number of L-BFGS iterations 13
Number of objective evals   27
Objective value             4.92843e+11
Gradient norm               6.88777e+08
   user  system elapsed
43.635   4.186  63.319 

> # Checks the general information of the Neural Network Model
> mod_neu
Number of input units      845
Number of output units     1
Number of hidden layers    0
Objective value            4.928430E+11
Solution status            Optimal (objMinProgress)
Output layer               number of neurons 1, activation 'linear'
Optimization solver        L-BFGS
Scale Hessian inverse      1
Number of L-BFGS updates   20
> mod_neu$nObjEvaluations
[1] 27
> mod_neu$nWeights
[1] 846

A sample benchmark against the same models running on Map-Reduce are illustrated below.  The Map-Reduce models used the exact same orch.neural() calls as the Spark-based ones, with only the Spark connection as a difference.

We can clearly see that the larger the dataset, the larger the difference in speeds of the Spark-based computation compared to the Map-Reduce ones, reducing the times from many hours to a few minutes.

This new performance makes possible to run much larger problems and test several models on 1 billion records, something that took half a day just to run one model.

Logistic and Deep Neural Networks with 1 billion records


To prove that it is now feasible not only to run Logistic and Linear Model Neural Networks on large scale datasets, but also complex Multi-Layer Neural Network Models, we decided to test the same 1 billion record dataset against several different architectures.

These tests were done to check for performance and feasibility of these types of models, and not for comparison of precision or quality, which will be part of a different Blog.

The default activation function for all Multi-Layer Neural Network models was used, which is the bipolar sigmoid function, and also the default output activation layer was also user, which is the linear function.

As a reminder, the number of weights we need to compute for a Neural Networks is as follows:

The generic formulation for the number of weights to be computed is then:
Total Number of weights = SUM of all Layers from First Hidden to the Output of [(Number of inputs into each Layer + 1) * Number of Neurons)]
In the simple example, we had [(3 inputs + 1 bias) * 2 neurons] + [(2 neurons + 1 bias) * 1 output ] = 8 + 3 = 11 weights

In our tests for the Simple Neural Network model (Linear Model), using the same formula, we can see that we were computing 846 weights, because it is using 845 inputs plus the Bias.

Thus, to calculate the number of weights necessary for the Deep Multi-layer Neural Networks that we are about to Test below, we have the following:

MLP 3 Layers (50,25,10) => [(845+1)*50]+[(50+1)*25]+[(25+1)*10]+[(10+1)*1] = 43,846 weights

MLP 4 Layers (100,50,25,10) => [(845+1)*100]+[(100+1)*50]+[(50+1)*25]+[(25+1)*10]+[(10+1)*1] = 91,196 weights

MLP 5 Layers (200,100,50,25,10) => [(845+1)*200]+[(200+1)*100]+[(100+1)*50]+[(50+1)*25]+ [(25+1)*10]+[(10+1)*1] = 195,896 weights

The times required to compute the GLM Logistic Regression Model that predicts the Flight Cancellations on 1 billion records is included just as an illustration point of the performance of the new Spark-based algorithms.

The Neural Network Models are all predicting Arrival Delay of Flights, so they are either Linear Models (the first one, with no Hidden Layers) or Non-linear Models using the bipolar sigmoid activation function (the Multi-Layer ones).


This demonstrates that the capability of building Very Complex and Deep Networks is available with ORAAH, and it makes possible to build networks with hundreds of thousands or millions of weights for more complex problems.

Not only that, but a Logistic Model can be computed on 1 billion records in less than 2 and a half minutes, and a Linear Neural Model in almost 3 minutes.

The R Output Listing of the Logistic Regression computation and of the MLP Neural Networks are below.

> # Spark Context Creation
> spark.connect(master="spark://my.spark.server:7077", memory="24G",dfs.namenode="my.dfs.namenode.server")

> # Attaches the HDFS file for use with ORAAH
> ont1bi <- hdfs.attach("/user/oracle/ontime_1bi")

> # Checks the size of the Dataset
> hdfs.dim(ont1bi)
[1] 1000000000         30

GLM - Logistic Regression


> # Testing GLM Logistic Regression on Spark
> # Formula definition: Cancellation of Flights in relation to other attributes
> form_oraah_glm2 <- CANCELLED ~ DISTANCE + ORIGIN + DEST + F(YEAR) + F(MONTH) +
+   F(DAYOFMONTH) + F(DAYOFWEEK)

> # ORAAH GLM2 Computation from RDD cached data (computing factor levels on its own)
> system.time(m_spark_glm <- orch.glm2(formula=form_oraah_glm2, ont1bi))
 ORCH GLM: processed 6 factor variables, 25.806 sec
 ORCH GLM: created model matrix, 100128 partitions, 32.871 sec
 ORCH GLM: iter  1,  deviance   1.38433414089348300E+09,  elapsed time 9.582 sec
 ORCH GLM: iter  2,  deviance   3.39315388583931150E+08,  elapsed time 9.213 sec
 ORCH GLM: iter  3,  deviance   2.06855738812683250E+08,  elapsed time 9.218 sec
 ORCH GLM: iter  4,  deviance   1.75868100359263200E+08,  elapsed time 9.104 sec
 ORCH GLM: iter  5,  deviance   1.70023181759611580E+08,  elapsed time 9.132 sec
 ORCH GLM: iter  6,  deviance   1.69476890425481350E+08,  elapsed time 9.124 sec
 ORCH GLM: iter  7,  deviance   1.69467586045954760E+08,  elapsed time 9.077 sec
 ORCH GLM: iter  8,  deviance   1.69467574351380850E+08,  elapsed time 9.164 sec
user  system elapsed
84.107   5.606 143.591

> # Checks the general information of the GLM Model
> summary(m_spark_glm)
               Length Class  Mode   
coefficients   846    -none- numeric
deviance         1    -none- numeric
solutionStatus   1    -none- character
nIterations      1    -none- numeric
formula          1    -none- character
factorLevels     6    -none- list 

Neural Networks - Initial Steps


For the Neural Models, we have to add the times for computing the Factor Levels plus the time for creating the Model Matrix to the Total elapsed time of the Model computation itself.

> # Testing Neural Model on Spark
> # Formula definition
> form_oraah_neu <- ARRDELAY ~ DISTANCE + ORIGIN + DEST + as.factor(MONTH) +
+   as.factor(YEAR) + as.factor(DAYOFMONTH) + as.factor(DAYOFWEEK)
>
> # Compute Factor Levels from HDFS data
> system.time(xlev <- orch.getXlevels(form_oraah_neu, dfs.dat = ont1bi))
  user  system elapsed
12.598   1.431  48.765

>
> # Compute and Cache the Model Matrix from cached RDD data
> system.time(Mod_Mat <- orch.prepare.model.matrix(form_oraah_neu, dfs.dat = ont1bi,xlev=xlev))
  user  system elapsed
  9.032   0.960  92.953
<

Neural Networks Model with 3 Layers of Neurons


> # Compute DEEP Neural Model from RDD cached Model Matrix (passing xlevels)
> # Three Layers, with 50, 25 and 10 neurons respectively.

> system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat,
+                                    xlev=xlev, hiddenSizes=c(50,25,10),trace=T))
Unconstrained Nonlinear Optimization
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
Iter           Objective Value   Grad Norm        Step   Evals
  0   5.12100202340115967E+11   5.816E+09   1.000E+00       1
  1   4.94849165811250305E+11   2.730E+08   1.719E-10       1
  2   4.94849149028958862E+11   2.729E+08   1.000E-04       3
  3   4.94848409777413513E+11   2.702E+08   1.000E-02       3
  4   4.94841423640935242E+11   2.437E+08   1.000E-01       4
  5   4.94825372589270386E+11   1.677E+08   3.162E-01       1
  6   4.94810879175052673E+11   1.538E+07   1.000E+00       1
  7   4.94810854064597107E+11   1.431E+07   1.000E+00       1
Solution status             Optimal (objMinProgress)
Number of L-BFGS iterations 7
Number of objective evals   15
Objective value             4.94811e+11
Gradient norm               1.43127e+07
    user   system  elapsed
  91.024    8.476 1975.947

>
> # Checks the general information of the Neural Network Model
> mod_neu
Number of input units      845
Number of output units     1
Number of hidden layers    3
Objective value            4.948109E+11
Solution status            Optimal (objMinProgress)
Hidden layer [1]           number of neurons 50, activation 'bSigmoid'
Hidden layer [2]           number of neurons 25, activation 'bSigmoid'
Hidden layer [3]           number of neurons 10, activation 'bSigmoid'
Output layer               number of neurons 1, activation 'linear'
Optimization solver        L-BFGS
Scale Hessian inverse      1
Number of L-BFGS updates   20
> mod_neu$nObjEvaluations
[1] 15
> mod_neu$nWeights
[1] 43846
>

Neural Networks Model with 4 Layers of Neurons


> # Compute DEEP Neural Model from RDD cached Model Matrix (passing xlevels)
> # Four Layers, with 100, 50, 25 and 10 neurons respectively.

> system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat,
+                                    xlev=xlev, hiddenSizes=c(100,50,25,10),trace=T))
Unconstrained Nonlinear Optimization
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
Iter           Objective Value   Grad Norm        Step   Evals
   0   5.15274440087001343E+11   7.092E+10   1.000E+00       1
   1   5.10168177067538818E+11   2.939E+10   1.410E-11       1
   2   5.10086354184862549E+11   5.467E+09   1.000E-02       2
   3   5.10063808510261475E+11   5.463E+09   1.000E-01       4
   4   5.07663007172408386E+11   5.014E+09   3.162E-01       1
   5   4.97115989230861267E+11   2.124E+09   1.000E+00       1
   6   4.94859162124700928E+11   3.085E+08   1.000E+00       1
   7   4.94810727630636963E+11   2.117E+07   1.000E+00       1
   8   4.94810490064279846E+11   7.036E+06   1.000E+00       1
Solution status             Optimal (objMinProgress)
Number of L-BFGS iterations 8
Number of objective evals   13
Objective value             4.9481e+11
Gradient norm               7.0363e+06
    user   system  elapsed
166.169   19.697 6467.703
>
> # Checks the general information of the Neural Network Model
> mod_neu
Number of input units      845
Number of output units     1
Number of hidden layers    4
Objective value            4.948105E+11
Solution status            Optimal (objMinProgress)
Hidden layer [1]           number of neurons 100, activation 'bSigmoid'
Hidden layer [2]           number of neurons 50, activation 'bSigmoid'
Hidden layer [3]           number of neurons 25, activation 'bSigmoid'
Hidden layer [4]           number of neurons 10, activation 'bSigmoid'
Output layer               number of neurons 1, activation 'linear'
Optimization solver        L-BFGS
Scale Hessian inverse      1
Number of L-BFGS updates   20
> mod_neu$nObjEvaluations
[1] 13
> mod_neu$nWeights
[1] 91196

Neural Networks Model with 5 Layers of Neurons


> # Compute DEEP Neural Model from RDD cached Model Matrix (passing xlevels)
> # Five Layers, with 200, 100, 50, 25 and 10 neurons respectively.

> system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat,
+                                    xlev=xlev, hiddenSizes=c(200,100,50,25,10),trace=T))

Unconstrained Nonlinear Optimization
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
Iter           Objective Value   Grad Norm        Step   Evals
  0   5.14697806831633850E+11   6.238E+09   1.000E+00       1
......
......

 6   4.94837221890043518E+11   2.293E+08   1.000E+00  1                                                                   
 7   4.94810299190365112E+11   9.268E+06   1.000E+00       1
 8   4.94810277908935242E+11   8.855E+06   1.000E+00       1
Solution status             Optimal (objMinProgress)
Number of L-BFGS iterations 8
Number of objective evals   16
Objective value             4.9481e+11
Gradient norm               8.85457e+06
     user    system   elapsed
  498.002    90.940 30473.421
>
> # Checks the general information of the Neural Network Model
> mod_neu
Number of input units      845
Number of output units     1
Number of hidden layers    5
Objective value            4.948103E+11
Solution status            Optimal (objMinProgress)
Hidden layer [1]           number of neurons 200, activation 'bSigmoid'
Hidden layer [2]           number of neurons 100, activation 'bSigmoid'
Hidden layer [3]           number of neurons 50, activation 'bSigmoid'
Hidden layer [4]           number of neurons 25, activation 'bSigmoid'
Hidden layer [5]           number of neurons 10, activation 'bSigmoid'
Output layer               number of neurons 1, activation 'linear'
Optimization solver        L-BFGS
Scale Hessian inverse      1
Number of L-BFGS updates   20
> mod_neu$nObjEvaluations
[1] 16
> mod_neu$nWeights
[1] 195896

Best Practices on logging level for using Apache Spark with ORAAH


It is important to note that Apache Spark’s log is by default verbose, so it might be useful after a few tests with different settings to turn down the level of logging, something a System Administrator typically will do by editing the file $SPARK_HOME/etc/log4j.properties (see Best Practices below).

By default, that file is going to look something like this:

# cat $SPARK_HOME/etc/log4j.properties # Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=INFO
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

A typical full log will provide the below information, but also might provide too much logging when running the Models themselves, so it will be more useful for the first tests and diagnostics.

> # Creates the Spark Context. Because the Memory setting is not specified,
> # the defaults of 1 GB of RAM per Spark Worker is used
> spark.connect("yarn-client", dfs.namenode="my.hdfs.namenode")
15/02/18 13:05:44 WARN SparkConf:
SPARK_JAVA_OPTS was detected (set to '-Djava.library.path=/usr/lib64/R/lib'). This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with conf/spark-defaults.conf to set defaults for an application
- ./spark-submit with --driver-java-options to set -X options for a driver
- spark.executor.extraJavaOptions to set -X options for executors
-
 SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)
15/02/18 13:05:44 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' to '- Djava.library.path=/usr/lib64/R/lib' as a work-around.
15/02/18 13:05:44 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to '- Djava.library.path=/usr/lib64/R/lib' as a work-around
15/02/18 13:05:44 INFO SecurityManager: Changing view acls to: oracle
15/02/18 13:05:44 INFO SecurityManager: Changing modify acls to: oracle
15/02/18 13:05:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(oracle); users with modify permissions: Set(oracle)
15/02/18 13:05:44 INFO Slf4jLogger: Slf4jLogger started
15/02/18 13:05:44 INFO Remoting: Starting remoting
15/02/18 13:05:45 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@my.spark.master:35936]
15/02/18 13:05:45 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@my.spark.master:35936]
15/02/18 13:05:46 INFO SparkContext: Added JAR /u01/app/oracle/product/12.2.0/dbhome_1/R/library/ORCHcore/java/orch-core-2.4.1-mr2.jar at http://1.1.1.1:11491/jars/orch-core-2.4.1-mr2.jar with timestamp 1424264746100
15/02/18 13:05:46 INFO SparkContext: Added JAR /u01/app/oracle/product/12.2.0/dbhome_1/R/library/ORCHcore/java/orch-bdanalytics-core-2.4.1- mr2.jar at http://1.1.1.1:11491/jars/orch-bdanalytics-core-2.4.1-mr2.jar with timestamp 1424264746101
15/02/18 13:05:46 INFO RMProxy: Connecting to ResourceManager at my.hdfs.namenode /10.153.107.85:8032
Utils: Successfully started service 'sparkDriver' on port 35936. SparkEnv: Registering MapOutputTracker
SparkEnv: Registering BlockManagerMaster
DiskBlockManager: Created local directory at /tmp/spark-local-
MemoryStore: MemoryStore started with capacity 265.1 MB
HttpFileServer: HTTP File server directory is /tmp/spark-7c65075f-850c-
HttpServer: Starting HTTP Server
Utils: Successfully started service 'HTTP file server' on port 11491. Utils: Successfully started service 'SparkUI' on port 4040.
SparkUI: Started SparkUI at http://my.hdfs.namenode:4040
15/02/18 13:05:46 INFO Client: Requesting a new application from cluster with 1 NodeManagers
15/02/18 13:05:46 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/02/18 13:05:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/02/18 13:05:46 INFO Client: Uploading resource file:/opt/cloudera/parcels/CDH-5.3.1- 1.cdh5.3.1.p0.5/lib/spark/lib/spark-assembly.jar -> hdfs://my.hdfs.namenode:8020/user/oracle/.sparkStaging/application_1423701785613_0009/spark- assembly.jar
15/02/18 13:05:47 INFO Client: Setting up the launch environment for our AM container
15/02/18 13:05:47 INFO SecurityManager: Changing view acls to: oracle
15/02/18 13:05:47 INFO SecurityManager: Changing modify acls to: oracle
15/02/18 13:05:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(oracle); users with modify permissions: Set(oracle) 15/02/18 13:05:47 INFO Client: Submitting application 9 to ResourceManager
15/02/18 13:05:47 INFO YarnClientImpl: Submitted application application_1423701785613_0009 15/02/18 13:05:48 INFO Client: Application report for application_1423701785613_0009 (state: ACCEPTED)
13:05:48 INFO Client:
client token: N/A
diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.oracle
start time: 1424264747559 final status: UNDEFINED tracking URL: http:// my.hdfs.namenode
13:05:46 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB
13:05:46 INFO Client: Setting up container launch context for our AM
13:05:46 INFO Client: Preparing resources for our AM container
my.hdfs.namenode:8088/proxy/application_1423701785613_0009/ user: oracle
15/02/18 13:05:49 INFO Client: Application report for application_1423701785613_0009 (state: ACCEPTED)
15/02/18 13:05:50 INFO Client: Application report for application_1423701785613_0009 (state: ACCEPTED)

Please note that all those warnings are expected, and may vary depending on the release of Spark used.

With the Console option in the log4j.properties settings are lowered from INFO to WARN, the request for a Spark Context would return the following:

# cat $SPARK_HOME/etc/log4j.properties

# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

Now the R Log is going to show only a few details about the Spark Connection.

> # Creates the Spark Context. Because the Memory setting is not specified,
> # the defaults of 1 GB of RAM per Spark Worker is used
> spark.connect(master="yarn-client", dfs.namenode="my.hdfs.server")
15/04/09 23:32:11 WARN SparkConf:
SPARK_JAVA_OPTS was detected (set to '-Djava.library.path=/usr/lib64/R/lib').
This is deprecated in Spark 1.0+.

Please instead use:
- ./spark-submit with conf/spark-defaults.conf to set defaults for an application
- ./spark-submit with --driver-java-options to set -X options for a driver
- spark.executor.extraJavaOptions to set -X options for executors
- SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)

15/04/09 23:32:11 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' to '-Djava.library.path=/usr/lib64/R/lib' as a work-around.
15/04/09 23:32:11 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to '-Djava.library.path=/usr/lib64/R/lib' as a work-around.
15/04/09 23:32:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Finally, with the Console logging option in the log4j.properties file set to ERROR instead of INFO or WARN, the request for a Spark Context would return nothing in case of success:

# cat $SPARK_HOME/etc/log4j.properties

# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

This time there is no message returned back to the R Session, because we requested it to only return feedback in case of an error:

> # Creates the Spark Context. Because the Memory setting is not specified,
> # the defaults of 1 GB of RAM per Spark Worker is used
> spark.connect(master="yarn-client", dfs.namenode="my.hdfs.server")
>

In summary, it is practical to start any Project with the full logging, but it would be a good idea to bring the level of logging down to WARN or ERROR after the environment has been tested to be working OK and the settings are stable.

Monday Sep 17, 2012

Podcast interview with Michael Kane

In this podcast interview with Michael Kane, Data Scientist and Associate Researcher at Yale University, Michael discusses the R statistical programming language, computational challenges associated with big data, and two projects involving data analysis he conducted on the stock market "flash crash" of May 6, 2010, and the tracking of transportation routes bird flu H5N1. Michael also worked with Oracle on Oracle R Enterprise, a component of the Advanced Analytics option to Oracle Database Enterprise Edition. In the closing segment of the interview, Michael comments on the relationship between the data analyst and the database administrator and how Oracle R Enterprise provides secure data management, transparent access to data, and improved performance to facilitate this relationship.

Listen now...

Friday Feb 03, 2012

What is R?

For many in the Oracle community, the addition of R through Oracle R Enterprise could leave them wondering "What is R?"

R has been receiving a lot of attention recently, although it’s been around for over 15 years. R is an open-source language and environment for statistical computing and data visualization, supporting data manipulation and transformations, as well as sophisticated graphical displays. It's being taught in colleges and universities in courses on statistics and advanced analytics - even replacing more traditional statistical software tools. Corporate data analysts and statisticians often know R and use it in their daily work, either writing their own R functionality, or leveraging the more than 3400 open source packages. The Comprehensive R Archive Network (CRAN) open source packages support a wide range of statistical and data analysis capabilities. They also focus on analytics specific to individual fields, such as bioinformatics, finance, econometrics, medical image analysis, and others (see CRAN Task Views).

So why do statisticians and data analysts use R?

Well, R is a statistics language similar to SAS or SPSS. It’s a powerful, extensible environment, and as noted above, it has a wide range of statistics and data visualization capabilities. It’s easy to install and use, and it’s free – downloadable from the CRAN R project website.

In contrast, statisticians and data analysts typically don’t know SQL and are not familiar with database tasks. R provides statisticians and data analysts access a wide range of analytical capabilities in a natural statistical language, allowing them to remain highly productive. For example, writing R functions is simple and can be done quickly. Functions can be made to return R objects that can be easily passed to and manipulated by other R functions. By comparison, traditional statistical tools can make the implementation of functions cumbersome, such that programmers resort to macro-oriented programming constructs instead.

So why do we need anything else?

R was conceived as a single user tool that is not multi-threaded.  The client and server components are bundled together as a single executable, much like Excel.

R is limited by the memory and processing power of the machine where it runs, but in addition, being single threaded, it cannot automatically leverage the CPU capacity on a user’s multi-processor laptop without special packages and programming.

However, there is another issue that limits R’s scalability…

R’s approach to passing data between function invocations results in data duplication – this chews up memory faster. So inherently, R is not good for big data, or depending on the machine and tasks, even gigabyte-sized data sets.

This is where Oracle R Enterprise comes in. As we'll continue to discuss in this blog, Oracle R Enterprise lifts this memory and computational constraint found in R today by executing requested R calculations on data in the database, using the database itself as the computational engine. Oracle R Enterprise allows users to further leverage Oracle's engineered systems, like Exadata, Big Data Appliance, and Exalytics, for enterprise-wide analytics, as well as reporting tools like Oracle Business Intelligence Enterprise Edition dashboards and BI Publisher documents.





Tuesday Jan 17, 2012

Welcome to Oracle R Enterprise!

Welcome to the Oracle R Enterprise blog - brought to you by the Oracle Advanced Analytics group. We'll be sharing best practices, tips, and tricks for applying Oracle R Enterprise and Oracle R Connector for Hadoop in both traditional and new "big data" environments. Oracle R Enterprise, along with Oracle Data Mining, are the two components of the new Oracle Advanced Analytics Option to Oracle Database.  

Here's a brief introduction to Oracle's R offerings: Oracle R Distribution, Oracle R Enterprise, and Oracle R Connector for Hadoop.

Oracle R Distribution provides an Oracle-supported distribution of open source R — enhanced with Intel’s MKL libraries for high performance mathematical computations on x86 hardware. The Oracle R Distribution facilitates enterprise acceptance of R, since the lack of a major corporate sponsor has made some companies concerned about fully adopting R.

Oracle R Enterprise (ORE) integrates the open-source R statistical environment and language with Oracle Database 11g, and the Oracle engineered solutions of Oracle Exadata and Oracle Big Data Appliance. ORE delivers enterprise-level advanced analytics based on the R environment, leveraging the database as an analytical compute engine. This allows R users like data analysts and statisticians to use the R client directly against data stored in Oracle Database 11g—vastly increasing scalability, performance, and security.

As an embedded component of the RDBMS, ORE eliminates R’s memory constraints since it can work on data directly in the database. R users can also execute R scripts in Oracle Database to support enterprise production applications. R's data.frame results and sophisticated graphics can be delivered through Oracle BI Publisher documents and OBIEE dashboards. Since it’s R, users are also able to leverage the latest contributed open source packages.

For data mining, R users not only can build models using any of the algorithms in the CRAN machine learning task view, but also leverage in-database implementations for predictions (e.g., stepwise regression, GLM, SVM), attribute selection, clustering, feature extraction via non-negative matrix factorization, association rules, and anomaly detection.

Oracle R Connector for Hadoop, one of the connectors available for Oracle Big Data Appliance, allows R users to work with the Hadoop Distributed File System (HDFS) and execute MapReduce programs on the Big Data Appliance Hadoop Cluster. R users write mapper and reducer functions in the R language, and invoke MapReduce jobs from the R environment.

We'll be exploring these components and their application in future posts.


 

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
4
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today