Monday Jan 04, 2016

ORE Random Forest

Random Forest is a popular ensemble learning technique for classification and regression, developed by Leo Breiman and Adele Cutler. By combining the ideas of “bagging” and random selection of variables, the algorithm produces a collection of decision trees with controlled variance, while avoiding overfitting – a common problem for decision trees. By constructing many trees, classification predictions are made by selecting the mode of classes predicted, while regression predictions are computed using the mean from the individual tree predictions.

Although the Random Forest algorithm provides high accuracy, performance and scalability can be issues for larger data sets. Oracle R Enterprise 1.5 introduces Random Forest for classification with three enhancements:

  •  ore.randomForest uses the ore.frame proxy for database tables so that data remain in the database server
  •  ore.randomForest executes in parallel for model building and scoring while using Oracle R Distribution or R’s randomForest package 4.6-10
  •  randomForest in Oracle R Distribution significantly reduces memory requirements of R’s algorithm, providing only the functionality required for use by ore.randomForest

Performance

Consider the model build performance of randomForest for 500 trees (the default) and three data set sizes (10K, 100K, and 1M rows). The formula is

‘DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH’

using samples of the popular ONTIME domestic flight dataset.

With ORE’s parallel, distributed implementation, ore.randomForest is an order of magnitude faster than the commonly used randomForest package. While the first plot uses the original execution times, the second uses a log scale to facilitate interpreting scalability.

Memory vs. Speed
ore.randomForest
is designed for speed, relying on ORE embedded R execution for parallelism to achieve the order of magnitude speedup. However, the data set is loaded into memory for each parallel R engine, so high degrees of parallelism (DOP) will result in the corresponding use of memory. Since Oracle R Distribution’s randomForest improves memory usage over R's randomForest (approximately 7X less), larger data sets can be accommodated. Users can specify the DOP using the ore.parallel global option.

API

The ore.randomForest API:

ore.randomForest(formula, data, ntree=500, mtry = NULL,
                replace = TRUE, classwt = NULL, cutoff = NULL,
                sampsize = if(replace) nrow(data) else ceiling(0.632*nrow(data)),
                nodesize = 1L, maxnodes = NULL, confusion.matrix = FALSE,
                na.action = na.fail, ...)

To highlight two of the arguments, confusion_matrix is a logical value indicating whether to calculate the confusion matrix. Note that this confusion matrix is not based on OOB (out-of-bag), it is the result of applying the built random forest model to the entire training data.


Argument groups is the number of tree groups that the total number of trees are divided into during model build. The default is equal to the value of the option 'ore.parallel'. If system memory is limited, it is recommended to set this argument to a value large enough so that the number of trees in each group is small to avoid exceeding memory availability.

Scoring with ore.randomForest follows other ORE scoring functions:

predict(object, newdata,
        type = c("response", "prob", "vote", "all"),
        norm.votes = TRUE,
        supplemental.cols = NULL,
        cache.model = TRUE, ...)

The arguments include:

  •  type: scoring output content – 'response', 'prob', 'votes', or 'all'. Corresponding to predicted values, matrix of class probabilities, matrix of vote counts, or both the vote matrix and predicted values, respectively.
  •  norm.votes: a logical value indicating whether the vote counts in the output vote matrix should be normalized. The argument is ignored if 'type' is 'response' or 'prob'.
  •  supplemental.cols: additional columns from the 'newdata' data set to include in the prediction result. This can be particularly useful for including a key column that can be related back to the original data set.
    cache.model: a logical value indicating whether the entire random forest model is cached in memory during prediction. While the default is TRUE, setting it to FALSE may be beneficial if memory is an issue.

Example


options(ore.parallel=8)
df <- ONTIME_S[,c("DAYOFWEEK","DEPDELAY","DISTANCE",
             "UNIQUECARRIER","DAYOFMONTH","MONTH")]
df <- df[complete.cases(df),]
mod <- ore.randomForest(DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH,                 df, ntree=100,groups=20)
ans <- predict(mod, df, type="all", supplemental.cols="DAYOFWEEK")
head(ans)



R> options(ore.parallel=8)
R> df <- ONTIME_S[,c("DAYOFWEEK","DEPDELAY","DISTANCE",
            "UNIQUECARRIER","DAYOFMONTH","MONTH")]
R> df <- dd[complete.cases(dd),]
R> mod <- ore.randomForest(DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH,
+                 df, ntree=100,groups=20)
R> ans <- predict(mod, df, type="all", supplemental.cols="DAYOFWEEK")

R> head(ans)
     1    2    3    4    5    6    7 prediction DAYOFWEEK
1 0.09 0.01 0.06 0.04 0.70 0.05 0.05 5          5
2 0.06 0.01 0.02 0.03 0.01 0.38 0.49 7          6
3 0.11 0.03 0.16 0.02 0.06 0.57 0.05 6          6
4 0.09 0.04 0.15 0.03 0.02 0.62 0.05 6          6
5 0.04 0.04 0.04 0.01 0.06 0.72 0.09 6          6
6 0.35 0.11 0.14 0.27 0.05 0.08 0.00 1          1

Friday Dec 06, 2013

Oracle R Distribution 3.0.1 now available for Windows 64-bit

We are excited to introduce support for Oracle R Distribution 3.0.1 on Windows 64-bit versions. Previous releases are available on Solaris x86, Solaris SPARC, AIX and Linux 64-bit platforms. Oracle R Distribution (ORD) continues to support these platforms and now expands support to Windows 64-bit platforms.

ORD is Oracle's free distribution of the open source R environment that adds support for dynamically loading the Intel Math Kernel Library (MKL) installed on your system. MKL provides faster performance by taking advantage of hardware-specific math library implementations. The net effect is optimized processing speed, especially on multi-core systems.

To enable MKL support on your ORD Windows client:

1. Add the location of libOrdBlasLoader.dll and mkl_rt.dll to the PATH system environment variable on the client.

In a typical ORD 3.0.1 installation, libOrdBlasLoader.dll is located in the R HOME directory:

C:\Program Files\R\R-3.0.1\bin\x64

In a full MKL 11.1 installation, mkl_rt.dll is located in the Intel MKL Composer XE directory:

C:\Program Files (x86)\Intel\Composer XE 2013 SP

2. Start R and execute the function Sys.BlasLapack:

    R> Sys.BlasLapack()
     $vendor
     [1] "Intel Math Kernel Library (Intel MKL)"

     $nthreads
     [1] -1

The vendor value returned indicates the presence of MKL instead of R's internal BLAS. The value for the number of threads to utilize, nthreads = -1, indicates all available cores are used by default. To modify the number of threads used, set the system environment variable MKL_NUM_THREADS = n, where n is the number of physical cores in the system you wish to use.

To install MKL on your Windows client, you must have an MKL license.

Oracle R Distribution will be certified with a future release of Oracle R Enterprise, and is available now from Oracle's free and Open Source Software portal. Questions and comments are welcome on the Oracle R Forum.

Friday Jan 18, 2013

Oracle R Distribution Performance Benchmark

Oracle R Distribution Performance Benchmarks

Oracle R Distribution provides dramatic performance gains with MKL

Using the recognized R benchmark R-benchmark-25.R test script, we compared the performance of Oracle R Distribution with and without the dynamically loaded high performance Math Kernel Library (MKL) from Intel. The benchmark results show Oracle R Distribution is significantly faster with the dynamically loaded high performance library. R users can immediately gain performance enhancements over open source R, analyzing data on 64-bit architectures and leveraging parallel processing within specific R functions that invoke computations performed by these high performance libraries.

The Community-developed test consists of matrix calculations and functions, program control, matrix multiplication, Cholesky Factorization, Singular Value Decomposition (SVD), Principal Component Analysis (PCA), and Linear Discriminant Analysis. Such computations form a core component of many real-world problems, often taking the majority of compute time. The ability to speed up these computations means faster results for faster decision making.

While the benchmark results reported were conducted using Intel MKL, Oracle R Distribution also supports AMD Core Math Library (ACML) and Solaris Sun Performance Library.

Oracle R Distribution 2.15.1 x64 Benchmark Results (time in seconds)


 ORD with internal BLAS/LAPACK
1 thread
 ORD + MKL
1 thread
 ORD + MKL
2 threads
 ORD + MKL
4 threads
 ORD + MKL
8 threads
 Performance gain ORD + MKL
4 threads
 Performance gain ORD + MKL
8 threads
 Matrix Calculations
 11.2  1.9  1.3  1.1  0.9  9.2x  11.4x
 Matrix Functions
 7.2  1.1 0.6
 0.4  0.4  17.0x  17.0x
 Program Control
 1.4  1.3  1.5  1.4  0.8  0.0x  0.8x
 Matrix Multiply
 517.6  21.2  10.9  5.8  3.1  88.2x  166.0x
 Cholesky Factorization
 25  3.9  2.1  1.3  0.8  18.2x  29.4x
 Singular Value Decomposition
 103.5  15.1  7.8  4.9  3.4  20.1x  40.9x
 Principal Component Analysis
 490.1  42.7  24.9  15.9  11.7  29.8x  40.9x
 Linear Discriminant Analysis
 419.8  120.9  110.8  94.1  88.0  3.5x  3.8x

This benchmark was executed on a 3-node cluster, with 24 cores at 3.07GHz per CPU and 47 GB RAM, using Linux 5.5.

In the first graph, we see significant performance improvements. For example, SVD with ORD plus MKL executes 20 times faster using 4 threads, and 29 times faster using 8 threads. For Cholesky Factorization, ORD plus MKL is 18 and 30 times faster for 4 and 8 threads, respectively.

In the second graph,we focus on the three longer running tests. Matrix multiplication is 88 and 166 times faster for 4 and 8 threads, respectively. PCA is 30 and 50 times faster, and LDA is over 3 times faster.

This level of performance improvement can significantly reduce application execution time and make interactive, dynamically generated results readily achievable. Note that ORD plus MKL not only impacts performance on the client side, but also when used in combination with R scripts executed using Oracle R Enterprise Embedded R Execution. Such R scripts, executing at the database server machine, reap these performance gains as well. 

Monday Feb 20, 2012

Announcing Oracle R Distribution

Oracle has released the Oracle R Distribution, an Oracle-supported distribution of open source R. This is provided as a free download from Oracle. Support for Oracle R Distribution is provided to customers of the Oracle Advanced Analytics option and Oracle Big Data Appliance. The Oracle R Distribution facilitates enterprise acceptance of R, since the lack of a major corporate sponsor has made some companies concerned about fully adopting R. With the Oracle R Distribution, Oracle plans to contribute bug fixes and relevant enhancements to open source R.

Oracle has already taken responsibility for and contributed modifications to ROracle - an Oracle database interface (DBI) driver for R based on OCI. As ROracle is LGPL and used for Oracle Database connectivity from R, we are committed to ensuring this is the best package for Oracle connectivity.
About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« July 2016
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
24
25
27
28
29
30
31
      
Today