By Mark Hornick-Oracle on Jan 04, 2016
Random Forest is a popular ensemble learning technique for classification and regression, developed by Leo Breiman and Adele Cutler. By combining the ideas of “bagging” and random selection of variables, the algorithm produces a collection of decision trees with controlled variance, while avoiding overfitting – a common problem for decision trees. By constructing many trees, classification predictions are made by selecting the mode of classes predicted, while regression predictions are computed using the mean from the individual tree predictions.
Although the Random Forest algorithm provides high accuracy, performance and scalability can be issues for larger data sets. Oracle R Enterprise 1.5 introduces Random Forest for classification with three enhancements:
• ore.randomForest uses the ore.frame proxy for database tables so that data remain in the database server
• ore.randomForest executes in parallel for model building and scoring while using Oracle R Distribution or R’s randomForest package 4.6-10
• randomForest in Oracle R Distribution significantly reduces memory requirements of R’s algorithm, providing only the functionality required for use by ore.randomForest
Consider the model build performance of randomForest for 500 trees (the default) and three data set sizes (10K, 100K, and 1M rows). The formula is
using samples of the popular ONTIME domestic flight dataset.
With ORE’s parallel, distributed implementation, ore.randomForest is an order of magnitude faster than the commonly used randomForest package. While the first plot uses the original execution times, the second uses a log scale to facilitate interpreting scalability.
Memory vs. Speed
ore.randomForest is designed for speed, relying on ORE embedded R execution for parallelism to achieve the order of magnitude speedup. However, the data set is loaded into memory for each parallel R engine, so high degrees of parallelism (DOP) will result in the corresponding use of memory. Since Oracle R Distribution’s randomForest improves memory usage over R's randomForest (approximately 7X less), larger data sets can be accommodated. Users can specify the DOP using the ore.parallel global option.
The ore.randomForest API:
ore.randomForest(formula, data, ntree=500, mtry = NULL,
replace = TRUE, classwt = NULL, cutoff = NULL,
sampsize = if(replace) nrow(data) else ceiling(0.632*nrow(data)),
nodesize = 1L, maxnodes = NULL, confusion.matrix = FALSE,
na.action = na.fail, ...)
To highlight two of the arguments, confusion_matrix is a logical value indicating whether to calculate the confusion matrix. Note that this confusion matrix is not based on OOB (out-of-bag), it is the result of applying the built random forest model to the entire training data.
Argument groups is the number of tree groups that the total number of trees are divided into during model build. The default is equal to the value of the option 'ore.parallel'. If system memory is limited, it is recommended to set this argument to a value large enough so that the number of trees in each group is small to avoid exceeding memory availability.
Scoring with ore.randomForest follows other ORE scoring functions:
type = c("response", "prob", "vote", "all"),
norm.votes = TRUE,
supplemental.cols = NULL,
cache.model = TRUE, ...)
The arguments include:
• type: scoring output content – 'response', 'prob', 'votes', or 'all'. Corresponding to predicted values, matrix of class probabilities, matrix of vote counts, or both the vote matrix and predicted values, respectively.
• norm.votes: a logical value indicating whether the vote counts in the output vote matrix should be normalized. The argument is ignored if 'type' is 'response' or 'prob'.
• supplemental.cols: additional columns from the 'newdata' data set to include in the prediction result. This can be particularly useful for including a key column that can be related back to the original data set.
cache.model: a logical value indicating whether the entire random forest model is cached in memory during prediction. While the default is TRUE, setting it to FALSE may be beneficial if memory is an issue.
df <- ONTIME_S[,c("DAYOFWEEK","DEPDELAY","DISTANCE",
df <- df[complete.cases(df),]
mod <- ore.randomForest(DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH, df, ntree=100,groups=20)
ans <- predict(mod, df, type="all", supplemental.cols="DAYOFWEEK")
R> df <- ONTIME_S[,c("DAYOFWEEK","DEPDELAY","DISTANCE",
R> df <- dd[complete.cases(dd),]
R> mod <- ore.randomForest(DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH,
+ df, ntree=100,groups=20)
R> ans <- predict(mod, df, type="all", supplemental.cols="DAYOFWEEK")
1 2 3 4 5 6 7 prediction DAYOFWEEK
1 0.09 0.01 0.06 0.04 0.70 0.05 0.05 5 5
2 0.06 0.01 0.02 0.03 0.01 0.38 0.49 7 6
3 0.11 0.03 0.16 0.02 0.06 0.57 0.05 6 6
4 0.09 0.04 0.15 0.03 0.02 0.62 0.05 6 6
5 0.04 0.04 0.04 0.01 0.06 0.72 0.09 6 6
6 0.35 0.11 0.14 0.27 0.05 0.08 0.00 1 1