Performance
Consider the model build performance of randomForest for 500 trees (the default) and three data set sizes (10K, 100K, and 1M rows). The formula is
‘DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH’
using samples of the popular ONTIME domestic flight dataset.
With ORE’s parallel, distributed implementation, ore.randomForest is an order of magnitude faster than the commonly used randomForest package. While the first plot uses the original execution times, the second uses a log scale to facilitate interpreting scalability.
Memory vs. Speed
ore.randomForest is designed for speed, relying on ORE embedded R execution for parallelism to achieve the order of magnitude speedup. However, the data set is loaded into memory for each parallel R engine, so high degrees of parallelism (DOP) will result in the corresponding use of memory. Since Oracle R Distribution’s randomForest improves memory usage over R's randomForest (approximately 7X less), larger data sets can be accommodated. Users can specify the DOP using the ore.parallel global option.
API
The ore.randomForest API:
ore.randomForest(formula, data, ntree=500, mtry = NULL,
replace = TRUE, classwt = NULL, cutoff = NULL,
sampsize = if(replace) nrow(data) else ceiling(0.632*nrow(data)),
nodesize = 1L, maxnodes = NULL, confusion.matrix = FALSE,
na.action = na.fail, ...)
To highlight two of the arguments, confusion_matrix is a logical value indicating whether to calculate the confusion matrix. Note that this confusion matrix is not based on OOB (out-of-bag), it is the result of applying the built random forest model to the entire training data.
Argument groups is the number of tree groups that the total number of trees are divided into during model build. The default is equal to the value of the option 'ore.parallel'. If system memory is limited, it is recommended to set this argument to a value large enough so that the number of trees in each group is small to avoid exceeding memory availability.
Scoring with ore.randomForest follows other ORE scoring functions:
predict(object, newdata,
type = c("response", "prob", "vote", "all"),
norm.votes = TRUE,
supplemental.cols = NULL,
cache.model = TRUE, ...)
The arguments include:
• type: scoring output content – 'response', 'prob', 'votes', or 'all'. Corresponding to predicted values, matrix of class probabilities, matrix of vote counts, or both the vote matrix and predicted values, respectively.
• norm.votes: a logical value indicating whether the vote counts in the output vote matrix should be normalized. The argument is ignored if 'type' is 'response' or 'prob'.
• supplemental.cols: additional columns from the 'newdata' data set to include in the prediction result. This can be particularly useful for including a key column that can be related back to the original data set.
cache.model: a logical value indicating whether the entire random forest model is cached in memory during prediction. While the default is TRUE, setting it to FALSE may be beneficial if memory is an issue.
Example
options(ore.parallel=8)
df <- ONTIME_S[,c("DAYOFWEEK","DEPDELAY","DISTANCE",
"UNIQUECARRIER","DAYOFMONTH","MONTH")]
df <- df[complete.cases(df),]
mod <- ore.randomForest(DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH, df, ntree=100,groups=20)
ans <- predict(mod, df, type="all", supplemental.cols="DAYOFWEEK")
head(ans)
R> options(ore.parallel=8)
R> df <- ONTIME_S[,c("DAYOFWEEK","DEPDELAY","DISTANCE",
"UNIQUECARRIER","DAYOFMONTH","MONTH")]
R> df <- dd[complete.cases(dd),]
R> mod <- ore.randomForest(DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH,
+ df, ntree=100,groups=20)
R> ans <- predict(mod, df, type="all", supplemental.cols="DAYOFWEEK")
R> head(ans)
1 2 3 4 5 6 7 prediction DAYOFWEEK
1 0.09 0.01 0.06 0.04 0.70 0.05 0.05 5 5
2 0.06 0.01 0.02 0.03 0.01 0.38 0.49 7 6
3 0.11 0.03 0.16 0.02 0.06 0.57 0.05 6 6
4 0.09 0.04 0.15 0.03 0.02 0.62 0.05 6 6
5 0.04 0.04 0.04 0.01 0.06 0.72 0.09 6 6
6 0.35 0.11 0.14 0.27 0.05 0.08 0.00 1 1