## Friday Feb 05, 2016

### Using SVD for Dimensionality Reduction

SVD, or Singular Value Decomposition, is one of several techniques that can be used to reduce the dimensionality, i.e., the number of columns, of a data set. Why would we want to reduce the number of dimensions? In predictive analytics, more columns normally means more time required to build models and score data. If some columns have no predictive value, this means wasted time, or worse, those columns contribute noise to the model and reduce model quality or predictive accuracy.

Dimensionality reduction can be achieved by simply dropping columns, for example, those that may show up as collinear with others or identified as not being particularly predictive of the target as determined by an attribute importance ranking technique. But it can also be achieved by deriving new columns based on linear combinations of the original columns. In both cases, the resulting transformed data set can be provided to machine learning algorithms to yield faster model build times, faster scoring times, and more accurate models.

While SVD can be used for dimensionality reduction, it is often used in digital signal processing for noise reduction, image compression, and other areas.

SVD is an algorithm that factors an m x n matrix, M, of real or complex values into three component matrices, where the factorization has the form USV*. U is an m x p matrix. S is a p x p diagonal matrix. V is an n x p matrix, with V* being the transpose of V, a p x n matrix, or the conjugate transpose if M contains complex values. The value p is called the rank. The diagonal entries of S are referred to as the singular values of M. The columns of U are typically called the left-singular vectors of M, and the columns of V are called the right-singular vectors of M.

Consider the following visual representation of these matrices:

One of the features of SVD is that given the decomposition of M into U, S, and V, one can reconstruct the original matrix M, or an approximation of it. The singular values in the diagonal matrix S can be used to understand the amount of variance explained by each of the singular vectors. In R, this can be achieved using the computation:

`cumsum(S^2/sum(S^2))`

When plotted, this provides a visual understanding of the variance captured by the model. The figure below indicates that the first singular vector accounts for 96.5% of the variance, the second with the first accounts for over 99.5%, and so on.

As such, we can use this information to limit the number of vectors to the amount of variance we wish to capture. Reducing the number of vectors can help eliminate noise in the original data set when that data set is reconstructed using the subcomponents of U, S, and V.

ORE’s parallel, distributed SVD

With Oracle R Enterprise’s parallel distributed implementation of R’s svd function, only the S and V components are returned. More specifically, the diagonal singular values are returned of S as the vector d. If we store the result of invoking svd on matrix dat in svd.mod, U can be derived from these using M as follows:

``````svd.mod <- svd(dat)
U <- dat %*%  svd.mod\$v  %*% diag(1./svd.mod\$d)``````

So, how do we achieve dimensionality reduction using SVD? We can use the first k columns of V and S and achieve U’ with fewer columns.

`U.reduced <-dat %*% svd.mod\$v[,1:k,drop=FALSE] %*% diag((svd.mod\$d)[1:k,drop=FALSE])`

This reduced U can now be used as a proxy for matrix dat with fewer columns.

The function dimReduce introduced below accepts a matrix x, the number of columns desired k, and a request for any supplemental columns to return with the transformed matrix.

``````dimReduce <- function(x, k=floor(ncol(x)/2), supplemental.cols=NULL) {
colIdxs <- which(colnames(x) %in% supplemental.cols)
colNames <- names(x[,-colIdxs])
sol <- svd(x[,-colIdxs])
sol.U <- as.matrix(x[,-colIdxs]) %*% (sol\$v)[,1:k,drop=FALSE] %*%
diag((sol\$d)[1:k,drop=FALSE])
sol.U = sol.U@data
res <- cbind(sol.U,x[,colIdxs,drop=FALSE])
names(res) <- c(names(sol.U@data),names(x[,colIdxs]))
res
}``````

We will now use this function to reduce the iris data set.

To prepare the iris data set, we first add a unique identifier, create the database table IRIS2 in the database, and then assign row names to enable row indexing. We could also make ID the primary key using ore.exec with the ALTER TABLE statement. Refreshing the ore.frame proxy object using ore.sync reflects the change in primary key.

```dat <- iris
dat\$ID <- seq_len(nrow(dat))
ore.drop("IRIS2")
ore.create(dat,table="IRIS2")
row.names(IRIS2) <- IRIS2\$ID
# ore.exec("alter table IRIS2 add constraint IRIS2 primary key (\"ID\")")
# ore.sync(table = "IRIS2", use.keys = TRUE)
IRIS2[1:5,]```

Using the function defined above, dimReduce, we produce IRIS2.reduced with supplemental columns of ID and Species. This allows us to easily generate a confusion matrix later. You will find that IRIS2.reduced has 4 columns.

```IRIS2.reduced <- dimReduce(IRIS2, 2, supplemental.cols=c("ID","Species"))
dim(IRIS2.reduced)  # 150 4```

Next, we will build an rpart model to predict Species using first the original iris data set, and then the reduced data set so we can compare the confusion matrices of each. Note that to use R's rpart for model building, the data set IRIS2.reduced is pulled to the client.

```library(rpart)
m1 <- rpart(Species~.,iris)
res1 <- predict(m1,iris,type="class")
table(res1,iris\$Species)
#res1         setosa versicolor virginica
#  setosa         50          0         0
#  versicolor      0         49         5
#  virginica       0          1        45

dat2 <- ore.pull(IRIS2.reduced)
m2 <- rpart(Species~.-ID,dat2)
res2 <- predict(m2,dat2,type="class")
table(res2,iris\$Species)
# res2         setosa versicolor virginica
#  setosa         50          0         0
#  versicolor      0         47         0
#  virginica       0          3        50```

Notice that the resulting models are comparable, but that the model that used IRIS2.reduced actually has better overall accuracy, making just 3 mistakes instead of 6. Of course, a more accurate assessment of error would be to use cross validation, however, this is left as an exercise for the reader.

We can build a similar model using the in-database decision tree algorithm, via ore.odmDT, and get the same results on this particular data set.

```m2.1 <- ore.odmDT(Species~.-ID, IRIS2.reduced)
res2.1 <- predict(m2.1,IRIS2.reduced,type="class",supplemental.cols = "Species")
table(res2.1\$PREDICTION, res2.1\$Species)
# res2         setosa versicolor virginica
#  setosa         50          0         0
#  versicolor      0         47         0
#  virginica       0          3        50```

A more interesting example is based on the digit-recognizer data which can be located on the Kaggle website here. In this example, we first use Support Vector Machine as the algorithm with default parameters on split train and test samples of the original training data. This allows us to get an objective assessment of model accuracy. Then, we preprocess the train and test sets using the in-database SVD algorithm and reduce the original 785 predictors to 40. The reduced number of variables specified is subject to experimentation. Degree of parallelism for SVD was set to 4.

The results highlight that reducing data dimensionality can improve overall model accuracy, and that overall execution time can be significantly faster. Specifically, using ore.odmSVM for model building saw a 43% time reduction and a 4.2% increase in accuracy by preprocessing the train and test data using SVD.

However, it should be noted that not all algorithms are necessarily aided by dimensionality reduction with SVD. In a second test on the same data using ore.odmRandomForest with 25 trees and defaults for other settings, accuracy of 95.3% was achieved using the original train and test sets. With the SVD reduced train and test sets, accuracy was 93.7%. While the model building time was reduced by 80% and scoring time reduced by 54%, if we factor in the SVD execution time, however, using the straight random forest algorithm does better by a factor of two.

Details

For this scenario, we modify the dimReduce function introduced above and add another function dimReduceApply. In dimReduce, we save the model in an ORE Datastore so that the same model can be used to transform the test data set for scoring. In dimReduceApply, that same model is loaded for use in constructing the reduced U matrix.

```dimReduce <- function(x, k=floor(ncol(x)/2), supplemental.cols=NULL, dsname="svd.model") {
colIdxs <- which(colnames(x) %in% supplemental.cols)
if (length(colIdxs) > 0) {
sol <- svd(x[,-colIdxs])
sol.U <- as.matrix(x[,-colIdxs]) %*% (sol\$v)[,1:k,drop=FALSE] %*%
diag((sol\$d)[1:k,drop=FALSE])
res <- cbind(sol.U@data,x[,colIdxs,drop=FALSE])
# names(res) <- c(names(sol.U@data),names(x[,colIdxs]))
res
} else {
sol <- svd(x)
sol.U <- as.matrix(x) %*% (sol\$v)[,1:k,drop=FALSE] %*%
diag((sol\$d)[1:k,drop=FALSE])
res <- sol.U@data
}
ore.save(sol, name=dsname, overwrite=TRUE)
res
}

dimReduceApply <- function(x, k=floor(ncol(x)/2), supplemental.cols=NULL, dsname="svd.model") {
colIdxs <- which(colnames(x) %in% supplemental.cols)
if (length(colIdxs) > 0) {
sol.U <- as.matrix(x[,-colIdxs]) %*% (sol\$v)[,1:k,drop=FALSE] %*%
diag((sol\$d)[1:k,drop=FALSE])
res <- cbind(sol.U@data,x[,colIdxs,drop=FALSE])
# names(res) <- c(names(sol.U@data),names(x[,colIdxs]))
res
} else {
sol.U <- as.matrix(x) %*% (sol\$v)[,1:k,drop=FALSE] %*%
diag((sol\$d)[1:k,drop=FALSE])
res <- sol.U@data
}
res
}```

Here is the script used for the digit data:

```# load data from file
dim(train)       # 42000   786

train\$ID <- 1:nrow(train)             # assign row id
ore.drop(table="DIGIT_TRAIN")
ore.create(train,table="DIGIT_TRAIN") # create as table in the database
dim(DIGIT_TRAIN)  # 42000   786

# Split the original training data into train and
# test sets to evaluate model accuracy
set.seed(0)
dt <- DIGIT_TRAIN
ind <- sample(1:nrow(dt),nrow(dt)*.6)
group <- as.integer(1:nrow(dt) %in% ind)

row.names(dt) <- dt\$ID
sample.train <- dt[group==TRUE,]
sample.test  <- dt[group==FALSE,]
dim(sample.train)  # 25200   786
dim(sample.test)   # 16800   786
# Create train table in database
ore.create(sample.train, table="DIGIT_SAMPLE_TRAIN")
# Create test table in database
ore.create(sample.test,  table="DIGIT_SAMPLE_TEST")

# Add persistent primary key for row indexing
# Note: could be done using row.names(DIGIT_SAMPLE_TRAIN) <- DIGIT_SAMPLE_TRAIN\$ID
DIGIT_SAMPLE_TRAIN primary key (\"ID\")")
DIGIT_SAMPLE_TEST  primary key (\"ID\")")
ore.sync(table = c("DIGIT_SAMPLE_TRAIN","DIGIT_SAMPLE_TRAIN"), use.keys = TRUE)

# SVM model
m1.svm <- ore.odmSVM(label~.-ID, DIGIT_SAMPLE_TRAIN, type="classification")
pred.svm <- predict(m1.svm, DIGIT_SAMPLE_TEST,
supplemental.cols=c("ID","label"),type="class")
cm <- with(pred.svm, table(label,PREDICTION))

library(caret)
confusionMatrix(cm)
# Confusion Matrix and Statistics
#
# PREDICTION
# label    0    1    2    3    4    5    6    7    8    9
#     0 1633    0    4    2    3    9   16    2    7    0
#     1    0 1855   12    3    2    5    4    2   23    3
#     2    9   11 1445   22   26    8   22   30   46   10
#     3    8    9   57 1513    2   57   16   16   41   15
#     4    5    9   10    0 1508    0   10    4   14   85
#     5   24   12   14   52   28 1314   26    6   49   34
#     6   10    2    7    1    8   26 1603    0    6    0
#     7   10    8   27    4   21    8    1 1616    4   70
#     8   12   45   14   40    7   47   13   10 1377   30
#     9   12   10    6   19   41   15    2   54   15 1447
#
# Overall Statistics
#
# Accuracy : 0.9114
# 95% CI : (0.907, 0.9156)
# No Information Rate : 0.1167
# P-Value [Acc > NIR] : < 2.2e-16
#...

options(ore.parallel=4)
sample.train.reduced <- dimReduce(DIGIT_SAMPLE_TRAIN, 40, supplemental.cols=c("ID","label"))
sample.test.reduced <- dimReduceApply(DIGIT_SAMPLE_TEST, 40, supplemental.cols=c("ID","label"))
ore.drop(table="DIGIT_SAMPLE_TRAIN_REDUCED")
ore.create(sample.train.reduced,table="DIGIT_SAMPLE_TRAIN_REDUCED")
ore.drop(table="DIGIT_SAMPLE_TEST_REDUCED")
ore.create(sample.test.reduced,table="DIGIT_SAMPLE_TEST_REDUCED")

m2.svm <- ore.odmSVM(label~.-ID,
DIGIT_SAMPLE_TRAIN_REDUCED, type="classification")
pred2.svm <- predict(m2.svm, DIGIT_SAMPLE_TEST_REDUCED,
supplemental.cols=c("label"),type="class")
cm <- with(pred2.svm, table(label,PREDICTION))
confusionMatrix(cm)
# Confusion Matrix and Statistics
#
# PREDICTION
# label    0    1    2    3    4    5    6    7    8    9
#     0 1652    0    3    3    2    7    4    1    3    1
#     1    0 1887    8    2    2    1    1    3    3    2
#     2    3    4 1526   11   20    3    7   21   27    7
#     3    0    3   29 1595    3   38    4   16   34   12
#     4    0    4    8    0 1555    2   11    5    9   51
#     5    5    6    2   31    6 1464   13    6   10   16
#     6    2    1    5    0    5   18 1627    0    5    0
#     7    2    6   22    7   10    2    0 1666    8   46
#     8    3    9    9   34    7   21    9    7 1483   13
#     9    5    2    8   17   30   10    3   31   20 1495
#
# Overall Statistics
#
# Accuracy : 0.9494
# 95% CI : (0.946, 0.9527)
# No Information Rate : 0.1144
# P-Value [Acc > NIR] : < 2.2e-16
#...

# CASE 2 with Random Forest
m2.rf <- ore.randomForest(label~.-ID, DIGIT_SAMPLE_TRAIN,ntree=25)
pred2.rf <- predict(m2.rf, DIGIT_SAMPLE_TEST, supplemental.cols=c("label"),type="response")
cm <- with(pred2.rf, table(label,prediction))
confusionMatrix(cm)
# Confusion Matrix and Statistics
#
# prediction
# label    0    1    2    3    4    5    6    7    8    9
#     0 1655    0    1    1    2    0    7    0    9    1
#     1    0 1876   12    8    2    1    1    2    6    1
#     2    7    4 1552   14   10    2    5   22   10    3
#     3    9    5   33 1604    1   21    4   16   27   14
#     4    1    4    3    0 1577    1    9    3    3   44
#     5    9    6    2   46    3 1455   18    1    9   10
#     6   13    2    3    0    6   14 1621    0    3    1
#     7    1    6   31    5   16    3    0 1675    3   29
#     8    3    7   15   31   11   20    8    4 1476   20
#     9    9    2    7   23   32    5    1   15   12 1515
#
# Overall Statistics
#
# Accuracy : 0.9527
# 95% CI : (0.9494, 0.9559)
# No Information Rate : 0.1138
# P-Value [Acc > NIR] : < 2.2e-16
#...

m1.rf <- ore.randomForest(label~.-ID, DIGIT_SAMPLE_TRAIN_REDUCED,ntree=25)
pred1.rf <- predict(m1.rf, DIGIT_SAMPLE_TEST_REDUCED,
supplemental.cols=c("label"),type="response")
cm <- with(pred1.rf, table(label,prediction))
confusionMatrix(cm)
# Confusion Matrix and Statistics
#
# prediction
# label    0    1    2    3    4    5    6    7    8    9
#     0 1630    0    4    5    2    8   16    3    5    3
#     1    0 1874   17    4    0    5    2    2    4    1
#     2   15    2 1528   17   10    5   10   21   16    5
#     3    7    1   32 1601    4   25   10    8   34   12
#     4    2    6    6    3 1543    2   17    4    4   58
#     5    9    1    5   45   12 1443   11    3   15   15
#     6   21    3    8    0    5   15 1604    0    7    0
#     7    5   11   33    7   17    6    1 1649    2   38
#     8    5   13   27   57   14   27    9   12 1404   27
#     9   10    2    6   22   52    8    5   41   12 1463
#
# Overall Statistics
#
# Accuracy : 0.9368
# 95% CI : (0.9331, 0.9405)
# No Information Rate : 0.1139
# P-Value [Acc > NIR] : < 2.2e-16
#...```
Execution Times

The following numbers reflect the execution times for select operations of the above script. Hardware was a Lenovo Thinkpad with Intel i5 processor and 16 GB RAM.

## Monday Jan 04, 2016

### ORE Random Forest

Random Forest is a popular ensemble learning technique for classification and regression, developed by Leo Breiman and Adele Cutler. By combining the ideas of “bagging” and random selection of variables, the algorithm produces a collection of decision trees with controlled variance, while avoiding overfitting – a common problem for decision trees. By constructing many trees, classification predictions are made by selecting the mode of classes predicted, while regression predictions are computed using the mean from the individual tree predictions.

Although the Random Forest algorithm provides high accuracy, performance and scalability can be issues for larger data sets. Oracle R Enterprise 1.5 introduces Random Forest for classification with three enhancements:

•  ore.randomForest uses the ore.frame proxy for database tables so that data remain in the database server
•  ore.randomForest executes in parallel for model building and scoring while using Oracle R Distribution or R’s randomForest package 4.6-10
•  randomForest in Oracle R Distribution significantly reduces memory requirements of R’s algorithm, providing only the functionality required for use by ore.randomForest

Performance

Consider the model build performance of randomForest for 500 trees (the default) and three data set sizes (10K, 100K, and 1M rows). The formula is

‘DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH’

using samples of the popular ONTIME domestic flight dataset.

With ORE’s parallel, distributed implementation, ore.randomForest is an order of magnitude faster than the commonly used randomForest package. While the first plot uses the original execution times, the second uses a log scale to facilitate interpreting scalability.

Memory vs. Speed
ore.randomForest
is designed for speed, relying on ORE embedded R execution for parallelism to achieve the order of magnitude speedup. However, the data set is loaded into memory for each parallel R engine, so high degrees of parallelism (DOP) will result in the corresponding use of memory. Since Oracle R Distribution’s randomForest improves memory usage over R's randomForest (approximately 7X less), larger data sets can be accommodated. Users can specify the DOP using the ore.parallel global option.

API

The ore.randomForest API:

ore.randomForest(formula, data, ntree=500, mtry = NULL,
replace = TRUE, classwt = NULL, cutoff = NULL,
sampsize = if(replace) nrow(data) else ceiling(0.632*nrow(data)),
nodesize = 1L, maxnodes = NULL, confusion.matrix = FALSE,
na.action = na.fail, ...)

To highlight two of the arguments, confusion_matrix is a logical value indicating whether to calculate the confusion matrix. Note that this confusion matrix is not based on OOB (out-of-bag), it is the result of applying the built random forest model to the entire training data.

Argument groups is the number of tree groups that the total number of trees are divided into during model build. The default is equal to the value of the option 'ore.parallel'. If system memory is limited, it is recommended to set this argument to a value large enough so that the number of trees in each group is small to avoid exceeding memory availability.

Scoring with ore.randomForest follows other ORE scoring functions:

predict(object, newdata,
type = c("response", "prob", "vote", "all"),
supplemental.cols = NULL,
cache.model = TRUE, ...)

The arguments include:

•  type: scoring output content – 'response', 'prob', 'votes', or 'all'. Corresponding to predicted values, matrix of class probabilities, matrix of vote counts, or both the vote matrix and predicted values, respectively.
•  norm.votes: a logical value indicating whether the vote counts in the output vote matrix should be normalized. The argument is ignored if 'type' is 'response' or 'prob'.
•  supplemental.cols: additional columns from the 'newdata' data set to include in the prediction result. This can be particularly useful for including a key column that can be related back to the original data set.
cache.model: a logical value indicating whether the entire random forest model is cached in memory during prediction. While the default is TRUE, setting it to FALSE may be beneficial if memory is an issue.

Example

options(ore.parallel=8)
df <- ONTIME_S[,c("DAYOFWEEK","DEPDELAY","DISTANCE",
"UNIQUECARRIER","DAYOFMONTH","MONTH")]
df <- df[complete.cases(df),]
mod <- ore.randomForest(DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH,                 df, ntree=100,groups=20)
ans <- predict(mod, df, type="all", supplemental.cols="DAYOFWEEK")

R> options(ore.parallel=8)
R> df <- ONTIME_S[,c("DAYOFWEEK","DEPDELAY","DISTANCE",
"UNIQUECARRIER","DAYOFMONTH","MONTH")]
R> df <- dd[complete.cases(dd),]
R> mod <- ore.randomForest(DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH,
+                 df, ntree=100,groups=20)
R> ans <- predict(mod, df, type="all", supplemental.cols="DAYOFWEEK")

1    2    3    4    5    6    7 prediction DAYOFWEEK
1 0.09 0.01 0.06 0.04 0.70 0.05 0.05 5          5
2 0.06 0.01 0.02 0.03 0.01 0.38 0.49 7          6
3 0.11 0.03 0.16 0.02 0.06 0.57 0.05 6          6
4 0.09 0.04 0.15 0.03 0.02 0.62 0.05 6          6
5 0.04 0.04 0.04 0.01 0.06 0.72 0.09 6          6
6 0.35 0.11 0.14 0.27 0.05 0.08 0.00 1          1

## Thursday Sep 10, 2015

### Consolidating wide and shallow data with ORE Datastore

Clinical trial data are often characterized by a relatively small set of participants (100s or 1000s) while the data collected and analyzed on each may be significantly larger (1000s or 10,000s). Genomic data alone can easily reach the higher end of this range. In talking with industry leaders, one of the problems pharmaceutical companies and research hospitals encounter is effectively managing such data. Storing data in flat files on myriad servers, perhaps even “closeted” when no longer actively needed, poses problems for data accessibility, backup, recovery, and security. While Oracle Database provides support for wide data using nested tables in a number of contexts, to take advantage of R native functions that handle wide data using data.frames, Oracle R Enterprise allows you to store wide data.frames directly in Oracle Database using Oracle R Enterprise datastores.

With Oracle R Enterprise (ORE), a component of the Oracle Advanced Analytics option, the ORE datastore supports storing arbitrary R objects, including data.frames, in Oracle Database. In particular, users can load wide data from a file into R and store the resulting data.frame directly the R datastore. From there, users can repeatedly load the data at much faster speeds than they can from flat files.

The following benchmark results illustrate the performance of saving and loading data.frames of various dimensions. These tests were performed on an Oracle Exadata 5-2 half rack, ORE 1.4.1, ROracle 1.2-1, and R 3.2.0. Logging is turned off on the datastore table (see performance tip below). The data.frame consists of numeric data.

Comparing Alternatives

When it comes to accessing data and saving data for use with R, there are several options, including: CSV file, .Rdata file, and the ORE datastore. Each comes with its own advantages.

CSV

“Comma separated value” or CSV files are generally portable, provide a common representation for exporting/importing data, and can be readily loaded into a range of applications. However, flat files need to be managed and often have inadequate security, auditing, backup, and recovery. As we’ll see, CSV files provide significantly slower read and write times compared to .Rdata and ORE datastore.

.Rdata

R’s native .Rdata flat file representation is generally efficient for reading/writing R objects since the objects are in serialized form, i.e., not converted to a textual representation as CSV data are. However, .Rdata flat files also need to be managed and often have inadequate security, auditing, backup, and recovery. While faster than CSV read and write times, .Rdata is slower than ORE datastore. Being an R-specific format, access is limited to the R environment, which may or may not be a concern.

ORE Datastore

ORE’s datastore capability allows users to organize and manage all data in a single location – the Oracle Database. This centralized repository provides Oracle Database quality security, auditing, backup, and recovery. The ORE datastore, as you’ll see below, provides read and write performance that is significantly better than CSV and .Rdata. Of course, as with .Rdata being accessed through R, accessing the datastore is through Oracle Database.

Let’s look at a few benchmark comparisons.

First, consider the execution time for loading data using each of these approaches. For 2000 columns, we see that ore.load() is 124X faster than read.csv(), and over 3 times faster than R’s load() function for 5000 rows. At 20,000 rows, ore.load() is 198X faster than read.csv() and almost 4 times faster than load().

Considering the time to save data, ore.save() is over 11X faster than write.csv() and over 8X faster than save() at 2000 rows, with that benefit continuing through 20000 rows.

Looking at this across even wider data.frames, e.g., adding results for 4000 and 16000 columns, we see a similar performance benefit for the ORE datastore over save/load and write.csv/read.csv.

If you are looking to consolidate data while gaining performance benefits along with security, backup, and recovery, the Oracle R Enterprise datastore may be a preferred choice.

Example using ORE Datastore

The ORE datastore functions ore.save() and ore.load() are similar to the corresponding R save() and load() functions.

In the following example, we read a CSV data file, save it in the ORE datastore using ore.save() and associated it with the name “MyDatastore”. Although not shown, multiple objects can be listed in the initial arguments. Note that any R objects can be included here, not just data.frames.

From there, we list the contents of the datastore and see that “MyDatastore” is listed with the number of objects stored and the overall size. Next we can ask for a summary of the contents of “MyDatastore”, which includes the data.frame ‘dat’.

Next we remove ‘dat’ and load the contents of the datastore, reconstituting ‘dat’ as a usable data.frame object. Lastly, we delete the datastore and see that the ORE datastore is empty.

>
> dim(dat)
[1] 300 2000
>
> ore.save(dat, name="MyDatastore")
> ore.datastore()
datastore.name object.count size creation.date description
1 MyDatastore 1 4841036 2015-09-01 12:07:38
>
> ore.datastoreSummary("MyDatastore")
object.name class size length row.count col.count
1 dat data.frame 4841036 2000 300 2000
>
> rm(dat)
[1] "dat"
>
> ore.delete("MyDatastore")
[1] "MyDatastore"
>
> ore.datastore()
[1] datastore.name object.count size creation.date description
<0 rows> (or 0-length row.names)
>

Performance Tip

The performance of saving R objects to the datastore can be increased by temporarily turning off logging on the table that serves as the datastore in the user’s schema: RQ\$DATASTOREINVENTORY. This can be accomplished using the following SQL, which can also be invoked from R:

SQL> alter table RQ\$DATASTOREINVENTORY NOLOGGING;

ORE> ore.exec(“alter table RQ\$DATASTOREINVENTORY NOLOGGING”)

While turning off logging speeds up inserts and index creation, it avoids writing the redo log and as such has implications for database recovery. It can be used in combination with explicit backups before and after loading data.

## Friday Apr 17, 2015

### The Intersection of “Data Capital” and Advanced Analytics

We’ve heard about the Three Laws of Data Capital from Paul Sonderegger at Oracle: data comes from activity, data tends to make more data, and platforms tend to win. Advanced analytics enables enterprises to take full advantage of the data their activity produces, ranging from IoT sensors and PoS transactions to social media and image/video. Traditional BI tools produce summary data from data – producing more data, but traditional BI tools provide a view of the past – what did happen. Advanced analytics also produces more data from data, but this data is transformative, generating previously unknown insights and providing a view of future behavior or outcomes – what will likely happen. Oracle provides a platform for advanced analytics today through Oracle Advanced Analytics on Oracle Database, and Oracle R Advanced Analytics for Hadoop on Big Data Appliance to support investing data.

Enterprises need to put their data to work to realize a return on their investment in data capture, cleansing, and maintenance. Investing data through advanced analytics algorithms has shown repeatedly to dramatically increase ROI. For examples, see customer quotes and videos from StubHub, dunnhumby, CERN, among others. Too often, data centers are perceived as imposing a “tax” instead of yielding a “dividend.” If you cannot extract new insights from your data and use data to perform such revenue enhancing actions such as predicting customer behavior, understanding root causes, and reducing fraud, the costs to maintain large volumes of historical data may feel like a tax. How do enterprises convert data centers to dividend-yielding assets?

One approach is to reduce “transaction costs.” Typically, these transaction costs involve the cost for moving data into environments where predictive models can be produced or sampling data to be small enough to fit existing hardware and software architectures. Then, there is the cost for putting those models into production. Transaction costs result in multi-step efforts that are labor intensive and make enterprises postpone investing their data and deriving value. Oracle has long recognized the origins of these high transaction costs and produced tools and a platform to eliminate or dramatically lower these costs.

Further, consider the data scientist or analyst as the “data capital manager,” the person or persons striving to extract the maximum yield from data assets. To achieve high dividends with low transaction costs, the data capital manager needs to be supported with tools and a platform that automates activities – making them more productive – and ultimately more heroic within the enterprise – doing more with less because it’s faster and easier. Oracle removes a lot of the grunt work from the advanced analytics process: data is readily accessible, data manipulation and model building / data scoring is scalable, and deployment is immediate. To learn more about how to increase dividends from your data capital, see Oracle Advanced Analytics and Oracle R Advanced Analytics for Hadoop.

## Thursday Feb 12, 2015

### Pain Point #6: “We need to build 10s of thousands of models fast to meet business objectives”

The last pain point in this series on Addressing Analytic Pain Points, involves one aspect of what I call massive predictive modeling. Increasingly, enterprise customers are building a greater number of models. In past decades, producing a handful of production models per year may have been considered a significant accomplishment. With the advent of powerful computing platforms, parallel and distributed algorithms, as well as the wealth of data – Big Data – we see enterprises building hundreds and thousands of models in targeted ways.

For example, consider the utility sector with data being collected from household smart meters. Whether water, gas, or electricity, utility companies can make more precise demand projections by modeling individual customer consumption behavior. Aggregating this behavior across all households can provide more accurate forecasts, since individual household patterns are considered, not just generalizations about all households, or even different household segments.

The concerns associated with this form of massive predictive modeling include: (i) dealing effectively with Big Data from the hardware, software, network, storage and Cloud, (ii) algorithm and infrastructure scalability and performance, (iii) production deployment, and (iv) model storage, backup, recovery and security. Some of these I’ve explored under previous pain points blog posts.

Oracle Advanced Analytics (OAA) and Oracle R Advanced Analytics for Hadoop (ORAAH) both provide support for massive predictive modeling. From the Oracle R Enterprise component of OAA, users leverage embedded R execution to run user-defined R functions in parallel, both from R and from SQL. OAA provides the infrastructure to allow R users to focus on their core R functionality while allowing Oracle Database to handle spawning of R engines, partitioning data and providing data to their R function across parallel R engines, aggregating results, etc. Data parallelism is enabled using the “groupApply” and “rowApply” functions, while task parallelism is enabled using the “indexApply” function. The Oracle Data Mining component of OAA provides "on-the-fly" models, also called "predictive queries," where the model is automatically built on partitions of the data and scoring using those partitioned models is similarly automated.

ORAAH enables the writing of mapper and reducer functions in R where corresponding ORE functionality can be achieved on the Hadoop cluster. For example, to emulate “groupApply”, users write the mapper to partition the data and the reducer to build a model on the resulting data. To emulate “rowApply”, users can simply use the mapper to perform, e.g., data scoring and passing the model to the environment of the mapper. No reducer is required.

## Monday Jan 19, 2015

### Pain Point #5: “Our company is concerned about data security, backup and recovery”

So far in this series on Addressing Analytic Pain Points, I’ve focused on the issues of data access, performance, scalability, application complexity, and production deployment. However, there are also fundamental needs for enterprise advanced analytics solutions that revolve around data security, backup, and recovery.

Traditional non-database analytics tools typically rely on flat files. If data originated in an RDBMS, that data must first be extracted. Once extracted, who has access to these flat files? Who is using this data and when? What operations are being performed? Security needs for data may be somewhat obvious, but what about the predictive models themselves? In some sense, these may be more valuable than the raw data since these models contain patterns and insights that help make the enterprise competitive, if not the dominant player. Are these models secure? Do we know who is using them, when, and with what operations? In short, what audit capabilities are available?

While security is a hot topic for most enterprises, it is essential to have a well-defined backup process in place. Enterprises normally have well-established database backup procedures that database administrators (DBAs) rigorously follow. If data and models are stored in flat files, perhaps in a distributed environment, one must ask what procedures exist and with what guarantees. Are the data files taxing file system backup mechanisms already in place – or not being backed up at all?

On the other hand, recovery involves using those backups to restore the database to a consistent state, reapplying any changes since the last backup. Again, enterprises normally have well-established database recovery procedures that are used by DBAs. If separate backup and recovery mechanisms are used for data, models, and scores, it may be difficult, if not impossible, to reconstruct a consistent view of an application or system that uses advanced analytics. If separate mechanisms are in place, they are likely more complex than necessary.

For Oracle Advanced Analytics (OAA), data is secured via Oracle Database, which wins security awards and is highly regarded for its ability to provide secure data for confidentiality, integrity, availability, authentication, authorization, and non-repudiation. Oracle Database logs and monitors user activity. Users can work independently or jointly in a shared environment with data access controlled by standard database privileges. The data itself can be encrypted and data redaction is supported.

OAA models are secured in one of two ways: (i) models produced in the kernel of the database are treated as first-class database objects with corresponding access privileges (create, update, delete, execute), and (ii) models produced through the R interface can be stored in the R datastore, which exists as a database table in the user's schema with its own access privileges. In either case, users must log into their Oracle Database schema/account, which provides the needed degree of confidentiality, integrity, availability, authentication, authorization, and non-repudiation.

Enterprise Oracle DBAs already follow rigorous backup and recovery procedures. The ability to reuse these procedures in conjunction with advanced analytics solutions is a major simplification and helps to ensure the integrity of data, models, and results.

## Tuesday Dec 23, 2014

### Pain Point #4: “Recoding R (or other) models into SQL, C, or Java takes time and is error prone”

In the previous post in this series Addressing Analytic Pain Points, I focused on some issues surrounding production deployment of advanced analytics solutions. One specific aspect of production deployment involves how to get predictive model results (e.g., scores) from R or leading vendor tools into applications that are based on programming languages such as SQL, C, or Java. In certain environments, one way to integrate predictive models involves recoding them into one of these languages. Recoding involves identifying the minimal information needed for scoring, i.e., making predictions, and implementing that in a language that is compatible with the target environment. For example, consider a linear regression model with coefficients. It can be fairly straightforward to write a SQL statement or a function in C or Java to produce a score using these coefficients. This translated model can then be integrated with production applications or systems.

While recoding has been a technique used for decades, it suffers from several drawbacks: latency, quality, and robustness. Latency refers to the time delay between the data scientist developing the solution and leveraging that solution in production. Customers recount historic horror stories where the process from analyst to software developers to application deployment took months. Quality comes into play on two levels: the coding and testing quality of the software produced, and the freshness of the model itself. In fast changing environments, models may become “stale” within days or weeks. As a result, latency can impact quality. In addition, while a stripped down implementation of the scoring function is possible, it may not account for all cases considered by the original algorithm implementer. As such, robustness, i.e., the ability to handle greater variation in the input data, may suffer.

One way to address this pain point is to make it easy to leverage predictive models immediately (especially open source R and in-database Oracle Advanced Analytics models), thereby eliminating the need to recode models. Since enterprise applications normally know how to interact with databases via SQL, as soon as a model is produced, it can be placed into production via SQL access. In the case of R models, these can be accessed using Oracle R Enterprise embedded R execution in parallel via ore.rowApply and, for select models, the ore.predict capability performs automatic translation of native R models for execution inside the database. In the case of native SQL Oracle Advanced Analytics interface algorithms, as found in Oracle Data Mining and exposed through an R interface in Oracle R Enterprise, users can perform scoring directly in Oracle Database. This capability minimizes or even eliminates latency, dramatically increases quality, and leverages the robustness of the original algorithm implementations.

## Sunday Dec 14, 2014

### Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex”

Continuing in our series Addressing Analytic Pain Points, another concern for data scientists and analysts, as well as enterprise management, is how to leverage analytic results in production systems. These production systems can include (i) dashboards used by management to make business decisions, (ii) call center applications where representatives see personalized recommendations for the customer they’re speaking to or how likely that customer is to churn, (iii) real-time recommender systems for customer retail web applications, (iv) automated network intrusion detection systems, and (v) semiconductor manufacturing alert systems that monitor product quality and equipment parameters via sensors – to name a few.

When a data scientist or analyst begins examining a data-based business problem, one of the first steps is to acquire the available data relevant to that problem. In many enterprises, this involves having it extracted from a data warehouse and operational systems, or acquiring supplemental data from third parties. They then explore the data, prepare it with various transformations, build models using a variety of algorithms and settings, evaluate the results, and after choosing a “best” approach, produce results such as predictions or insights that can be used by the enterprise.

If the end goal is to produce a slide deck or report, aside from those final documents, the work is done. However, reaping financial benefits from advanced analytics often needs to go beyond PowerPoint! It involves automating the process described above: extract and prepare the data, build and select the “best” model, generate predictions or highlight model details such as descriptive rules, and utilize them in production systems.

One of the biggest challenges enterprises face involves realizing the promised benefits in production that the data scientist achieved in the lab. How do you take that cleverly crafted R script, for example, and put all the necessary “plumbing” around it to enable not only the execution of the R script, but the movement of data and delivery of results where they are needed, parallel and distributed script execution across compute nodes, and execution scheduling.

As a production deployment, care needs to taken to safeguard against potential failures in the process. Further, more “moving parts” result in greater complexity. Since the plumbing is often custom implemented for each deployment, this plumbing needs to be reinvented and thoroughly tested for each project. Unfortunately, code and process reuse is seldom realized across an enterprise even for similar projects, which results in duplication of effort.

Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining) with Oracle Database provides an environment that eliminates the need for a separately managed analytics server, the corresponding movement of data and results between such a server and the database, and the need for custom plumbing. Users can store their R and SQL scripts directly in Oracle Database and invoke them through standard database mechanisms. For example, R scripts can be invoked via SQL, and SQL scripts can be scheduled for execution through Oracle Database’s DMBS_SCHEDULER package. Parallel and distributed execution of R scripts is supported through embedded R execution, while the database kernel supports parallel and distributed execution of SQL statements and in-database data mining algorithms. In addition, using the Oracle Advanced Analytics GUI, Oracle Data Miner, users can convert “drag and drop” analytic workflows to SQL scripts for ease of deployment in Oracle Database.

By making solution deployment a well-defined and routine part of the production process and reducing complexity through fewer moving parts and built-in capabilities, enterprises are able to realize and then extend the value they get from predictive analytics faster and with greater confidence.

## Wednesday Nov 19, 2014

### Pain Point #2: “I can’t analyze or mine all of my data – it has to be sampled”

Continuing in our series Addressing Analytic Pain Points, another concern for enterprise data scientists and analysts is having to compromise accuracy due to sampling. While sampling is an important technique for data analysis, it’s one thing to sample because you choose to; it’s quite another if you are forced to sample or to use a much smaller sample than is useful. A combination of memory, compute power, and algorithm design normally contributes to this.

In some cases, data simply cannot fit in memory. As a result, users must either process data in batches (adding to code or process complexity), or limit the data they use through sampling. In some environments, sampling itself introduces a catch 22 problem: the data is too big to fit in memory so it needs to be sampled, but to sample it with the current tool, I need to fit the data in memory! As a result, sampling large volume data may require processing it in batches, involving extra coding.

As data volumes increase, computing statistics and predictive analytics models on a data sample can significantly reduce accuracy. For example, to find all the unique values for a given variable, a sample may miss values, especially those that occur infrequently. In addition, for environments like open source R, it is not enough for data to fit in memory; sufficient memory must be left over to perform the computation. This results from R’s call-by-value semantics.

Even when data fits in memory, local machines, such as laptops, may have insufficient CPU power to process larger data sets. Insufficient computing resources means that performance suffers and users must wait for results - perhaps minutes, hours, or longer. This wastes the valuable (and expensive) time of the data scientist or analyst. Having multiple fast cores for parallel computations, as normally present on database server machines, can significantly reduce execution time.

So let’s say we can fit the data in memory with sufficient memory left over, and we have ample compute resources. It may still be the case that performance is slow, or worse, the computation effectively “never” completes. A computation that would take days or weeks to complete on the full data set may be deemed as “never” completing by the user or business, especially where the results are time-sensitive. To address this problem, algorithm design must be addressed. Serial, non-threaded algorithms, especially with quadratic or worse order run time do not readily scale. Algorithms need to be redesigned to work in a parallel and even distributed manner to handle large data volumes.

provides a range of statistical computations and predictive algorithms implemented in a parallel, distributed manner to enable processing much larger volume data. By virtue of executing in Oracle Database, client-side memory limitations can be eliminated. For example, with Oracle R Enterprise, R users operate on database tables using proxy objects – of type ore.frame, a subclass of data.frame – such that data.frame functions are transparently converted to SQL and executed in Oracle Database. This eliminates data movement from the database to the client machine. Users can also leverage the Oracle Data Miner graphical interface or SQL directly. When high performance hardware, such as Oracle Exadata, is used, there are powerful resources available to execute operations efficiently on big data. On Hadoop, Oracle R Advanced Analytics for Hadoop – a part of the Big Data Connectors often deployed on Oracle Big Data Appliance – also provides a range of pre-package parallel, distributed algorithms for scalability and performance across the Hadoop cluster.

## Friday Oct 24, 2014

### Pain Point #1: “It takes too long to get my data or to get the ‘right’ data”

This is the first in a series on Addressing Analytic Pain Points: “It takes too long to get my data or to get the ‘right’ data.”

Analytics users can be characterized along multiple dimensions. One such dimension is how they get access to or receive data. For example, some receive data via flat files. Since we’re talking about “enterprise” users, this often means data stored in RDBMSs where users request data extracts from a DBA or more generally the IT department. Turnaround time can be hours to days, or even weeks, depending on the organization. If the data scientist needs more or different data, the cycle repeats – often leading to frustration on both sides and delays in generating results.

Others users are granted access to databases directly using programmatic access tools like ODBC, JDBC, their corresponding R variants, or ROracle. These users may be given read-only access to a range of data tables, possibly in a sandbox schema. Here, analytics users don’t have to go back to their DBA or IT as to obtain extracts, but they still need to pull the data from the database to their client environment, e.g., a laptop, and push results back to the database. If significant volumes of data are involved, the time required for pulling data can hinder productivity. (Of course, this assumes the client has enough RAM to load the needed data sets, but that’s a topic for the next blog post.)

To address the first type of user, since much of the data in question resides in databases, empowering users with a self service model mitigates the vicious cycle described above. When the available data are readily accessible to analytics users, they can see and select what they need at will. An Oracle Database solution addresses this data access pain point by providing schema access, possibly in a sandbox with read-only table access, for the analytics user.

Even so, this approach just turns the first type of user into the second mentioned above. An Oracle Database solution further addresses this pain point by either minimizing or eliminating data movement as much as possible. Most analytics engines bring data to the computation, requiring extracts and in some cases even proprietary formats before being able to perform analytics. This takes time. Often, data movement can dwarf the time required to perform the actual computation. From the perspective of the analytics user, this is wasted time because it is just a perfunctory step on the way to getting the desired results. By bringing computation to the data, using Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining), the time normally required to move data is eliminated. Consider the time savings of being able to prepare data, compute statistics, or build predictive models and score data directly in the database. Using Oracle Advanced Analytics, either from R via Oracle R Enterprise, SQL via Oracle Data Mining, or the graphical interface Oracle Data Miner, users can leverage Oracle Database as a high performance computational engine.

We should also note that Oracle Database has the high performance Oracle Call Interface (OCI) library for programmatic data access. For R users, Oracle provides the package ROracle that is optimized using OCI for fast data access. While ROracle performance may be much faster than other methods (ODBC- and JDBC-based), the time is still greater than zero and there are other problems that I’ll address in the next pain point.

If you’re an enterprise data scientist, data analyst, or statistician, and perform analytics using R or another third party analytics engine, you’ve likely encountered one or more of these pain points:

Pain Point #1: “It takes too long to get my data or to get the ‘right’ data”
Pain Point #2: “I can’t analyze or mine all of my data – it has to be sampled”
Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex”
Pain Point #4: “Recoding R (or other) models into SQL, C, or Java takes time and is error prone”
Pain Point #5: “Our company is concerned about data security, backup and recovery”
Pain Point #6: “We need to build 10s of thousands of models fast to meet business objectives”

Some pain points are related to the scale of data, yet others are felt regardless of data size. In this blog series, I’ll explore each of these pain points, how they affect analytics users and their organizations, and how Oracle Advanced Analytics addresses them.

## Monday Sep 22, 2014

### Oracle R Enterprise 1.4.1 Released

Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, makes the open source R statistical programming language and environment ready for the enterprise and big data. Designed for problems involving large data volumes, Oracle R Enterprise integrates R with Oracle Database.

R users can execute R commands and scripts for statistical and graphical analyses on data stored in Oracle Database. R users can develop, refine, and deploy R scripts that leverage the parallelism and scalability of the database to automate data analysis. Data analysts and data scientists can use open source R packages and develop and operationalize R scripts for analytical applications in one step – from R or SQL.

With the new release of Oracle R Enterprise 1.4.1, Oracle enables support for Multitenant Container Database (CDB) in Oracle Database 12c and pluggable databases (PDB). With support for CDB / PDB, enterprises can take advantage of new ways of organizing their data: easily taking entire databases offline and easily bringing them back online when needed. Enterprises, such as pharmaceutical companies, that collect vast quantities of data across multiple experiments for individual projects immediately benefit from this capability.

This point release also includes the following enhancements:

• Certified for use with R 3.1.1 and Oracle R Distribution 3.1.1.

• Simplified and enhanced script for install, upgrade, uninstall of ORE Server and the creation and configuratioon of ORE users.

• New supporting packages: arules and statmod.

• ore.glm accepts offset terms in model formula and can fit negative binomial and tweedie families of GLM.

• ore.sync argument, query, creates ore.frame object from SELECT statement without creating view. This allows users to effectively access a view of the data without the CREATE VIEW privilege.

• Global option for serialization, ore.envAsEmptyenv, specifies whether referenced environment objects in an R object, e.g., in an lm model, should be replaced with an empty environment during serialization to the ORE R datastore. This is used by (i) ore.push, which for a list object accepts envAsEmptyenv as an optional argument, (ii) ore.save, which has envAsEmptyenv as a named argument, and (iii) ore.doEval and the other embedded R execution functions, which accept ore.envAsEmptyenv as a control argument.

Oracle R Enterprise 1.4.1

## Wednesday Sep 17, 2014

### Seismic Data Repository: on-the-fly data analysis and visualization using Oracle R Enterprise

RN-KrasnoyarskNIPIneft Establishes Seismic Information Repository for One of the World’s Largest Oil and Gas Companies. Read the complete customer story here, excerpts follow.

RN-KrasnoyarskNIPIneft (KrasNIPI) is a research and development subsidiary of Rosneft Oil Companya, top oil and gas company in Russia and worldwide. KrasNIPI provides high-quality information from seismic surveys to Rosneft—delivering key information that oil and gas companies seek to lower costs, environmental impacts, and risks while exploring for resources to satisfy growing energy needs. KrasNIPI’s primary activities include preparing the information base used for the exploration of hydrocarbons, development and construction of oil and gas fields, processing and interpretation of 2-D and 3-D seismic data, and seismic data warehousing.

Part of the solution involved on-the-fly data analysis and visualization for remote users with only a thin client—such as a web browser (without additional plug-ins and extensions). This was made possible by using Oracle R Enterprise (a component of Oracle Advanced Analytics) to support applications requiring extensive analytical processing.

We store vast amounts of seismic data, process this information with sophisticated math algorithms, and deliver it to remote users under tight deadlines. We deployed Oracle Database together with Oracle Spatial and Graph, Oracle Fusion Middleware MapViewer on Oracle WebLogic Server, and Oracle R Enterprise to keep these complex business processes running smoothly. The result exceeded our most optimistic expectations.”
– Artem Khodyaev, Chief Engineer
Corporate Center of Seismic Information Repository
RN-KrasnoyarskNIPIneft

## Thursday Aug 14, 2014

### Selecting the most predictive variables – returning Attribute Importance results as a database table

Attribute Importance (AI) is a technique of Oracle Advanced Analytics (OAA) that ranks the relative importance of predictors given a categorical or numeric target for classification or regression models, respectively. OAA AI uses the minimum description length algorithm and produces importance scores such that predictors with positive scores help predict the target, while zero or negative do not, and may even contribute noise to a model, making it less accurate. OAA AI, however, considers predictors only pairwise with the target, so any interactions among predictors are addressed. OAA AI is a good first assessment of which predictors should be included in a classification or regression model, enabling what is sometimes called feature selection or variable selection.

In my series on Oracle R Enterprise Embedded R Execution, I explored how structured table results could be returned from embedded R calls. In a subsequent post, I explored how to return select results from a principal components analysis (PCA) model as a table. In this post, I describe how you can work with results from an Attribute Importance model from ORE embedded R execution via an R function. This R function takes a table name and target variable name as input, places the predictor rankings in an named ORE datastore also specified as input, and returns a data.frame with the predictor variable name, rank, importance value.

The function below implements this functionality. Notice that we dynamically sync the named table and get its ore.frame proxy object. From here, we invoke ore.odmAI using the dynamically generated formula using the targetName argument. We pull out the importance component of the result, explicitly assign the column variable to the row names, and then reorder the columns. Next, we nullify the row names since these are now redundant with column variable.

The next three lines assign the result to a datastore. This is technically not necessary since the result is returned by this function, but if a user wanted to access this result without recomputing it, the user could simply retrieve the datastore object using another embedded R function. This is left as an exercise for the reader to load the named datastore and return the contents as an ore.frame in R or database table in SQL.

Lastly, the resulting data.frame is returned.

rankPredictors <- function(tableName,targetName,dsName) {
ore.sync(table=tableName)
ore.attach()
dat <- ore.get(tableName)
formulaStr <- paste(targetName,".",sep="~")
res <- ore.odmAI(as.formula(formulaStr),dat)
res <- res\$importance
res\$variable <- rownames(res)
res <- res[,c("variable","rank","importance")]
row.names(res) <- NULL
resName <- paste(tableName,targetName,"AI",sep=".")
assign(resName,res)
ore.save(list=c(resName),name=dsName,overwrite=TRUE)
res
}

To test this funtion, we invoke it explicitly with suitable arguments.

res <- rankPredictors ("IRIS","Species","/DS/Test1")
res

Here, you see the results.

> res
variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

The contents of the datastore can be accessed as well.

ore.datastore(pattern="/DS")
ore.datastoreSummary(name="/DS/Test1")
IRIS.Species.AI
> ore.datastore(pattern="/DS")
datastore.name object.count size    creation.date description
1 /DS/Test1 1 355 2014-08-14 16:38:46 <na>
> ore.datastoreSummary(name="/DS/Test1")
object.name class size length row.count col.count
1 IRIS.Species.AI data.frame 355 3 4 3
[1] "IRIS.Species.AI"
> IRIS.Species.AI
variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

With the confidence that our R function is behaving correctly, we load it into the R Script Repository in Oracle Database.

ore.scriptDrop("rankPredictors")
ore.scriptCreate("rankPredictors",rankPredictors)

To test that the function behaves properly with embedded R execution, we invoke it first from R using ore.doEval, passing the desired parameters and returning the result as an ore.frame. This last part is enabled through the specification of the FUN.VALUE argument. Since we are using a datastore and the transparency layer, ore.connect is set to TRUE.

ore.doEval(
FUN.NAME="rankPredictors",
tableName="IRIS",
target="Species",
dsName="/AttributeImportance/IRIS/Species",
FUN.VALUE=data.frame(variable=character(0)
,rank=numeric(0)
,importance=numeric(0)),
ore.connect=TRUE
)

Notice we get the same result as above.

variable rank importance
1  Petal.Width    1  1.1701851
2 Petal.Length    2  1.1494402
3 Sepal.Length    3  0.5248815
4  Sepal.Width    4  0.2504077

Again, we can view the datastore contents for the execution above. Notice our use of the “/” notation to organize our datastore content. While we can name datastores with any arbitrary string, this approach can help structure the retrieval of datastore contents.

ore.datastore(pattern="/AttributeImportance/IRIS")
ore.datastoreSummary(name="/AttributeImportance/IRIS/Species")

We have a single datastore matching our IRIS data set followed by the summary with the IRIS.Species.AI object, which is an R data.frame with 3 columns and 4 rows.

> ore.datastore(pattern="/AttributeImportance/IRIS")
datastore.name object.count size creation.date description
1 /AttributeImportance/IRIS/Species 1 355 2014-08-14 16:55:40
> ore.datastoreSummary(name="/AttributeImportance/IRIS/Species")
object.name class size length row.count col.count
1 IRIS.Species.AI data.frame 355 3 4 3

To execute this R script from SQL, use the ORE SQL API.

select * from table(rqEval(
cursor(select 1 "ore.connect",
'IRIS' "tableName",
'Species' "targetName",
'/AttributeImportance/IRIS/Species' "dsName"
from dual),
'select cast(''a'' as varchar2(50)) "variable",
1 "rank",
1 "importance"
from dual',
'rankPredictors'));

In summary, we’ve explored how to use ORE embedded R execution to extract model elements from an in-database algorithm and present it as an R data.frame, ore.frame, and SQL table.

The process used above can also serve as a template for working on your own embedded R execution projects:

+ Interactively develop an R script that does what you need and wrap it in a function
+ Validate that the R function behaves as expected
+ Store the function in the R Script Repository
+ Validate that the R interface to embedded R execution produces the desired results
+ Generate SQL query that invokes the R function
+ Validate that the SQL interface to embedded R execution produces the desired resultsv

## Thursday Jul 24, 2014

### Are you experiencing analytics pain points?

At the user!2014 conference at UCLA in early July, which was a stimulating and well-attended conference, I spoke about Oracle’s R Technologies during the sponsor talks. One of my slides focused on examples of analytics pain points we often hear from customers and prospects. For example,

“It takes too long to get my data or to get the ‘right’ data”
“I can’t analyze or mine all of my data – it has to be sampled”
“Putting R models and results into production is ad hoc and complex”
“Recoding R models into SQL, C, or Java takes time and is error prone”
“Our company is concerned about data security, backup and recovery”
“We need to build 10s of thousands of models fast to meet business objectives”

After the talk, several people approached me remarking how these are exactly the problems they encounter in their organizations. One person even asked, if I’d interviewed her for my talk since she is experiencing every one of these pain points.

Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, addresses these pain points. Let’s take a look one by one.

Increasingly, data scientists want to avoid sampling data when analyzing data or building predictive models. Minimally, they at least want to use much more data than may fit in typical analytics servers. Oracle R Enterprise provides an R interface to powerful in-database analytic functions and data mining algorithms. These algorithms are designed to work in a parallel distributed manner whether the data fits in memory or not. In other cases, sampling is desired, if not required, but this results in the chicken-and-egg problem: The data need to be sampled since they won’t fit in memory, but the data are too big to fit in memory to sample! Users have developed home grown techniques to chunk the data and combine partial samples; however, they shouldn’t have to. When sampling is desired/required, with Oracle R Enterprise, we are able to leverage row indexing and in-database sampling to extract only database table rows that are in the sample, using standard R syntax or Oracle R Enterprise-based sampling functions.

Our next pain point involves production deployment. Many good predictive models have been laid waste for lack of integration with or complexity introduced by production environments. Enterprise applications and dashboards often speak SQL and know how to access data. However, to craft a solution that extracts data, invokes an R script in an external R engine, and places batch results back in the database requires a lot of manual coding, often leveraging ad hoc cron jobs. Oracle R Enterprise enables the execution of R scripts on the database server machine, in local R engines under the control of Oracle Database. This can be done from R and SQL. Using the SQL API, R scripts can be invoked to return results in the form of table data, images, and XML. In addition, data can be moved to these R engines more efficiently, and the powerful database hardware, such as Exadata machines, can be leveraged for data-parallel and task-parallel R script execution.

When users don’t have access to a tight integration between R and SQL as noted above, another pain point involves using R only to build the models and relying on developers to recode the scoring procedures in a programming language that fits with the production environment, e.g., SQL, C, or Java. This has multiple downsides: it takes time to recode, manual recoding is error prone, and the resulting code requires significant testing. When the model is refreshed, the process repeats.

The pain points discussed so far also suffer from concerns about security, backup, and recovery. If data is being moved around in flat files, what security protocols or access controls are placed on those flat files? How can access be audited? Oracle R Enterprise enables analytics users to leverage an Oracle Database secured environment for data access. Moving on, if R scripts, models, and other R objects are stored and managed as flat files, how are these backed up? How are they synced with the deployed application? By storing all these artifacts in Oracle Database via Oracle R Enterprise, backup is a normal part of DBA operation with established protocols. The R Script Repository and Datastore simplify backup. Crafting ad hoc solutions involving third party analytic servers, there is the issue of recovery, or resilience to failures. Fewer moving parts mean lower complexity. Programming for failure contingencies in a distributed application adds significant complexity to an application. Allowing Oracle Database to control the execution of R scripts in database server side R engines reduces complexity and frees application developers and data scientists to focus on the more creative aspects of their work.

Lastly, users of advanced analytics software – data scientists, analysts, statisticians – are increasing pushing the barrier of scalability. Not just in volume of data processed, but in the number and frequency of their computations and analyses, e.g., predictive model building. Where only a few models are involved, it may be tractable to manage a few files to store predictive models on disk (although as noted above, this has its own complications). When you need to build thousands of models or hundreds of thousands of models, managing these models becomes a challenge in its own right.

In summary, customers are facing a wide range of pain points in their analytics activities. Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, addresses these pain points allowing data scientists, analysts, and statisticians, as well as the IT staff who supports them, to be more productive, while promoting and enabling new uses of advanced analytics.

## Thursday May 01, 2014

### "Darden uses analytics to understand customer restaurants"

See the InformationWeek article Darden Uses Analytics To Understand Restaurant Customers highlighting Darden's use of Oracle Advanced Analytics:

"Check-Level Analytics is one of the tools Darden plans to use to boost sales and customer loyalty, while pulling together data from across all its operations to show how they can work together better. ... it brought in some new tools, including Oracle Data Miner and Oracle R Enterprise, both included in the Oracle Advanced Analytics option to Oracle Database, to spot correlations and meaningful patterns in the data."

Darden's data analytics project earned them the 5th spot on this year's InformationWeek Elite 100 ranking.

## Sunday Dec 08, 2013

### Explore Oracle's R Technologies at BIWA Summit 2014

It’s getting to be that time of year again. The Oracle BIWA Summit '14 will be taking place January 14-16 at Oracle HQ Conference Center, Redwood Shores, CA. Check out the detailed agenda.

BIWA Summit provides a wide range of sessions on Business Intelligence, Warehousing, and Analytics, including: novel and interesting use cases of Oracle Big Data, Exadata, Advanced Analytics/Data Mining, OBIEE, Spatial, Endeca and more! You’ll also have opportunities to get hands on experience with products in the Hands-on Labs, great customer case studies and talks by Oracle Technical Professionals and Partners.  Meet with technical experts on the technology you want and need to use.

Click HERE to read detailed abstracts and speaker profiles.  Use the SPECIAL DISCOUNT code ORACLE12c and registration is only \$199 for the 2.5 day technically focused Oracle user group event.

On the topic of Oracle’s R technologies, don't miss:

• Introduction to Oracle's R Technologies
• Applying Oracle's R Technologies to Big Data Problems
• Hands-on Lab: Learn to use Oracle R Enterprise
• OBIEE + OAA Integration Paths : interactive OAA in SampleApp Dashboards
• Blazing Business Analytics: Analytic Options to the Oracle Database
• Best Practices for In-Database Analytics

We look forward to meeting you there!

## Wednesday Feb 06, 2013

### Oracle R Enterprise 1.3 gives predictive analytics an in-database performance boost

Recently released Oracle R Enterprise 1.3 adds packages to R that enable even more in-database analytics. These packages provide horizontal, commonly used techniques that are blazingly fast in-database for large data. With Oracle R Enterprise 1.3, Oracle makes R even better and usable in enterprise settings. (You can download ORE 1.3 here and documentation here.)

When it comes to predictive analytics, scoring (predicting outcomes using a data mining model) is often a time critical operation. Scoring can be done online (real-time), e.g., while a customer is browsing a webpage or using a mobile app, where on-the-spot recommendations can be made based on current actions. Scoring can also be done offline (batch), e.g., predict which of your 100 million customers will respond to each of a dozen offers, e.g., where applications leverage results to identify which customers should be targeted with a particular ad campaign or special offer.

In this blog post, we explore where using Oracle R Enterprise pays huge dividends. When working with small data, R can be sufficient, even when pulling data from a database. However, depending on the algorithm, benefits of in-database computation can be seen in a few thousand rows. The time difference with 10s of thousands of rows makes an interactive session more interactive, whereas 100s of thousands of rows becomes a real productivity gain, and on millions (or billions) of rows, becomes a competitive advantage! In addition to performance benefits, ORE integrates R into the database enabling you to leave data in place.

We’ll look at a few proof points across Oracle R Enterprise features, including:

• OREdm – a new package that provides R access to several in-database Oracle Data Mining algorithms (Attribute Importance, Decision Tree, Generalized Linear Models, K-Means, Naïve Bayes, Support Vector Machine).
• OREpredict – a new package that enables scoring models built using select standard R algorithms in the database (glm, negbin, hclust, kmeans, lm, multinom, nnet, rpart).
• Embedded R Execution – an ORE feature that allows running R under database control and boosts real performance of CRAN predictive analytics packages by providing faster access to data than occurs between the database and client, as well as leveraging a more powerful database machine with greater RAM and CPU resources.

OREdm

Pulling data out of a database for any analytical tool impedes interactive data analysis due to access latency, either directly when pulling data out of the database or indirectly via an IT process that involves requesting data to be staged in flat files. Such latencies can quickly become intolerable. On the R front, you’ll also need to consider whether the data will fit in memory. If flat files are involved, consideration needs to be given to how files will be stored, backed up, and secured.

Of course, model building and data scoring execution time is only part of the story. Consider a scenario A, the “build combined script,” where data is extracted from the database, and an R model built and persisted for later use. In the corresponding scenario B, the “score combined script”, data is pulled from the database, a previously built model loaded, data scored, and the scores written to the database. This is a typical scenario for use in, e.g., enterprise dashboards or within an application supporting campaign management or next-best-offer generation. In-database execution provides significant performance benefits, even for relatively small data sets as included below. Readers should be able to reproduce such results at these scales. We’ve also included a Big Data example by replicating the 123.5 million row ONTIME data set to 1 billion rows. Consider the following examples:

Linear Models: We compared R lm and ORE ore.lm in-database algorithm on the combined scripts. On datasets ranging from 500K to 1.5M rows with 3-predictors, in-database analytics showed an average 2x-3x performance improvement for build, and nearly 4x performance improvement for scoring. Notice in Figure 1 that the trend is significantly less for ore.lm than lm, indicating greater scalability for ore.lm.

Figure 1. Overall lm and ore.lm execution time for model building (A) and data scoring (B)

Figure 2 provides a more detailed view comparing data pull and model build time for build detail, followed by data pull, data scoring, and score writing for score detail. For model building, notice that while data pull is a significant part of lm’s total build time, the actual build time is still greater than ore.lm. A similar statement can be made in the case of scoring.

Figure 2. Execution time components for lm and ore.lm (excluding model write and load)

Naïve Bayes from the e1071 package: On 20-predictor datasets ranging from 50k to 150k rows, in-database ore.odmNB improved data scoring performance by a factor of 118x to 418x, while the full scenario B execution time yielded a 13x performance improvement, as depicted in Figure 3B. Using a non-parallel execution of ore.odmNB, we see the cross-over point where ore.odmNB overtakes R, but more importantly, the slope of the trend points to the greater scalability of ORE, as depicted in Figure 3A for the full scenario A execution time.

Figure 3. Overall naiveBayes and ore.odmNB execution time for model building (A) and data scoring (B)

K-Means clustering: Using 6 numeric columns from the ONTIME airline data set ranging from 1 million to 1 billion rows, we compare in-database ore.odmKMeans with R kmeans through embedded R execution with ore.tableApply. At 100 million rows, ore.odmKMeans demonstrates better performance than kmeans , and scalability at 1 billion rows. The performance results depicted in Figure 4 uses a log-log plot. The legend shows the function invoked and corresponding parameters, using subset of ONTIME data set d. While ore.odmKMeans scales linearly with number of rows, R kmeans does not. Further, R kmeans did not complete at 1 billion rows.

Figure 4: K-Means clustering model building on Big Data

OREpredict

With OREpredict, R users can also benefit from in-database scoring of R models. This becomes evident not only when considering the full “round trip” of pulling data from the database, scoring in R, and writing data back to the database, but also for the scoring itself.

Consider an lm model built using a dataset with 4-predictors and 1 million to 5 million rows. Pulling data from the database, scoring, and writing the results back to the database shows a pure R-based approach taking 4x - 9x longer than in-database scoring using ore.predict with that same R model. Notice in Figure 5 that the slope of the trend is dramatically less for ore.predict than predict, indicating greater scalability. When considering the scoring time only, ore.predict was 20x faster than predict in R for 5M rows. In ORE 1.3, ore.predict is recommended and will provide speedup over R for numeric predictors.

Figure 5. Overall lm execution time using R predict vs. ORE ore.predict

For rpart, we see a similar result. On a 20-predictor, 1 million to 5 million row data set, ore.predict resulted in a 6x – 7x faster execution. In Figure 5, we again see that the slope of the trend is dramatically less for ore.predict than predict, indicating greater scalability. When considering the scoring time only, ore.predict was 123x faster than predict in R for 5 million rows.

Figure 6. Overall rpart Execution Time using R predict vs. ORE ore.predict

This scenario is summarized in Figure 7. In the client R engine, we have the ORE packages installed. There, we invoke the pure R-based script, which requires pulling data from the database. We also invoke the ORE-based script that keeps the data in the database.

Figure 7. Summary of OREpredict performance gains

To use a real world data set, we again consider the ONTIME airline data set with 123.5 million rows. We will build lm models with varying number of coefficients derived by converting categorical data to multiple columns. The variable p corresponds to the number of coefficients resulting from the transformed formula and is dependent on the number of distinct values in the column. For example, DAYOFWEEK has 7 values, so with DEPDELAY, p=9. In Figure 8, you see that using an lm model with embedded R for a single row (e.g., one-off or real-time scoring), has much more overhead (as expected given that an R engine is being started) compared to ore.predict, which shows subsecond response time through 40 coefficients at 0.54 seconds, and the 106 coefficients at 1.1 seconds. Here are the formulas describing the columns included in the analysis:

• ARRDELY ~ DEPDELAY (p=2)
• ARRDELY ~ DEPDELAY + DAYOFWEEK (p=8)
• ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH (p=19)
• ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH + YEAR (p=40)
• ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH + YEAR (p=106)

Figure 8. Comparing performance of ore.predict with Embedded R Execution for lm

Compare this with scoring the entire ONTIME table of 123.5 million rows. We see that ore.predict outperforms embedded R until about 80 coefficients, when embedded R becomes the preferred choice.

Data Movement between R and Database: Embedded R Execution

One advantage of R is its community and CRAN packages. The goal for Oracle R Enterprise with CRAN packages is to enable reuse of these packages while:

• Leveraging the parallelization and efficient data processing capabilities of Oracle Database
• Minimizing data transfer and communication overhead between R and the database
• Leveraging R as a programming language for writing custom analytics

There are three ways in which we’ll explore the performance of pulling data.

1) Using ore.pull at a separate client R engine to pull data from the database

2) Using Embedded R Execution and ore.pull within an embedded R script from a database-spawned R engine

3) Using Embedded R Execution functions for data-parallelism and task-parallelism to pass database data to the embedded R script via function parameter

With ORE Embedded R Execution (ERE), the database delivers data-parallelism and task-parallelism, and reduces data access latency due to optimized data transfers into R. Essentially, R runs under the control of the database. As illustrated in Figure 9, loading data at the database server is 12x faster than loading data from the database to a separate R client. Embedded R Execution also provides a 13x advantage when using ore.pull invoked at the database server within an R closure (function) compared with a separate R client. The data load from database to R client is depicted as 1x – the baseline for comparison with embedded R execution data loading.

Figure 9. Summary of Embedded R Execution data load performance gains

Data transfer rates are displayed in Figure 10, for a table with 11 columns and 5 million to 15 million rows of data. Loading data via ORE embedded R execution using server-side ore.pull or through the framework with, e.g., ore.tableApply (one of the embedded R execution functions) is dramatically faster than a non-local client load via ore.pull. The numbers shown reflect MB/sec data transfer rates, so a bigger bar is better!

Figure 10. Data load and write execution time with 11 columns

While this is impressive, let’s expand our data up to 1 billion rows. To create our 1 billion row data set (1.112 billion rows), we duplicated the 123.5 million row ONTIME dataset 9 times, replacing rows with year 1987 with years 2010 through 2033, and selecting 6 integer columns (YEAR, MONTH, DAYOFMONTH, ARRDELAY, DEPDELAY, DISTANCE) with bitmap index of columns (YEAR, MONTH, DAYOFMONTH). The full data set weighs in at ~53 GB.

In Figure 11, we see linear scalability for loading data into the client R engine. Times range from 2.8 seconds for 1 million rows, to 2700 seconds for 1 billion rows. While your typical user may not need to load 1 billion rows into R memory, this graph demonstrates the feasibility to do so.

Figure 11. Client Load of Data via ore.pull for Big Data

In Figure12, we look at how degree of parallelism (DOP) affects data load times involving ore.rowApply. This test addresses the question of how fast ORE can load 1 billion, e.g., when scoring data. The degree of parallelism corresponds to the number of R engines that are spawned for concurrent execution at the database server. The number of chunks the data is divided into is 1 for a single degree of parallelism, and 10 times the DOP for the remaining tests. For DOP of 160, the data was divided into 1600 chunks, i.e., 160 R engines were spawned, each processing 10 chunks. The graph on the left depicts that execution times improve for the 1 billion row data set through DOP of 160. As expected, at some point, the overhead of spawning additional R engines and partitioning the data outweighs the benefit. At its best time, processing 1 billion rows took 43 seconds.

Figure 12. Client Load of Data via ore.pull for Big Data

In the second graph of Figure 12, we contrast execution time for the “sweet spot” identified in the previous graph with varying number of rows. Using this DOP of 160, with 1600 chunks of data, we see that through 100 million rows, there is very little increase in execution time (between 6.4 and 8.5 seconds in actual time). While 1 billion rows took significantly more, it took only 43 seconds.

We can also consider data write at this scale. In Figure 13, we also depict linear scalability from 1 million through 1 billion rows using the ore.create function to creating database tables from R data. Actual times ranged from 2.6 seconds to roughly 2600 seconds.

Figure 13. Data Write using ore.create for Big Data

ORE supports data-parallelism to enable, e.g., building predictive models in parallel on partitions of the data. Consider a marketing firm that micro-segments customers and builds predictive models on each segment. ORE embedded R execution automatically partitions the data, spawns R engines according to the degree of parallelism specified, and executes the specified user R function. To address how efficiently ore.groupApply can process data, Figure 14 shows the total execution time to process the 123.5M rows from the ONTIME data with varying number of columns. The figure shows that ore.groupApply scales linearly as the number of columns increases. Three columns were selected based on their number of distinct values: TAILNUM 12861, DEST 352, and UNIQUECARRIER 29. For UNIQUECARRIER, all columns (total of 29 columns) could not be completed since 29 categories resulted in data too large for a single R engine.

Figure 14. Processing time for 123.5M rows via ore.groupApply

ORE also supports row-parallelism, where the same embedded R function can be invoked on chunks of rows. As with ore.groupApply, depending on the specified degree of parallelism, a different chunk of rows will be submitted to a dynamically spawned database server-side R engine. Figure 15 depicts a near linear execution time to process the 123.5M rows from ONTIME with varying number of columns. The chunk size can be specified, however, testing 3 chunk sizes (10k, 50k, and 100k rows) showed no significant difference in overall execution time, hence a single line is graphed.

Figure 15. Processing time for 123.5M rows via ore.rowApply for chunk sizes 10k-100k

All tests were performed on an Exadata X3-8. Except as noted, the client R session and database were actually on the same machine, so network latency for data read and write were minimum. Over a LAN or WAN, the benefits of in-database execution and ORE will be even more dramatic.

# Oracle Announces Availability of Oracle Advanced Analytics for Big Data

## Oracle Integrates R Statistical Programming Language into Oracle Database 11g

REDWOOD SHORES, Calif. - February 8, 2012

### News Facts

Oracle today announced the availability of  Oracle Advanced Analytics, a new option for Oracle Database 11g that bundles Oracle R Enterprise together with Oracle Data Mining.
Oracle R Enterprise delivers enterprise class performance for users of the R statistical programming language, increasing the scale of data that can be analyzed by orders of magnitude using Oracle Database 11g.
R has attracted over two million users since its introduction in 1995, and Oracle R Enterprise dramatically advances capability for R users. Their existing R development skills, tools, and scripts can now also run transparently, and scale against data stored in Oracle Database 11g.
Customer testing of Oracle R Enterprise for Big Data analytics on Oracle Exadata has shown up to 100x increase in performance in comparison to their current environment.
Oracle Data Mining, now part of Oracle Advanced Analytics, helps enable customers to easily build and deploy predictive analytic applications that help deliver new insights into business performance.
Oracle Advanced Analytics, in conjunction with Oracle Big Data Appliance, Oracle Exadata Database Machine and Oracle Exalytics In-Memory Machine, delivers the industry’s most integrated and comprehensive platform for Big Data analytics.

### Comprehensive In-Database Platform for Advanced Analytics

Oracle Advanced Analytics brings analytic algorithms to data stored in Oracle Database 11g and Oracle Exadata as opposed to the traditional approach of extracting data to laptops or specialized servers.
With Oracle Advanced Analytics, customers have a comprehensive platform for real-time analytic applications that deliver insight into key business subjects such as churn prediction, product recommendations, and fraud alerting.
By providing direct and controlled access to data stored in Oracle Database 11g, customers can accelerate data analyst productivity while maintaining data security throughout the enterprise.
Powered by decades of Oracle Database innovation, Oracle R Enterprise helps enable analysts to run a variety of sophisticated numerical techniques on billion row data sets in a matter of seconds making iterative, speed of thought, and high-quality numerical analysis on Big Data practical.
Oracle R Enterprise drastically reduces the time to deploy models by eliminating the need to translate the models to other languages before they can be deployed in production.
Oracle R Enterprise integrates the extensive set of Oracle Database data mining algorithms, analytics, and access to Oracle OLAP cubes into the R language for transparent use by R users.
Oracle Data Mining provides an extensive set of in-database data mining algorithms that solve a wide range of business problems. These predictive models can be deployed in Oracle Database 11g and use Oracle Exadata Smart Scan to rapidly score huge volumes of data.
The tight integration between R, Oracle Database 11g, and Hadoop enables R users to write one R script that can run in three different environments: a laptop running open source R, Hadoop running with Oracle Big Data Connectors, and Oracle Database 11g.
Oracle provides single vendor support for the entire Big Data platform spanning the hardware stack, operating system, open source R, Oracle R Enterprise and Oracle Database 11g.
To enable easy enterprise-wide Big Data analysis, results from Oracle Advanced Analytics can be viewed from Oracle Business Intelligence Foundation Suite and Oracle Exalytics In-Memory Machine.

### Supporting Quotes

“Oracle is committed to meeting the challenges of Big Data analytics. By building upon the analytical depth of Oracle SQL, Oracle Data Mining and the R environment, Oracle is delivering a scalable and secure Big Data platform to help our customers solve the toughest analytics problems,” said Andrew Mendelsohn, senior vice president, Oracle Server Technologies.
“We work with leading edge customers who rely on us to deliver better BI from their Oracle Databases. The new Oracle R Enterprise functionality allows us to perform deep analytics on Big Data stored in Oracle Databases. By leveraging R and its library of open source contributed CRAN packages combined with the power and scalability of Oracle Database 11g, we can now do that,” said Mark Rittman, co-founder, Rittman Mead.

### Supporting Resources

Oracle engineers hardware and software to work together in the cloud and in your data center. For more information about Oracle (NASDAQ: ORCL), visit http://www.oracle.com.

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

### Contact Info

 Eloy Ontiveros Oracle +1.650.607.6458 eloy.ontiveros@oracle.com Joan Levy Blanc & Otus for Oracle +1.415.856.5110 jlevy@blancandotus.com