X

Building "partition models" with Oracle R Enterprise

By: Mark Hornick | Director, Advanced Analytics and Machine Learning

There are many approaches for improving model accuracy - anything from enriching or cleansing the data you start with to optimizing algorithm parameters or creating ensemble models. One technique that Oracle R Enterprise users sometimes employ is to partition data based on the distinct values of one or more columns and build a model for each partition. By building a model on each partition, forming a kind of ensemble model, better accuracy is possible. The embedded R execution function ore.groupApply enables users to do this in a data-parallel manner; however, this is requires managing the models explicitly.

New in Oracle R Enterprise 1.5.1 is a feature that further automates this process. Oracle R Enterprise allows users to easily specify "partition models" by providing a single argument that excepts one or more columns on which to partition the data. The result is a single model that is used and managed as one model.

When the model is built, the abstraction of a single model is provided to the user. This single model can then be used for scoring new data. Oracle R Enterprise automatically partitions the new data and selects the appropriate component model to be used for scoring.

The same capability is available through the SQL API of Oracle Data Mining.

Let's look at an example. We'll use the data set stored in the database table 'WINE' originally from the UCI repository  https://archive.ics.uci.edu/ml/datasets/Wine+Quality, consisting of over 6000 data samples for both red and white wines. We will illustrate the use of partitioned models to first build a default Support Vector Machine model, then a partition model separating red wines from white wines.

First, using the ORE transparency layer, we randomly split the data into train and test sets. Note that sampling occurs in the database and the resulting samples remain in the database as well, producing ore.frame proxy objects.

> n.rows <- nrow(WINE)
> row.names(WINE) <- WINE$color
> set.seed(seed=6218945)
> random.sample <- sample(1:n.rows, ceiling(n.rows/2)) 
> WINE.train <- WINE[random.sample,]
> WINE.test  <- WINE[setdiff(1:n.rows,random.sample),]

Using the Oracle Advanced Analytics in-database Support Vector Machine (SVM) algorithm, we build a classification model that predicts quality based on the remaining columns, excluding pH and fixed.acidity as a result of using ore.odmAI for attribute importance / feature selection (not shown). Next we predict using the model on the test data set, appending the quality column to the result to facilitate generating a confusion matrix. .

> mod.svm   <- ore.odmSVM(quality~.-pH-fixed.acidity, WINE.train, 
+                         "classification",kernel.function="linear")
> pred.svm  <- predict (mod.svm, WINE.test,"quality")

First, let's example the prediction result. It contains probabilities for each class, by default, along with the actual value and prediction. Looking at the confusion matrix, we see the model only predicts three classes, predominantly focused on quality 5 and 6.

> head(pred.svm,3)
             '3'        '4'       '5'        '6'       '7'        '8'        '9'
red   0.12740669 0.12740718 0.2355671 0.12740755 0.1274072 0.12740682 0.12739745
red.1 0.12511451 0.12511496 0.2493204 0.12511526 0.1251149 0.12511455 0.12510538
red.2 0.08893389 0.08893407 0.4664034 0.08893388 0.0889339 0.08893366 0.08892724
      quality PREDICTION
red         6          5
red.1       6          5
red.2       5          5
> with(pred.svm, table(quality,PREDICTION, dnn = c("Actual","Predicted")))
      Predicted
Actual   5   6   9
     3  14   2   0
     4  74  24   0
     5 923 152   0
     6 789 631   3
     7 130 405   0
     8  20  79   0
     9   0   2   0

Let's now build a partitioned model, separating data based on the color column. All the other parameters remain the same for both model building and predicting.

> mod.svm2   <- ore.odmSVM(quality~.-pH-fixed.acidity, WINE.train, 
+                         "classification",kernel.function="linear",
+                         odm.settings=list(odms_partition_columns = "color"))
> pred.svm2  <- predict (mod.svm2, WINE.test,"quality")

Looking at the first few predictions, we see that there are no scores produced for quality 9. Looking further into the data, this is because no reds received a rating of 9.

> head(pred.svm2,3)
               '3'       '4'       '5'       '6'       '7'       '8' '9' quality
red   1.936908e-04 0.1390664 0.2136602 0.3689467 0.1390664 0.1390666  NA       6
red.1 1.310798e-05 0.1366415 0.3563299 0.2337322 0.1366414 0.1366418  NA       6
red.2 2.127579e-02 0.1236327 0.4279944 0.1798320 0.1236323 0.1236328  NA       5
      PREDICTION
red            6
red.1          5
red.2          5

See the confusion matrix from the partition model where the predictions have improved modestly, but more classes are being predicted. Clearly, more work can be done on tuning the model, but this serves to illustrate the concept and use of partition models.

> with(pred.svm2, table(quality,PREDICTION, dnn = c("Actual","Predicted")))
      Predicted
Actual   3   5   6   7   8   9
     3   0  13   3   0   0   0
     4   0  73  24   0   0   1
     5   1 865 208   0   1   0
     6   1 694 725   2   1   0
     7   0 103 426   5   1   0
     8   0  20  78   1   0   0
     9   0   1   1   0   0   0

With a partition model, we can also view the component model using the partition function. Further, we can get summary details on each component model by accessing it using the [] operator and invoking summary.

> partitions(mod.svm2)
  PARTITION_NAME color
1            red   red
2          white white
> summary(mod.svm2["red"])
$red

Call:
ore.odmSVM(formula = quality ~ . - pH - fixed.acidity, data = WINE.train, 
    type = "classification", kernel.function = "linear", odm.settings = list(odms_partition_columns = "color"))

Settings: 
                                               value
clas.weights.balanced                            OFF
odms.max.partitions                             1000
odms.missing.value.treatment odms.missing.value.auto
odms.partition.columns                       "color"
odms.sampling                  odms.sampling.disable
prep.auto                                         ON
active.learning                            al.enable
conv.tolerance                                 1e-04
kernel.function                               linear

Coefficients: 
   PARTITION_NAME class             variable value      estimate
1             red     3          (Intercept)       -6.527165e+00
2             red     3              alcohol       -7.049019e-01
3             red     3            chlorides       -4.485416e-02
4             red     3          citric.acid       -2.022373e+00
5             red     3              density       -1.086982e+00
6             red     3  free.sulfur.dioxide       -4.736011e-01
7             red     3       residual.sugar        9.530107e-01
8             red     3            sulphates        3.817229e-01
9             red     3 total.sulfur.dioxide       -1.438305e+00
10            red     3     volatile.acidity        3.716756e-01
11            red     4          (Intercept)       -1.000002e+00
12            red     4              alcohol       -2.586759e-07
13            red     4            chlorides        1.060210e-07
14            red     4          citric.acid       -4.288424e-07
15            red     4              density       -2.802913e-07
16            red     4  free.sulfur.dioxide        6.349260e-08
17            red     4       residual.sugar        2.162969e-07
18            red     4            sulphates        2.098022e-07
19            red     4 total.sulfur.dioxide       -2.514968e-07
20            red     4     volatile.acidity        4.040972e-07
21            red     5          (Intercept)       -3.970451e-01
22            red     5              alcohol       -7.772918e-01
23            red     5            chlorides        2.753498e-01
24            red     5          citric.acid       -1.069587e-01
25            red     5              density        1.424669e-01
26            red     5  free.sulfur.dioxide       -1.786699e-01
27            red     5       residual.sugar       -1.787814e-01
28            red     5            sulphates       -5.966614e-01
29            red     5 total.sulfur.dioxide        4.356998e-01
30            red     5     volatile.acidity        7.460307e-02
31            red     6          (Intercept)       -3.563915e-01
32            red     6              alcohol        5.199011e-01
33            red     6            chlorides       -6.433083e-02
34            red     6          citric.acid       -8.938135e-02
35            red     6              density       -3.060354e-02
36            red     6  free.sulfur.dioxide        1.633804e-01
37            red     6       residual.sugar       -3.666848e-02
38            red     6            sulphates        2.400912e-01
39            red     6 total.sulfur.dioxide       -2.590342e-01
40            red     6     volatile.acidity       -2.599212e-01
41            red     7          (Intercept)       -1.000002e+00
42            red     7              alcohol        1.288288e-06
43            red     7            chlorides       -4.242262e-07
44            red     7          citric.acid        1.310518e-07
45            red     7              density        1.236209e-07
46            red     7  free.sulfur.dioxide        4.925050e-07
47            red     7       residual.sugar        3.419610e-07
48            red     7            sulphates        7.295076e-07
49            red     7 total.sulfur.dioxide       -1.046192e-06
50            red     7     volatile.acidity       -5.349807e-07
51            red     8          (Intercept)       -1.000000e+00
52            red     8              alcohol        6.933444e-08
53            red     8            chlorides       -8.491114e-08
54            red     8          citric.acid        4.561492e-08
55            red     8              density        1.115950e-08
56            red     8  free.sulfur.dioxide       -8.925505e-08
57            red     8       residual.sugar        3.791063e-08
58            red     8            sulphates        5.243467e-08
59            red     8 total.sulfur.dioxide        3.168575e-08
60            red     8     volatile.acidity        3.733357e-08

The ORE partition model feature makes it is easy for users to specify, build and score on partitions of their data, potentially resulting in improved model accuracy. Many of the ore.odm* algorithms support the "partition model" feature.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha