X

All Things Database: Education, Best Practices,
Use Cases & More

Using Oracle Machine Learning for Python - building predictive models

Mark Hornick
Senior Director, Data Science and Machine Learning
This is a syndicated post, view the original post here

In my previous post, we introduced some basic features of OML4Py. In this post, I'll illustrate using some in-database machine learning algorithms.

Creating in-database predictive models

In this example, we use the Random Forest algorithm and the optional cost matrix input to predict Species from the IRIS table. First, we get a DataFrame proxy object to the IRIS table, use the overloaded split function to obtain training and test data sets, which are also proxy objects (data remains in the database).  We then create and display a demo cost matrix - just for illustration - before building a Random Forest model. Note that the cost matrix is also a named database table, RF_COST.

IRIS = oml.sync(table = "IRIS")        # obtain proxy object
IRIS_TRAIN, IRIS_TEST = IRIS.split()   # split data using default ratio
TRAIN_x = IRIS_TRAIN[0].drop('Species')
TRAIN_y = IRIS_TRAIN[0]['Species']

# Create a cost matrix table in the database
cost_matrix = [['setosa', 'setosa', 0],
  ['setosa', 'virginica', 0.2],
  ['setosa', 'versicolor', 0.8],
  ['virginica', 'virginica', 0],
  ['virginica', 'setosa', 0.5],
  ['virginica', 'versicolor', 0.5],
  ['versicolor', 'versicolor', 0],
  ['versicolor', 'setosa', 0.4],
  ['versicolor', 'virginica', 0.6]]

COST = oml.create(pd.DataFrame(cost_matrix,
                               columns = ['ACTUAL_TARGET_VALUE', 
                                          'PREDICTED_TARGET_VALUE', 
                                          'COST']),
                  "RF_COST")
COST.pull().pivot(index='ACTUAL_TARGET_VALUE',         
                  columns='PREDICTED_TARGET_VALUE',values='COST')

With the output:

PREDICTED_TARGET_VALUE   setosa   versicolor   virginica
ACTUAL_TARGET_VALUE
setosa                      0.0          0.8         0.2
versicolor                  0.4          0.0         0.6
virginica                   0.5          0.5         0.0

To build our Random Forest model, we create an rf object, specifying a maximum tree depth of 4, and invoke fit on our training data with the cost matrix. To display model details, we print the object. Note that the Random Forest model also reports attribute importance.

from oml import rf

rf_mod = rf(tree_term_max_depth = '4')
rf_mod = rf_mod.fit(TRAIN_x, TRAIN_y, COST)

rf_mod   # Show model details

With the output:

Algorithm Name: Random Forest

Mining Function: CLASSIFICATION

Target: Species

Settings:
                    setting name            setting value
0                      ALGO_NAME       ALGO_RANDOM_FOREST
1           CLAS_COST_TABLE_NAME      "PYQUSER"."RF_COST"
2              CLAS_MAX_SUP_BINS                       32
3          CLAS_WEIGHTS_BALANCED                      OFF
4                   ODMS_DETAILS              ODMS_ENABLE
5   ODMS_MISSING_VALUE_TREATMENT  ODMS_MISSING_VALUE_AUTO
6               ODMS_RANDOM_SEED                        0
7                  ODMS_SAMPLING    ODMS_SAMPLING_DISABLE
8                      PREP_AUTO                       ON
9                 RFOR_NUM_TREES                       20
10           RFOR_SAMPLING_RATIO                       .5
11          TREE_IMPURITY_METRIC       TREE_IMPURITY_GINI
12           TREE_TERM_MAX_DEPTH                        4
13         TREE_TERM_MINPCT_NODE                      .05
14        TREE_TERM_MINPCT_SPLIT                       .1
15         TREE_TERM_MINREC_NODE                       10
16        TREE_TERM_MINREC_SPLIT                       20

Computed Settings:
  setting name setting value
0    RFOR_MTRY             2

Global Statistics:
   AVG_DEPTH  AVG_NODECOUNT  MAX_DEPTH  MAX_NODECOUNT  MIN_DEPTH  \
0       3.25            5.6        4.0            6.0        4.0   

   MIN_NODECOUNT  NUM_ROWS  
0            4.0     102.0  

Attributes:
SEPAL_LENGTH
SEPAL_WIDTH
PETAL_LENGTH
PETAL_WIDTH

Partition: NO

Importance:

  ATTRIBUTE_NAME ATTRIBUTE_SUBNAME  ATTRIBUTE_IMPORTANCE
0   PETAL_LENGTH              None              0.584287
1    PETAL_WIDTH              None              0.512901
2   SEPAL_LENGTH              None              0.136424
3    SEPAL_WIDTH              None              0.038918

Use this Random Forest in-database model to make predictions

The rf_mod object is itself a proxy object to its corresponding in-database model. We predict using the IRIS_TEST proxy object and optionally specify supplemental columns from the original data. Supplemental columns are useful to compare actual target values against predicted values, or to include unique keys for subsequent row identification or joining with other tables. You may recall that row order is not implicit on results from relational databases unless a sort (i.e., ORDER BY) is specified by the user. Why? Sorting of often unnecessary and a potentially expensive operation depending on data size. The predicted results in pred are also a DataFrame proxy object, since the result could be large (e.g., millions or billions of rows).

pred = rf_mod.predict(IRIS_TEST.drop('Species'), 
                      supplemental_cols = test_dat[:, ['SEPAL_LENGTH', 
                                                       'SEPAL_WIDTH',
                                                       'PETAL_LENGTH', 
                                                       'Species']])
print("Shape:",pred.shape)
pred.head(10)

With the output:

Shape: (48, 5)

Out[16]:
   SEPAL_LENGTH  SEPAL_WIDTH  PETAL_LENGTH Species PREDICTION
0           4.9          3.0           1.4  setosa     setosa
1           4.9          3.1           1.5  setosa     setosa
2           4.8          3.4           1.6  setosa     setosa
3           5.8          4.0           1.2  setosa     setosa
4           5.1          3.5           1.4  setosa     setosa
5           5.0          3.4           1.6  setosa     setosa
6           5.2          3.5           1.5  setosa     setosa
7           5.4          3.4           1.5  setosa     setosa
8           4.9          3.1           1.5  setosa     setosa
9           5.0          3.2           1.2  setosa     setosa

Next, let's compute a confusion matrix for the model predictions to assess model accuracy. Using the overloaded crosstab function, we compute the confusion matrix between the actual values in Species and the predicted result in the PREDICTION column directly in the database.

res_ct = pred.crosstab('Species','PREDICTION',pivot=True)
print("Type:",type(res_ct))
print("Columns:",res_ct.columns)
res_ct.sort_values(by='Species')

With the output:

Type: 
Columns: ['Species', 'count_(setosa)', 'count_(versicolor)', 'count_(virginica)']
     Species  count_(setosa)  count_(versicolor)  count_(virginica)
0      setosa              16                   0                  0
1  versicolor               0                  13                  2
2   virginica               0                   0                 17

In my next post, we'll look at the embedded Python execution features of OML4Py.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.