Learn about Oracle Machine Learning for Oracle Database and Big Data, on-premises and Oracle Cloud

  • June 2, 2021

Metrics for Classification Using OML4Py Part I

Jie Liu
Data Scientist

Metrics for Classification

Think about the following scenario. As a seasoned data scientist, you spent a lot of time and effort tackling a challenging dataset. Finally, you built a good model, with creative feature engineering and smart data cleaning. The screen displays a great AUC (0.9, unbelievable!) score. How exciting!

Then the stakeholder comes in and throws out a question: How does such a high AUC score help me in targeting customers? Sometimes there is a gap between data science terminology and business needs. AUC can be a perfect example. How to explain the AUC to people unfamiliar with machine learning? It is not that easy. The good news is that we can use multiple metrics to evaluate model performance. Depending on business needs, some metrics fit particular user scenarios better and helps to bridge the gap between the data scientist and the line-of-business.

Open source packages provide abundant APIs to compute various kinds of model metrics. When the dataset is large enough, it is not that easy to scale up the computation. If the dataset resides in Oracle Database, we can leverage Oracle Machine Learning tools to compute such metrics without moving the data out of the database.

The newly released Oracle Machine Learning for Python OML4Py API brings benefits that are similar to those in OML4R: transparency layer, in-database algorithms, and embedded Python execution. New in OML4Py is automated machine learning.
In this blog, we will demonstrate how to compute metrics in a scalable way using OML4Py and Oracle SQL. We can see the benefit of computing the needed statistics for metrics using SQL queries and then generate plots.

Data Overview

We use the dataset customer insurance lifetime value for our demonstration, an Oracle-produced dataset. The use case involves an insurance company targeting customers likely to buy insurance based on their lifetime value, demographics, and financial features for each customer. The following is a glimpse into this dataset with a subset of the columns.

Based on the column names, we can see that the dataset contains user demographic features such as state, region, gender, marital status, and some financial features like income, credit card limits.

The main business problem here is to find out which customers are likely to buy an insurance policy. From the dataset, the ground truth is that a customer has purchased an insurance policy based on the column BUY_INSURANCE. This is a typical binary classification problem, and we can use all the features (columns) provided in this dataset to build a model.

Model Building

In this blog, we pick a basic logistic regression model as an example for later discussion. In OML4Py, we can use the following code to build a model.
First, we load the data from table CUSTOMER_INSURANCE_LTV in Oracle database and create a table with a subset of its columns.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import oml
CUST_DF = oml.sync(schema = 'JIE', table = 'CUSTOMER_INSURANCE_LTV')

We recode the BUY_INSURANCE column to numerical values for convenience of later computation. We use 1 for positive label ‘Yes’ and 0 for negative label ‘No’. This can be done using the replace function in OML4Py.

CUST_DF = CUST_DF.replace(old = ['Yes'], new = [1.0], default = 0.0, columns = ['BUY_INSURANCE'])

After that, we materialized the OML data frame into a table for later use.

    oml.drop(table = 'CUST_SUBSET_TBL')
    print("No such table")
CUST_SUBSET_DF = CUST_DF.materialize(table = 'CUST_SUBSET_TBL')

Split the data into train and test datasets. We will build the model on the train dataset and evaluate the model on the test dataset. The default split ratio is 70% training data, 30% test data.

dat = CUST_SUBSET_DF.split(seed = 1)
train_x = dat[0].drop(['BUY_INSURANCE'])
train_y = dat[0]['BUY_INSURANCE']
test_x = dat[1]
test_y = dat[1]['BUY_INSURANCE']

We materialize the test dataset for later use. This is convenient for running SQL queries against the materialized table to compute metrics, which will be shown later.

    oml.drop(table = 'CUST_TEST_TBL')
    print("No such table")
_ = dat[1].materialize(table = 'CUST_TEST_TBL')

Now we build the model.

    oml.drop(model = 'GLM_METRIC_MDL')
print("No such model")
setting = dict()
glm_mod = oml.glm("classification", **setting)
glm_mod.fit(train_x, train_y, case_id = 'CUSTOMER_ID', model_name = 'GLM_METRIC_MDL')

Metrics for Model Evaluation

To validate the model, we need to choose which metric we want to use. The choice of metric depends on the type of machine learning task. For classification, we can use metrics like accuracy, AUC (area under curve), precision, recall, precision-recall curve, F1 - score etc. For regression, we can choose metrics like mean squared error (MSE), mean absolute err (MAE) and R squared, which will be covered by another blog. In this blog, we will focus on metrics for classification.


Accuracy measures the ratio of the number of correct predictions and the total number of data points in the test dataset. In OML4Py, after a model is built. We can simply call

glm_mod.score(test_x, test_y)

In this case, the accuracy is 0.715193
The problem with accuracy is that it cannot provide the full picture, especially when the class is unbalanced. Suppose we have only 1% of customers who bought an insurance policy. If the model predicted all customers as negative, then the accuracy is still 99%. We need to get a full picture of how the model behaves.

Confusion Matrix

A confusion matrix gives all possible scenarios after we apply the model predictions.

  1. True positive: the customer who buys insurance is predicted as the case that customer buys insurance (PREDICTION = 1, BUY_INSURANCE = 1)
  2. False positive: the customer who does not buy insurance is predicted as the case that customer buys insurance (PREDICTION = 0, BUY_INSURANCE = 1)
  3. True negative: the customer who does not buy insurance is predicted as the case that customer does not buy insurance ( PREDICTION = 0, BUY_INSURANCE = 0)
  4. False negative: the customer who buys insurance is predicted as the case that customer does not buy insurance ( PREDICTION = 1, BUY_INSURANCE = 0)
    Note that for a model that generates a probability score only, the confusion matrix needs a threshold for the prediction.
    To obtain the confusion matrix and all other statistics, we can aggregate the prediction results as follows:

Note that for a model that generates a probability score only, the confusion matrix needs a threshold for the prediction.

To obtain the confusion matrix and all other statistics, we can aggregate the prediction results as follows:

GLM_RES_DF = glm_mod.predict(test_x, 
                        supplemental_cols = test_x[['CUSTOMER_ID', 'BUY_INSURANCE']])
GLM_RES_PROB = glm_mod.predict_proba(test_x, supplemental_cols = test_x['CUSTOMER_ID'])
GLM_RES_DF = GLM_RES_DF.merge(GLM_RES_PROB, how = "inner", on = 'CUSTOMER_ID', suffixes = ["", ""])

The table obtained is as follows

Note that here the model produces PROBABILITY_OF_1 and PROBABILITY_OF_0, which are the predicted probability of both cases. Then the final prediction is based on the probability scores. If PROBABILITY_OF_1 > 0.5, then the prediction is 1, otherwise the prediction is 0.

conf_mat = CONF_MAT.pull()
TP = conf_mat[conf_mat['BUY_INSURANCE'] == 1]['count_(1)'].values[0]
TN = conf_mat[conf_mat['BUY_INSURANCE'] == 0]['count_(0)'].values[0]
FP = conf_mat[conf_mat['BUY_INSURANCE'] == 0]['count_(1)'].values[0]
FN = conf_mat[conf_mat['BUY_INSURANCE'] == 1]['count_(0)'].values[0]
Accuracy = (TP+TN)/(TP+TN+FP+FN)
TN_P = np.round(TN/TOTAL*100, 1)
FP_P = np.round(FP/TOTAL*100, 1)
FN_P = np.round(FN/TOTAL*100, 1)
TP_P = np.round(TP/TOTAL*100, 1)
acc_p = np.round(Accuracy*100, 1)
              "True Negative: "+str(TN)+" ("+str(TN_P)+"%)\t"+
              "False Positive: "+str(FP)+" ("+str(FP_P)+"%)\nACTUAL 1\t"+
              "False Negative: "+str(FN)+" ("+str(FN_P)+"%)\t"+
              "True Positive: "+str(TP)+" ("+str(TP_P)+"%)\n"+
              "Accuracy: "+str(acc_p)+"%")

Note that the formatting code inside print is to generate a zeppelin style table as below:

Now we have a clear picture of how the model behaves. It looks like the model did very well in terms of false negatives but with high false positives. But what would be the impact on the business?

In order to understand more about this prediction performance. We can further look at precision and recall.

Precision, Recall, F1-score

Precision means that among the examples that are predicted as positive, what is the percentage of those cases are truly positive. In our user scenario, it means among the customers predicted as buying insurance, the percentage of those who really buys insurance. If we target the customers predicted as buying insurance, that is related to the ratio of our budget which is effective. Of course, 1 - precision also tells us how much budget is wasted in the campaign.

Recall means that among the examples that are truly positive, what is the percentage of those is correctly predicted. In our case, this means among the customers who buys insurance, the percentage of those who are correctly identified. In this use case, this translates to how much real customers the marketing campaign can reach.

Precision = TP/(TP+FP)
Recall = TP/(TP+FN)

In this case, we have precision = 0.43 and recall = 0.11. The precision tells us among the targeted customers , 43% of them will buy insurance eventually. The recall tells us, targeting the customer in this way, we can actually cover 43% of total customers who want to buy insurance.

In most cases, the business target prefers one measure to the other. In this case, we have a relatively high precision. This is actually great for our use case. In advertising campaign, sometimes we do not expect the ad can reach all customers with conversion but actually try to avoid wasting the budget. In this case, although we only reached 11% of the converted customers but we can guarantee that 43% of the budget is not wasted.

Sometimes, one wants to evaluate the performance on the whole and make a balance between this two measures. F1- score can be used in this scenario.
F1-score is defined as

2/ ( 1/Precision + 1/Recall)

It is a value between (0, 1). When the precision and recall are both high, F1-score is approaching to 1. If either of precision or recall is very low, then F1 -score becomes very low. In our case, F1 score is 0.17, which is low because we have a low precision.

Note that those metrics depends on the threshold used in prediction. If we increase the threshold, then we will have a higher recall and lower precision. How the precision and recall are affected when one adjusts the threshold? The precision-recall curve can provide a full picture. We will continue our journey in Part II of this blog with a discussion of precision and recall with visualizations.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.