In part I, we discussed popular metrics such as accuracy, confusion matrix, precision, recall and the F1 score. The common characteristic for those metrics is that they rely on a given threshold for producing the ultimate prediction. In most cases, a classification model originally produces a probably score. In order to arrive at a prediction, one needs to come up with a threshold: a case is predicted as positive when the probably score is greater than the threshold and vice versa. Therefore, if we increase the threshold, we will have a higher recall and lower precision. How are precision and recall are affected when one adjusts the threshold? The precision-recall curve can provide a fuller picture.
The precision-recall curve plots the recall on the x-axis and precision on the y-axis. We plot all the possible pairs of recall and precision by varying the threshold from the lowest possible value to the highest. Then we connect all points and form a curve. The idea is to get an overview of how precision and recall behave when we vary the threshold to a wide range.
In this case, we need to compute pairs of precision and recall for each threshold we can choose. Looks like we need to write a loop in Python. Actually, we have a better choice here because all the data is stored in database. We can leverage a SQL window function to compute it efficiently.
SQL window functions allow the user to compute quantities from multiple rows that are related to the current row. In this particular use case, SQL window functions allow us to go through each row and then compute the precision and recall using all rows with prediction score less than the score of the current row. The query can be written in a concise way and runs fast.
We first generate the following view based on our positive probability and counts.
CREATE OR REPLACE VIEW PR_SUMMARY_V AS
WITH
pos_prob_and_counts AS (
SELECT PREDICTION_PROBABILITY(GLM_METRIC_MDL, 1 USING *) pos_prob,
BUY_INSURANCE
FROM CUST_TEST_TBL
)
SELECT BUY_INSURANCE,
pos_prob,
SUM(BUY_INSURANCE) OVER (ORDER BY pos_prob DESC) pos_acc,
SUM(BUY_INSURANCE) OVER (ORDER BY pos_prob DESC) /COUNT(1) OVER (ORDER BY pos_prob DESC) precision,
SUM(BUY_INSURANCE) OVER (ORDER BY pos_prob DESC) /SUM(BUY_INSURANCE) OVER () recall
FROM pos_prob_and_counts
order by pos_prob desc
We can then generate the plot using native Python with data retrieved from this view using OML4Py:
PR_DF = oml.sync(schema = 'JIE', view = 'PR_SUMMARY_V')
fig, ax = plt.subplots(nrows=1, ncols=1,figsize=[10,8])
pr_df = PR_DF.pull()
pr_df = pr_df.sort_values('POS_PROB', ascending = False)
x = pr_df['RECALL'].values
y = pr_df['PRECISION'].values
ax.set_title('Precision-Recall curve')
ax.plot(x[1:], y[1:], color='blue', lw=2, label='P-R curve')
ax.legend(loc="upper right")
ax.set_xlabel('Recall', size=13)
ax.set_ylabel('Precision', size=13)
plt.show()
In the precision- recall curve, we plot the points of (recall, precision) using the threshold from the highest possible value to the lowest. In the curve, the highest threshold corresponds to the leftmost point and the lowest one corresponds to the rightmost point. From the leftmost, we see a spike with precision equal to 1 and then falls quickly. This is because we start from the highest threshold and it usually achieves high precision since most of the negative cases tend to have low probability score. At the leftmost point, it has lowest recall since we only predict the cases with high probability as positive cases and leave majority of the other positive cases out.
As we increase the threshold, we tend to predict more cases as positive and that naturally leads to lower precision and higher recall so the curve tends to be decreasing. In our case, the precision drops quickly and settle down after the point with recall = 0.1. This indicates that our model is not that good at precision, which matches with what we observed from the confusion matrix. The bright side is that our model can maintain the precision for a wide range of thresholds.
In the precision-recall curve, we also plot a reference straight line which is the ratio of the positive examples in the dataset. This compares the behavior that a random guess would have. We can see that the precision we obtained here is way above this reference line, which indicates the predictive power of the model.
Besides precision and recall, lift is a widely used metric particularly designed for marketing campaigns. The idea is that if we target all customers randomly, how many customers who wants to buy insurance can we reach? With a random approach, we reach only the ratio of the positive cases in the original dataset. Then we ask, after using the model, how much benefit can we get compared to randomly selecting customers?
To that end, lift provides two types of plots. One is called a lift chart, which looks a bit of like AUC but is different. It plots the top percentage of the customers targeted on the x-axis, and the recall obtained by doing that. For instance, the point at 10% means that if we rank the customers based on the probability score, and target the top 10%, how much recall we can achieve. The straight line is the scenario that we target customers randomly. Thus, the lift curve should be above the straight line.
Another plot is called waterfall analysis. The idea is to segment customers based on the probability score. It generates a series of user segments such as customer probability score between [0, 10%], [10%, 20%], â€¦, etc. Then for each segment, we compute the percentage of customers who want to buy insurance in that segment and plot it as histogram. This provides a detailed view of how much benefit can we get when we target each customer segment.
The following python code generates the lift chart and waterfall analysis .
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=[20,7])
lift_df = LIFT_DF.pull()
x = lift_df['COVERED'].values[0]
y = lift_df['POS_RATE'].pull()
x = [i*100 for i in x]
y = [i*100 for i in y]
axes[0].set_title('Lift Chart')
axes[0].plot(x, y, color='blue', lw=2, label='Lift curve')
axes[0].plot([0, 100], [0, 100], lw=2, linestyle='--', color='grey', label='Random guess')
axes[0].legend(loc="lower right")
axes[0].set_xlabel('Top Customer Targeted %', size=13)
axes[0].set_ylabel('Recall %', size=13)
axes[1].set_title('Waterfall Analysis')
axes[1].bar(decile_df['DECILE'],decile_df['POS_RATE']*100,color='blue', alpha=0.6, label='Model')
axes[1].axhline(LIFT_DF['BUY_INSURANCE'].sum()*100/LIFT_DF.shape[0], color='grey', linestyle='--', label='Avg TARGET')
axes[1].legend(loc="upper right")
axes[1].set_xlabel('Decile', size=13)
axes[1].set_ylabel('Actual Customers Targeted %', size=13)
plt.show()
In this use case, the lift chart and waterfall chart probably fit better for stakeholders to understand the impact of the model. From the charts, we can see the benefit of applying our model in targeting customers. In the case where a company has few resources to target customers, we can probably only focus on the customer segments with percentage of actual customers above the dash line. This is because the dashed line is the percentage of actual customers, which is the ratio that can be achieved by just targeting customers randomly. Targeting customer segments above the dashed line will lead to reaching more actual customers.
ROC and AUC are widely used metric by data scientists. Although it may not appear easy to explain to stakeholders, it is still a valuable metric for checking model performance for the following reasons:
ROC and AUC do not require a particular threshold. This is convenient because we do not need to come up with a threshold, which is required in precision and recall.
ROC and AUC measure how correctly the data is ranked. This is useful in our use case because we actually do not care about how accurate the probability score is. What we need to do is to prioritize the customers to target.
For scalability, we implemented the AUC computation by calling a SQL query using oml.cursor. This is an efficient and fast computation using Oracle SQL window functions to speed up the computation of AUC score. Based on an experiment on expanded dataset with 11 million rows, the open source approach, which requires pulling the data into the memory, takes 27 minutes. While using the following in-DB approach, it only takes 52 seconds! Notice that we invoke SQL from within our Python function using the database cursor object, which gives us access to the database using the execute function.
def auc_score(table_name, prob, target):
import oml
cr = oml.cursor()
query_template = """
WITH
pos_prob_and_counts AS (
SELECT <PROB1> pos_prob,
DECODE(<TARGET>, 1, 1, 0) pos_cnt
FROM <TABLE>
),
tpf_fpf AS (
SELECT pos_cnt,
SUM(pos_cnt) OVER (ORDER BY pos_prob DESC) /SUM(pos_cnt) OVER () tpf,
SUM(1 - pos_cnt) OVER (ORDER BY pos_prob DESC) / SUM(1 - pos_cnt) OVER () fpf
FROM pos_prob_and_counts
),
trapezoid_areas AS (
SELECT 0.5 * (fpf - LAG(fpf, 1, 0) OVER (ORDER BY fpf, tpf))*(tpf + LAG(tpf, 1, 0) OVER (ORDER BY fpf, tpf)) area
FROM tpf_fpf
WHERE pos_cnt = 1
OR (tpf = 1 AND fpf = 1)
)
SELECT SUM(area) auc
FROM trapezoid_areas"""
query = query_template.replace('<PROB1>', prob)
query = query.replace('<TARGET>', target)
query = query.replace('<TABLE>', table_name)
_ = cr.execute(query)
auc = cr.fetchall()
cr.close()
return auc[0][0]
To show the ROC curve, we can prepare a view by adapting the SQL query used in the function auc_score.
create or replace view roc_data_v as
WITH
pos_prob_and_counts AS (
SELECT PREDICTION_PROBABILITY(GLM_METRIC_MDL, 1 USING *) pos_prob,
DECODE(BUY_INSURANCE, 1, 1, 0) pos_cnt
FROM CUST_TEST_TBL
)
SELECT pos_cnt,
SUM(pos_cnt) OVER (ORDER BY pos_prob DESC) / SUM(pos_cnt) OVER () tpf,
SUM(1 - pos_cnt) OVER (ORDER BY pos_prob DESC) /SUM(1 - pos_cnt) OVER () fpf
FROM pos_prob_and_counts
Then we can generate the plot using the following Python code.
fig, ax = plt.subplots(nrows=1, ncols=1,figsize=[10,8])
x = ROC_DF['FPF'].pull()
y = ROC_DF['TPF'].pull()
ax.set_title('ROC curve')
ax.plot(x, y, color='blue', lw=2, label='ROC curve')
ax.plot([0, 1], [0, 1], lw=2, linestyle='--', color='grey', label='Random guess')
ax.legend(loc="lower right")
plt.show()
In this blog, we discussed popular metrics for classification and presented a way of computing those metrics using Oracle SQL and OML4Py. We discussed the pros and cons of using those metrics and focused on how to interpret each metric and explain the impact of the predictions generated by the model.