Being able to apply machine learning models to relevant business problems is the entire rationale for developing them in the first place. Yet, as we saw in Part 1, most of the content available online decidedly glosses over this most important phase of the machine learning engineering lifecycle. Model building and evaluation are just a wee part in the overall process, yet it takes the lion’s share of publicity.
Bridging the gap between the machine learning model and the measurable business outcome is no simple task, and it often falls in a gray area between different areas of expertise. You can’t rely on data scientists, machine learning engineers, and business executives to speak the same language, yet this is exactly what needs to happen to achieve value from these projects.
Previously, I introduced the Expected Value Framework as a means of creating this connection. The framework is really nothing more than costbenefit analysis applied to the probabilistic predictions of a machine learning model.
Part 1 outlined the first use case for the Expected Value Framework: how will we use a model in a business context? Specifically, if our model makes predictions about the probability of a potential customer buying our product given a targeted marketing campaign, at what probability threshold do we “draw the line in the sand,” so to speak, and say “any potential customer with a predicted probability greater than X will be targeted?” This use case is centered on individual decisions about each potential customer, but what if we want to take an aggregated pointofview and evaluate and compare different models’ efficacy overall?
In Part 2, let’s look at the second use case for the Expected Value Framework: comparing models. Before we continue, let’s discuss a common tool in machine learning classification modeling called the confusion matrix (which is kind of a misnomer – it’s not that hard to understand).
Observe Predictions vs. Actual Outcomes with the Confusion Matrix

Actual Purchase 
Actual Nonpurchase 
Predicted Purchase 
True Positive (TP) 
False Positive (FP) 
Predicted Nonpurchase 
False Negative (FN) 
True Negative (TN) 
A confusion matrix is a means of comparing the predictions our model makes with the actual outcome associated with an observation – in this case, will a potential customer purchase or not purchase?
Remember, machine learning models designed to predict a classification (purchase/not purchase) for a given observation (potential customer) in our future data set are built using data we collected in the past. This is called supervised learning because we can build a model with past data and compare the predictions our model makes compared to the actual past outcome – did a customer purchase from us or not? We should already know the answer to that if we were doing our due diligence for data collection!
There are four potential outcomes when comparing our predictions to our actuals: 1) our model correctly predicted a purchaser as a purchaser (True Positive), 2) our model correctly predicted a nonpurchaser as a nonpurchaser (True Negative), 3) our model predicted a purchaser when, in fact, they were a nonpurchaser (False Positive), and 4) our model predicted a nonpurchaser when, in fact, they were a purchaser (False Negative). This is summarized in the 2 x 2 matrix above.
Just as we applied the Expected Value Framework to individual potential customers in Part 1, by calculating the probability and value for each potential outcome, we can also apply it to each individual cell of our confusion matrix and calculate the Expected Value aggregating the four potential outcomes. This allows us to get a picture of the model’s aggregate Expected Value so we can say something like:
“If I apply this model to a new set of data on prospective customers and target my marketing efforts towards only those prospects predicted as purchasers, I can expect to make, on average, $XX profit per customer.”
That’s a nice tool in the machine learning engineer’s tool chest!
Applying the Expected Value Framework
Our equation will look something like this:
E[X] = P(TP,p) * V(TP,p) + P(FN,p) * V(FN,p) +
P(FP,n) * V(FP,n) + P(TN,n) * V(TN,n)
Now, we know in the real world that the proportions of purchasers to nonpurchasers are not evenly balanced. Say we were to compare two models using a measure called Accuracy, which in a classification model setting has a specific definition:
So, accuracy in this sense lumps together our model’s correct predictions (i.e. True Positives and True Negatives) in the numerator with all the model’s predictions in the denominator.
Yet, from perspective of value in the Expected Value Framework, aggregating both False Positives and False Negatives together doesn’t always make sense because they are not typically equally important. Say our model predicts a customer is going to leave us so we appropriate funds in our marketing campaign to target them with some incentive to stay. Say we have another customer our model predicts will remain a customer, so we do nothing. Both customers leave us for a competitor. This means our model predicted a False Positive and a False Negative, respectively.
Clearly, the costs associated with each of the actions we’d take for these customers isn’t equal…but using a measure like accuracy, we’d never know.
We can take our new Expected Value equation and factor out the population probabilities of each potential outcome to eliminate the influence of what we call Class Priors, or, the probability of a customer being a purchaser/nonpurchaser in general in our population of customers. To do so, we must revisit one of the rules of probability (I’m sure you all remember those and were paying astute attention in your statistics class, right?).
What is the probability of two events occurring, or the Joint Probability? We can express this as:
P(a,b) = P(b) * P(ab)
This says the probability of both “a” and “b” occurring is equal to the probability of “b” times the probability of “a” given (that’s what the “” symbol means in probability notation) “b” has already occurred. We can apply this rule to our Expected Value equation and substitute the probabilities of “a” and “b” for the probabilities of purchase and nonpurchase. For example, we can take one cell of our confusion matrix and decompose it using this rule:
P(TP,p) = P(TPp) * P(p)
By transforming one cell of our original Expected Value equation using this rule, we’ve essentially partitioned out the class prior. If we apply this rule to our entire equation (i.e. all four cells of our 2 x 2 confusion matrix), we can factor out the class priors altogether. This makes our final equation look something like this mess:
E[X] = P(p) * [P(TPp) * V(TP,p) + P(FNp) * V(FN,p)] +
P(n) * [P(FPn) * V(FP,n) + P(TNn) * V(TN,n)]
Essentially, what we’ve done is split our equation into the Expected Value of positive/purchasers and the Expected Value of negative/nonpurchasers each weighted by the probability of those existing in our population in the first place. This will adjust our overall Expected Value estimate to account for the potential of imbalanced purchasers/nonpurchasers existing in our population of potential customers – which in the real world is most certainly the case.
OK, admittedly things are starting to get a little hairy with the math for the tastes of most readers. Let’s start to plug in some numbers and hopefully that will make things easier to understand!
Explain the Key Takeaway for Business Executives
Let’s revisit our confusion matrix and plug in some arbitrary values and their totals:

Actual Purchase 
Actual Nonpurchase 
TOTAL 
Predicted Purchase 
1000 
1500 
2500 
Predicted Nonpurchase 
500 
8000 
8500 
TOTAL 
1500 
9500 
11,000 
From the above, we can see that the class priors for purchase and nonpurchase can be calculated by simply dividing the total for each group by the total overall, or P(p) = 1500/11000 = .14 and P(n) = 9500/11000 = .86. We can also calculate the probabilities of each cell in our confusion matrix given the probability of purchase/nonpurchase in our population (for the interested, we call these conditional probabilities):

Actual Purchase 
Actual Nonpurchase 
Predicted Purchase 
1000/1500 = .67 
1500/9500 = .16 
Predicted Nonpurchase 
500/1500 = .33 
8000/9500 = .84 
TOTAL 
1500 
9500 
Now we have our class priors, P(p) and P(n), and our conditional probabilities, P(TPp), P(FNp), etc. All we need now are the values associated with each cell in our confusion matrix and our Expected Value equation will have all the data it needs! If we reuse the values we used in Part 1, we end up with something like this:

Actual Purchase 
Actual Nonpurchase 
Predicted Purchase 
$305 
$15 
Predicted Nonpurchase 
$0 
$0 
Recall that we sell our product for $550 with associated costsperproduct of $245 giving us a profit of $305 for each sale to a solicited prospective customer predicted to purchase by our model. Conversely, if we solicit to a prospect that ends up not purchasing (i.e. a False Positive), we are only out the costs of the marketing efforts which we estimated to be $15. If our model predicted a prospect to be a nonpurchaser, we don’t lose anything regardless of whether or not they purchase (remember, in our example, for ease of understanding, we assume a prospect can only purchase our product if they were solicited to).
Let’s plug all of this into our Expected Value equation:
E[X] = .14 * [.67 * 305 + .33 * 0] +
.86 * [.16 * 15 + .84 * 0]
Solving for Expected Value, we get $26.55. Let’s circle back to the original statement we wanted to make about our model:
“If I apply this model to a new set of data on prospective customers and target my marketing efforts towards only those prospects predicted as purchasers, I can expect to make, on average, $26.55 profit per customer.”
That sounds like something an executive could wrap her head around, doesn’t it?
We’ve effectively translated the outcome from a machine learning model into the language of dollars. We’ve also created a way of comparing different models to see which one elicits the greatest profit.
However, we don’t always know the values (costs/benefits) associated with a particular outcome like we do in this example. What do we do then? Find out in Part 3!
To learn more about machine learning and data science, visit Oracle’s Data Science page.