You’ve probably been to a supermarket that printed coupons for you at checkout. Or listened to a playlist that your streaming service generated for you. Or gone shopping online and seen a list of products labeled “you might be interested in….” that did indeed contain some stuff that you were interested in.
Recommendation engines take data about you, similar consumers, and available products, and use that to figure out what you might be interested in and therefore deliver those coupons, playlists, and suggestions.
Recommendation engines can be extremely complex. For example, Netflix ran a $1M competition from 2006 to 2009 to improve their movie recommendation engine performance. Over 5,000 teams participated. The winning team combined results from 107 different algorithms or techniques to deliver the 10 percent improvement and claim the prize.
So, there are many different ways to build a recommendation engine and most will combine multiple techniques or approaches. In this article, I want to cover just one approach, association rules, which are fairly easy to understand and require minimal skills in mathematics. If you can work with simple percentages, there’s nothing more complex than that below.
Conceptually association rules is a very simple technique. The end result is one or more statements of the form “if this happened, then the following is likely to happen." In a rule, the "if" portion is called the antecedent, and the "then" portion is called the consequent. Remember those two terms because they are going to come up in the descriptions below. Let’s start with food shopping because association rules are very often used to analyze the contents of your shopping cart.
As you make your shopping list, you probably buy a mix of pantry staples as well as ingredients for a specific meal or dish that you plan to prepare. Imagine you plan to make tomato sauce for pizza or a pasta dish. You’re probably going to buy tomatoes, onions, garlic, maybe olive oil or fresh basil. You’re far from the only person making tomato sauce and many others will have similar sets of ingredients.
If we looked at all the various shopping baskets that people purchased, we could start to see some things in common. “If somebody buys canned tomatoes, then they are more likely to buy dried pasta (or onions or garlic or pizza dough or …)”. Armed with this knowledge, a supermarket could print you a coupon at checkout for something you didn’t purchase, hoping that you would come back. Or a manufacturer might offer you a coupon for their pre-made tomato sauce for those nights when you don’t want to make it from scratch.
Although tomatoes might imply garlic and/or basil, the reverse may not be true. For example, somebody buying garlic and basil could be looking to make pesto, in which case they’d be more likely to buy pine nuts than tomatoes. But with the right analysis, it would be possible to find the rules governing which products were more likely to be associated with each other. Hence the name “association rules”.
Let’s illustrate this process with some real numbers. And to do so we’ll move from buying groceries to watching movies.
The starting point for this algorithm is a collection of transactions. They could be traditional purchase transactions, but could also include events like “put a product in an online shopping cart,” “clicked on a web ad” or, in this case, “watched a movie.”
I’ll use this very abbreviated data set of movie watching habits of five people. I’ve anonymized them to hide their identities (not that this approach always works). Here you see each person and the list of movies they have watched, here represented by numbers from 1-5.
User |
Movies Watched |
A |
1, 2, 4 |
B |
1, 3 |
C |
1, 4 |
D |
2, 3, 4 |
E |
3, 4 |
As you work your way down that table the first thing to stand out is that the first and third users both watched movies 1 and 4. From this data, the rule there would be “if somebody watches movie number 1, then they are likely to watch movie number 4.” You’ll need to understand the two terms I snuck in above: movie 1 is the antecedent and movie 4 is the consequent. Let’s look at this rule in more detail.
How useful is this rule? There are 2 users out of 5 who demonstrate watched both movies 1 and 4. So we can say that this rule has support of 40% (2 out of 5 users). How confident are we that it’s a reliable indicator? Three users watched movie number 1, but only 2 of them also watched number 3. The confidence in this rule is 67 percent.
Note that if you reverse the order rule (or swap the antecedent and consequent if you prefer) we can also say that “if somebody watched movie number 4 then they are likely to watch movie number 1.” However, while the support is also 40 percent, the confidence changes and is now only 50 percent (check the table above to see how that came about). This is the same process as in the example with tomato sauce and pesto above.
What do these metrics mean? With just 5 users and 5 movies it might be hard to see, but imagine this is a subset of many millions of users and thousands of movies. If the support is very low, it basically means that this rule will not apply to many customers. For example it might mean that people who watch some obscure 70s documentary will also watch an equally obscure 80s film. In the movie recommendation space, this would translate to a niche rule that might not get used very often, but could be quite valuable to that very small subset of customers. However, if you were using rules to find the optimal placement of products on the shelves in a supermarket, lots of low support rules would lead to a very fragmented set of displays. In this kind of application, you might set a threshold for support and discard rules that didn’t meet that minimum.
Confidence is a little easier to understand. If there’s a rule linking two movies but with very low confidence, then it simply means that most of the time they watch the first movie, they won’t actually watch the second one. For the purpose of making recommendations or predictions, you’d much rather have a rule that you were confident about. You could also use a minimum threshold for confidence and ignore or discard rules below a certain threshold.
Take another look at the first rule from above: if somebody watches movie 1 they will also watch movie 4. The confidence here is 67 percent which is pretty good. But take a look at the rest of the table. Four out of 5 users watched movie number 4 anyway. If we know nothing else about their other movie preferences, we know that there’s an 80 percent chance of them watching movie 4. So despite that confidence of 67 percent that first rule we found is actually not useful: somebody who has watched movie 1 is actually less likely to watch movie 4 than somebody picked at random. Fortunately, there’s a way to take this into account. It’s called “lift”.
Lift is used to measure the performance of the rule when compared against the entire data set. In the example above, we would want to compare the probability of “watching movie 1 and movie 4” with the probability of “watching movie 4” occurring in the dataset as a whole. As you might expect, there’s a formula for lift:
Lift is equal to the probability of the consequent given the antecedent (that’s just the confidence for that rule) divided by probability of that consequent occurring in the entire data set (which is the support for the consequent), or more concisely:
Lift = confidence / support(consequent)
In this example, the probability of movie 4, given that movie 1 was watched, is just the confidence of that first rule: 67 percent or 0.67. The probability of some random person in the entire dataset (of just 5 users in this simple example) watching movie 4 is 80 percent or 0.8. Dividing 0.67 by 0.8 gives a lift of approximately 0.84.
In general, if you have a lift of less than 1, it shows a rule that is less predictive than just picking a user at random which is the case with this rule as I explained in the first paragraph of this section. If you have a lift of around 1, then it’s indicating two independent events, e.g., watching one movie does not influence the likelihood of watching another. Values of lift that are greater than 1 show that the antecedent does influence finding the consequent. In other words, here is a rule that is useful.
I’ll finish with a few more tips and extensions to the simple example above.
First, how do you test the accuracy of your rules? You can just use the same approach that I previously outlined for classification: build your rules with a subset of the available transactions, and then test the performance of those rules against the remainder. And of course, you should monitor their performance if they are used to make recommendations to actual users.
We worked with a simple rule above: “if user watched movie 1 then they are likely to watch movie 4.” This is referred to as a rule of length 2, because it incorporates two elements. Of course, we could have more complex rules: “if a user watched movies 1, 2 and 3, they then are likely to watch movie 4” is a rule of length 4. Or if you want to go back to grocery shopping for a similar rule, somebody who buys tomatoes, garlic, and pasta is likely to want some parmesan cheese to go with their spaghetti dinner.
Some streaming sites ask users to rate the things they watch on a scale of 1-5 or 1-10. If we had that information we couldn’t use the numeric value directly; we’d want to “bin” the answers. For example, we might say that a score of 7-10 was considered “high” and so on. A rule then might then incorporate “watched movie A and rated it high”.
The concept of binning applies to the last example, which takes us well away from movies and shopping to machines because these rules potentially have wider uses.
Imagine you’re responsible for maintaining some machine that breaks from time to time. You have lots of sensor data and other information about its operation, and you’ve captured several failures. You could in principle treat failure as a consequent and search for the antecedents. You’d have to bin the sensor data in some way (flow rate of 7.3 to 11.4 goes into this bin etc.). In principle you could use association rules to find the conditions that are associated with failure and take corrective action, also referred to as root cause analysis. Replace mechanical failure with diagnosis and you could even use some form of association rules with medical data.
Visit some of our previous articles for high level overviews of machine learning, a look at decision trees, or k-means clustering. If you’d like to find out more about Oracle’s ML technologies you can look at our support for R as well as Advanced Analytics inside Oracle Database. If you're ready to get started with machine learning, try Oracle Cloud for free and build your own data lake to test out some of these techniques.