Previously, we discussed what machine learning is and how it can be used. But within machine learning, there are several techniques you can use to analyze your data. Today I’m going to walk you through some common ones so you have a good foundation for understanding what’s going on in that much-hyped machine learning world.
If you are a data scientist, remember that this series is for the non-expert.
But first, let’s talk about terminology. I’ll use three different terms which I’ve seen used interchangeably (and sometimes not accurately): techniques, algorithms and models. Let me explain each one.
A technique is a way of solving a problem. For example, classification (which we’ll see later on) is a technique for grouping things that are similar. To actually do classification on some data, a data scientist would have to employ a specific algorithm like decision trees (though there are many other classification algorithms to choose from).
Finally, having applied an algorithm to some data, the end result would be a trained model which you can use on new data or situations with some expectation of accuracy. It should all be clearer after these examples, so read on.
If you’re looking for a great conversation starter at the next party you go to, you could always start with “You know, machine learning is not so new; why, the concept of regression was first described by Francis Galton, Charles Darwin’s half cousin, all the way back in 1875”. Of course, it will probably be the last party you get an invite to for a while.
But the concept is simple enough. Francis Galton was looking at the sizes of sweet peas over many generations. We know that if you selectively breed peas for size, you can get larger ones.
But if you let nature take its course, you see a variety of sizes. Eventually, even bigger peas will produce smaller offspring and “regress to the mean”. Basically, there’s a typical size for a pea and although things vary, they don’t “stay varied” (as long as you don’t selectively breed).
The same principle applies to monkeys picking stocks. On more than one occasion there have been stock-picking competitions (WSJ has done them, for example) where a monkey will beat the pros. Great headline. But what happens next year or the year after that? Chances are that monkey, which is just an entertaining way of illustrating “random,” will not do so well. Put another way, its performance will eventually regress to the mean.
What this means is that in this simple situation, you can predict what the next result will be (with some kind of error). The next generation of pea will be the average size, with some variability or uncertainty (accommodating smaller and larger peas). Of course, in the real world things are a little more complicated than that.
In the image above, we don’t have a single mean value like pea size. We’ve got a straight line with a slope and two values to work with, not just one. Instead of variability around a single value, here we’ve got variability in a two-dimensional plane based on an underlying line.
You can see all the various data points in blue, and that red line is the line that best fits all that data. And based on that red line, you could make a prediction about what would happen if, say, the next data point was a 70 on the X axis. (That prediction would not be a single definitive value, but rather a projected value with some degree of uncertainty, just like for the pea sizes we looked at earlier).
Regression algorithms are used to make predictions about numbers. For example, with more data, we can:
The straight line in the graph is an example of linear regression, but looking at those three examples above, I’d be surprised if any of them fit well to a straight line. And in fact, the underlying line behind your data doesn’t have to be straight. It could be an exponential, a sine wave or some arbitrary curve. And there are algorithms and techniques to find the best fit to the underlying data no matter what shape the underlying line is.
Furthermore, I’ve given you a two-dimensional diagram there. If you were trying to predict house prices, for example, you’d include many more factors than just two: size, number of rooms, school scores, recent sales, size of garden, age of house and more.
Finally, perhaps my favorite example of regression is this approach to measuring the quality of Bordeaux wine.
Let’s move on to classification. And now I want you to pretend you're back in preschool and I'll play the role of teacher trying hard to teach a room of children about fruit (presumably fruit-hating children if they've got to this age without knowing what a banana is).
While you kids don't know about fruit, the good news for you is that I do. You don’t have to guess (at least initially). I’m going to show you lots of pieces of fruit and tell you what each one is. And so, like children in a preschool, you will learn how to classify fruit. You’ll look at things like size, color, taste, firmness, smell, shape and whatever else strikes your fancy as you attempt to figure out what it is that makes an apple, an apple, as opposed to a banana.
Once I've gone through 70 percent to 80 percent of the basket, we can move onto the next stage. I’ll show you a fruit (that I have already identified) and ask you “What is it?” Based on the learning you’ve done, you should be able to classify that new fruit correctly.
In fact, by testing you on fruit that I’ve already classified correctly, I can see how well you’ve learned. If you do a good job, then you’re ready for some real work which in a non-kindergarten situation, would mean deploying that trained model into production. If of course the results of the test weren’t good enough that would mean the model wasn’t ready. Perhaps we need to start again with more data, better data, or a different algorithm.
We call this approach “supervised learning” because we’re testing your ability to get the right answers, and we have got lots of correct examples to work with since we have a whole basket that has been correctly classified.
That idea of using part of the basket for training and the rest for testing is also important. That's how techniques like this make sure that the training worked or, alternatively, that the training didn't work and a new approach is needed.
Note that the basket of fruit we worked with had only four kinds of fruit: apples, bananas, strawberries (you can't see them in the picture, but I assure you they are there) and oranges. So, if you were presented with a cherry to classify it would be somewhat unpredictable. It would depend what the algorithm found to be important in differentiating the others. The point here of course, is that if you want to recognize cherries then the model should be trained on them.
Here's an example of a chart showing a data set that has been grouped into two different classes. We've got a few outliers in this diagram, a few colored points that are on the wrong side of the line. I show this to emphasize the point that these algorithms aren't magic and may not get everything right. It could also be the case that with different approaches or algorithms, we could do a better job classifying these data points and identifying them correctly.
Summarizing the previous entry, classification enables you to find membership in a known class. Examples of known classes? Let’s go back to customer segmentation. I know who my high-value customers are today. What did they look like some time ago? By using them as a training class, I could train a model to spot a valuable customer earlier.
Another example is customer churn. We know who’s left us. Let’s train a model on that class and then see if we do a better job of spotting other churners before they churn. This kind of approach is what triggers those unexpected offers from companies who think you are about to leave them.
Insurance companies pay out on claims and they've got a historical set of claims that they have already classified into "good claims" and ones that need "further investigation". Train a classification algorithm on all those old claims, and perhaps you can do a better job of spotting dubious claims when they come in.
One additional point.
In all these cases, it’s important to have lots of data available to train on. The more data you have, the better the training (more accurate, wider range of situations etc.). One of the reasons (of course there are others) for building a data lake is to have easy access to more data for machine learning algorithms.
Alert readers should have noticed that this is the same bowl of fruit used in the classification example. Yes, this was done on purpose. Same fruit, but a different approach.
This time we’re going to do clustering, which is an example of unsupervised learning. You're back in preschool and the same teacher is standing in front of you with the same basket of fruit.
But this time, as I hand the stuff out, I'm not going to tell you "This is a banana." Instead I'm effectively going to say, “Do these things have any kind of natural grouping?” (Which is a complex concept for a pre-schooler, but work with me for a moment).
You’ll look at them and their various characteristics, and you might end up with several piles of fruit that look like “squidgy red things”, “curved yellow things”, “small green things” and “larger red or green things”.
To clarify, what you did (in your role as preschoolers/machine learning algorithm) is group the fruits in that way. What the teacher (or the human supervising the machine learning process) did was to come up with meaningful names for those different piles. This is likely the same process used to do the customer segmentation mentioned in the previous blog. Having found logical groupings of customers, somebody came up with a shorthand way to name or describe each grouping.
Here’s a real-world cluster diagram. With these data points you can see five separate clusters. Those little arrows represent part of the process of calculating the clusters and their boundaries: basically pick arbitrary centers, calculate which points belong in which cluster and then move your arbitrary point to the actual center of the cluster and repeat until you’ve got close enough (movements of the centers are sufficiently small).
This approach is very common for customer segmentation. You could evaluate credit risk, or even things like the similarity between written documents. Basically, if you look at a mass of data and don’t know how to logically group it, then clustering is a good place to start.
Sometimes you’re not trying to group like things together. Maybe you don’t much care about all the things that blend in with the flock. What you’re looking for is something unusual, something different, something that stands out in some way.
This approach is called anomaly detection. You can use this to find things that are different, even if you can’t say up front how they are different. It’s fairly easy to spot the outliers here, but in the real world, those outliers might be harder to find.
One health provider used anomaly detection to look at claims for medical services and found a dentist billing at the extraordinarily high rate of 85 fillings per hour. That's 42 seconds per patient to get the numbing shot, drill the bad stuff out and put the filling in.
Clearly that's suspicious and needs further investigation. Just by looking at masses of data (and there were millions of records) it would not have been obvious that you were looking for something like that.
Of course, it might also throw up that fact that one doctor only ever billed on Thursdays. Anomalous, yes. Relevant, probably not. Anomaly detection can throw up the outliers for you to evaluate to see if they need further investigation.
Finding a dentist billing for too much work is a relatively simple anomaly. If you knew to look at billing rates (which will not always be the case), you could find this kind of issue using other techniques. But anomaly detection could also apply to more complex scenarios. Perhaps you are responsible for some mechanical equipment where things like pressure, flow rate and temperature are normally in sync with each other: one goes up, they all go up; one goes down, they all go down. Anomaly detection could identify the situation where two of those variables go up and the other one goes down. That would be really hard to spot with any other technique.
All right, I think that’s enough to think about and process for this week. But be sure to subscribe to the Oracle blog, because the fun hasn’t come to an end yet. Next, we’re going to post about three more machine learning techniques that people are especially excited about.
If you're ready to get started with machine learning, try Oracle Cloud for free and build your own data lake to test out some of these techniques.