Previously, we discussed what machine learning is and how it can be used. But within machine learning, there are several techniques you can use to analyze your data. Today I’m going to walk you through some common ones so you have a good foundation for understanding what’s going on in that much-hyped machine learning world.
If you are a data scientist, remember that this series is for the non-expert.
But first, let’s talk about terminology. I’ll use three different terms which I’ve seen used interchangeably (and sometimes not accurately): techniques, algorithms, and models. Let me explain each one.
A technique is a way of solving a problem. For example, classification (which we’ll see later on) is a technique for grouping things that are similar. To actually do classification on some data, a data scientist would have to employ a specific algorithm like decision trees (though there are many other classification algorithms to choose from).
Finally, having applied an algorithm to some data, the end result would be a trained model which you can use on new data or situations with some expectation of accuracy. It should all be clearer after these examples, so read on.
If you’re looking for a great conversation starter at the next party you go to, you could always start with “You know, machine learning is not so new; why the concept of regression was first described by Francis Galton, Charles Darwin’s half-cousin, all the way back in 1875”. Of course, it will probably be the last party you get an invite to for a while.
But the concept is simple enough. Francis Galton was looking at the sizes of sweet peas over many generations. We know that if you selectively breed peas for size, you can get larger ones.
But if you let nature take its course, you see a variety of sizes. Eventually, even bigger peas will produce smaller offspring and “regress to the mean”. Basically, there’s a typical size for a pea and although things vary, they don’t “stay varied” (as long as you don’t selectively breed).
The same principle applies to monkeys picking stocks. On more than one occasion there have been stock-picking competitions (WSJ has done them, for example) where a monkey will beat the pros. Great headline. But what happens next year or the year after that? Chances are that monkey, which is just an entertaining way of illustrating “random,” will not do so well. Put another way, its performance will eventually regress to the mean.
What this means is that in this simple situation, you can predict what the next result will be (with some kind of error). The next generation of the pea will be the average size, with some variability or uncertainty (accommodating smaller and larger peas). Of course, in the real world, things are a little more complicated than that.
Guest author, Peter Jeffcock is an Oracle Senior Principal Product Marketing Director focused on Big Data and Cloud Tech Database