Following on from our previous article about the decision tree algorithm, today we're exploring clustering within the realm of machine learning and big data. Specifically, let’s look at the commonly used k-means algorithm. If you want a refresh on clustering (and other techniques), take a look at some of our other articles about machine learning.
Clustering as a general technique is something that humans do. But unlike decision trees, I don’t think anybody really uses k-means as a technique outside of the realm of data science. But let’s pretend for a second, that you really wanted to do just that. What would it look like?
Imagine it’s the end of the long summer vacation and your child is heading back off to college. The day before leaving, said child delivers a request: “Hey, I forgot to do my laundry; can you wash it so it’s clean for college”? You head down to the bedroom and are greeted (that’s not really the right word here, but you get the idea) with a pile of dirty clothes in the middle of the floor that’s nearly as tall as you. What do you do (apart from burning it all)?
Of course, you know that some of these clothes are washed in hot water, some in cold, that some can be put in the dryer and some need air drying, and so on.
To make this a clustering scenario, we have to assume that your child, who is going to be doing all the work here, has no clue about how to group things for laundry. Not a stretch in this scenario. Let’s treat this as a learning opportunity and put them to work.
Tell them that you want them to group all the clothes into three piles using a very specific approach. Ask them to pick three different items of clothing at random that will be the starting points for those three separate piles (clusters in fact). Then, go through that massive initial pile and look at each item of clothing in turn. Ask them to compare the attributes of “water temperature” and “drying temperature” and “color” with each of the three starting items and then place the new item into the best pile.
The definition of “best pile” is based not on the whole pile, but purely on that starting item (technically that starting item is known as a centroid, but we’ll get to that later). The items won’t be identical, so pick the best match. If you want to be really detailed about this, ask them to place each item closer to or farther away from the starting item, based on how similar they are (the more similar, the closer together).
Now you’ve got three piles, which means your child is ready to use the washing machine. Not so fast, we’re just getting started. This is a very iterative process, and the next step would be to determine the new centroids of each of those piles and repeat the process. Those new centers would be calculated and may not correspond to an actual item of clothing. And we don’t really know if three piles is the optimal number, so once your child has completed iteration with three piles, they should go and try four, five, and more.
Clustering in general, and perhaps this algorithm in particular, is not a good technique for sorting laundry. Let’s look at a more realistic example using K means clustering and start working with data, not dirty socks and laundry labels.
I’ll reuse the same data table that we had for the decision trees example. But this time, instead of trying to predict customer churn, we’re going to use clustering to see what different customer segments we can find.
Customer ID |
Age |
Income |
Gender |
Churned |
1008764 |
34 |
47,200 |
F |
Yes |
I’m initially going to work with just two columns or attributes: age and income. This will allow us to focus on the method without the complexity of the data. Having two attributes enables us to work with a two-dimensional plane, and easily plot data points.
Is the data normalized? Let’s say that ages range up to 100 and incomes up to 200,000. We’ll scale the age range 0..100 to 0..1, and similarly salary 0..200,000 to 0..1. (Note that if your data has outliers like one person with an income of 800,000, there are techniques to deal with that; I just won’t cover them here).
The first thing we do is pick two centroids, which means that we’re going to start with K=2 (now you know what the K in k-means represents). These represent the proposed centers of the two clusters we are going to uncover. This is quite a simple choice: pick two rows at random and use those values. If you’re building a mental representation of this process in your head, then you’ve got a piece of paper with two axes, one labeled “Age” and the other labeled “Income.” That chart has two color coded points marked on it, representing those two centroids.
Now we can get to work. We start with the first row and determine the Euclidean distance (which is a fancy way of saying we’ll use Pythagoras’s theorem) between the point in question and both centroids. It will be closer to one of them. If you’re visualizing this in your head, you can just plot the new point on the chart, and color code it to match the closer centroid (diagram below). Repeat this process for every row that you want to work with. You’ll end up with a chart with all points plotted and color coded, and two clusters will be apparent.
Once all the points have been allocated to a cluster we have to go back and calculate the centers of both clusters (a centroid doesn’t have to be in the center of the cluster, and in the early iterations could be some distance away from the eventual center). Then go back and repeat the calculations for each and every point (including those initial two rows that were your starting centroids). Some of those points will actually be nearer to the other cluster and will effectively swap clusters. This, of course, will change the centers of your clusters, so when you’ve completed all the rows for a second time, just recalculate the new centroids and start again.
In the diagram below you can see two clusters with the eight-point star showing the location of the new centroids which will be the basis for the next iteration. The arrows show the “movement” of the centroid from the previous iteration.
As with all iterative processes you need to figure out when to stop. Two obvious stopping points would be:
Is K=2 or two clusters optimal for this data set? Good question. At this stage you don’t know. But you can start finding out by repeating the whole process above with K=3 (i.e. start with 3 random centroids). And then go to four, and five and so on.
Fortunately, this process doesn’t go on forever. There’s one more calculation to do: for each value of K, measure the average distance between all the data points in a cluster and the centroid for that cluster. As you add more clusters you will tend to get smaller clusters, more tightly grouped together. So that “average within-cluster distance to centroid” will decrease as K increases (as the number of clusters increases). Basically this metric gives you a concrete measure of how “good” a cluster you’ve got, with lower numbers meaning a more tightly grouped cluster with members that are more similar to each other.
If you increase K to equal the total number of data points you have (giving you K clusters, each with one member) then that distance will be zero but that’s not going to be useful. The best way to find the optimal value of K is to look at how that “average within-cluster distance to centroid” decreases as K increases and find the “elbow”. In this chart I’d say that the elbow is at K=3, though you might prefer to use K=4. There doesn’t look to be much point going with five or more clusters because the incremental improvement is minimal.
In this example I used just two attributes to build the clusters. This has the advantage of being easy to visualize on a chart. What would those clusters represent? In this simple example with just ages and incomes, it might not be very illuminating, but maybe I’d find that I have a segment of younger customers with relatively higher incomes (and presumably higher disposable incomes) that I could target with promotions for suitable luxury goods. In general, you’d probably get more value from segments defined with more attributes.
With three attributes things are a little harder to visualize in 2D. Below there’s a cluster diagram for cars using fuel economy (in miles per gallon), peak power (horsepower) and engine displacement (in cubic inches) for a selection of cars from the early 1980s.
Real use cases for k-means clustering might employ 100s or 1000s of attributes and while hard to visualize on a single chart, they are easily computed. They are also likely to be much more useful. In the customer segmentation example starting about half way down this machine learning article somebody obviously used many more attributes than just age and income.
If you’d like to find out more about Oracle’s machine learning technologies you can look at our support for R as well as Advanced Analytics inside Oracle Database. If you're ready to get started with machine learning, try Oracle Cloud for free and build your own data lake to test out some of these techniques.
best science training in chennai