Among clustering algorithms, K-means is the most popular. It is fast, scalable and easy to interpret. Therefore, it is almost the default first choice when data scientists want to cluster data and get insight into the inner structure of a dataset. A good introduction to this method can be found in the datascience.com blog.
However, there is one key parameter for K-means clustering which needs to be selected appropriately. That is K, the number of clusters for the algorithm to generate, which is left for the user to choose. For a given dataset without any prior knowledge, we cannot know for sure how many clusters naturally exists in the dataset. Choosing the wrong K often leads to an undesirable result.
There are many methods proposed to solve this problem, such as the Elbow method, Silhouette, and the gap statistic. Practically, the elbow method is the most intuitive to understand and easiest to implement. In this blog, we will show how to implement the elbow method to choose the parameter K using Oracle R Enterprise for both performance and scalability.
The elbow method works as follows. Assuming the best K lies within a range [1, n], search for the best K by running K-means over each K = 1, 2, ..., n. Based on each K-means result, calculate the mean distance between data points and their cluster centroid. For short, we call it mean in-cluster distance. Naturally, if we increase K, this number will decrease. This is because the more centers, the smaller mean in-cluster distance there will be. Imagine if we have the number of clusters equal to the number of data points in the dataset, the distances will be all zero.
To select the best K, we need to plot the mean in-cluster distance for each K. As K increases from 1, before reaching the optimal K, the decrease speed is relatively fast because the number of centers are too low from the very beginning and each new center will incur a large decrease in the mean distance. But after the optimal K, the decrease is slow since the correct cluster structure is already discovered and any newly added center will appear in a certain cluster already formed. That will not decrease the mean in-cluster distance too much. The entire curve looks like an L shape and the best K lies in the turning point or the elbow of the L shape. The plot usually looks like below.
To illustrate this method, we use a real-world dataset from NYC taxi data, which is available online.
The following R code reads the NYC yellow taxi data from .csv and saves it as a table in Oracle Database by Oracle R Enterprise.
ore.connect(â€¦)
nyc.data <- read.csv(file="yellow_tripdata_2009-01.csv", header=T)
colnames(nyc.data) <- toupper(colnames(nyc.data))
ore.create(nyc.data, table="NYC_YELLOW_TAXI_1G")
Let us take a look at the data frame:
> dim(NYC_YELLOW_TAXI_1G)
[1] 14092413 18
> head(NYC_YELLOW_TAXI_1G)
VENDOR_NAME TRIP_PICKUP_DATETIME TRIP_DROPOFF_DATETIME PASSENGER_COUNT
1 VTS 2009-01-26 23:04:00 2009-01-26 23:05:00 1
2 CMT 2009-01-23 10:11:33 2009-01-23 10:36:24 1
3 CMT 2009-01-23 11:40:37 2009-01-23 12:07:09 1
4 VTS 2009-01-26 16:32:00 2009-01-26 16:32:00 2
5 CMT 2009-01-23 06:20:49 2009-01-23 06:23:44 1
6 DDS 2009-01-06 07:50:21 2009-01-06 08:06:08 4
TRIP_DISTANCE START_LON START_LAT RATE_CODE STORE_AND_FORWARD END_LON END_LAT
1 0.71 -73.97717 40.74292 NA NA -73.98488 40.73889
2 3.70 -73.98622 40.76235 NA NA -74.00987 40.72123
3 7.50 -73.99476 40.68473 NA NA -73.98017 40.75152
4 0.99 -73.95994 40.77095 NA NA -73.94618 40.77272
5 0.60 -73.97818 40.75410 NA NA -73.97782 40.76318
6 2.10 -73.95949 40.77119 NA NA -73.97552 40.79190
PAYMENT_TYPE FARE_AMT SURCHARGE MTA_TAX TIP_AMT TOLLS_AMT TOTAL_AMT
1 CASH 4.1 0.5 NA 0 0 4.6
2 Cash 14.9 0.0 NA 0 0 14.9
3 Cash 21.7 0.0 NA 0 0 21.7
4 Credit 4.9 1.0 NA 1 0 6.9
5 Cash 4.1 0.0 NA 0 0 4.1
6 CASH 9.7 0.0 NA 0 0 9.7
It will be interesting if we can generate clusters of the pickup and drop off locations, i.e., the longitude and latitude of both start and end points of taxi trips. This will display the usual patterns of trips across NYC. For example, perhaps there are more frequent trips starting from midtown to downtown Manhattan. Naturally, those particular start and end points will form clusters. So we want to do K-means clustering on the four coordinate features: START_LON, START_LAT, END_LON, END_LAT.
The remaining problem is to determine the number of clusters. We will show how to generate the elbow plot using Oracle R Enterprise and Oracle Data Mining methods.
Before we start to do clustering, we need to make sure the data is clean. Usually the data contain outliers, which are extremely far from the center of the distribution. The clustering algorithm tends to form small clusters that consist of outliers and cluster everything else together, which is not useful. Let us take a look at the distribution of the coordinate features.
> summary(NYC_YELLOW_TAXI_1G[, c("START_LON", "START_LAT", "END_LON", "END_LAT")])
START_LON START_LAT END_LON END_LAT
Min. :-775.45 Min. : -7.335 Min. :-784.3000 Min. : -7.335
1st Qu.: -73.99 1st Qu.: 40.736 1st Qu.: -73.9910 1st Qu.: 40.736
Median : -73.98 Median : 40.754 Median : -73.9798 Median : 40.754
Mean : -72.85 Mean : 40.136 Mean : -72.8713 Mean : 40.146
3rd Qu.: -73.97 3rd Qu.: 40.768 3rd Qu.: -73.9641 3rd Qu.: 40.769
Max. :3555.91 Max. :935.525 Max. : 0.1014 Max. :1809.958
Note that NYC is approximately located around (40, -73). But the max value of the coordinates lies above 1000! Something must be wrong here. We need to remove those outliers. The following function removes rows with outliers of a given feature based on quantiles.
remove_outliers <- function(df, feature, na.rm = TRUE, ...) {
qnt <- quantile(df[,feature], probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(df[,feature], na.rm = na.rm)
df <- df[ df[,feature] > (qnt[1] - H), ]
df <- df[ df[,feature] < (qnt[2] + H), ]
return(df)
}
We remove the outliers by applying the function above to each coordinate, and then save the result as a new table. Note that all of this computation occurs in the database, leveraging Oracle Database as a high performance compute engine through Oracle R Enterprise's Transparency Layer.
NYC_TRIMMED_1G <-remove_outliers(NYC_YELLOW_TAXI_1G, 'START_LON')
NYC_TRIMMED_1G <-remove_outliers(NYC_TRIMMED_1G, 'START_LAT')
NYC_TRIMMED_1G <-remove_outliers(NYC_TRIMMED_1G, 'END_LON')
NYC_TRIMMED_1G <-remove_outliers(NYC_TRIMMED_1G, 'END_LAT')
ore.drop(table = 'NYC_TRIMMED_1G')
ore.create(NYC_TRIMMED_1G, table = 'NYC_TRIMMED_1G')
ore.sync(table = 'NYC_TRIMMED_1G')
Let us take a look at the range of the coordinates after the filtering.
> summary(NYC_TRIMMED_1G[,c("START_LON", "START_LAT", "END_LON", "END_LAT")])
START_LON START_LAT END_LON END_LAT
Min. :-74.03 Min. :40.69 Min. :-74.03 Min. :40.69
1st Qu.:-73.99 1st Qu.:40.74 1st Qu.:-73.99 1st Qu.:40.74
Median :-73.98 Median :40.75 Median :-73.98 Median :40.76
Mean :-73.98 Mean :40.75 Mean :-73.98 Mean :40.75
3rd Qu.:-73.97 3rd Qu.:40.77 3rd Qu.:-73.97 3rd Qu.:40.77
Max. :-73.93 Max. :40.81 Max. :-73.93 Max. :40.82
The range now looks much better and we are good to go!
Although there exists an implementation of K-means in R, sometimes it is not the desirable choice. If a data set is large, building each clustering model can take a significant amount of time. The default R kmeans is in-memory and single threaded. It may not scale to bigger data. Especially when the memory is limited, it cannot even build a model.
In contrast, algorithms provided by Oracle R Enterprise have a huge advantage. Using parallel, distributed, in-database algorithm implementations can well address the issues of scalability and performance. Moreover, we can build multiple models in parallel, which also improves overall execution performance. We will leave that topic for a future blog.
Using the NYC Yellow Taxi dataset as an example, let us measure the time taken for both open source and in-database algorithms. Suppose we do a K-means with K =4.
Let us first look at the result using open source kmeans. Assume we start from the data stored in Oracle database. We need to pull the data from the database into memory first. Here, the hardware environment we are using is an Exadata server with two nodes. Because there is sufficient memory, it is safe to do the pull. Suppose the environment does not have enough memory, then it is even impossible to use the open source solution.
Let us see how long it takes to load the data.
nyc.df <- ore.pull(NYC_TRIMMED_1G)
This takes 258s. Then we can train the model on the data in memory:
feature.list <- c("START_LON", "START_LAT", "END_LON", "END_LAT")
k.model <- kmeans(nyc.df[,feature.list], 4, nstart = 1)
Training takes 105s. The two steps cost 363s.
Using ore.odmKeans, we can kick off the training inside database right away.
model.km <- ore.odmKMeans(~., NYC_TRIMMED_1G[, feature.list] , num.centers=4, iterations = 10)
This takes 76s. We can see that the in-database algorithm is nearly 4 times faster.
We write two R functions to generate the elbow plot. The following function calculates the mean in-cluster distance for a given K, which is set to be num.centers. DF is the name of the ore.frame that holds the data. The features used for clustering is provided by feature.list. We use the in-database method ore.odmKmeans to do the K-means clustering.
elbow_ss <- function(DF, feature.list, num.centers){
model.km <- ore.odmKMeans(~., DF[, feature.list] ,
num.centers=num.centers,
auto.data.prep = FALSE)
km.res <- predict(model.km, DF, type="class", feature.list)
cluster.ids <- ore.pull(unique(km.res$CLUSTER_ID))
total.ss <- c()
for(id in cluster.ids){
cur.cluster <- km.res[km.res$CLUSTER_ID == id, feature.list]
cur.center = model.km$centers2[rownames(model.km$centers2) == id,]
for(col in feature.list){
cur.cluster[, col] <- (cur.cluster[,col] - cur.center[,col])^2
}
cur.cluster$DIST_SQ <- rowSums(cur.cluster)
total.ss <- c(total.ss, sum(sqrt(cur.cluster$DIST_SQ)))
}
mean(total.ss)
}
Another function iterates through the number of clusters from 1 to max.cluster and saves the mean in-cluster distances and generate the plot.
elbow_stats <- function(DF, feature.list, max.cluster){
stopifnot(max.cluster > 0)
elbow.data <- rep(0, max.cluster)
for(i in seq_len(max.cluster)){
elbow.data[i] <- elbow_ss(DF, feature.list, i)
}
plot(seq_len(max.cluster), elbow.data,
ylab = 'Average in-cluster sum of centroid distance',
xlab = 'Number of clusters',
main = 'K-means clustering "elbow" method evaluation results')
}
Let us use the two functions above to generate the plot.
feature.list <- c("START_LON", "START_LAT", "END_LON", "END_LAT")
elbow_stats(NYC_TRIMMED_1G, feature.list, max.cluster=10)
After we run the code, the following graph is obtained. As the number of clusters K increases from 1 to 10, the in-cluster centroid sum is decreasing fast at the first few steps and then slows down and hits a plateau from K = 4. The point where K=4 is the elbow of this graph. This suggests to choose K=4 for clustering this dataset.
How can we know K= 4 is a reasonable choice? Let us visualize the clustering result. In this case, the dimension of features is four, which is not easy to visualize directly. One way is to plot the start locations and then mark the start location and end location from the centroid of the clusters to have an idea of the patterns.
The figure shows that the start points form a rough shape of NYC, with most of the dots concentrated in Manhattan. Notice the square gap in the upper side of the shape: central park. We also plot the centroids, which consist of both the start and end points, with black arrows from the start to the end. Note that the clusters with the color of green has both points very close.
The arrows indicates the general direction of the traffic. What is interesting is the upper side of Manhattan. Two clusters indicate the trips between blocks around central park to midtown, meaning that there are two clusters of trips going back and forth. The red cluster indicates trips from downtown Manhattan to Brooklyn or Queens. It is interesting that this direction is not the other way around. The green cluster is related to people traveling within downtown.
In this blog, we demonstrated how to use Oracle R Enterprise and the in-database K-means clustering algorithm to implement the elbow method to choose the best number of clusters. We used data from NYC yellow taxi as an example. After the best value for K is found, we then generated a plot to visualize the results and some interesting patterns are found.