X

Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

Selecting the Best Number of Clusters at Scale using Oracle R Enterprise

Jie Liu
Data Scientist

Among clustering algorithms, K-means is the most popular. It is fast, scalable and easy to interpret. Therefore, it is almost the default first choice when data scientists want to cluster data and get insight into the inner structure of a dataset. A good introduction to this method can be found in the datascience.com blog.
However, there is one key parameter for K-means clustering which needs to be selected appropriately. That is K, the number of clusters for the algorithm to generate, which is left for the user to choose. For a given dataset without any prior knowledge, we cannot know for sure how many clusters naturally exists in the dataset. Choosing the wrong K often leads to an undesirable result.
There are many methods proposed to solve this problem, such as the Elbow method, Silhouette, and the gap statistic. Practically, the elbow method is the most intuitive to understand and easiest to implement. In this blog, we will show how to implement the elbow method to choose the parameter K using Oracle R Enterprise for both performance and scalability.

Elbow method

The elbow method works as follows. Assuming the best K lies within a range [1, n], search for the best K by running K-means over each K = 1, 2, ..., n. Based on each K-means result, calculate the mean distance between data points and their cluster centroid. For short, we call it mean in-cluster distance. Naturally, if we increase K, this number will decrease. This is because the more centers, the smaller mean in-cluster distance there will be. Imagine if we have the number of clusters equal to the number of data points in the dataset, the distances will be all zero.
To select the best K, we need to plot the mean in-cluster distance for each K. As K increases from 1, before reaching the optimal K, the decrease speed is relatively fast because the number of centers are too low from the very beginning and each new center will incur a large decrease in the mean distance. But after the optimal K, the decrease is slow since the correct cluster structure is already discovered and any newly added center will appear in a certain cluster already formed. That will not decrease the mean in-cluster distance too much. The entire curve looks like an L shape and the best K lies in the turning point or the elbow of the L shape. The plot usually looks like below.

NYC Yellow Taxi Data

To illustrate this method, we use a real-world dataset from NYC taxi data, which is available online.
The following R code reads the NYC yellow taxi data from .csv and saves it as a table in Oracle Database by Oracle R Enterprise.

ore.connect(…)
nyc.data <- read.csv(file="yellow_tripdata_2009-01.csv", header=T) 
colnames(nyc.data) <- toupper(colnames(nyc.data))
ore.create(nyc.data, table="NYC_YELLOW_TAXI_1G")

Let us take a look at the data frame:

> dim(NYC_YELLOW_TAXI_1G)
[1] 14092413       18 
> head(NYC_YELLOW_TAXI_1G)
  VENDOR_NAME TRIP_PICKUP_DATETIME TRIP_DROPOFF_DATETIME PASSENGER_COUNT
1         VTS  2009-01-26 23:04:00   2009-01-26 23:05:00               1
2         CMT  2009-01-23 10:11:33   2009-01-23 10:36:24               1
3         CMT  2009-01-23 11:40:37   2009-01-23 12:07:09               1
4         VTS  2009-01-26 16:32:00   2009-01-26 16:32:00               2
5         CMT  2009-01-23 06:20:49   2009-01-23 06:23:44               1
6         DDS  2009-01-06 07:50:21   2009-01-06 08:06:08               4
  TRIP_DISTANCE START_LON START_LAT RATE_CODE STORE_AND_FORWARD   END_LON  END_LAT
1          0.71 -73.97717  40.74292        NA                NA -73.98488 40.73889
2          3.70 -73.98622  40.76235        NA                NA -74.00987 40.72123
3          7.50 -73.99476  40.68473        NA                NA -73.98017 40.75152
4          0.99 -73.95994  40.77095        NA                NA -73.94618 40.77272
5          0.60 -73.97818  40.75410        NA                NA -73.97782 40.76318
6          2.10 -73.95949  40.77119        NA                NA -73.97552 40.79190
  PAYMENT_TYPE FARE_AMT SURCHARGE MTA_TAX TIP_AMT TOLLS_AMT TOTAL_AMT
1         CASH      4.1       0.5      NA       0         0       4.6
2         Cash     14.9       0.0      NA       0         0      14.9
3         Cash     21.7       0.0      NA       0         0      21.7
4       Credit      4.9       1.0      NA       1         0       6.9
5         Cash      4.1       0.0      NA       0         0       4.1
6         CASH      9.7       0.0      NA       0         0       9.7

It will be interesting if we can generate clusters of the pickup and drop off locations, i.e., the longitude and latitude of both start and end points of taxi trips. This will display the usual patterns of trips across NYC. For example, perhaps there are more frequent trips starting from midtown to downtown Manhattan. Naturally, those particular start and end points will form clusters. So we want to do K-means clustering on the four coordinate features: START_LON, START_LAT, END_LON, END_LAT.
The remaining problem is to determine the number of clusters. We will show how to generate the elbow plot using Oracle R Enterprise and Oracle Data Mining methods.

Preprocessing

Before we start to do clustering, we need to make sure the data is clean. Usually the data contain outliers, which are extremely far from the center of the distribution. The clustering algorithm tends to form small clusters that consist of outliers and cluster everything else together, which is not useful. Let us take a look at the distribution of the coordinate features.

> summary(NYC_YELLOW_TAXI_1G[, c("START_LON", "START_LAT", "END_LON", "END_LAT")])
 START_LON         START_LAT          END_LON             END_LAT        
 Min.   :-775.45   Min.   : -7.335   Min.   :-784.3000   Min.   :  -7.335  
 1st Qu.: -73.99   1st Qu.: 40.736   1st Qu.: -73.9910   1st Qu.:  40.736  
 Median : -73.98   Median : 40.754   Median : -73.9798   Median :  40.754  
 Mean   : -72.85   Mean   : 40.136   Mean   : -72.8713   Mean   :  40.146  
 3rd Qu.: -73.97   3rd Qu.: 40.768   3rd Qu.: -73.9641   3rd Qu.:  40.769  
 Max.   :3555.91   Max.   :935.525   Max.   :   0.1014   Max.   :1809.958  

Note that NYC is approximately located around (40, -73). But the max value of the coordinates lies above 1000! Something must be wrong here. We need to remove those outliers. The following function removes rows with outliers of a given feature based on quantiles.

remove_outliers <- function(df, feature, na.rm = TRUE, ...) {
  qnt <- quantile(df[,feature], probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(df[,feature], na.rm = na.rm)
  df <- df[ df[,feature] > (qnt[1] - H), ]
  df <- df[ df[,feature] < (qnt[2] + H), ]
  return(df)
}

We remove the outliers by applying the function above to each coordinate, and then save the result as a new table. Note that all of this computation occurs in the database, leveraging Oracle Database as a high performance compute engine through Oracle R Enterprise's Transparency Layer.

NYC_TRIMMED_1G <-remove_outliers(NYC_YELLOW_TAXI_1G, 'START_LON')
NYC_TRIMMED_1G <-remove_outliers(NYC_TRIMMED_1G, 'START_LAT')
NYC_TRIMMED_1G <-remove_outliers(NYC_TRIMMED_1G, 'END_LON')
NYC_TRIMMED_1G <-remove_outliers(NYC_TRIMMED_1G, 'END_LAT')
ore.drop(table = 'NYC_TRIMMED_1G')
ore.create(NYC_TRIMMED_1G, table = 'NYC_TRIMMED_1G')
ore.sync(table = 'NYC_TRIMMED_1G')

Let us take a look at the range of the coordinates after the filtering.

> summary(NYC_TRIMMED_1G[,c("START_LON", "START_LAT", "END_LON", "END_LAT")])
   START_LON        START_LAT        END_LON          END_LAT     
 Min.   :-74.03   Min.   :40.69   Min.   :-74.03   Min.   :40.69  
 1st Qu.:-73.99   1st Qu.:40.74   1st Qu.:-73.99   1st Qu.:40.74  
 Median :-73.98   Median :40.75   Median :-73.98   Median :40.76  
 Mean   :-73.98   Mean   :40.75   Mean   :-73.98   Mean   :40.75  
 3rd Qu.:-73.97   3rd Qu.:40.77   3rd Qu.:-73.97   3rd Qu.:40.77  
 Max.   :-73.93   Max.   :40.81   Max.   :-73.93   Max.   :40.82  

The range now looks much better and we are good to go!

Why Oracle R Enterprise In-database method

Although there exists an implementation of K-means in R, sometimes it is not the desirable choice. If a data set is large, building each clustering model can take a significant amount of time. The default R kmeans is in-memory and single threaded. It may not scale to bigger data. Especially when the memory is limited, it cannot even build a model.
In contrast, algorithms provided by Oracle R Enterprise have a huge advantage. Using parallel, distributed, in-database algorithm implementations can well address the issues of scalability and performance. Moreover, we can build multiple models in parallel, which also improves overall execution performance. We will leave that topic for a future blog.
Using the NYC Yellow Taxi dataset as an example, let us measure the time taken for both open source and in-database algorithms. Suppose we do a K-means with K =4.
Let us first look at the result using open source kmeans. Assume we start from the data stored in Oracle database. We need to pull the data from the database into memory first. Here, the hardware environment we are using is an Exadata server with two nodes. Because there is sufficient memory, it is safe to do the pull. Suppose the environment does not have enough memory, then it is even impossible to use the open source solution.
Let us see how long it takes to load the data.

nyc.df <- ore.pull(NYC_TRIMMED_1G)

This takes 258s. Then we can train the model on the data in memory:

feature.list <- c("START_LON", "START_LAT", "END_LON", "END_LAT")
k.model <- kmeans(nyc.df[,feature.list], 4, nstart = 1)       

Training takes 105s. The two steps cost 363s.
Using ore.odmKeans, we can kick off the training inside database right away.

model.km <- ore.odmKMeans(~., NYC_TRIMMED_1G[, feature.list] , num.centers=4, iterations = 10)

This takes 76s. We can see that the in-database algorithm is nearly 4 times faster.

Implementation using Oracle R Enterprise

We write two R functions to generate the elbow plot. The following function calculates the mean in-cluster distance for a given K, which is set to be num.centers. DF is the name of the ore.frame that holds the data. The features used for clustering is provided by feature.list. We use the in-database method ore.odmKmeans to do the K-means clustering.

elbow_ss <- function(DF, feature.list, num.centers){
  model.km <- ore.odmKMeans(~., DF[, feature.list] , 
                            num.centers=num.centers, 
                            auto.data.prep = FALSE)
  km.res <- predict(model.km, DF, type="class", feature.list)  
  cluster.ids <- ore.pull(unique(km.res$CLUSTER_ID))  
  total.ss <- c()  
  for(id in cluster.ids){
    cur.cluster <- km.res[km.res$CLUSTER_ID == id, feature.list]    
    cur.center = model.km$centers2[rownames(model.km$centers2) == id,]    
    for(col in feature.list){
      cur.cluster[,  col] <- (cur.cluster[,col] - cur.center[,col])^2
    }    
    cur.cluster$DIST_SQ <- rowSums(cur.cluster)
    total.ss <- c(total.ss, sum(sqrt(cur.cluster$DIST_SQ)))
  }  
  mean(total.ss)
}

Another function iterates through the number of clusters from 1 to max.cluster and saves the mean in-cluster distances and generate the plot.

elbow_stats <- function(DF, feature.list, max.cluster){
  stopifnot(max.cluster > 0)
  elbow.data <- rep(0, max.cluster)
  for(i in seq_len(max.cluster)){    
    elbow.data[i] <- elbow_ss(DF, feature.list, i) 
  }
  plot(seq_len(max.cluster), elbow.data, 
       ylab = 'Average in-cluster sum of centroid distance', 
       xlab = 'Number of clusters',
       main = 'K-means clustering "elbow" method evaluation results')
}

Let us use the two functions above to generate the plot.

feature.list <- c("START_LON", "START_LAT", "END_LON", "END_LAT")
elbow_stats(NYC_TRIMMED_1G, feature.list, max.cluster=10)

After we run the code, the following graph is obtained. As the number of clusters K increases from 1 to 10, the in-cluster centroid sum is decreasing fast at the first few steps and then slows down and hits a plateau from K = 4. The point where K=4 is the elbow of this graph. This suggests to choose K=4 for clustering this dataset.

Visualization

How can we know K= 4 is a reasonable choice? Let us visualize the clustering result. In this case, the dimension of features is four, which is not easy to visualize directly. One way is to plot the start locations and then mark the start location and end location from the centroid of the clusters to have an idea of the patterns.


The figure shows that the start points form a rough shape of NYC, with most of the dots concentrated in Manhattan. Notice the square gap in the upper side of the shape: central park. We also plot the centroids, which consist of both the start and end points, with black arrows from the start to the end. Note that the clusters with the color of green has both points very close.
The arrows indicates the general direction of the traffic. What is interesting is the upper side of Manhattan. Two clusters indicate the trips between blocks around central park to midtown, meaning that there are two clusters of trips going back and forth. The red cluster indicates trips from downtown Manhattan to Brooklyn or Queens. It is interesting that this direction is not the other way around. The green cluster is related to people traveling within downtown.

Conclusion

In this blog, we demonstrated how to use Oracle R Enterprise and the in-database K-means clustering algorithm to implement the elbow method to choose the best number of clusters. We used data from NYC yellow taxi as an example. After the best value for K is found, we then generated a plot to visualize the results and some interesting patterns are found.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services