In our previous post
we talked about variable selection and introduced a technique based on hierarchical divisive clustering and implemented using the Oracle R Enterprise
embedded execution capabilities. In this post we illustrate how to visualize the clustering solution, discuss stopping criteria and highlight some performance aspects.
The clustering efficiency can be assessed, from a high level perspective, through a visual representation of metrics related to variability. The plot.clusters()
function provided as example in varclus_lib.R
takes the datastore name, the iteration number (nclust
corresponds to the number of clusters after the final iteration) and an output directory to generate a png output file with two plots.
unix> ls -1 out.varclus.MYDATA
The upper plot focuses on the last iteration. The x axis represents the cluster id (1 to 6 for six clusters after the 6-th and final iteration). The variation explained and proportion of variation explained (Variation.Explained and Proportion.Explained from 'Cluster Summary') are rendered by the blue curve with units on the left y axis and the red curve with units on the right y axis). Clusters 1,2,3,4,6 are well represented by their first principal component. Cluster 5, contains variation which is not well captured by a single component (only 47.8% is explained, as alraedy mentioned in Part 1). This can be seen also from the r2.own values for the variables of Cluster 5, VAR20, VAR26,...,VAR29 , between 0.24 and 0.62 indicating that their are not well correlated with the 1st principal component score. For this kind of situation, domain expertise will be needed to evaluate the results and decide the course of action : does it make sense to have VAR20, VAR26,...,VAR29 clustered together/keep VAR27 as representative variable or should Cluster 5 be further split by lowering eig2.threshold (below the corresponding Secnd.Eigenval value from the 'Clusters Summary' section) ?
The bottom plot illustrates the entire clustering sequence (all iterations) The x axis represents the iteration number or the numbers of clusters after that iteration. The total variation explained and proportion of total variation explained (Tot.Var.Explained and Prop.Var.Explained from 'Grand Summary' are rendered by the blue curve with units on the left y axis and the red curve with units on the right y axis). One can see how Prop.Var.Explained tends to flatten below 90% (86.3% for the last iteration).
For the case above a single cluster was 'weak' and there were no ambiguities about where to start examining the results or search for issues. Below is the same output for a different problem with 120 variables and 29 final clusters. For this case, the proportion of variation explained by the 1st component (red curve, upper plot) shows several 'weak' clusters : 23, 28, 27, 4, 7, 19. The Prop.Var.Explained is below 60% for these clusters. Which one should be examined first ? A good choice could be Cluster 7 because it plays a more important role as measured by the absolute value of Variation.Explained. Here again, domain knowledge will be required to examine these clusters and decide if and for how long how one should continue the splitting process.
Stopping criteria & number of variables
As illustrated in the previous section, the number of final clusters can be raised or reduced by lowering or increasing the eig2.trshld
parameter. For problems with many variables the user may want to stop the iterations at lower values and inspect the clustering results & history before convergence to gain a better understanding of the variable selection process. Early stopping is achieved through the maxclust
argument, as discussed in the previous post
, and can be used also if the user wants/has to keep the number of selected variables below an upper limit.
The clustering runtime is entirely dominated by the cost of the PCA analysis. The 1st split is the most expensive as PCA is run for the entire data; the subsequent splits are executed faster and faster as the PCAs handle clusters with less and less variables. For the 39 variables & 55k rows case presented it took ~10s for the entire run (splitting into 6 clusters, post-processing from datastore, generation). The 120 variables & 55k rows case required ~54s. For a larger case with 666 variables & 64k rows the execution completed in 112s and generated 128 clusters. These numbers were obtained on a Intel Xeon 2.9Ghz OL6 machine.The customer ran cases with more than 600 variable & O[1e6] rows in 5-10 mins.