Variable selection also known as feature or attribute selection is an important technique for data mining and predictive analytics. It is used when the number of variables is large and has received a special attention from application areas where this number is very large (like genomics, combinatorial chemistry, text mining, etc). The underlying hypothesis for variable selection is that the data can contain many variables which are either irrelevant or redundant. Solutions are therefore sought for selecting subsets of these variables which can predict the output with an accuracy comparable to that of the complete input set.
Variable selection serves multiple purposes: (1) It provides a faster and more cost-effective model generation (2) It simplifies the model interpretation as it based on a (much) smaller and more effective set of predictors (3) It supports a better generalization because the elimination of irrelevant features can reduce model over-fitting.
There are many approaches for feature selection differentiated by search techniques, validation methods or optimality considerations. In this blog we will describe a solution based on hierarchical and divisive variable clustering which generates disjoint groups of variables such that each group can be interpreted essentially as uni-dimensional and represented by a single variable from the original set.
This solution was developed and implemented during a POC with a customer from the banking sector. The data consisted of tables with several hundred variables and O[1e5-1e6] observations. The customer wanted to build an analysis flow operating with a much smaller number of 'relevant' attributes, from the original set, which would best capture the variability expressed in the data.
The procedure is iterative and starts from a single cluster containing all original variables. This cluster is divided in two clusters and variables assigned to one or another of the two children clusters. At every iteration one particular cluster is selected for division and the procedure continues until there are no more suitable candidates for division or if the user decided to stop the procedure once n clusters were generated (and n representative variables were identified)
The selection criteria for division is related to the variation contained in the candidate cluster, more precisely to how this variation is distributed among it's principal components. PCA is performed on the initial (starting) cluster and on every cluster resulting from divisions. If the 2nd eigenvalue is large it means that the variation is distributed at least between two principal axis or components. We are not looking beyond the 2nd eigenvalue and divide that cluster's variables into two groups depending on how they are associated with the first two axis of variability. The division process continues until every clusters has variables associated with only one principal component i.e. until every cluster has a 2nd PCA eigenvalue less than a specified threshold. During the iterative process, the cluster picked for splitting is the one having the largest 2nd eigenvalue.
The assignment of variables to clusters is based on the matrix of factor loadings or the correlation between the original variables and the PCA factors. Actually the factor loadings matrix is not directly used but a rotated matrix which improves separability. Details on the principle of factor rotations and the various types of rotations can be found in
Choosing the Right Type of Rotation in PCA and EFA and
Factor Rotations in Factor Analyses.
The rotations are performed with the function
GPFoblq() from the
GPArotation package, a
pre-requisite for ORE varclus.
The next sections will describe how to run the variable clustering algorithm and interpret the results.
The ORE varclus scripts
The present version of ORE varclus is implemented in a function,
ore.varclus() to be run in embedded execution mode. The driver script example,
varclus_run.R illustrates how to call this function with ore.doEval:
R> clust.log <- ore.doEval(FUN.NAME="ore.varclus",
,data.name="MYDATA"
,maxclust=200
,pca="princomp"
,eigv2.threshold=1.
,dsname="datstr.MYDATA"
,ore.connect=TRUE)
The arguments passed to
ore.varclus() are :
ore.varclus() is implemented in the
varclus_lib.R script. The script contains also examples of post-processing functions illustrating how to selectively extract results from the datastore and generate reports and plots. The current version of
ore.varclus() supports only numerical attributes. Details on the usage of the post-processing functions are provided in the next section.
The output
Datastores
We illustrate the output of ORE varclus for a particular dataset (MYDATA) containing 39 numeric variables and 54k observations.
ore.varclus() saves the history of the entire cluster generation in a datastore specified via the
dsname argument:
datastore.name object.count size creation.date description
1 datstr.MYDATA 13 30873 2015-05-28 01:03:42 <NA>
object.name class size length row.count col.count
1 Grand.Summary data.frame 562 5 6 5
2 clusters.ncl1 list 2790 1 NA NA
3 clusters.ncl2 list 3301 2 NA NA
4 clusters.ncl3 list 3811 3 NA NA
5 clusters.ncl4 list 4322 4 NA NA
6 clusters.ncl5 list 4833 5 NA NA
7 clusters.ncl6 list 5344 6 NA NA
8 summary.ncl1 list 527 2 NA NA
9 summary.ncl2 list 677 2 NA NA
10 summary.ncl3 list 791 2 NA NA
11 summary.ncl4 list 922 2 NA NA
12 summary.ncl5 list 1069 2 NA NA
13 summary.ncl6 list 1232 2 NA NA
For this dataset the algorithm generated 6 clusters after 6 iterations with a threshold eigv2.trshld=1.00. The datastore contains several types of objects : clusters.nclX, summary.nclX and Grand.Summary. The suffix X indicates the iteration step. For example clusters.ncl4 does not mean the 4th cluster; it is a list of objects (numbers and tables) related to the 4 clusters generated during the 4th iteration. summary.ncl4 contains summarizing information about each of the 4 clusters generated during the 4th iteration. Grand.Summary provides the same metrics but aggregated for all clusters per iteration. More details will be provided below.
The user can load and inspect each clusters.nclX or summary.nclX individually to track for example how variables are assigned to clusters during the iterative process. Saving the results on a per iteration basis becomes practical when the number of starting variables is several hundreds large and many clusters are generated.
Text based output
ore.varclus_lib.R contains a function write.clusters.to.file() which allows to concatenate all the information from either one single or multiple iterations and dump it in formatted text for visual inspection. In the example below the results from the last two step (5 and 6) specified via the clust.steps argument is written to the file named via the fout argument.
R> fclust <- "out.varclus.MYDATA/out.MYDATA.clusters"
R> write.clusters.to.file(fout=fclust,
dsname="datstr.MYDATA",clust.steps=c(5,6))
The output contains now the info from summary.ncl5, clusters.ncl5, summary.ncl6, clusters.ncl6, and Grand.Summary in that order. Below we show only the output corresponding to the 6th iteration which contains the final results.
The output starts with data collected from
summary.ncl6 and displayed as two sections 'Clusters Summary' and 'Inter-Clusters Correlation'. The columns of 'Clusters Summary' are:
The 'Inter-Clusters Correlation' matrix is the correlation matrix between the scores of (data on) the 1st principal component of every cluster. It is a measure of how much the clusters are uncorrelated when represented by the 1st principal component.
----------------------------------------------------------------------------------------
Clustering step 6
----------------------------------------------------------------------------------------
Clusters Summary :
Cluster Members Variation.Explained Proportion.Explained Secnd.Eigenval Represent.Var
1 1 13 11.522574 0.8863518 7.856187e-01 VAR25
2 2 6 5.398123 0.8996871 3.874496e-01 VAR13
3 3 6 5.851600 0.9752667 1.282750e-01 VAR9
4 4 3 2.999979 0.9999929 2.112009e-05 VAR10
5 5 5 2.390534 0.4781069 8.526650e-01 VAR27
6 6 6 5.492897 0.9154828 4.951499e-01 VAR14
Inter-Clusters Correlation :
Clust.1 Clust.2 Clust.3 Clust.4 Clust.5 Clust.6
Clust.1 1.000000000 0.031429267 0.0915034534 -0.0045104029 -0.0341091948 0.0284033464
Clust.2 0.031429267 1.000000000 0.0017441189 -0.0014435672 -0.0130659191 0.8048780461
Clust.3 0.091503453 0.001744119 1.0000000000 0.0007563413 -0.0080611117 -0.0002118345
Clust.4 -0.004510403 -0.001443567 0.0007563413 1.0000000000 -0.0008410022 -0.0022667776
Clust.5 -0.034109195 -0.013065919 -0.0080611117 -0.0008410022 1.0000000000 -0.0107850694
Clust.6 0.028403346 0.804878046 -0.0002118345 -0.0022667776 -0.0107850694 1.0000000000
Cluster 1
Comp.1 Comp.2 r2.own r2.next r2.ratio var.idx
VAR25 -0.3396562963 0.021849138 0.9711084 0.010593134 0.02920095 25
VAR38 -0.3398365257 0.021560264 0.9710107 0.010590140 0.02929962 38
VAR23 -0.3460431639 0.011946665 0.9689027 0.010689408 0.03143329 23
VAR36 -0.3462378084 0.011635813 0.9688015 0.010685952 0.03153546 36
VAR37 -0.3542777932 -0.001166427 0.9647680 0.010895771 0.03562009 37
VAR24 -0.3543088809 -0.001225793 0.9647155 0.010898262 0.03567326 24
VAR22 -0.3688379400 -0.026782777 0.9484384 0.011098450 0.05214028 22
VAR35 -0.3689127408 -0.026900129 0.9484077 0.011093779 0.05217103 35
VAR30 -0.0082726659 0.478137910 0.8723316 0.006303141 0.12847817 30
VAR32 0.0007818601 0.489061629 0.8642301 0.006116234 0.13660543 32
VAR31 0.0042646500 0.493099400 0.8605441 0.005992662 0.14029666 31
VAR33 0.0076560545 0.497131056 0.8573146 0.005934929 0.14353729 33
VAR34 -0.0802417381 0.198756967 0.3620001 0.007534643 0.64284346 34
Cluster 2
Comp.1 Comp.2 r2.own r2.next r2.ratio var.idx
VAR13 -0.50390550 -0.03826113 0.9510065 0.6838419 0.1549652 13
VAR3 -0.50384385 -0.03814382 0.9509912 0.6838322 0.1550089 3
VAR18 -0.52832332 -0.09384185 0.9394948 0.6750884 0.1862204 18
VAR11 -0.31655455 0.33594147 0.9387738 0.5500716 0.1360798 11
VAR16 -0.34554284 0.26587848 0.9174539 0.5351907 0.1775913 16
VAR39 -0.02733522 -0.90110241 0.7004025 0.3805168 0.4836249 39
Cluster 3
Comp.1 Comp.2 r2.own r2.next r2.ratio var.idx
VAR9 -4.436290e-01 0.010645774 0.9944599 0.0111098555 0.005602316 9
VAR8 -4.440656e-01 0.009606151 0.9944375 0.0113484256 0.005626315 8
VAR7 -4.355970e-01 0.028881014 0.9931890 0.0110602004 0.006887179 7
VAR6 -4.544373e-01 -0.016395561 0.9914545 0.0114996393 0.008644956 6
VAR21 -4.579777e-01 -0.027336302 0.9865562 0.0004552779 0.013449888 21
VAR5 1.566362e-06 0.998972842 0.8915032 0.0093737140 0.109523464 5
Cluster 4
Comp.1 Comp.2 r2.own r2.next r2.ratio var.idx
VAR10 7.067763e-01 0.0004592019 0.9999964 1.899033e-05 3.585911e-06 10
VAR1 7.074371e-01 -0.0004753728 0.9999964 1.838949e-05 3.605506e-06 1
VAR15 2.093320e-11 0.9999997816 0.9999859 2.350467e-05 1.408043e-05 15
Cluster 5
Comp.1 Comp.2 r2.own r2.next r2.ratio var.idx
VAR27 -0.556396037 -0.031563215 0.6199740 0.0001684573 0.3800900 27
VAR29 -0.532122723 -0.041330455 0.5586173 0.0001938785 0.4414683 29
VAR28 -0.506440510 -0.002599593 0.5327290 0.0001494172 0.4673408 28
VAR26 -0.389716922 0.198849850 0.4396647 0.0001887849 0.5604411 26
VAR20 0.003446542 0.979209797 0.2395493 0.0076757755 0.7663329 20
Cluster 6
Comp.1 Comp.2 r2.own r2.next r2.ratio var.idx
VAR14 -0.0007028647 0.5771114183 0.9164991 0.7063442 0.2843495 14
VAR4 -0.0007144334 0.5770967589 0.9164893 0.7063325 0.2843714 4
VAR12 -0.5779762250 -0.0004781436 0.9164238 0.4914497 0.1643420 12
VAR2 -0.5779925997 -0.0004993306 0.9164086 0.4914361 0.1643676 2
VAR17 -0.5760772611 0.0009732350 0.9150015 0.4900150 0.1666686 17
VAR19 0.0014223072 0.5778410825 0.9120741 0.7019736 0.2950272 19
---------------------------------------------------------------------------------------
Grand Summary
---------------------------------------------------------------------------------------
Nb.of.Clusters Tot.Var.Explained Prop.Var.Explained Min.Prop.Explained Max.2nd.Eigval
1 1 11.79856 0.3025272 0.3025272 9.787173
2 2 21.47617 0.5506711 0.4309593 5.778829
3 3 27.22407 0.6980530 0.5491522 2.999950
4 4 30.22396 0.7749735 0.6406729 2.389400
5 5 32.60496 0.8360246 0.4781069 1.205769
6 6 33.65571 0.8629668 0.4781069 0.852665
The sections 'Cluster 1' ... 'Cluster 6' contain results collected from the
clusters.ncl6 list from the datastore. Each cluster is described by a table where the rows are the variables and the columns correspond to:

For example, from 'Clusters Summary', the first cluster (index 1) has 13 variables and is best represented by variable VAR25 which, from an inspecting the 'Cluster 1' section, shows the highest r2.own = 0.9711084.
The section 'Grand Summary' displays the results from the Grand.Summary table in the datastore. The rows correspond to the clustering iterations and the columns are defined as:
For example, for the final clusters (Nb.of.Clusters = 6) Min.Proportion.Explained is 0.4781069. This corresponds to Cluster 5 - see Proportion.Explained value from 'Clusters Summary'. It means that variation in Cluster 5 is poorly captured by the first principal component (only 47.8%)
As previously indicated, the representative variables, one per final cluster, are collected in the Represent.Var column from the 'Clusters Summary' section in the output text file. They can be retrieved from the summary.ncl6 object in the datastore as shown below:
R> ore.load(list=c("summary.ncl6"),name=datstr.name)
[1] "summary.ncl6"
R> names(summary.ncl6)
[1] "clusters.summary" "inter.clusters.correl"
R> names(summary.ncl6$clusters.summary)
[1] "Cluster" "Members" "Variation.Explained" "Proportion.Explained" "Secnd.Eigenval"
[6] "Represent.Var"
R> summary.ncl6$clusters.summary$Represent.Var
[1] "VAR25" "VAR13" "VAR9" "VAR10" "VAR27" "VAR14"
In our next post we'll look at plots, performance and future developments for ORE varclus.