Variable selection also known as feature or attribute selection is an important technique for data mining and predictive analytics. It is used when the number of variables is large and has received a special attention from application areas where this number is very large (like genomics, combinatorial chemistry, text mining, etc). The underlying hypothesis for variable selection is that the data can contain many variables which are either irrelevant or redundant. Solutions are therefore sought for selecting subsets of these variables which can predict the output with an accuracy comparable to that of the complete input set.

Variable selection serves multiple purposes: (1) It provides a faster and more cost-effective model generation (2) It simplifies the model interpretation as it based on a (much) smaller and more effective set of predictors (3) It supports a better generalization because the elimination of irrelevant features can reduce model over-fitting.

There are many approaches for feature selection differentiated by search techniques, validation methods or optimality considerations. In this blog we will describe a solution based on hierarchical and divisive variable clustering which generates disjoint groups of variables such that each group can be interpreted essentially as uni-dimensional and represented by a single variable from the original set.

This solution was developed and implemented during a POC with a customer from the banking sector. The data consisted of tables with several hundred variables and O[1e5-1e6] observations. The customer wanted to build an analysis flow operating with a much smaller number of 'relevant' attributes, from the original set, which would best capture the variability expressed in the data.

The procedure is iterative and starts from a single cluster containing all original variables. This cluster is divided in two clusters and variables assigned to one or another of the two children clusters. At every iteration one particular cluster is selected for division and the procedure continues until there are no more suitable candidates for division or if the user decided to stop the procedure once n clusters were generated (and n representative variables were identified)

The selection criteria for division is related to the variation contained in the candidate cluster, more precisely to how this variation is distributed among it's principal components. PCA is performed on the initial (starting) cluster and on every cluster resulting from divisions. If the 2nd eigenvalue is large it means that the variation is distributed at least between two principal axis or components. We are not looking beyond the 2nd eigenvalue and divide that cluster's variables into two groups depending on how they are associated with the first two axis of variability. The division process continues until every clusters has variables associated with only one principal component i.e. until every cluster has a 2nd PCA eigenvalue less than a specified threshold. During the iterative process, the cluster picked for splitting is the one having the largest 2nd eigenvalue.

The assignment of variables to clusters is based on the matrix of factor loadings or the correlation between the original variables and the PCA factors. Actually the factor loadings matrix is not directly used but a rotated matrix which improves separability. Details on the principle of factor rotations and the various types of rotations can be found in

*Choosing the Right Type of Rotation in PCA and EFA* and

*Factor Rotations in Factor Analyses*.

The rotations are performed with the function

*GPFoblq()* from the

GPArotation package, a

pre-requisite for ORE varclus.

The next sections will describe how to run the variable clustering algorithm and interpret the results.

## The ORE varclus scripts

The present version of ORE varclus is implemented in a function,

*ore.varclus()* to be run in embedded execution mode. The driver script example,

varclus_run.R illustrates how to call this function with ore.doEval:

R> clust.log <- ore.doEval(FUN.NAME="ore.varclus",

,data.name="MYDATA"

,maxclust=200

,pca="princomp"

,eigv2.threshold=1.

,dsname="datstr.MYDATA"

,ore.connect=TRUE)

The arguments passed to

*ore.varclus()* are :

*ore.varclus()* is implemented in the

varclus_lib.R script. The script contains also examples of post-processing functions illustrating how to selectively extract results from the datastore and generate reports and plots. The current version of

*ore.varclus()* supports only numerical attributes. Details on the usage of the post-processing functions are provided in the next section.

## The output

### Datastores

We illustrate the output of ORE varclus for a particular dataset (MYDATA) containing 39 numeric variables and 54k observations.

*ore.varclus()* saves the history of the entire cluster generation in a datastore specified via the

*dsname* argument:

datastore.name object.count size creation.date description

1 datstr.MYDATA 13 30873 2015-05-28 01:03:42 <NA>

object.name class size length row.count col.count

1 Grand.Summary data.frame 562 5 6 5

2 clusters.ncl1 list 2790 1 NA NA

3 clusters.ncl2 list 3301 2 NA NA

4 clusters.ncl3 list 3811 3 NA NA

5 clusters.ncl4 list 4322 4 NA NA

6 clusters.ncl5 list 4833 5 NA NA

7 clusters.ncl6 list 5344 6 NA NA

8 summary.ncl1 list 527 2 NA NA

9 summary.ncl2 list 677 2 NA NA

10 summary.ncl3 list 791 2 NA NA

11 summary.ncl4 list 922 2 NA NA

12 summary.ncl5 list 1069 2 NA NA

13 summary.ncl6 list 1232 2 NA NA

For this dataset the algorithm generated 6 clusters after 6 iterations with a threshold *eigv2.trshld*=1.00. The datastore contains several types of objects : *clusters.nclX*, *summary.nclX* and *Grand.Summary*. The suffix *X* indicates the iteration step. For example *clusters.ncl4* does not mean the 4th cluster; it is a list of objects (numbers and tables) related to the 4 clusters generated during the 4th iteration. *summary.ncl4* contains summarizing information about each of the 4 clusters generated during the 4th iteration. *Grand.Summary* provides the same metrics but aggregated for all clusters per iteration. More details will be provided below.

The user can load and inspect each *clusters.nclX* or *summary.nclX* individually to track for example how variables are assigned to clusters during the iterative process. Saving the results on a per iteration basis becomes practical when the number of starting variables is several hundreds large and many clusters are generated.

### Text based output

*ore.varclus_lib.R* contains a function *write.clusters.to.file()* which allows to concatenate all the information from either one single or multiple iterations and dump it in formatted text for visual inspection. In the example below the results from the last two step (5 and 6) specified via the *clust.steps* argument is written to the file named via the *fout* argument.

R> fclust <- "out.varclus.MYDATA/out.MYDATA.clusters"

R> write.clusters.to.file(fout=fclust,

dsname="datstr.MYDATA",clust.steps=c(5,6))

The output contains now the info from *summary.ncl5, clusters.ncl5, summary.ncl6, clusters.ncl6*, and *Grand.Summary* in that order. Below we show only the output corresponding to the 6th iteration which contains the final results.

The output starts with data collected from

*summary.ncl6* and displayed as two sections 'Clusters Summary' and 'Inter-Clusters Correlation'. The columns of 'Clusters Summary' are:

The 'Inter-Clusters Correlation' matrix is the correlation matrix between the scores of (data on) the 1st principal component of every cluster. It is a measure of how much the clusters are uncorrelated when represented by the 1st principal component.

----------------------------------------------------------------------------------------

Clustering step 6

----------------------------------------------------------------------------------------

Clusters Summary :

Cluster Members Variation.Explained Proportion.Explained Secnd.Eigenval Represent.Var

1 1 13 11.522574 0.8863518 7.856187e-01 VAR25

2 2 6 5.398123 0.8996871 3.874496e-01 VAR13

3 3 6 5.851600 0.9752667 1.282750e-01 VAR9

4 4 3 2.999979 0.9999929 2.112009e-05 VAR10

5 5 5 2.390534 0.4781069 8.526650e-01 VAR27

6 6 6 5.492897 0.9154828 4.951499e-01 VAR14

Inter-Clusters Correlation :

Clust.1 Clust.2 Clust.3 Clust.4 Clust.5 Clust.6

Clust.1 1.000000000 0.031429267 0.0915034534 -0.0045104029 -0.0341091948 0.0284033464

Clust.2 0.031429267 1.000000000 0.0017441189 -0.0014435672 -0.0130659191 0.8048780461

Clust.3 0.091503453 0.001744119 1.0000000000 0.0007563413 -0.0080611117 -0.0002118345

Clust.4 -0.004510403 -0.001443567 0.0007563413 1.0000000000 -0.0008410022 -0.0022667776

Clust.5 -0.034109195 -0.013065919 -0.0080611117 -0.0008410022 1.0000000000 -0.0107850694

Clust.6 0.028403346 0.804878046 -0.0002118345 -0.0022667776 -0.0107850694 1.0000000000

Cluster 1

Comp.1 Comp.2 r2.own r2.next r2.ratio var.idx

VAR25 -0.3396562963 0.021849138 0.9711084 0.010593134 0.02920095 25

VAR38 -0.3398365257 0.021560264 0.9710107 0.010590140 0.02929962 38

VAR23 -0.3460431639 0.011946665 0.9689027 0.010689408 0.03143329 23

VAR36 -0.3462378084 0.011635813 0.9688015 0.010685952 0.03153546 36

VAR37 -0.3542777932 -0.001166427 0.9647680 0.010895771 0.03562009 37

VAR24 -0.3543088809 -0.001225793 0.9647155 0.010898262 0.03567326 24

VAR22 -0.3688379400 -0.026782777 0.9484384 0.011098450 0.05214028 22

VAR35 -0.3689127408 -0.026900129 0.9484077 0.011093779 0.05217103 35

VAR30 -0.0082726659 0.478137910 0.8723316 0.006303141 0.12847817 30

VAR32 0.0007818601 0.489061629 0.8642301 0.006116234 0.13660543 32

VAR31 0.0042646500 0.493099400 0.8605441 0.005992662 0.14029666 31

VAR33 0.0076560545 0.497131056 0.8573146 0.005934929 0.14353729 33

VAR34 -0.0802417381 0.198756967 0.3620001 0.007534643 0.64284346 34

Cluster 2

Comp.1 Comp.2 r2.own r2.next r2.ratio var.idx

VAR13 -0.50390550 -0.03826113 0.9510065 0.6838419 0.1549652 13

VAR3 -0.50384385 -0.03814382 0.9509912 0.6838322 0.1550089 3

VAR18 -0.52832332 -0.09384185 0.9394948 0.6750884 0.1862204 18

VAR11 -0.31655455 0.33594147 0.9387738 0.5500716 0.1360798 11

VAR16 -0.34554284 0.26587848 0.9174539 0.5351907 0.1775913 16

VAR39 -0.02733522 -0.90110241 0.7004025 0.3805168 0.4836249 39

Cluster 3

Comp.1 Comp.2 r2.own r2.next r2.ratio var.idx

VAR9 -4.436290e-01 0.010645774 0.9944599 0.0111098555 0.005602316 9

VAR8 -4.440656e-01 0.009606151 0.9944375 0.0113484256 0.005626315 8

VAR7 -4.355970e-01 0.028881014 0.9931890 0.0110602004 0.006887179 7

VAR6 -4.544373e-01 -0.016395561 0.9914545 0.0114996393 0.008644956 6

VAR21 -4.579777e-01 -0.027336302 0.9865562 0.0004552779 0.013449888 21

VAR5 1.566362e-06 0.998972842 0.8915032 0.0093737140 0.109523464 5

Cluster 4

Comp.1 Comp.2 r2.own r2.next r2.ratio var.idx

VAR10 7.067763e-01 0.0004592019 0.9999964 1.899033e-05 3.585911e-06 10

VAR1 7.074371e-01 -0.0004753728 0.9999964 1.838949e-05 3.605506e-06 1

VAR15 2.093320e-11 0.9999997816 0.9999859 2.350467e-05 1.408043e-05 15

Cluster 5

Comp.1 Comp.2 r2.own r2.next r2.ratio var.idx

VAR27 -0.556396037 -0.031563215 0.6199740 0.0001684573 0.3800900 27

VAR29 -0.532122723 -0.041330455 0.5586173 0.0001938785 0.4414683 29

VAR28 -0.506440510 -0.002599593 0.5327290 0.0001494172 0.4673408 28

VAR26 -0.389716922 0.198849850 0.4396647 0.0001887849 0.5604411 26

VAR20 0.003446542 0.979209797 0.2395493 0.0076757755 0.7663329 20

Cluster 6

Comp.1 Comp.2 r2.own r2.next r2.ratio var.idx

VAR14 -0.0007028647 0.5771114183 0.9164991 0.7063442 0.2843495 14

VAR4 -0.0007144334 0.5770967589 0.9164893 0.7063325 0.2843714 4

VAR12 -0.5779762250 -0.0004781436 0.9164238 0.4914497 0.1643420 12

VAR2 -0.5779925997 -0.0004993306 0.9164086 0.4914361 0.1643676 2

VAR17 -0.5760772611 0.0009732350 0.9150015 0.4900150 0.1666686 17

VAR19 0.0014223072 0.5778410825 0.9120741 0.7019736 0.2950272 19

---------------------------------------------------------------------------------------

Grand Summary

---------------------------------------------------------------------------------------

Nb.of.Clusters Tot.Var.Explained Prop.Var.Explained Min.Prop.Explained Max.2nd.Eigval

1 1 11.79856 0.3025272 0.3025272 9.787173

2 2 21.47617 0.5506711 0.4309593 5.778829

3 3 27.22407 0.6980530 0.5491522 2.999950

4 4 30.22396 0.7749735 0.6406729 2.389400

5 5 32.60496 0.8360246 0.4781069 1.205769

6 6 33.65571 0.8629668 0.4781069 0.852665

The sections 'Cluster 1' ... 'Cluster 6' contain results collected from the

*clusters.ncl6* list from the datastore. Each cluster is described by a table where the rows are the variables and the columns correspond to:

For example, from 'Clusters Summary', the first cluster (index 1) has 13 variables and is best represented by variable VAR25 which, from an inspecting the 'Cluster 1' section, shows the highest *r2.own* = 0.9711084.

The section 'Grand Summary' displays the results from the *Grand.Summary* table in the datastore. The rows correspond to the clustering iterations and the columns are defined as:

For example, for the final clusters (*Nb.of.Clusters* = 6) *Min.Proportion.Explained* is 0.4781069. This corresponds to Cluster 5 - see *Proportion.Explained* value from 'Clusters Summary'. It means that variation in Cluster 5 is poorly captured by the first principal component (only 47.8%)

As previously indicated, the representative variables, one per final cluster, are collected in the *Represent.Var* column from the 'Clusters Summary' section in the output text file. They can be retrieved from the *summary.ncl6* object in the datastore as shown below:

R> ore.load(list=c("summary.ncl6"),name=datstr.name)

[1] "summary.ncl6"

R> names(summary.ncl6)

[1] "clusters.summary" "inter.clusters.correl"

R> names(summary.ncl6$clusters.summary)

[1] "Cluster" "Members" "Variation.Explained" "Proportion.Explained" "Secnd.Eigenval"

[6] "Represent.Var"

R> summary.ncl6$clusters.summary$Represent.Var

[1] "VAR25" "VAR13" "VAR9" "VAR10" "VAR27" "VAR14"

In our next post we'll look at plots, performance and future developments for ORE varclus.