X

Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

Variable Selection with ORE varclus - Part 1

Guest Author

Variable selection also known as feature or attribute selection is an important technique for data mining and predictive analytics. It is used when the number of variables is large and has received a special attention from application areas where this number is very large (like genomics, combinatorial chemistry, text mining, etc). The underlying hypothesis for variable selection is that the data can contain many variables which are either irrelevant or redundant. Solutions are therefore sought for selecting subsets of these variables which can predict the output with an accuracy comparable to that of the complete input set.
Variable selection serves multiple purposes: (1) It provides a faster and more cost-effective model generation (2) It simplifies the model interpretation as it based on a (much) smaller and more effective set of predictors (3) It supports a better generalization because the elimination of irrelevant features can reduce model over-fitting.
There are many approaches for feature selection differentiated by search techniques, validation methods or optimality considerations. In this blog we will describe a solution based on hierarchical and divisive variable clustering which generates disjoint groups of variables such that each group can be interpreted essentially as uni-dimensional and represented by a single variable from the original set.
This solution was developed and implemented during a POC with a customer from the banking sector. The data consisted of tables with several hundred variables and O[1e5-1e6] observations. The customer wanted to build an analysis flow operating with a much smaller number of 'relevant' attributes, from the original set, which would best capture the variability expressed in the data.
The procedure is iterative and starts from a single cluster containing all original variables. This cluster is divided in two clusters and variables assigned to one or another of the two children clusters. At every iteration one particular cluster is selected for division and the procedure continues until there are no more suitable candidates for division or if the user decided to stop the procedure once n clusters were generated (and n representative variables were identified)
The selection criteria for division is related to the variation contained in the candidate cluster, more precisely to how this variation is distributed among it's principal components. PCA is performed on the initial (starting) cluster and on every cluster resulting from divisions. If the 2nd eigenvalue is large it means that the variation is distributed at least between two principal axis or components. We are not looking beyond the 2nd eigenvalue and divide that cluster's variables into two groups depending on how they are associated with the first two axis of variability. The division process continues until every clusters has variables associated with only one principal component i.e. until every cluster has a 2nd PCA eigenvalue less than a specified threshold. During the iterative process, the cluster picked for splitting is the one having the largest 2nd eigenvalue.
The assignment of variables to clusters is based on the matrix of factor loadings or the correlation between the original variables and the PCA factors. Actually the factor loadings matrix is not directly used but a rotated matrix which improves separability. Details on the principle of factor rotations and the various types of rotations can be found in Choosing the Right Type of Rotation in PCA and EFA and Factor Rotations in Factor Analyses.
The rotations are performed with the function GPFoblq() from the GPArotation package, a pre-requisite for ORE varclus.

The next sections will describe how to run the variable clustering algorithm and interpret the results.

The ORE varclus scripts


The present version of ORE varclus is implemented in a function, ore.varclus() to be run in embedded execution mode. The driver script example, varclus_run.R illustrates how to call this function with ore.doEval:
R> clust.log <- ore.doEval(FUN.NAME="ore.varclus",
                 ,data.name="MYDATA"
                 ,maxclust=200
                 ,pca="princomp"
                 ,eigv2.threshold=1.
                 ,dsname="datstr.MYDATA"                        
                 ,ore.connect=TRUE)

The arguments passed to ore.varclus() are :

ore.varclus() is implemented in the varclus_lib.R script. The script contains also examples of post-processing functions illustrating how to selectively extract results from the datastore and generate reports and plots. The current version of ore.varclus() supports only numerical attributes. Details on the usage of the post-processing functions are provided in the next section.

The output

Datastores

We illustrate the output of ORE varclus for a particular dataset (MYDATA) containing 39 numeric variables and 54k observations. ore.varclus() saves the history of the entire cluster generation in a datastore specified via the dsname argument:
  datastore.name object.count  size       creation.date description
1 datstr.MYDATA            13 30873 2015-05-28 01:03:42        <NA>

     object.name      class size length row.count col.count
1  Grand.Summary data.frame  562      5         6         5
2  clusters.ncl1       list 2790      1        NA        NA
3  clusters.ncl2       list 3301      2        NA        NA
4  clusters.ncl3       list 3811      3        NA        NA
5  clusters.ncl4       list 4322      4        NA        NA
6  clusters.ncl5       list 4833      5        NA        NA
7  clusters.ncl6       list 5344      6        NA        NA
8   summary.ncl1       list  527      2        NA        NA
9   summary.ncl2       list  677      2        NA        NA
10  summary.ncl3       list  791      2        NA        NA
11  summary.ncl4       list  922      2        NA        NA
12  summary.ncl5       list 1069      2        NA        NA
13  summary.ncl6       list 1232      2        NA        NA    
For this dataset the algorithm generated 6 clusters after 6 iterations with a threshold eigv2.trshld=1.00. The datastore contains several types of objects : clusters.nclX, summary.nclX and Grand.Summary. The suffix X indicates the iteration step. For example clusters.ncl4 does not mean the 4th cluster; it is a list of objects (numbers and tables) related to the 4 clusters generated during the 4th iteration. summary.ncl4 contains summarizing information about each of the 4 clusters generated during the  4th iteration. Grand.Summary provides the same metrics but aggregated for all clusters per iteration. More details will be provided below.
The user can load and inspect each clusters.nclX or summary.nclX individually to track for example how variables are assigned to clusters during the iterative process. Saving the results on a per iteration basis becomes practical when the number of starting variables is several hundreds large and many clusters are generated.

Text based output

ore.varclus_lib.R contains a function write.clusters.to.file() which allows to concatenate all the information from either one single or multiple iterations and dump it in formatted text for visual inspection. In the example below the results from the last two step (5 and 6) specified via the clust.steps argument is written to the file named via the fout argument.
R> fclust <- "out.varclus.MYDATA/out.MYDATA.clusters"
R> write.clusters.to.file(fout=fclust,
                          dsname="datstr.MYDATA",clust.steps=c(5,6))
The output contains now the info from summary.ncl5, clusters.ncl5, summary.ncl6, clusters.ncl6, and Grand.Summary in that order. Below we show only the output corresponding to the 6th iteration which contains the final results.

The output starts with data collected from summary.ncl6 and displayed as two sections 'Clusters Summary' and 'Inter-Clusters Correlation'. The columns of  'Clusters Summary' are:

The 'Inter-Clusters Correlation' matrix is the correlation matrix between the scores of (data on) the 1st principal component of every cluster. It is a measure of how much the clusters are uncorrelated when represented by the 1st principal component.
----------------------------------------------------------------------------------------
Clustering step 6
----------------------------------------------------------------------------------------
Clusters Summary :

  Cluster Members Variation.Explained Proportion.Explained Secnd.Eigenval Represent.Var
1       1      13           11.522574            0.8863518   7.856187e-01         VAR25
2       2       6            5.398123            0.8996871   3.874496e-01         VAR13
3       3       6            5.851600            0.9752667   1.282750e-01          VAR9
4       4       3            2.999979            0.9999929   2.112009e-05         VAR10
5       5       5            2.390534            0.4781069   8.526650e-01         VAR27
6       6       6            5.492897            0.9154828   4.951499e-01         VAR14

Inter-Clusters Correlation :

             Clust.1      Clust.2       Clust.3       Clust.4       Clust.5       Clust.6
Clust.1  1.000000000  0.031429267  0.0915034534 -0.0045104029 -0.0341091948  0.0284033464
Clust.2  0.031429267  1.000000000  0.0017441189 -0.0014435672 -0.0130659191  0.8048780461
Clust.3  0.091503453  0.001744119  1.0000000000  0.0007563413 -0.0080611117 -0.0002118345
Clust.4 -0.004510403 -0.001443567  0.0007563413  1.0000000000 -0.0008410022 -0.0022667776
Clust.5 -0.034109195 -0.013065919 -0.0080611117 -0.0008410022  1.0000000000 -0.0107850694
Clust.6  0.028403346  0.804878046 -0.0002118345 -0.0022667776 -0.0107850694  1.0000000000

Cluster 1
             Comp.1       Comp.2    r2.own     r2.next   r2.ratio var.idx
VAR25 -0.3396562963  0.021849138 0.9711084 0.010593134 0.02920095      25
VAR38 -0.3398365257  0.021560264 0.9710107 0.010590140 0.02929962      38
VAR23 -0.3460431639  0.011946665 0.9689027 0.010689408 0.03143329      23
VAR36 -0.3462378084  0.011635813 0.9688015 0.010685952 0.03153546      36
VAR37 -0.3542777932 -0.001166427 0.9647680 0.010895771 0.03562009      37
VAR24 -0.3543088809 -0.001225793 0.9647155 0.010898262 0.03567326      24
VAR22 -0.3688379400 -0.026782777 0.9484384 0.011098450 0.05214028      22
VAR35 -0.3689127408 -0.026900129 0.9484077 0.011093779 0.05217103      35
VAR30 -0.0082726659  0.478137910 0.8723316 0.006303141 0.12847817      30
VAR32  0.0007818601  0.489061629 0.8642301 0.006116234 0.13660543      32
VAR31  0.0042646500  0.493099400 0.8605441 0.005992662 0.14029666      31
VAR33  0.0076560545  0.497131056 0.8573146 0.005934929 0.14353729      33
VAR34 -0.0802417381  0.198756967 0.3620001 0.007534643 0.64284346      34

Cluster 2
           Comp.1      Comp.2    r2.own   r2.next  r2.ratio var.idx
VAR13 -0.50390550 -0.03826113 0.9510065 0.6838419 0.1549652      13
VAR3  -0.50384385 -0.03814382 0.9509912 0.6838322 0.1550089       3
VAR18 -0.52832332 -0.09384185 0.9394948 0.6750884 0.1862204      18
VAR11 -0.31655455  0.33594147 0.9387738 0.5500716 0.1360798      11
VAR16 -0.34554284  0.26587848 0.9174539 0.5351907 0.1775913      16
VAR39 -0.02733522 -0.90110241 0.7004025 0.3805168 0.4836249      39

Cluster 3
             Comp.1       Comp.2    r2.own      r2.next    r2.ratio var.idx
VAR9  -4.436290e-01  0.010645774 0.9944599 0.0111098555 0.005602316       9
VAR8  -4.440656e-01  0.009606151 0.9944375 0.0113484256 0.005626315       8
VAR7  -4.355970e-01  0.028881014 0.9931890 0.0110602004 0.006887179       7
VAR6  -4.544373e-01 -0.016395561 0.9914545 0.0114996393 0.008644956       6
VAR21 -4.579777e-01 -0.027336302 0.9865562 0.0004552779 0.013449888      21
VAR5   1.566362e-06  0.998972842 0.8915032 0.0093737140 0.109523464       5

Cluster 4
            Comp.1        Comp.2    r2.own      r2.next     r2.ratio var.idx
VAR10 7.067763e-01  0.0004592019 0.9999964 1.899033e-05 3.585911e-06      10
VAR1  7.074371e-01 -0.0004753728 0.9999964 1.838949e-05 3.605506e-06       1
VAR15 2.093320e-11  0.9999997816 0.9999859 2.350467e-05 1.408043e-05      15

Cluster 5
            Comp.1       Comp.2    r2.own      r2.next  r2.ratio var.idx
VAR27 -0.556396037 -0.031563215 0.6199740 0.0001684573 0.3800900      27
VAR29 -0.532122723 -0.041330455 0.5586173 0.0001938785 0.4414683      29
VAR28 -0.506440510 -0.002599593 0.5327290 0.0001494172 0.4673408      28
VAR26 -0.389716922  0.198849850 0.4396647 0.0001887849 0.5604411      26
VAR20  0.003446542  0.979209797 0.2395493 0.0076757755 0.7663329      20

Cluster 6
             Comp.1        Comp.2    r2.own   r2.next  r2.ratio var.idx
VAR14 -0.0007028647  0.5771114183 0.9164991 0.7063442 0.2843495      14
VAR4  -0.0007144334  0.5770967589 0.9164893 0.7063325 0.2843714       4
VAR12 -0.5779762250 -0.0004781436 0.9164238 0.4914497 0.1643420      12
VAR2  -0.5779925997 -0.0004993306 0.9164086 0.4914361 0.1643676       2
VAR17 -0.5760772611  0.0009732350 0.9150015 0.4900150 0.1666686      17
VAR19  0.0014223072  0.5778410825 0.9120741 0.7019736 0.2950272      19

---------------------------------------------------------------------------------------
Grand Summary
---------------------------------------------------------------------------------------
  Nb.of.Clusters Tot.Var.Explained Prop.Var.Explained Min.Prop.Explained Max.2nd.Eigval
1              1          11.79856          0.3025272          0.3025272       9.787173
2              2          21.47617          0.5506711          0.4309593       5.778829
3              3          27.22407          0.6980530          0.5491522       2.999950
4              4          30.22396          0.7749735          0.6406729       2.389400
5              5          32.60496          0.8360246          0.4781069       1.205769
6              6          33.65571          0.8629668          0.4781069       0.852665

The sections 'Cluster 1' ... 'Cluster 6' contain results collected from the clusters.ncl6 list from the datastore. Each cluster is described by a table where the rows are the variables and the columns correspond to:


For example, from 'Clusters Summary', the first cluster (index 1) has 13 variables and is best represented by variable VAR25 which, from an inspecting the 'Cluster 1' section, shows the highest r2.own = 0.9711084.

The section 'Grand Summary' displays the results from the Grand.Summary table in the datastore. The rows correspond to the clustering iterations and the columns are defined as:

For example, for the final clusters (Nb.of.Clusters = 6) Min.Proportion.Explained is 0.4781069. This corresponds to Cluster 5 - see Proportion.Explained value from 'Clusters Summary'. It means that variation in Cluster 5 is poorly captured by the first principal component (only 47.8%)

As previously indicated, the representative variables, one per final cluster, are collected in the Represent.Var column from the 'Clusters Summary' section in the output text file. They can be retrieved from the summary.ncl6 object in the datastore as shown below:
R> ore.load(list=c("summary.ncl6"),name=datstr.name)
[1] "summary.ncl6"
R> names(summary.ncl6)
[1] "clusters.summary"      "inter.clusters.correl"
R> names(summary.ncl6$clusters.summary)
[1] "Cluster"  "Members"  "Variation.Explained"  "Proportion.Explained" "Secnd.Eigenval"     
[6] "Represent.Var"      
R> summary.ncl6$clusters.summary$Represent.Var
[1] "VAR25" "VAR13" "VAR9"  "VAR10" "VAR27" "VAR14"

In our next post we'll look at plots, performance and future developments for ORE varclus.


Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha