Tuesday Jun 30, 2015

R Consortium Launched!

The Linux Foundation announces the R Consortium to support R users globally.

The R Consortium works with and provides support to the R Foundation and other organizations developing, maintaining and distributing R software and provides a unifying framework for the R user community.

“Data science is pushing the boundaries of what is possible in business, science, and technology, where the R language and ecosystem is a major enabling force,” said Neil Mendelson, Vice President, Big Data and Advanced Analytics, Oracle “The R Consortium is an important enabling body to support and help grow the R user community, which increasingly includes enterprise data scientists.”

R is a key enabling technology for data science as evidenced by its dramatic rise in adoption over the past several years. We look forward to contributing to R's continued success through the R Consortium.

Thursday Jun 04, 2015

Variable Selection with ORE varclus - Part 1



Variable selection also known as feature or attribute selection is an important technique for data mining and predictive analytics. It is used when the number of variables is large and has received a special attention from application areas where this number is very large (like genomics, combinatorial chemistry, text mining, etc). The underlying hypothesis for variable selection is that the data can contain many variables which are either irrelevant or redundant. Solutions are therefore sought for selecting subsets of these variables which can predict the output with an accuracy comparable to that of the complete input set.

Variable selection serves multiple purposes: (1) It provides a faster and more cost-effective model generation (2) It simplifies the model interpretation as it based on a (much) smaller and more effective set of predictors (3) It supports a better generalization because the elimination of irrelevant features can reduce model over-fitting.

There are many approaches for feature selection differentiated by search techniques, validation methods or optimality considerations. In this blog we will describe a solution based on hierarchical and divisive variable clustering which generates disjoint groups of variables such that each group can be interpreted essentially as uni-dimensional and represented by a single variable from the original set.
This solution was developed and implemented during a POC with a customer from the banking sector. The data consisted of tables with several hundred variables and O[1e5-1e6] observations. The customer wanted to build an analysis flow operating with a much smaller number of 'relevant' attributes, from the original set, which would best capture the variability expressed in the data.

The procedure is iterative and starts from a single cluster containing all original variables. This cluster is divided in two clusters and variables assigned to one or another of the two children clusters. At every iteration one particular cluster is selected for division and the procedure continues until there are no more suitable candidates for division or if the user decided to stop the procedure once n clusters were generated (and n representative variables were identified)

The selection criteria for division is related to the variation contained in the candidate cluster, more precisely to how this variation is distributed among it's principal components. PCA is performed on the initial (starting) cluster and on every cluster resulting from divisions. If the 2nd eigenvalue is large it means that the variation is distributed at least between two principal axis or components. We are not looking beyond the 2nd eigenvalue and divide that cluster's variables into two groups depending on how they are associated with the first two axis of variability. The division process continues until every clusters has variables associated with only one principal component i.e. until every cluster has a 2nd PCA eigenvalue less than a specified threshold. During the iterative process, the cluster picked for splitting is the one having the largest 2nd eigenvalue.

The assignment of variables to clusters is based on the matrix of factor loadings or the correlation between the original variables and the PCA factors. Actually the factor loadings matrix is not directly used but a rotated matrix which improves separability. Details on the principle of factor rotations and the various types of rotations can be found in Choosing the Right Type of Rotation in PCA and EFA and Factor Rotations in Factor Analyses.
The rotations are performed with the function GPFoblq() from the GPArotation package, a pre-requisite for ORE varclus.

The next sections will describe how to run the variable clustering algorithm and interpret the results.

The ORE varclus scripts


The present version of ORE varclus is implemented in a function, ore.varclus() to be run in embedded execution mode. The driver script example, varclus_run.R illustrates how to call this function with ore.doEval:

R> clust.log <- ore.doEval(FUN.NAME="ore.varclus",
                 ,data.name="MYDATA"
                 ,maxclust=200
                 ,pca="princomp"
                 ,eigv2.threshold=1.
                 ,dsname="datstr.MYDATA"                        
                 ,ore.connect=TRUE)

The arguments passed to ore.varclus() are :


ore.varclus() is implemented in the varclus_lib.R script. The script contains also examples of post-processing functions illustrating how to selectively extract results from the datastore and generate reports and plots. The current version of ore.varclus() supports only numerical attributes. Details on the usage of the post-processing functions are provided in the next section.

The output


Datastores

We illustrate the output of ORE varclus for a particular dataset (MYDATA) containing 39 numeric variables and 54k observations. ore.varclus() saves the history of the entire cluster generation in a datastore specified via the dsname argument:

  datastore.name object.count  size       creation.date description
1 datstr.MYDATA            13 30873 2015-05-28 01:03:42        <NA>

     object.name      class size length row.count col.count
1  Grand.Summary data.frame  562      5         6         5
2  clusters.ncl1       list 2790      1        NA        NA
3  clusters.ncl2       list 3301      2        NA        NA
4  clusters.ncl3       list 3811      3        NA        NA
5  clusters.ncl4       list 4322      4        NA        NA
6  clusters.ncl5       list 4833      5        NA        NA
7  clusters.ncl6       list 5344      6        NA        NA
8   summary.ncl1       list  527      2        NA        NA
9   summary.ncl2       list  677      2        NA        NA
10  summary.ncl3       list  791      2        NA        NA
11  summary.ncl4       list  922      2        NA        NA
12  summary.ncl5       list 1069      2        NA        NA
13  summary.ncl6       list 1232      2        NA        NA    

For this dataset the algorithm generated 6 clusters after 6 iterations with a threshold eigv2.trshld=1.00. The datastore contains several types of objects : clusters.nclX, summary.nclX and Grand.Summary. The suffix X indicates the iteration step. For example clusters.ncl4 does not mean the 4th cluster; it is a list of objects (numbers and tables) related to the 4 clusters generated during the 4th iteration. summary.ncl4 contains summarizing information about each of the 4 clusters generated during the  4th iteration. Grand.Summary provides the same metrics but aggregated for all clusters per iteration. More details will be provided below.

The user can load and inspect each clusters.nclX or summary.nclX individually to track for example how variables are assigned to clusters during the iterative process. Saving the results on a per iteration basis becomes practical when the number of starting variables is several hundreds large and many clusters are generated.

Text based output


ore.varclus_lib.R contains a function write.clusters.to.file() which allows to concatenate all the information from either one single or multiple iterations and dump it in formatted text for visual inspection. In the example below the results from the last two step (5 and 6) specified via the clust.steps argument is written to the file named via the fout argument.

R> fclust <- "out.varclus.MYDATA/out.MYDATA.clusters"
R> write.clusters.to.file(fout=fclust,
                          dsname="datstr.MYDATA",clust.steps=c(5,6))

The output contains now the info from summary.ncl5, clusters.ncl5, summary.ncl6, clusters.ncl6, and Grand.Summary in that order. Below we show only the output corresponding to the 6th iteration which contains the final results.

The output starts with data collected from summary.ncl6 and displayed as two sections 'Clusters Summary' and 'Inter-Clusters Correlation'. The columns of  'Clusters Summary' are:


The 'Inter-Clusters Correlation' matrix is the correlation matrix between the scores of (data on) the 1st principal component of every cluster. It is a measure of how much the clusters are uncorrelated when represented by the 1st principal component.

----------------------------------------------------------------------------------------
Clustering step 6
----------------------------------------------------------------------------------------
Clusters Summary :

  Cluster Members Variation.Explained Proportion.Explained Secnd.Eigenval Represent.Var
1       1      13           11.522574            0.8863518   7.856187e-01         VAR25
2       2       6            5.398123            0.8996871   3.874496e-01         VAR13
3       3       6            5.851600            0.9752667   1.282750e-01          VAR9
4       4       3            2.999979            0.9999929   2.112009e-05         VAR10
5       5       5            2.390534            0.4781069   8.526650e-01         VAR27
6       6       6            5.492897            0.9154828   4.951499e-01         VAR14

Inter-Clusters Correlation :

             Clust.1      Clust.2       Clust.3       Clust.4       Clust.5       Clust.6
Clust.1  1.000000000  0.031429267  0.0915034534 -0.0045104029 -0.0341091948  0.0284033464
Clust.2  0.031429267  1.000000000  0.0017441189 -0.0014435672 -0.0130659191  0.8048780461
Clust.3  0.091503453  0.001744119  1.0000000000  0.0007563413 -0.0080611117 -0.0002118345
Clust.4 -0.004510403 -0.001443567  0.0007563413  1.0000000000 -0.0008410022 -0.0022667776
Clust.5 -0.034109195 -0.013065919 -0.0080611117 -0.0008410022  1.0000000000 -0.0107850694
Clust.6  0.028403346  0.804878046 -0.0002118345 -0.0022667776 -0.0107850694  1.0000000000

Cluster 1
             Comp.1       Comp.2    r2.own     r2.next   r2.ratio var.idx
VAR25 -0.3396562963  0.021849138 0.9711084 0.010593134 0.02920095      25
VAR38 -0.3398365257  0.021560264 0.9710107 0.010590140 0.02929962      38
VAR23 -0.3460431639  0.011946665 0.9689027 0.010689408 0.03143329      23
VAR36 -0.3462378084  0.011635813 0.9688015 0.010685952 0.03153546      36
VAR37 -0.3542777932 -0.001166427 0.9647680 0.010895771 0.03562009      37
VAR24 -0.3543088809 -0.001225793 0.9647155 0.010898262 0.03567326      24
VAR22 -0.3688379400 -0.026782777 0.9484384 0.011098450 0.05214028      22
VAR35 -0.3689127408 -0.026900129 0.9484077 0.011093779 0.05217103      35
VAR30 -0.0082726659  0.478137910 0.8723316 0.006303141 0.12847817      30
VAR32  0.0007818601  0.489061629 0.8642301 0.006116234 0.13660543      32
VAR31  0.0042646500  0.493099400 0.8605441 0.005992662 0.14029666      31
VAR33  0.0076560545  0.497131056 0.8573146 0.005934929 0.14353729      33
VAR34 -0.0802417381  0.198756967 0.3620001 0.007534643 0.64284346      34

Cluster 2
           Comp.1      Comp.2    r2.own   r2.next  r2.ratio var.idx
VAR13 -0.50390550 -0.03826113 0.9510065 0.6838419 0.1549652      13
VAR3  -0.50384385 -0.03814382 0.9509912 0.6838322 0.1550089       3
VAR18 -0.52832332 -0.09384185 0.9394948 0.6750884 0.1862204      18
VAR11 -0.31655455  0.33594147 0.9387738 0.5500716 0.1360798      11
VAR16 -0.34554284  0.26587848 0.9174539 0.5351907 0.1775913      16
VAR39 -0.02733522 -0.90110241 0.7004025 0.3805168 0.4836249      39

Cluster 3
             Comp.1       Comp.2    r2.own      r2.next    r2.ratio var.idx
VAR9  -4.436290e-01  0.010645774 0.9944599 0.0111098555 0.005602316       9
VAR8  -4.440656e-01  0.009606151 0.9944375 0.0113484256 0.005626315       8
VAR7  -4.355970e-01  0.028881014 0.9931890 0.0110602004 0.006887179       7
VAR6  -4.544373e-01 -0.016395561 0.9914545 0.0114996393 0.008644956       6
VAR21 -4.579777e-01 -0.027336302 0.9865562 0.0004552779 0.013449888      21
VAR5   1.566362e-06  0.998972842 0.8915032 0.0093737140 0.109523464       5

Cluster 4
            Comp.1        Comp.2    r2.own      r2.next     r2.ratio var.idx
VAR10 7.067763e-01  0.0004592019 0.9999964 1.899033e-05 3.585911e-06      10
VAR1  7.074371e-01 -0.0004753728 0.9999964 1.838949e-05 3.605506e-06       1
VAR15 2.093320e-11  0.9999997816 0.9999859 2.350467e-05 1.408043e-05      15

Cluster 5
            Comp.1       Comp.2    r2.own      r2.next  r2.ratio var.idx
VAR27 -0.556396037 -0.031563215 0.6199740 0.0001684573 0.3800900      27
VAR29 -0.532122723 -0.041330455 0.5586173 0.0001938785 0.4414683      29
VAR28 -0.506440510 -0.002599593 0.5327290 0.0001494172 0.4673408      28
VAR26 -0.389716922  0.198849850 0.4396647 0.0001887849 0.5604411      26
VAR20  0.003446542  0.979209797 0.2395493 0.0076757755 0.7663329      20

Cluster 6
             Comp.1        Comp.2    r2.own   r2.next  r2.ratio var.idx
VAR14 -0.0007028647  0.5771114183 0.9164991 0.7063442 0.2843495      14
VAR4  -0.0007144334  0.5770967589 0.9164893 0.7063325 0.2843714       4
VAR12 -0.5779762250 -0.0004781436 0.9164238 0.4914497 0.1643420      12
VAR2  -0.5779925997 -0.0004993306 0.9164086 0.4914361 0.1643676       2
VAR17 -0.5760772611  0.0009732350 0.9150015 0.4900150 0.1666686      17
VAR19  0.0014223072  0.5778410825 0.9120741 0.7019736 0.2950272      19

---------------------------------------------------------------------------------------
Grand Summary
---------------------------------------------------------------------------------------
  Nb.of.Clusters Tot.Var.Explained Prop.Var.Explained Min.Prop.Explained Max.2nd.Eigval
1              1          11.79856          0.3025272          0.3025272       9.787173
2              2          21.47617          0.5506711          0.4309593       5.778829
3              3          27.22407          0.6980530          0.5491522       2.999950
4              4          30.22396          0.7749735          0.6406729       2.389400
5              5          32.60496          0.8360246          0.4781069       1.205769
6              6          33.65571          0.8629668          0.4781069       0.852665

The sections 'Cluster 1' ... 'Cluster 6' contain results collected from the clusters.ncl6 list from the datastore. Each cluster is described by a table where the rows are the variables and the columns correspond to:



For example, from 'Clusters Summary', the first cluster (index 1) has 13 variables and is best represented by variable VAR25 which, from an inspecting the 'Cluster 1' section, shows the highest r2.own = 0.9711084.

The section 'Grand Summary' displays the results from the Grand.Summary table in the datastore. The rows correspond to the clustering iterations and the columns are defined as:



For example, for the final clusters (Nb.of.Clusters = 6) Min.Proportion.Explained is 0.4781069. This corresponds to Cluster 5 - see Proportion.Explained value from 'Clusters Summary'. It means that variation in Cluster 5 is poorly captured by the first principal component (only 47.8%)

As previously indicated, the representative variables, one per final cluster, are collected in the Represent.Var column from the 'Clusters Summary' section in the output text file. They can be retrieved from the summary.ncl6 object in the datastore as shown below:

R> ore.load(list=c("summary.ncl6"),name=datstr.name)
[1] "summary.ncl6"
R> names(summary.ncl6)
[1] "clusters.summary"      "inter.clusters.correl"
R> names(summary.ncl6$clusters.summary)
[1] "Cluster"  "Members"  "Variation.Explained"  "Proportion.Explained" "Secnd.Eigenval"     
[6] "Represent.Var"      
R> summary.ncl6$clusters.summary$Represent.Var
[1] "VAR25" "VAR13" "VAR9"  "VAR10" "VAR27" "VAR14"

In our next post we'll look at plots, performance and future developments for ORE varclus.



Saturday Mar 15, 2014

Oracle R Enterprise 1.4 Released

We’re pleased to announce that Oracle R Enterprise (ORE) 1.4 is now available for download on all supported platforms. In addition to numerous bug fixes, ORE 1.4 introduces an enhanced high performance computing infrastructure, new and enhanced parallel distributed predictive algorithms for both scalability and performance, added support for production deployment, and compatibility with the latest R versions.  These updates enable IT administrators to easily migrate the ORE database schema to speed production deployment, and statisticians and analysts have access to a larger set of analytics techniques for more powerful predictive models.

Here are the highlights for the new and upgraded features in ORE 1.4:

Upgraded R version compatibility


ORE 1.4 is certified with R-3.0.1 - both open source R and Oracle R Distribution. See the server support matrix for the complete list of supported R versions. R-3.0.1 brings improved performance and big-vector support to R, and compatibility with more than 5000 community-contributed R packages.

High Performance Computing Enhancements

Ability to specify degree of parallelism (DOP) for parallel-enabled functions (ore.groupApply, ore.rowApply, and ore.indexApply)
An additional global option, ore.parallel, to set the number of parallel threads used in embedded R execution

Data Transformations and Analytics

ore.neural now provides a highly flexible network architecture with a wide range of activation functions, supporting 1000s of formula-derived columns, in addition to being a parallel and distributed implementation capable of supporting billion row data sets
ore.glm now also prevents selection of less optimal coefficient methods with parallel distributed in-database execution
Support for weights in regression models
New ore.esm enables time series analysis, supporting both simple and double exponential smoothing for scalable in-database execution
Execute standard R functions for Principal Component Analysis (princomp), ANOVA (anova), and factor analysis (factanal) on database data

Oracle Data Mining Model Algorithm Functions

Newly exposed in-database Oracle Data Mining algorithms:

ore.odmAssocRules function for building Oracle Data Mining association models using the apriori algorithm
ore.odmNMF function for building Oracle Data Mining feature extraction models using the Non-Negative Matrix Factorization (NMF) algorithm
ore.odmOC function for building Oracle Data Mining clustering models using the Orthogonal Partitioning Cluster (O-Cluster) algorithm

Production Deployment

New migration utility eases production deployment from development environments
"Snapshotting" of production environments for debugging in test systems

For a complete list of new features, see the Oracle R Enterprise User's Guide. To learn more about Oracle R Enterprise, check out the white paper entitled, "Bringing R to the Enterprise -  A Familiar R Environment with Enterprise-Caliber Performance, Scalability, and Security.", visit Oracle R Enterprise on Oracle's Technology Network, or review the variety of use cases on the Oracle R blog.

Sunday Dec 08, 2013

Explore Oracle's R Technologies at BIWA Summit 2014

It’s getting to be that time of year again. The Oracle BIWA Summit '14 will be taking place January 14-16 at Oracle HQ Conference Center, Redwood Shores, CA. Check out the detailed agenda.

BIWA Summit provides a wide range of sessions on Business Intelligence, Warehousing, and Analytics, including: novel and interesting use cases of Oracle Big Data, Exadata, Advanced Analytics/Data Mining, OBIEE, Spatial, Endeca and more! You’ll also have opportunities to get hands on experience with products in the Hands-on Labs, great customer case studies and talks by Oracle Technical Professionals and Partners.  Meet with technical experts on the technology you want and need to use. 

Click HERE to read detailed abstracts and speaker profiles.  Use the SPECIAL DISCOUNT code ORACLE12c and registration is only $199 for the 2.5 day technically focused Oracle user group event.

On the topic of Oracle’s R technologies, don't miss:

  • Introduction to Oracle's R Technologies
  • Applying Oracle's R Technologies to Big Data Problems
  • Hands-on Lab: Learn to use Oracle R Enterprise
  • OBIEE + OAA Integration Paths : interactive OAA in SampleApp Dashboards
  • Blazing Business Analytics: Analytic Options to the Oracle Database
  • Best Practices for In-Database Analytics

We look forward to meeting you there!

Friday Dec 06, 2013

Oracle R Distribution 3.0.1 now available for Windows 64-bit

We are excited to introduce support for Oracle R Distribution 3.0.1 on Windows 64-bit versions. Previous releases are available on Solaris x86, Solaris SPARC, AIX and Linux 64-bit platforms. Oracle R Distribution (ORD) continues to support these platforms and now expands support to Windows 64-bit platforms.

ORD is Oracle's free distribution of the open source R environment that adds support for dynamically loading the Intel Math Kernel Library (MKL) installed on your system. MKL provides faster performance by taking advantage of hardware-specific math library implementations. The net effect is optimized processing speed, especially on multi-core systems.

To enable MKL support on your ORD Windows client:

1. Add the location of libOrdBlasLoader.dll and mkl_rt.dll to the PATH system environment variable on the client.

In a typical ORD 3.0.1 installation, libOrdBlasLoader.dll is located in the R HOME directory:

C:\Program Files\R\R-3.0.1\bin\x64

In a full MKL 11.1 installation, mkl_rt.dll is located in the Intel MKL Composer XE directory:

C:\Program Files (x86)\Intel\Composer XE 2013 SP

2. Start R and execute the function Sys.BlasLapack:

    R> Sys.BlasLapack()
     $vendor
     [1] "Intel Math Kernel Library (Intel MKL)"

     $nthreads
     [1] -1

The vendor value returned indicates the presence of MKL instead of R's internal BLAS. The value for the number of threads to utilize, nthreads = -1, indicates all available cores are used by default. To modify the number of threads used, set the system environment variable MKL_NUM_THREADS = n, where n is the number of physical cores in the system you wish to use.

To install MKL on your Windows client, you must have an MKL license.

Oracle R Distribution will be certified with a future release of Oracle R Enterprise, and is available now from Oracle's free and Open Source Software portal. Questions and comments are welcome on the Oracle R Forum.

Monday Jul 29, 2013

Oracle R Distribution for R-3.0.1 released

We're pleased to announce that the Oracle R Distribution 3.0.1 Linux RPMs are now available on Oracle's public yum. R-3.0.1, code-named "Good Sport", is the second release in the R-3.0.x series. This new series in R doesn't announce new features, but indicates that the code base has developed to a new level of maturity.

However, there are some significant improvements in the 3.0 series worth mentioning.  R-3.0.0 introduces the use of large vectors in R, and eliminates some restrictions in the core R engine by allowing R to use the memory available on 64-bit systems more efficiently. Prior to this release, objects had a hard-coded limit of 2^31-1 elements, or roughly 2.1 billion elements.  Objects exceeding this limit were treated as missing (NA) and R sometimes returned a warning, regardless of available memory on the system. Starting in R-3.0.0, objects can exceed this limit, which is a significant improvement. Here's the relevant statement from the R-devel NEWS file:

 There is a subtle change in behaviour for numeric index 
 values 2^31 and larger. These never used to be legitimate 
 and so were treated as NA, sometimes with a warning. They 
 are now legal for long vectors so there is no longer a 
 warning, and x[2^31] <- y will now extend the vector on a 
 64-bit platform and give an error on a 32-bit one. 

R-3.0.1 adds to these updates by improving serialization for big objects and fixing a variety of bugs.

Older open source R packages will need to be re-installed after upgrading from ORD 2.15.x to ORD 3.0.1, which is accomplished by running:

R> update.packages(checkBuilt = TRUE) 

This command upgrades open source packages if a more recent version exists on CRAN or if the installed package was build with an older version of R.

Oracle R Distribution 3.0.1 will be compatible with future versions of Oracle R Enterprise.  As of this posting, we recommend using ORD 2.15.3 with Oracle R Enterprise 1.3.1.  When installing ORD for use with ORE 1.3.1, be sure to use the command yum install R- 2.15.3, otherwise R-3.0.1 will be installed by default.

 
ORD 3.0.1 binaries for AIX, Solaris x86, and Solaris SPARC platforms will be available from Oracle's free and Open Source portal soon. Please check back for updates.

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
4
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today