Monday Jul 13, 2015

BIWASummit 2016 "Call for Speakers" is open!

Oracle BIWA Summit is an annual conference that provides attendees a concentrated three days of content focused on Big Data and Analytics. Once again, it will be held at the Oracle Headquarters Conference Center in Redwood Shores, CA. As part of the organizing committee, I invite you to submit session proposals, especially those involving Oracle's R technologies.

BIWA Summit attendees want to hear about your use of Oracle technology. Proposals will be accepted through Monday evening November 2, 2015, at midnight EST.

To submit your abstract, click here.

This year's tracks include:


Oracle BIWA Summit 2016 is organized and managed by the Oracle BIWA SIG, the Oracle Spatial SIG, and the Oracle Northern California User Group. The event attracts top BI, data warehousing, analytics, Spatial, IoT and Big Data experts.

The three-day event includes keynotes from industry experts, educational sessions, hands-on labs, and networking events.

Hot topics include:


  • Database, data warehouse and cloud, Big Data architecture

  • Deep dives and hands-on labs on existing Oracle BI, data warehouse, and analytics products

  • Updates on the latest Oracle products and technologies (e.g. Big Data Discovery, Oracle Visual Analyzer, Oracle Big Data SQL)

  • Novel and interesting use cases on everything – Spatial, Graph, Text, Data Mining, IoT, ETL, Security, Cloud

  • Working with Big Data (e.g., Hadoop, "Internet of Things,” SQL, R, Sentiment Analysis)

  • Oracle Business Intelligence (OBIEE), Oracle Big Data Discovery, Oracle Spatial, and Oracle Advanced Analytics—Better Together

I look forward to seeing you there!

Tuesday Jun 30, 2015

R Consortium Launched!

The Linux Foundation announces the R Consortium to support R users globally.

The R Consortium works with and provides support to the R Foundation and other organizations developing, maintaining and distributing R software and provides a unifying framework for the R user community.

“Data science is pushing the boundaries of what is possible in business, science, and technology, where the R language and ecosystem is a major enabling force,” said Neil Mendelson, Vice President, Big Data and Advanced Analytics, Oracle “The R Consortium is an important enabling body to support and help grow the R user community, which increasingly includes enterprise data scientists.”

R is a key enabling technology for data science as evidenced by its dramatic rise in adoption over the past several years. We look forward to contributing to R's continued success through the R Consortium.

Friday Jun 12, 2015

Variable Selection with ORE varclus - Part 2

In our previous post we talked about variable selection and introduced a technique based on hierarchical divisive clustering and implemented using the Oracle R Enterprise embedded execution capabilities. In this post we illustrate how to visualize the clustering solution, discuss stopping criteria and highlight some performance aspects.

Plots


The clustering efficiency can be assessed, from a high level perspective, through a visual representation of metrics related to variability. The plot.clusters() function provided as example in varclus_lib.R takes the datastore name, the iteration number (nclust corresponds to the number of clusters after the final iteration) and an output directory to generate a png output file with two plots.

R> plot.clusters(dsname="datstr.MYDATA",nclust=6,
                   outdir="out.varclus.MYDATA")

unix> ls -1 out.varclus.MYDATA
out.MYDATA.clusters
out.MYDATA.log
plot.datstr.MYDATA.ncl6.png

The upper plot focuses on the last iteration. The x axis represents the cluster id (1 to 6 for six clusters after the 6-th and final iteration). The variation explained and proportion of variation explained (Variation.Explained and Proportion.Explained from 'Cluster Summary') are rendered by the blue curve with units on the left y axis and the red curve with units on the right y axis). Clusters 1,2,3,4,6 are well represented by their first principal component. Cluster 5, contains variation which is not well captured by a single component (only 47.8% is explained, as alraedy mentioned in Part 1). This can be seen also from the r2.own values for the variables of Cluster 5, VAR20, VAR26,...,VAR29 , between 0.24 and 0.62 indicating that their are not well correlated with the 1st principal component score. For this kind of situation, domain expertise will be needed to evaluate the results and decide the course of action : does it make sense to have VAR20, VAR26,...,VAR29 clustered together/keep VAR27 as representative variable or should Cluster 5 be further split by lowering eig2.threshold (below the corresponding Secnd.Eigenval value from the 'Clusters Summary' section) ?

The bottom plot illustrates the entire clustering sequence (all iterations) The x axis represents the iteration number or the numbers of clusters after that iteration. The total variation explained and proportion of total variation explained (Tot.Var.Explained and Prop.Var.Explained from 'Grand Summary' are rendered by the blue curve with units on the left y axis and the red curve with units on the right y axis). One can see how Prop.Var.Explained tends to flatten below 90% (86.3% for the last iteration).




For the case above a single cluster was 'weak' and there were no ambiguities about where to start examining the results or search for issues.  Below is the same output for a different problem with 120 variables and 29 final clusters. For this case, the proportion of variation explained by the 1st component (red curve, upper plot) shows several 'weak' clusters : 23, 28, 27, 4, 7, 19.  The Prop.Var.Explained is below 60% for these clusters. Which one should be examined first ? A good choice could be Cluster 7 because it plays a more important role as measured by the absolute value of Variation.Explained. Here again, domain knowledge will be required to examine these clusters and decide if and for how long how one should continue the splitting process.





Stopping criteria & number of variables


As illustrated in the previous section, the number of final clusters can be raised or reduced by lowering or increasing the eig2.trshld parameter. For problems with many variables the user may want to stop the iterations at lower values and inspect the clustering results & history before convergence to gain a better understanding of the variable selection process. Early stopping is achieved through the maxclust argument, as discussed in the previous post, and can be used also if the user wants/has to keep the number of selected variables below an upper limit.

Performance


The clustering runtime is entirely dominated  by the cost of the PCA analysis. The 1st split is the most expensive as PCA is run for the entire data; the subsequent splits are executed faster and faster as the PCAs handle clusters with less and less variables. For the 39 variables & 55k rows case presented it took ~10s for the entire run (splitting into 6 clusters, post-processing from datastore, generation). The 120 variables & 55k rows case required ~54s. For a larger case with 666 variables & 64k rows the execution completed in 112s and generated 128 clusters. These numbers were obtained on a Intel Xeon 2.9Ghz OL6 machine.The customer ran cases with more than 600 variable & O[1e6] rows in 5-10 mins.

Thursday Jun 04, 2015

Variable Selection with ORE varclus - Part 1



Variable selection also known as feature or attribute selection is an important technique for data mining and predictive analytics. It is used when the number of variables is large and has received a special attention from application areas where this number is very large (like genomics, combinatorial chemistry, text mining, etc). The underlying hypothesis for variable selection is that the data can contain many variables which are either irrelevant or redundant. Solutions are therefore sought for selecting subsets of these variables which can predict the output with an accuracy comparable to that of the complete input set.

Variable selection serves multiple purposes: (1) It provides a faster and more cost-effective model generation (2) It simplifies the model interpretation as it based on a (much) smaller and more effective set of predictors (3) It supports a better generalization because the elimination of irrelevant features can reduce model over-fitting.

There are many approaches for feature selection differentiated by search techniques, validation methods or optimality considerations. In this blog we will describe a solution based on hierarchical and divisive variable clustering which generates disjoint groups of variables such that each group can be interpreted essentially as uni-dimensional and represented by a single variable from the original set.
This solution was developed and implemented during a POC with a customer from the banking sector. The data consisted of tables with several hundred variables and O[1e5-1e6] observations. The customer wanted to build an analysis flow operating with a much smaller number of 'relevant' attributes, from the original set, which would best capture the variability expressed in the data.

The procedure is iterative and starts from a single cluster containing all original variables. This cluster is divided in two clusters and variables assigned to one or another of the two children clusters. At every iteration one particular cluster is selected for division and the procedure continues until there are no more suitable candidates for division or if the user decided to stop the procedure once n clusters were generated (and n representative variables were identified)

The selection criteria for division is related to the variation contained in the candidate cluster, more precisely to how this variation is distributed among it's principal components. PCA is performed on the initial (starting) cluster and on every cluster resulting from divisions. If the 2nd eigenvalue is large it means that the variation is distributed at least between two principal axis or components. We are not looking beyond the 2nd eigenvalue and divide that cluster's variables into two groups depending on how they are associated with the first two axis of variability. The division process continues until every clusters has variables associated with only one principal component i.e. until every cluster has a 2nd PCA eigenvalue less than a specified threshold. During the iterative process, the cluster picked for splitting is the one having the largest 2nd eigenvalue.

The assignment of variables to clusters is based on the matrix of factor loadings or the correlation between the original variables and the PCA factors. Actually the factor loadings matrix is not directly used but a rotated matrix which improves separability. Details on the principle of factor rotations and the various types of rotations can be found in Choosing the Right Type of Rotation in PCA and EFA and Factor Rotations in Factor Analyses.
The rotations are performed with the function GPFoblq() from the GPArotation package, a pre-requisite for ORE varclus.

The next sections will describe how to run the variable clustering algorithm and interpret the results.

The ORE varclus scripts


The present version of ORE varclus is implemented in a function, ore.varclus() to be run in embedded execution mode. The driver script example, varclus_run.R illustrates how to call this function with ore.doEval:

R> clust.log <- ore.doEval(FUN.NAME="ore.varclus",
                 ,data.name="MYDATA"
                 ,maxclust=200
                 ,pca="princomp"
                 ,eigv2.threshold=1.
                 ,dsname="datstr.MYDATA"                        
                 ,ore.connect=TRUE)

The arguments passed to ore.varclus() are :


ore.varclus() is implemented in the varclus_lib.R script. The script contains also examples of post-processing functions illustrating how to selectively extract results from the datastore and generate reports and plots. The current version of ore.varclus() supports only numerical attributes. Details on the usage of the post-processing functions are provided in the next section.

The output


Datastores

We illustrate the output of ORE varclus for a particular dataset (MYDATA) containing 39 numeric variables and 54k observations. ore.varclus() saves the history of the entire cluster generation in a datastore specified via the dsname argument:

  datastore.name object.count  size       creation.date description
1 datstr.MYDATA            13 30873 2015-05-28 01:03:42        <NA>

     object.name      class size length row.count col.count
1  Grand.Summary data.frame  562      5         6         5
2  clusters.ncl1       list 2790      1        NA        NA
3  clusters.ncl2       list 3301      2        NA        NA
4  clusters.ncl3       list 3811      3        NA        NA
5  clusters.ncl4       list 4322      4        NA        NA
6  clusters.ncl5       list 4833      5        NA        NA
7  clusters.ncl6       list 5344      6        NA        NA
8   summary.ncl1       list  527      2        NA        NA
9   summary.ncl2       list  677      2        NA        NA
10  summary.ncl3       list  791      2        NA        NA
11  summary.ncl4       list  922      2        NA        NA
12  summary.ncl5       list 1069      2        NA        NA
13  summary.ncl6       list 1232      2        NA        NA    

For this dataset the algorithm generated 6 clusters after 6 iterations with a threshold eigv2.trshld=1.00. The datastore contains several types of objects : clusters.nclX, summary.nclX and Grand.Summary. The suffix X indicates the iteration step. For example clusters.ncl4 does not mean the 4th cluster; it is a list of objects (numbers and tables) related to the 4 clusters generated during the 4th iteration. summary.ncl4 contains summarizing information about each of the 4 clusters generated during the  4th iteration. Grand.Summary provides the same metrics but aggregated for all clusters per iteration. More details will be provided below.

The user can load and inspect each clusters.nclX or summary.nclX individually to track for example how variables are assigned to clusters during the iterative process. Saving the results on a per iteration basis becomes practical when the number of starting variables is several hundreds large and many clusters are generated.

Text based output


ore.varclus_lib.R contains a function write.clusters.to.file() which allows to concatenate all the information from either one single or multiple iterations and dump it in formatted text for visual inspection. In the example below the results from the last two step (5 and 6) specified via the clust.steps argument is written to the file named via the fout argument.

R> fclust <- "out.varclus.MYDATA/out.MYDATA.clusters"
R> write.clusters.to.file(fout=fclust,
                          dsname="datstr.MYDATA",clust.steps=c(5,6))

The output contains now the info from summary.ncl5, clusters.ncl5, summary.ncl6, clusters.ncl6, and Grand.Summary in that order. Below we show only the output corresponding to the 6th iteration which contains the final results.

The output starts with data collected from summary.ncl6 and displayed as two sections 'Clusters Summary' and 'Inter-Clusters Correlation'. The columns of  'Clusters Summary' are:


The 'Inter-Clusters Correlation' matrix is the correlation matrix between the scores of (data on) the 1st principal component of every cluster. It is a measure of how much the clusters are uncorrelated when represented by the 1st principal component.

----------------------------------------------------------------------------------------
Clustering step 6
----------------------------------------------------------------------------------------
Clusters Summary :

  Cluster Members Variation.Explained Proportion.Explained Secnd.Eigenval Represent.Var
1       1      13           11.522574            0.8863518   7.856187e-01         VAR25
2       2       6            5.398123            0.8996871   3.874496e-01         VAR13
3       3       6            5.851600            0.9752667   1.282750e-01          VAR9
4       4       3            2.999979            0.9999929   2.112009e-05         VAR10
5       5       5            2.390534            0.4781069   8.526650e-01         VAR27
6       6       6            5.492897            0.9154828   4.951499e-01         VAR14

Inter-Clusters Correlation :

             Clust.1      Clust.2       Clust.3       Clust.4       Clust.5       Clust.6
Clust.1  1.000000000  0.031429267  0.0915034534 -0.0045104029 -0.0341091948  0.0284033464
Clust.2  0.031429267  1.000000000  0.0017441189 -0.0014435672 -0.0130659191  0.8048780461
Clust.3  0.091503453  0.001744119  1.0000000000  0.0007563413 -0.0080611117 -0.0002118345
Clust.4 -0.004510403 -0.001443567  0.0007563413  1.0000000000 -0.0008410022 -0.0022667776
Clust.5 -0.034109195 -0.013065919 -0.0080611117 -0.0008410022  1.0000000000 -0.0107850694
Clust.6  0.028403346  0.804878046 -0.0002118345 -0.0022667776 -0.0107850694  1.0000000000

Cluster 1
             Comp.1       Comp.2    r2.own     r2.next   r2.ratio var.idx
VAR25 -0.3396562963  0.021849138 0.9711084 0.010593134 0.02920095      25
VAR38 -0.3398365257  0.021560264 0.9710107 0.010590140 0.02929962      38
VAR23 -0.3460431639  0.011946665 0.9689027 0.010689408 0.03143329      23
VAR36 -0.3462378084  0.011635813 0.9688015 0.010685952 0.03153546      36
VAR37 -0.3542777932 -0.001166427 0.9647680 0.010895771 0.03562009      37
VAR24 -0.3543088809 -0.001225793 0.9647155 0.010898262 0.03567326      24
VAR22 -0.3688379400 -0.026782777 0.9484384 0.011098450 0.05214028      22
VAR35 -0.3689127408 -0.026900129 0.9484077 0.011093779 0.05217103      35
VAR30 -0.0082726659  0.478137910 0.8723316 0.006303141 0.12847817      30
VAR32  0.0007818601  0.489061629 0.8642301 0.006116234 0.13660543      32
VAR31  0.0042646500  0.493099400 0.8605441 0.005992662 0.14029666      31
VAR33  0.0076560545  0.497131056 0.8573146 0.005934929 0.14353729      33
VAR34 -0.0802417381  0.198756967 0.3620001 0.007534643 0.64284346      34

Cluster 2
           Comp.1      Comp.2    r2.own   r2.next  r2.ratio var.idx
VAR13 -0.50390550 -0.03826113 0.9510065 0.6838419 0.1549652      13
VAR3  -0.50384385 -0.03814382 0.9509912 0.6838322 0.1550089       3
VAR18 -0.52832332 -0.09384185 0.9394948 0.6750884 0.1862204      18
VAR11 -0.31655455  0.33594147 0.9387738 0.5500716 0.1360798      11
VAR16 -0.34554284  0.26587848 0.9174539 0.5351907 0.1775913      16
VAR39 -0.02733522 -0.90110241 0.7004025 0.3805168 0.4836249      39

Cluster 3
             Comp.1       Comp.2    r2.own      r2.next    r2.ratio var.idx
VAR9  -4.436290e-01  0.010645774 0.9944599 0.0111098555 0.005602316       9
VAR8  -4.440656e-01  0.009606151 0.9944375 0.0113484256 0.005626315       8
VAR7  -4.355970e-01  0.028881014 0.9931890 0.0110602004 0.006887179       7
VAR6  -4.544373e-01 -0.016395561 0.9914545 0.0114996393 0.008644956       6
VAR21 -4.579777e-01 -0.027336302 0.9865562 0.0004552779 0.013449888      21
VAR5   1.566362e-06  0.998972842 0.8915032 0.0093737140 0.109523464       5

Cluster 4
            Comp.1        Comp.2    r2.own      r2.next     r2.ratio var.idx
VAR10 7.067763e-01  0.0004592019 0.9999964 1.899033e-05 3.585911e-06      10
VAR1  7.074371e-01 -0.0004753728 0.9999964 1.838949e-05 3.605506e-06       1
VAR15 2.093320e-11  0.9999997816 0.9999859 2.350467e-05 1.408043e-05      15

Cluster 5
            Comp.1       Comp.2    r2.own      r2.next  r2.ratio var.idx
VAR27 -0.556396037 -0.031563215 0.6199740 0.0001684573 0.3800900      27
VAR29 -0.532122723 -0.041330455 0.5586173 0.0001938785 0.4414683      29
VAR28 -0.506440510 -0.002599593 0.5327290 0.0001494172 0.4673408      28
VAR26 -0.389716922  0.198849850 0.4396647 0.0001887849 0.5604411      26
VAR20  0.003446542  0.979209797 0.2395493 0.0076757755 0.7663329      20

Cluster 6
             Comp.1        Comp.2    r2.own   r2.next  r2.ratio var.idx
VAR14 -0.0007028647  0.5771114183 0.9164991 0.7063442 0.2843495      14
VAR4  -0.0007144334  0.5770967589 0.9164893 0.7063325 0.2843714       4
VAR12 -0.5779762250 -0.0004781436 0.9164238 0.4914497 0.1643420      12
VAR2  -0.5779925997 -0.0004993306 0.9164086 0.4914361 0.1643676       2
VAR17 -0.5760772611  0.0009732350 0.9150015 0.4900150 0.1666686      17
VAR19  0.0014223072  0.5778410825 0.9120741 0.7019736 0.2950272      19

---------------------------------------------------------------------------------------
Grand Summary
---------------------------------------------------------------------------------------
  Nb.of.Clusters Tot.Var.Explained Prop.Var.Explained Min.Prop.Explained Max.2nd.Eigval
1              1          11.79856          0.3025272          0.3025272       9.787173
2              2          21.47617          0.5506711          0.4309593       5.778829
3              3          27.22407          0.6980530          0.5491522       2.999950
4              4          30.22396          0.7749735          0.6406729       2.389400
5              5          32.60496          0.8360246          0.4781069       1.205769
6              6          33.65571          0.8629668          0.4781069       0.852665

The sections 'Cluster 1' ... 'Cluster 6' contain results collected from the clusters.ncl6 list from the datastore. Each cluster is described by a table where the rows are the variables and the columns correspond to:



For example, from 'Clusters Summary', the first cluster (index 1) has 13 variables and is best represented by variable VAR25 which, from an inspecting the 'Cluster 1' section, shows the highest r2.own = 0.9711084.

The section 'Grand Summary' displays the results from the Grand.Summary table in the datastore. The rows correspond to the clustering iterations and the columns are defined as:



For example, for the final clusters (Nb.of.Clusters = 6) Min.Proportion.Explained is 0.4781069. This corresponds to Cluster 5 - see Proportion.Explained value from 'Clusters Summary'. It means that variation in Cluster 5 is poorly captured by the first principal component (only 47.8%)

As previously indicated, the representative variables, one per final cluster, are collected in the Represent.Var column from the 'Clusters Summary' section in the output text file. They can be retrieved from the summary.ncl6 object in the datastore as shown below:

R> ore.load(list=c("summary.ncl6"),name=datstr.name)
[1] "summary.ncl6"
R> names(summary.ncl6)
[1] "clusters.summary"      "inter.clusters.correl"
R> names(summary.ncl6$clusters.summary)
[1] "Cluster"  "Members"  "Variation.Explained"  "Proportion.Explained" "Secnd.Eigenval"     
[6] "Represent.Var"      
R> summary.ncl6$clusters.summary$Represent.Var
[1] "VAR25" "VAR13" "VAR9"  "VAR10" "VAR27" "VAR14"

In our next post we'll look at plots, performance and future developments for ORE varclus.



Wednesday May 06, 2015

Experience using ORAAH on a customer business problem: some basic issues & solutions

We illustrate in this blog a few simple, practical solutions for problems which can arise when developing ORAAH mapreduce applications for the Oracle BDA. These problems were actually encountered during a recent POC engagement. The customer, an  important player in the medical technologies market, was interested in building an analysis flow consisting of a sequence of data manipulation and transformation steps followed by multiple model generation. The data preparation included multiple types of merging, filtering, variable generation based on complex search patterns and represented, by far, the most time consuming component of the flow. The original implementation on the customer's hardware required multiple days per flow to complete. Our ORAAH mapreduce based implementation running on a X5-2 Starter Rack BDA reduced that time to between 4-20 minutes, depending on which flow was tested.

The points which will be addressed in this blog are related to the fact that the data preparation was structured as a chain of task where each tasks performed transformations on HDFS data generated by one or multiple upstream tasks. More precisely we will consider the:


  • Merging of HDFS data from multiple sources

  • Re-balancing and parts reduction for HDFS data

  • Getting unique levels for categorical variables from HDFS data

  • Partitioning the data for distributed mapreduce execution


'Merging data' from above is to be understood as row binding of multiple tables. Re-balancing and parts reduction addresses the fact the HDFS data (generated by upstream jobs) may consist of very unequal parts (chunks) - this would lead to performance losses when this data further processed by other mapreduce jobs. The 3rd and 4th items are related. Getting the unique levels of categorical variables was useful for the data partitioning process, namely for how to generate the key-values pairs within the mapper functions.

1. Merging of hdfs data from multiple sources


The practical case here is that of a data transformation task for which the input consists of several, similarly structured HDFS data sets. As a reminder, data in HDFS is stored as a collection of flat files/chunks (part-00000, part-00001, etc) under an HDFS directory and the hdfs.* functions access the directory, not the 'part-xxxxx' chunks. Also the hadoop.run()/hadoop.exec().* functions work with single input data objects (HDFS object identifier representing a directory in HDFS); R rbind, cbind, merge, etc operations cannot be invoked within mapreduce to bind two or several large tables.

For the case under consideration, each input (dataA_dfs, dataB_dfs, etc) consists of a different number of files/chunks


R> hdfs.ls("dataA_dfs")
[1] "__ORCHMETA__" "part-00000" "part-00001" .... "part-00071"
R> hdfs.ls("dataB_dfs")
[1] "__ORCHMETA__" "part-00000" "part-00001" .... "part-00035"


corresponding to the number of reducers used by the upstream mapreduce jobs which generated this data. As these multiple chunks from various HDFS directories need to be processed as a single input data, they need to be moved into a unique HDFS directory. The 'merge_hdfs_data' function below does just that, by creating a new HDFS directory and copying all the part-xxxxx from each source directory  with proper updating of the resulting parts numbering. :

R> merge_hdfs_data <- function(SrcDirs,TrgtDir) {
  #cat(sprintf("merge_hdfs_files : Creating %s ...\n",TrgtDir))
  hdfs.mkdir(TrgtDir,overwrite=TRUE)
  i <- 0
  for (srcD in SrcDirs) {
    fparts <- hdfs.ls(get(srcD),pattern="part")
    srcd <- (hdfs.describe(get(srcD)))[1,2]
    for (fpart in fparts) {
      #cat(sprintf("merge_hdfs_files : Copying %s/%s to %s ...\n",

                       srcD,fpart,TrgtDir))
      i <- i+1
      hdfs.cp(paste(srcd,fpart,sep="/"),sprintf("%s/part-%05d",TrgtDir,i))
    }
  }
}


Merging of the dataA_dfs and dataB_dfs directories into a new data_merged_dfs directory is achieved through:

R> merge_hdfs_data(c("dataA_dfs","dataB_dfs"),"data_merged_dfs")

2. Data re-balancing / Reduction of the number of parts


Data stored in HDFS can suffer from two key problems that will affect performance: too many small files and files with very different numbers of records, especially those with very few records. The merged data produced by the function  above consists of a number of files equal to the sum of all files from all input HDFS directories. Since the upstream mapeduce jobs generating the inputs were run with a high number of reducers (for faster execution) the resulting total number of files got large (100+). This created an impractical constraint for the subsequent analysis as one cannot run a mapreduce application with a number of mappers less than the number of parts (the reverse is true, hdfs parts are splittable for processing by multiple mappers). Moreover if the parts have very different number of records the performance of the application will be affected since different mappers will handle very different volumes of data.

The rebalance_data function below represents a simple way of addressing these issues. Every mapper splits its portion of the data into a user-defined number of parts (nparts) containing quasi the same number of records. A key is associated with each part. In this implementation the number of reducers is set to the number of parts. After shuffling each reducer will collect the records corresponding to one particular key and write them to the output. The overall output consists of nparts parts with quasi equal size. A basic mechanism for preserving the data types is illustrated (see the map.output and reduce.output constructs below).

R> rebalance_data <- function(HdfsData,nmap,nparts)
{
  mapper_func <- function(k,v) {
    nlin <- nrow(v)
    if(nlin>0) {
      idx.seq <- seq(1,nlin)
      kk <- ceiling(idx.seq/(nlin/nparts))
      orch.keyvals(kk,v)
    }
  }
  reducer_func <- function(k,v) {
    if (nrow(v) > 0) { orch.keyvals(k=NULL,v) }
  }
  dtypes.out <- sapply(hdfs.meta(HdfsData)$types,
                       function(x) ifelse(x=="character","\"a\"",
                                          ifelse(x=="logical","FALSE","0")))
  val.str <- paste0(hdfs.meta(HdfsData)$names,"=",dtypes.out,collapse=",")
  meta.map.str <- sprintf("data.frame(key=0,%s)",val.str)
  meta.red.str <- sprintf("data.frame(key=NA,%s)",val.str)

  config <- new("mapred.config",
                job.name      = "rebalance_data",
                map.output    = eval(parse(text=meta.map.str)),
                reduce.output = eval(parse(text=meta.red.str)),
                map.tasks     = nmap,
                reduce.tasks  = nparts)
                reduce.split  = 1e5)
  res <- hadoop.run(data = HdfsData,
                    mapper = mapper_func,
                    reducer = reducer_func,
                    config = config,
                    cleanup = TRUE
  )
  res
}

Before using this function, the data associated with the new data_merged_dfs directory needs to be attached to the ORAAH framework:

R> data_merged_dfs <- hdfs.attach("data_merged_dfs")

The invocation below uses 144 mappers for splitting the data into 4 parts:

R> x <- rebalance_data(data_merged_dfs,nmap=144,nparts=4)


The user may also want to save the resulting object, permanently, under some convenient/recognizable name like 'data_rebalanced_dfs' for example. The path to the temporary object x is retrieved with the hdfs.describe() command and provided as first argument to the hdfs.cp() command.

R> tmp_dfs_name <- hdfs.describe(x)[1,2]
R> hdfs.cp(tmp_dfs_name,"data_rebalanced_dfs",overwrite=TRUE)

The choice of the number of parts is up to the user. It is better to have a few parts to avoid constraining from below the number of mappers for the downstream runs but one should consider other factors like the read/write performance related to the size of the data sets, the HDFS block size, etc which are not the topic of the present blog.

3. Getting unique levels


Determining the unique levels of categorical variables in a dataset is of basic interest for any data exploration procedure. If the data is distributed in HDFS, this determination requires an appropriate solution. For the application under consideration here, getting the unique levels serves another purpose; the unique levels are used to generate data splits better suited for distributed execution by the downstream mapreduce jobs. More details are available in the next section.

Depending on the categorical variables in question and data charactersitics, the determination of unique levels may require different solutions. The implementation below is a generic solution providing these levels for multiple variables bundled together in the input argument 'cols'. The mappers associate a key with each variable and collect the unique levels for each of these variables. The resulting array of values are packed in text stream friendly format and provided as value argument to orch.keyvals() - in this way complex data types can be safely passed between the mappers and reducers (via text-based Hadoop streams). The reducers unpack the strings, retrieve the all values associated with a particular key (variable) and re-calculate the unique levels accounting now for all values of that variable.

R> get_unique_levels <- function(x, cols, nmap, nred) {
  mapper <- function(k, v) {
    for (col in cols) {
      uvals <- unique(v[[col]])
      orch.keyvals(col, orch.pack(uvals))
    }
  }
  reducer <- function(k, v) {
    lvals <- orch.unpack(v$val)
    uvals <- unique(unlist(lvals))
    orch.keyval(k, orch.pack(uvals))
  }
  config <- new("mapred.config",
                job.name      = "get_unique_levls",
                map.output    = data.frame(key="a",val="packed"),
                reduce.output = data.frame(key="a",val="packed"),
                map.tasks     = nmap,
                reduce.tasks  = nred,
  )
  res <- hadoop.run(data = x,
                    mapper = mapper,
                    reducer = reducer,
                    config = config,
                    export = orch.export(cols=cols))
  resl <- (lapply((hdfs.get(res))$val,function(x){orch.unpack(x)}))[[1]]
}

This implementation works fine provided that the number of levels for the categorical variables are much smaller than the large number of records of the entire data. If some categorical variables have many levels, not far  from order of the total number of records, each mapper may return a large numbers of levels and each reducer may have to handle multiple large objects. An efficient solution for this case requires a different approach. However, if the column associated with one of these variables can  fit in memory, a direct, very crude calculation like below can run faster than the former implementation. Here the mappers extract the column with the values of the variable in question, the column is pulled into an in-memory object and unique() is called to determine the unique levels.

R> get_unique_levels_sngl <- function(HdfsData,col,nmap)
{
  mapper_fun <- function(k,v) { orch.keyvals(key=NULL,v[[col]]) }
  config <- new("mapred.config",
                job.name      = "extract_col",
                map.output    = data.frame(key=NA,VAL=0),
                map.tasks     = nmap)
    x <- hadoop.run(data=HdfsData,
                    mapper=mapper_fun,
                    config=config,
                    export=orch.export(col=col),
                    cleanup=TRUE)
  xl <- hdfs.get(x)
  res <- unique(xl$VAL)
}

R> customers <- get_unique_levls_sngl(data_rebalanced_dfs,"CID",nmap=32)

We obtained thus the unique levels of the categorical variable CID (customer id) from our data_balanced_dfs data.

4. Partitioning the data for mapreduce execution


Let's suppose that the user wants to execute some specific data manipulations at the CID level like aggregations, variable transformations or new variables generation, etc. Associating a key with every customer (CID level) would be a bad idea since there are many customers - our hypothesis was that the number of CID levels is not orders of magnitude below the total number of records. This would lead to an excessive number of reducers with a terrible impact on performance. In such case it would be better, for example, to bag customers into groups and distribute the execution at the group level. The user may want to set the number of this groups ngrp to something commensurate with the number of  BDA cores available for parallelizing the task.

The example below illustrates how to do that at a basic level. The groups are generated within the encapsulating function myMRjob, before the hadoop.run() execution - the var.grp dataframe has two columns : the CID levels and the group number (from 1 to ngrp) with which they are associated. This table is passed to the hadoop execution environment via orch.export() within hadoop.run(). The mapper_fun function extracts the group number as key and inserts the multiple key-values pairs into the output buffer. The reducer gets then a complete set of records for every customer associated with a particular key(group) and can proceed with the transformations/ manipulations within a loop-over-customers or whatever programming construct would be appropriate. Each reducer would handle a quasi-equal number of customers because this is how the groups were generated. However the number of records per customer is not constant and may introduce some imbalances.

R> myMRjob <- function(HdfsData,var,ngrp,nmap,nred)
{
  mapper_fun <- function(k,v) {
    ....
    fltr <- <some_row_filetring>
    cID <- which(names(v) %in% "CUSTOMID")
    kk <- var.grps[match(v[fltr,cID],var.grps$CUSTOMID),2]
    orch.keyvals(kk,v[fltr,,drop=FALSE])
  }
  reducer_fun <- function(k,v) { ... }
  config <- new("mapred.config", map.tasks = nmap, reduce.tasks = nred,....)

  var.grps <- data.frame(CUSTOMID=var,
    GRP=rep(1:ngrp,sapply(split(var,ceiling(seq_along(var)/(length(var)/ngrp))),length)))

  res <- hadoop.run(data = HdfsData,
                    mapper = mapper_fun,
                    reducer = reducer_fun,
                    config = config,
                    export = orch.export(var.grps=var.grps,ngrp=ngrp),
                    cleanup = TRUE
  )
  res
}

x <- myMRjob(HdfsData=data_balanced_dfs, var=customers, ngrp=..,nmap=..,nred=..)

Improved data partitioning solutions could be sought for the cases where there are strong imbalances in the number of records per customer or if great variations are noticed between the reducer jobs completion times. This kind of optimization will be addressed in a later blog.

Friday Apr 17, 2015

The Intersection of “Data Capital” and Advanced Analytics

We’ve heard about the Three Laws of Data Capital from Paul Sonderegger at Oracle: data comes from activity, data tends to make more data, and platforms tend to win. Advanced analytics enables enterprises to take full advantage of the data their activity produces, ranging from IoT sensors and PoS transactions to social media and image/video. Traditional BI tools produce summary data from data – producing more data, but traditional BI tools provide a view of the past – what did happen. Advanced analytics also produces more data from data, but this data is transformative, generating previously unknown insights and providing a view of future behavior or outcomes – what will likely happen. Oracle provides a platform for advanced analytics today through Oracle Advanced Analytics on Oracle Database, and Oracle R Advanced Analytics for Hadoop on Big Data Appliance to support investing data.

Enterprises need to put their data to work to realize a return on their investment in data capture, cleansing, and maintenance. Investing data through advanced analytics algorithms has shown repeatedly to dramatically increase ROI. For examples, see customer quotes and videos from StubHub, dunnhumby, CERN, among others. Too often, data centers are perceived as imposing a “tax” instead of yielding a “dividend.” If you cannot extract new insights from your data and use data to perform such revenue enhancing actions such as predicting customer behavior, understanding root causes, and reducing fraud, the costs to maintain large volumes of historical data may feel like a tax. How do enterprises convert data centers to dividend-yielding assets?

One approach is to reduce “transaction costs.” Typically, these transaction costs involve the cost for moving data into environments where predictive models can be produced or sampling data to be small enough to fit existing hardware and software architectures. Then, there is the cost for putting those models into production. Transaction costs result in multi-step efforts that are labor intensive and make enterprises postpone investing their data and deriving value. Oracle has long recognized the origins of these high transaction costs and produced tools and a platform to eliminate or dramatically lower these costs.

Further, consider the data scientist or analyst as the “data capital manager,” the person or persons striving to extract the maximum yield from data assets. To achieve high dividends with low transaction costs, the data capital manager needs to be supported with tools and a platform that automates activities – making them more productive – and ultimately more heroic within the enterprise – doing more with less because it’s faster and easier. Oracle removes a lot of the grunt work from the advanced analytics process: data is readily accessible, data manipulation and model building / data scoring is scalable, and deployment is immediate. To learn more about how to increase dividends from your data capital, see Oracle Advanced Analytics and Oracle R Advanced Analytics for Hadoop.

Monday Apr 06, 2015

Using rJava in Embedded R Execution

Integration with high performance programming languages is one way to tackle big data with R. Portions of the R code are moved from R to another language to avoid bottlenecks and perform expensive procedures. The goal is to balance R’s elegant handling of data with the heavy duty computing capabilities of other languages.

Outsourcing R to another language can easily be hidden in R functions, so proficiency in the target language is not requisite for the users of these functions. The rJava package by Simon Urbanek is one such example - it outsources R to Java very much like R's native .C/.Call interface. rJava allows users to create objects, call methods and access fields of Java objects from R.

Oracle R Enterprise (ORE) provides an additional boost to rJava when used in embedded R script execution on the database server machine. Embedded R Execution allows R scripts to take advantage of a likely more powerful database server machine - more memory and CPUs, and greater CPU power. Through embedded R, ORE enables R to leverage database support for data parallel and task parallel execution of R scripts and also operationalize R scripts in database applications.  The net result is the ability to analyze larger data sets in parallel from a single R or SQL interface, depending on your preference.

In this post, we demonstrate a basic example of configuring and deploying rJava in base R and embedded R execution.

1. Install Java

To start, you need Java. If you are not using a pre-configured engineered system like Exadata or the Big Data Appliance, you can download the Java Runtime Environment (JRE) and Java Development Kit (JDK) here.

To verify the JRE is installed on your system, execute the command:

$ java -version
java version "1.7.0_67"


If the JRE is installed on the system, the version number is returned. The equivalent check for JDK is:

$ javac -version
javac 1.7.0_67


A "command not recognized" error indicates either Java is not present or you need to add Java to your PATH and CLASSPATH environment variables.

2. Configure Java Parameters for R

R provides the javareconf utility to configure Java support in R.  To prepare the R environment for Java, execute this command:

$ sudo R CMD javareconf

or

$ R CMD javareconf -e

3.  Install rJava Package

rJava release versions can be obtained from CRAN.  Assuming an internet connection is available, the install.packages command in an R session will do the trick.

> install.packages("rJava")
..
..
* installing *source* package ‘rJava’ ...
** package ‘rJava’ successfully unpacked and MD5 sums checked
checking for gcc... gcc -m64 -std=gnu99
..
..
** testing if installed package can be loaded
* DONE (rJava)


4. Configure the Environment Variable CLASSPATH

The CLASSPATH environment variable must contain the directories with the jar and class files.  The class files in this example will be created in /home/oracle/tmp.

  export CLASSPATH=$ORACLE_HOME/jlib:/home/oracle/tmp

Alternatively, use the rJava function .jaddClassPath to define the path to the class files.


5. Create and Compile Java Program

For this test, we create a simple, Hello, World! example. Create the file HelloWorld.java in /home/oracle/tmp with the contents:

  public class HelloWorld {
          public String SayHello(String str)
            {
                  String a = "Hello,";
            return a.concat(str);
            }
    }


Compile the Java code.

$ javac HelloWorld.java


6.  Call Java from R


In R, execute the following commands to call the rJava package and initialize the Java Virtual Machine (JVM).

R> library(rJava)
R> .jinit()


Instantiate the class HelloWorld in R. In other words, tell R to look at the compiled HelloWorld program.

R> .jnew
("HelloWorld")

Call the function directly.

R> .jcall(obj, "S", "SayHello", str)
              VAL
1 Hello,      World!


7.  Call Java In Embedded R Execution


Oracle R Enterprise uses external procedures in Oracle Database to support embedded R execution. The default configuration for external procedures is spawned directly by Oracle Database. The path to the JVM shared library, libjvm.so must be added to the environment variable LD_LIBRARY_PATH so it is found in the shell where Oracle is started.  This is defined in two places: at the OS shell and in the external procedures configuration file, extproc.ora.

In the OS shell:

$ locate libjvm.so

/usr/java/jdk1.7.0_45/jre/lib/amd64/server

$ export LD_LIBRARY_PATH=/usr/java/jdk1.7.0_45/jre/lib/amd64/server:$LD_LIBRARY_PATH


In extproc.ora:

$ cd $ORACLE_HOME/hs/admin/extproc.ora


Edit the file extproc.ora to add the path to libjvm.so in LD_LIBRARY_PATH:

SET EXTPROC_DLLS=ANY
SET LD_LIBRARY_PATH=/usr/java/jdk1.7.0_45/jre/lib/amd64/server
export LD_LIBRARY_PATH


You will need to bounce the database instance after updating extproc.ora.

Now load rJava in embedded R:

> library(ORE)
> ore.connect(user     = 'oreuser',
             password = 'password',
             sid      = 'sid',
             host     = 'hostname',
             all      = TRUE)


> TEST <- ore.doEval(function(str) {
                       library(rJava)
                       .jinit()
                       obj <- .jnew("HelloWorld")
                       val <- .jcall(obj, "S", "SayHello", str)
                       return(as.data.frame(val))
                     },
                     str = 'World!',
                    FUN.VALUE = data.frame(VAL = character())
  )

> print(TEST)
              VAL
1 Hello,      World!


If you receive this error, LD_LIBRARY_PATH is not set correctly in extproc.ora:

Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch,  :
  Error in try({ : ORA-20000: RQuery error
Error : package or namespace load failed for ‘rJava’
ORA-06512: at "RQSYS.RQEVALIMPL", line 104
ORA-06512: at "RQSYS.RQEVALIMPL", line 101


Once you've mastered this simple example, you can move to your own use case. If you get stuck, the rJava package has very good documentation. Start with the information on the rJava CRAN page. Then, from an R session with the rJava package loaded, execute the command help(package="rJava") lto list  the available functions.

After that, the source code of R packages which use rJava are a useful source of further inspiration – look at the reverse dependencies list for rJava in CRAN. In particular, the helloJavaWorld package is a tutorial for how to include Java code in an R package.



Monday Mar 30, 2015

Oracle Open World 2015 Call for Proposals!

It's that time of year again...submit your session proposals for Oracle OpenWorld 2015!

Oracle customers and partners are encouraged to submit proposals to present at the Oracle OpenWorld 2015 conference, October 25 - 29, 2015, held at the Moscone Center in San Francisco.

Details and submission guidelines are available on the Oracle OpenWorld Call for Proposals web site. The deadline for submissions is Wednesday, April 29, 11:59 p.m. PDT.

We look forward to checking out your sessions on Oracle Advanced Analytics, including Oracle R Enterprise and Oracle Data Mining, and Oracle R Advanced Analytics for Hadoop. Tell how these tools have enhanced the way you do business!

Monday Mar 23, 2015

Oracle R Distribution 3.1.1 Available for Download on all Platforms

The Oracle R Distribution 3.1.1 binaries for Windows, AIX, Solaris SPARC and Solaris x86 are now available on OSS, Oracle's Open Source Software portal. Oracle R Distribution 3.1.1 is an update to R version 3.1.0, and it includes many improvements, including upgrades to the package help system and improved accuracy importing data with large integers. The complete list of changes is in the NEWS file.

To install Oracle R Distribution,
follow the instructions for your platform in the Oracle R Enterprise Installation and Administration Guide.

Thursday Feb 12, 2015

Pain Point #6: “We need to build 10s of thousands of models fast to meet business objectives”

The last pain point in this series on Addressing Analytic Pain Points, involves one aspect of what I call massive predictive modeling. Increasingly, enterprise customers are building a greater number of models. In past decades, producing a handful of production models per year may have been considered a significant accomplishment. With the advent of powerful computing platforms, parallel and distributed algorithms, as well as the wealth of data – Big Data – we see enterprises building hundreds and thousands of models in targeted ways.

For example, consider the utility sector with data being collected from household smart meters. Whether water, gas, or electricity, utility companies can make more precise demand projections by modeling individual customer consumption behavior. Aggregating this behavior across all households can provide more accurate forecasts, since individual household patterns are considered, not just generalizations about all households, or even different household segments.

The concerns associated with this form of massive predictive modeling include: (i) dealing effectively with Big Data from the hardware, software, network, storage and Cloud, (ii) algorithm and infrastructure scalability and performance, (iii) production deployment, and (iv) model storage, backup, recovery and security. Some of these I’ve explored under previous pain points blog posts.

Oracle Advanced Analytics (OAA) and Oracle R Advanced Analytics for Hadoop (ORAAH) both provide support for massive predictive modeling. From the Oracle R Enterprise component of OAA, users leverage embedded R execution to run user-defined R functions in parallel, both from R and from SQL. OAA provides the infrastructure to allow R users to focus on their core R functionality while allowing Oracle Database to handle spawning of R engines, partitioning data and providing data to their R function across parallel R engines, aggregating results, etc. Data parallelism is enabled using the “groupApply” and “rowApply” functions, while task parallelism is enabled using the “indexApply” function. The Oracle Data Mining component of OAA provides "on-the-fly" models, also called "predictive queries," where the model is automatically built on partitions of the data and scoring using those partitioned models is similarly automated.

ORAAH enables the writing of mapper and reducer functions in R where corresponding ORE functionality can be achieved on the Hadoop cluster. For example, to emulate “groupApply”, users write the mapper to partition the data and the reducer to build a model on the resulting data. To emulate “rowApply”, users can simply use the mapper to perform, e.g., data scoring and passing the model to the environment of the mapper. No reducer is required.

Monday Jan 19, 2015

Pain Point #5: “Our company is concerned about data security, backup and recovery”

So far in this series on Addressing Analytic Pain Points, I’ve focused on the issues of data access, performance, scalability, application complexity, and production deployment. However, there are also fundamental needs for enterprise advanced analytics solutions that revolve around data security, backup, and recovery.

Traditional non-database analytics tools typically rely on flat files. If data originated in an RDBMS, that data must first be extracted. Once extracted, who has access to these flat files? Who is using this data and when? What operations are being performed? Security needs for data may be somewhat obvious, but what about the predictive models themselves? In some sense, these may be more valuable than the raw data since these models contain patterns and insights that help make the enterprise competitive, if not the dominant player. Are these models secure? Do we know who is using them, when, and with what operations? In short, what audit capabilities are available?

While security is a hot topic for most enterprises, it is essential to have a well-defined backup process in place. Enterprises normally have well-established database backup procedures that database administrators (DBAs) rigorously follow. If data and models are stored in flat files, perhaps in a distributed environment, one must ask what procedures exist and with what guarantees. Are the data files taxing file system backup mechanisms already in place – or not being backed up at all?

On the other hand, recovery involves using those backups to restore the database to a consistent state, reapplying any changes since the last backup. Again, enterprises normally have well-established database recovery procedures that are used by DBAs. If separate backup and recovery mechanisms are used for data, models, and scores, it may be difficult, if not impossible, to reconstruct a consistent view of an application or system that uses advanced analytics. If separate mechanisms are in place, they are likely more complex than necessary.

For Oracle Advanced Analytics (OAA), data is secured via Oracle Database, which wins security awards and is highly regarded for its ability to provide secure data for confidentiality, integrity, availability, authentication, authorization, and non-repudiation. Oracle Database logs and monitors user activity. Users can work independently or jointly in a shared environment with data access controlled by standard database privileges. The data itself can be encrypted and data redaction is supported.

OAA models are secured in one of two ways: (i) models produced in the kernel of the database are treated as first-class database objects with corresponding access privileges (create, update, delete, execute), and (ii) models produced through the R interface can be stored in the R datastore, which exists as a database table in the user's schema with its own access privileges. In either case, users must log into their Oracle Database schema/account, which provides the needed degree of confidentiality, integrity, availability, authentication, authorization, and non-repudiation.

Enterprise Oracle DBAs already follow rigorous backup and recovery procedures. The ability to reuse these procedures in conjunction with advanced analytics solutions is a major simplification and helps to ensure the integrity of data, models, and results.

Tuesday Dec 23, 2014

Pain Point #4: “Recoding R (or other) models into SQL, C, or Java takes time and is error prone”

In the previous post in this series Addressing Analytic Pain Points, I focused on some issues surrounding production deployment of advanced analytics solutions. One specific aspect of production deployment involves how to get predictive model results (e.g., scores) from R or leading vendor tools into applications that are based on programming languages such as SQL, C, or Java. In certain environments, one way to integrate predictive models involves recoding them into one of these languages. Recoding involves identifying the minimal information needed for scoring, i.e., making predictions, and implementing that in a language that is compatible with the target environment. For example, consider a linear regression model with coefficients. It can be fairly straightforward to write a SQL statement or a function in C or Java to produce a score using these coefficients. This translated model can then be integrated with production applications or systems.

While recoding has been a technique used for decades, it suffers from several drawbacks: latency, quality, and robustness. Latency refers to the time delay between the data scientist developing the solution and leveraging that solution in production. Customers recount historic horror stories where the process from analyst to software developers to application deployment took months. Quality comes into play on two levels: the coding and testing quality of the software produced, and the freshness of the model itself. In fast changing environments, models may become “stale” within days or weeks. As a result, latency can impact quality. In addition, while a stripped down implementation of the scoring function is possible, it may not account for all cases considered by the original algorithm implementer. As such, robustness, i.e., the ability to handle greater variation in the input data, may suffer.

One way to address this pain point is to make it easy to leverage predictive models immediately (especially open source R and in-database Oracle Advanced Analytics models), thereby eliminating the need to recode models. Since enterprise applications normally know how to interact with databases via SQL, as soon as a model is produced, it can be placed into production via SQL access. In the case of R models, these can be accessed using Oracle R Enterprise embedded R execution in parallel via ore.rowApply and, for select models, the ore.predict capability performs automatic translation of native R models for execution inside the database. In the case of native SQL Oracle Advanced Analytics interface algorithms, as found in Oracle Data Mining and exposed through an R interface in Oracle R Enterprise, users can perform scoring directly in Oracle Database. This capability minimizes or even eliminates latency, dramatically increases quality, and leverages the robustness of the original algorithm implementations.

Sunday Dec 14, 2014

Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex”

Continuing in our series Addressing Analytic Pain Points, another concern for data scientists and analysts, as well as enterprise management, is how to leverage analytic results in production systems. These production systems can include (i) dashboards used by management to make business decisions, (ii) call center applications where representatives see personalized recommendations for the customer they’re speaking to or how likely that customer is to churn, (iii) real-time recommender systems for customer retail web applications, (iv) automated network intrusion detection systems, and (v) semiconductor manufacturing alert systems that monitor product quality and equipment parameters via sensors – to name a few.

When a data scientist or analyst begins examining a data-based business problem, one of the first steps is to acquire the available data relevant to that problem. In many enterprises, this involves having it extracted from a data warehouse and operational systems, or acquiring supplemental data from third parties. They then explore the data, prepare it with various transformations, build models using a variety of algorithms and settings, evaluate the results, and after choosing a “best” approach, produce results such as predictions or insights that can be used by the enterprise.

If the end goal is to produce a slide deck or report, aside from those final documents, the work is done. However, reaping financial benefits from advanced analytics often needs to go beyond PowerPoint! It involves automating the process described above: extract and prepare the data, build and select the “best” model, generate predictions or highlight model details such as descriptive rules, and utilize them in production systems.

One of the biggest challenges enterprises face involves realizing the promised benefits in production that the data scientist achieved in the lab. How do you take that cleverly crafted R script, for example, and put all the necessary “plumbing” around it to enable not only the execution of the R script, but the movement of data and delivery of results where they are needed, parallel and distributed script execution across compute nodes, and execution scheduling.

As a production deployment, care needs to taken to safeguard against potential failures in the process. Further, more “moving parts” result in greater complexity. Since the plumbing is often custom implemented for each deployment, this plumbing needs to be reinvented and thoroughly tested for each project. Unfortunately, code and process reuse is seldom realized across an enterprise even for similar projects, which results in duplication of effort.

Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining) with Oracle Database provides an environment that eliminates the need for a separately managed analytics server, the corresponding movement of data and results between such a server and the database, and the need for custom plumbing. Users can store their R and SQL scripts directly in Oracle Database and invoke them through standard database mechanisms. For example, R scripts can be invoked via SQL, and SQL scripts can be scheduled for execution through Oracle Database’s DMBS_SCHEDULER package. Parallel and distributed execution of R scripts is supported through embedded R execution, while the database kernel supports parallel and distributed execution of SQL statements and in-database data mining algorithms. In addition, using the Oracle Advanced Analytics GUI, Oracle Data Miner, users can convert “drag and drop” analytic workflows to SQL scripts for ease of deployment in Oracle Database.

By making solution deployment a well-defined and routine part of the production process and reducing complexity through fewer moving parts and built-in capabilities, enterprises are able to realize and then extend the value they get from predictive analytics faster and with greater confidence.

Wednesday Nov 19, 2014

Pain Point #2: “I can’t analyze or mine all of my data – it has to be sampled”

Continuing in our series Addressing Analytic Pain Points, another concern for enterprise data scientists and analysts is having to compromise accuracy due to sampling. While sampling is an important technique for data analysis, it’s one thing to sample because you choose to; it’s quite another if you are forced to sample or to use a much smaller sample than is useful. A combination of memory, compute power, and algorithm design normally contributes to this.

In some cases, data simply cannot fit in memory. As a result, users must either process data in batches (adding to code or process complexity), or limit the data they use through sampling. In some environments, sampling itself introduces a catch 22 problem: the data is too big to fit in memory so it needs to be sampled, but to sample it with the current tool, I need to fit the data in memory! As a result, sampling large volume data may require processing it in batches, involving extra coding.

As data volumes increase, computing statistics and predictive analytics models on a data sample can significantly reduce accuracy. For example, to find all the unique values for a given variable, a sample may miss values, especially those that occur infrequently. In addition, for environments like open source R, it is not enough for data to fit in memory; sufficient memory must be left over to perform the computation. This results from R’s call-by-value semantics.

Even when data fits in memory, local machines, such as laptops, may have insufficient CPU power to process larger data sets. Insufficient computing resources means that performance suffers and users must wait for results - perhaps minutes, hours, or longer. This wastes the valuable (and expensive) time of the data scientist or analyst. Having multiple fast cores for parallel computations, as normally present on database server machines, can significantly reduce execution time.

So let’s say we can fit the data in memory with sufficient memory left over, and we have ample compute resources. It may still be the case that performance is slow, or worse, the computation effectively “never” completes. A computation that would take days or weeks to complete on the full data set may be deemed as “never” completing by the user or business, especially where the results are time-sensitive. To address this problem, algorithm design must be addressed. Serial, non-threaded algorithms, especially with quadratic or worse order run time do not readily scale. Algorithms need to be redesigned to work in a parallel and even distributed manner to handle large data volumes.

Oracle Advanced Analytics
provides a range of statistical computations and predictive algorithms implemented in a parallel, distributed manner to enable processing much larger volume data. By virtue of executing in Oracle Database, client-side memory limitations can be eliminated. For example, with Oracle R Enterprise, R users operate on database tables using proxy objects – of type ore.frame, a subclass of data.frame – such that data.frame functions are transparently converted to SQL and executed in Oracle Database. This eliminates data movement from the database to the client machine. Users can also leverage the Oracle Data Miner graphical interface or SQL directly. When high performance hardware, such as Oracle Exadata, is used, there are powerful resources available to execute operations efficiently on big data. On Hadoop, Oracle R Advanced Analytics for Hadoop – a part of the Big Data Connectors often deployed on Oracle Big Data Appliance – also provides a range of pre-package parallel, distributed algorithms for scalability and performance across the Hadoop cluster.

Friday Oct 24, 2014

Pain Point #1: “It takes too long to get my data or to get the ‘right’ data”

This is the first in a series on Addressing Analytic Pain Points: “It takes too long to get my data or to get the ‘right’ data.”

Analytics users can be characterized along multiple dimensions. One such dimension is how they get access to or receive data. For example, some receive data via flat files. Since we’re talking about “enterprise” users, this often means data stored in RDBMSs where users request data extracts from a DBA or more generally the IT department. Turnaround time can be hours to days, or even weeks, depending on the organization. If the data scientist needs more or different data, the cycle repeats – often leading to frustration on both sides and delays in generating results.

Others users are granted access to databases directly using programmatic access tools like ODBC, JDBC, their corresponding R variants, or ROracle. These users may be given read-only access to a range of data tables, possibly in a sandbox schema. Here, analytics users don’t have to go back to their DBA or IT as to obtain extracts, but they still need to pull the data from the database to their client environment, e.g., a laptop, and push results back to the database. If significant volumes of data are involved, the time required for pulling data can hinder productivity. (Of course, this assumes the client has enough RAM to load the needed data sets, but that’s a topic for the next blog post.)

To address the first type of user, since much of the data in question resides in databases, empowering users with a self service model mitigates the vicious cycle described above. When the available data are readily accessible to analytics users, they can see and select what they need at will. An Oracle Database solution addresses this data access pain point by providing schema access, possibly in a sandbox with read-only table access, for the analytics user.

Even so, this approach just turns the first type of user into the second mentioned above. An Oracle Database solution further addresses this pain point by either minimizing or eliminating data movement as much as possible. Most analytics engines bring data to the computation, requiring extracts and in some cases even proprietary formats before being able to perform analytics. This takes time. Often, data movement can dwarf the time required to perform the actual computation. From the perspective of the analytics user, this is wasted time because it is just a perfunctory step on the way to getting the desired results. By bringing computation to the data, using Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining), the time normally required to move data is eliminated. Consider the time savings of being able to prepare data, compute statistics, or build predictive models and score data directly in the database. Using Oracle Advanced Analytics, either from R via Oracle R Enterprise, SQL via Oracle Data Mining, or the graphical interface Oracle Data Miner, users can leverage Oracle Database as a high performance computational engine.

We should also note that Oracle Database has the high performance Oracle Call Interface (OCI) library for programmatic data access. For R users, Oracle provides the package ROracle that is optimized using OCI for fast data access. While ROracle performance may be much faster than other methods (ODBC- and JDBC-based), the time is still greater than zero and there are other problems that I’ll address in the next pain point.

Addressing Analytic Pain Points

If you’re an enterprise data scientist, data analyst, or statistician, and perform analytics using R or another third party analytics engine, you’ve likely encountered one or more of these pain points:

Pain Point #1: “It takes too long to get my data or to get the ‘right’ data”
Pain Point #2: “I can’t analyze or mine all of my data – it has to be sampled”
Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex”
Pain Point #4: “Recoding R (or other) models into SQL, C, or Java takes time and is error prone”
Pain Point #5: “Our company is concerned about data security, backup and recovery”
Pain Point #6: “We need to build 10s of thousands of models fast to meet business objectives”

Some pain points are related to the scale of data, yet others are felt regardless of data size. In this blog series, I’ll explore each of these pain points, how they affect analytics users and their organizations, and how Oracle Advanced Analytics addresses them.

Monday Sep 22, 2014

Oracle R Enterprise 1.4.1 Released

Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, makes the open source R statistical programming language and environment ready for the enterprise and big data. Designed for problems involving large data volumes, Oracle R Enterprise integrates R with Oracle Database.

R users can execute R commands and scripts for statistical and graphical analyses on data stored in Oracle Database. R users can develop, refine, and deploy R scripts that leverage the parallelism and scalability of the database to automate data analysis. Data analysts and data scientists can use open source R packages and develop and operationalize R scripts for analytical applications in one step – from R or SQL.

With the new release of Oracle R Enterprise 1.4.1, Oracle enables support for Multitenant Container Database (CDB) in Oracle Database 12c and pluggable databases (PDB). With support for CDB / PDB, enterprises can take advantage of new ways of organizing their data: easily taking entire databases offline and easily bringing them back online when needed. Enterprises, such as pharmaceutical companies, that collect vast quantities of data across multiple experiments for individual projects immediately benefit from this capability.

This point release also includes the following enhancements:

• Certified for use with R 3.1.1 and Oracle R Distribution 3.1.1.

• Simplified and enhanced script for install, upgrade, uninstall of ORE Server and the creation and configuratioon of ORE users.

• New supporting packages: arules and statmod.

• ore.glm accepts offset terms in model formula and can fit negative binomial and tweedie families of GLM.

• ore.sync argument, query, creates ore.frame object from SELECT statement without creating view. This allows users to effectively access a view of the data without the CREATE VIEW privilege.

• Global option for serialization, ore.envAsEmptyenv, specifies whether referenced environment objects in an R object, e.g., in an lm model, should be replaced with an empty environment during serialization to the ORE R datastore. This is used by (i) ore.push, which for a list object accepts envAsEmptyenv as an optional argument, (ii) ore.save, which has envAsEmptyenv as a named argument, and (iii) ore.doEval and the other embedded R execution functions, which accept ore.envAsEmptyenv as a control argument.

Oracle R Enterprise 1.4.1
can be downloaded from OTN here.

Wednesday Sep 17, 2014

Seismic Data Repository: on-the-fly data analysis and visualization using Oracle R Enterprise

RN-KrasnoyarskNIPIneft Establishes Seismic Information Repository for One of the World’s Largest Oil and Gas Companies. Read the complete customer story here, excerpts follow.

RN-KrasnoyarskNIPIneft (KrasNIPI) is a research and development subsidiary of Rosneft Oil Companya, top oil and gas company in Russia and worldwide. KrasNIPI provides high-quality information from seismic surveys to Rosneft—delivering key information that oil and gas companies seek to lower costs, environmental impacts, and risks while exploring for resources to satisfy growing energy needs. KrasNIPI’s primary activities include preparing the information base used for the exploration of hydrocarbons, development and construction of oil and gas fields, processing and interpretation of 2-D and 3-D seismic data, and seismic data warehousing.

Part of the solution involved on-the-fly data analysis and visualization for remote users with only a thin client—such as a web browser (without additional plug-ins and extensions). This was made possible by using Oracle R Enterprise (a component of Oracle Advanced Analytics) to support applications requiring extensive analytical processing.

We store vast amounts of seismic data, process this information with sophisticated math algorithms, and deliver it to remote users under tight deadlines. We deployed Oracle Database together with Oracle Spatial and Graph, Oracle Fusion Middleware MapViewer on Oracle WebLogic Server, and Oracle R Enterprise to keep these complex business processes running smoothly. The result exceeded our most optimistic expectations.”
                              – Artem Khodyaev, Chief Engineer
                                                              Corporate Center of Seismic Information Repository
                                                              RN-KrasnoyarskNIPIneft

Thursday Aug 21, 2014

Oracle R Distribution 3.1.1 Released


Oracle R Distribution version 3.1.1 has been released to Oracle's public yum today. R-3.1.1 (code name "Sock it to Me") is an update to R-3.1.0 that consists mainly of bug fixes. It also 
includes enhancements related to accessing package help files, improved accuracy when importing data with large integers, and better integration with RStudio graphics. The full list of new features and bug fixes is listed in the NEWS file.

To install Oracle R Distribution using
yum, follow the instructions in the Oracle R Enterprise Installation and Administration Guide.

Installing using
yum will resolve any operating system dependencies automatically. As such, we recommend using yum to install Oracle R Distribution. However, if yum is not available, you can install Oracle R Distribution RPMs directly using RPM commands.

For Oracle Linux 5, the Oracle R Distribution RPMs are available in the Enterprise Linux Add-Ons repository
:

  R-3.1.1-1.el5.x86_64.rpm
  R-core-3.1.1-1.el5.x86_64.rpm
  R-devel-3.1.1-1.el5.x86_64.rpm
  libRmath-3.1.1-1.el5.x86_64.rpm
  libRmath-devel-3.1.1-1.el5.x86_64.rpm
  libRmath-static-3.1.1-1.el5.x86_64.rpm

For Oracle Linux 6, the Oracle R Distribution RPMs are available in the Oracle Linux Add-Ons repository:

  R-3.1.1-1.el6.x86_64.rpm
  R-core-3.1.1-1.el6.x86_64.rpm
  R-devel-3.1.1-1.el6.x86_64.rpm
  libRmath-3.1.1-1.el6.x86_64.rpm
  libRmath-devel-3.1.1-1.el6.x86_64.rpm
  libRmath-static-3.1.1-1.el6.x86_64.rpm

For example, this command installs the R 3.1.1 RPM on Oracle Linux x86-64 version 6:

  rpm -i R-3.1.1-1.el6.x86_64.rpm

To complete the Oracle R Distribution 3.1.1 installation, repeat this command for each of the 6 RPMs, resolving dependencies as required. 

Oracle R Distribution 3.1.1 is the certified with Oracle R Enterprise 1.4.x. Refer to 
Table 1-2 in the Oracle R Enterprise Installation Guide for supported configurations of Oracle R Enterprise components, or check this blog for updates. The Oracle R Distribution 3.1.1 binaries for Windows, AIX, Solaris SPARC and Solaris x86 are also available on OSS, Oracle's Open Source Software portal.


Monday Aug 18, 2014

Real-time Big Data Analytics is a reality for StubHub with Oracle Advanced Analytics

What can you use for a comprehensive platform for real-time analytics?
How can you process big data volumes for near-real-time recommendations and dramatically reduce fraud?

Learn in this video what Stubhub achieved with Oracle R Enterprise from the Oracle Advanced Analytics option to Oracle Database, and read more on their story here.

Advanced analytics solutions that impact the bottom line of a business are challenging due to the range of skills and individuals involved in realizing such solutions. While we hear a lot about the role of the data scientist, that role is but one piece of the puzzle. Advanced analytics solutions also have an operationalization aspect that also requires close proximity to where the transactional activity occurs.

The data scientist needs access to the right data with which to model the business problem. This involves IT for data collection, management, and administration, as well as ensuring zero downtime (a website needs to be up 24x7). This also involves working with the data scientist to keep predictive models refreshed with the latest scripts.

Integrating advanced analytics solutions into enterprise apps involves not just generating predictions, but supporting the whole life-cycle from data collection, to model building, model assessment, and then outcome assessment and feedback to the model building process again. Application and web interface designers need to take into account how end users will see and use the advanced analytics results, e.g., supporting operations staff that need to handle the potentially fraudulent transactions.

As just described, advanced analytics projects can be "complicated" from just a human perspective. The extent to which software can simplify the interactions among users and systems will increase the likelihood of project success. The ability to quickly operationalize advanced analytics projects and demonstrate measurable value, means the difference between a successful project and just a nice research report.

By standardizing on Oracle Database and SQL invocation of R, along with in-database modeling as found in Oracle Advanced Analytics, expedient model deployment and zero downtime for refreshing models becomes a reality. Meanwhile, data scientists are also able to explore leading edge techniques available in open source. The Oracle solution propels the entire organization forward to realize the value of advanced analytics.

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« July 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today