Using Embedded R Execution: Imputing Missing Data While Preserving Data Structure

This guest post from Matt Fritz, Data Scientist, demonstrates a method for imputing missing values in data using Embedded R Execution with Oracle R Enterprise.

Missing data is a common issue among analyses and is mitigated by imputation. Several techniques handle this process within Oracle R Enterprise; however, some bias the data or generate outputs as data objects that are less accessible than others. This post illustrates ways to effectively impute data while specifying the exact data structure of the output keeping the output’s structure functional in Oracle R Enterprise.

Let’s first create missing data in the WorldPhones data set and create it in Oracle R Enterprise:


  WorldPhones[c(2,6),c(1,2,4)] <- NA
  WorldPhones <- as.data.frame(WorldPhones)
  ore.create(WorldPhones, table = 'PHONES')
  class(PHONES)

  > class(PHONES)
  [1] "ore.frame"

  > PHONES
        N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
  1951  45939  21574 2876   1815    1646     89      555
  1956     NA     NA 4708     NA    2366   1411      733
  1957  64721  32510 5230   2695    2526   1546      773
  1958  68484  35218 6662   2845    2691   1663      836
  1959  71799  37598 6856   3000    2868   1769      911
  1960     NA     NA 8220     NA    3054   1905     1008
  1961  79831  43173 9053   3338    3224   2005     1076

The easiest way to handle missing data is by substituting these values with a constant, such as zero. We are ready to recode the missing values and can use either the Transparency Layer or Embedded R Execution. The Transparency Layer will convert the base R code below into SQL and run the generated SQL inside the database:

  newPHONES <- PHONES
  newPHONES$N.Amer <- ifelse(is.na(newPHONES$N.Amer),0,newPHONES$N.Amer)
  newPHONES$Europe <- ifelse(is.na(newPHONES$Europe),0,newPHONES$Europe)
  newPHONES$S.Amer <- ifelse(is.na(newPHONES$S.Amer),0,newPHONES$S.Amer)
  newPHONES

  > newPHONES
        N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
  1951  45939  21574 2876   1815    1646     89    555
  1956  0      0     4708   0       2366   1411    733
  1957  64721  32510 5230   2695    2526   1546    773
  1958  68484  35218 6662   2845    2691   1663    836
  1959  71799  37598 6856   3000    2868   1769    911
  1960  0      0     8220   0       3054   1905    1008
  1961  79831  43173 9053   3338    3224   2005    1076


This process can also be executed in Embedded R Execution – which spawns an R engine on the database server under the control of Oracle Database – by using a custom R function, such as: 

  function(x) ifelse(is.na(x),0,x) 

One way to call this custom function is with ore.doEval. This method requires code to be written as if it were to be executed on the client; however, the ore.doEval wrapper moves the code to the R Script Repository of Oracle R Enterprise in the database and then leverages the database server’s superior processing capacity: 

  newPHONE <- ore.doEval(
     function() {
        ore.sync(table="PHONES")
        ore.attach()
        data.frame(apply(ore.pull(PHONES)
           ,2
           ,function(x) ifelse(is.na(x),0,x)))} 
           ,ore.connect=TRUE)

Note that we explicitly pull the data from the database using Oracle R Enterprise’s Transparency Layer on the database table PHONES. We must connect to the database to obtain the ore.frame that corresponds to the PHONES table. This is accomplished through the ore.sync function. The ore.attach function allows us to reference the ore.frame by its table name.

The second way is via ore.tableApply, which applies a function on an entire input table within Oracle R Enterprise. The same result is created as with ore.doEval and although both operations are successful, the output’s structure defaults to an ORE object instead of a data frame: 

  newPHONES <- 
ore.tableApply(PHONES
                    ,function(y) {
                       apply(y
                            ,2
                            ,function(x) ifelse(is.na(x),0,x))})
  class(newPHONES)

  > class(newPHONES)
   [1] "ore.object"

Since we cannot work with this object the same way as data frames or matrices, we must pull the ORE object onto the client in order to deserialize the object into an R matrix:

  newphones <- ore.pull(newPHONES)
  class(newphones)

  > class(newphones)
   [1] "matrix"

  > head(newphones)
     N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
  1  45939  21574 2876   1815    1646     89      555
  2  0      0     4708   0       2366   1411      733
  3  64721  32510 5230   2695    2526   1546      773
  4  68484  35218 6662   2845    2691   1663      836
  5  71799  37598 6856   3000    2868   1769      911
  6  0      0     8220   0       3054   1905     100

In this example, it is preferred that the output be structured as a data frame so that we can continue to work within Oracle R Enterprise versus the client. The FUN.VALUE feature within Embedded R provides this flexibility by defining the output data’s structure. For example, the output can be explicitly expressed as a data frame of numeric columns that have identical names to the input. 


  newPHONES <- ore.tableApply(PHONES,
                   function(y) {
                     data.frame(apply(y,
                           2,
                           function(x) ifelse(is.na(x),0,x)))},
                              FUN.VALUE=data.frame(setNames(replicate(7,
                                                          numeric(0),
                                                          simplify=F),
                                                          colnames(PHONES))))


  class(newPHONES)

  > class(newPHONES)
   [1] "ore.frame"

We can now continue to work with the newPHONES output within Oracle R Enterprise just as we would a data frame.

While these methods are technically sufficient, they are not practical for this type of data set. As this is panel data ranging from 1951 to 1961, simply recoding missing values to zero appears to strongly bias the data. Perhaps we prefer to calculate the average of each missing observation’s pre- and post-period values. Embedded R allows for a simple solution by utilizing the open-source zoo package.

  newPHONES <-  ore.tableApply(PHONES,
                    function(y) {
                      library(zoo)
                        data.frame(
                        apply(y, 2, function(x) (na.locf(x) + rev(na.locf(rev(x))))/2))},
                           FUN.VALUE=data.frame(setNames(replicate(7,
                                                      numeric(0),
                                                      simplify=F),
                                                      colnames(PHONES))))
  newPHONES


  > newPHONES
     N.Amer  Europe Asia S.Amer Oceania Africa Mid.Amer
  1  45939 21574.0 2876   1815    1646     89      555
  2  55330 27042.0 4708   2255    2366   1411      733
  3  64721 32510.0 5230   2695    2526   1546      773
  4  68484 35218.0 6662   2845    2691   1663      836
  5  71799 37598.0 6856   3000    2868   1769      911
  6  75815 40385.5 8220   3169    3054   1905     1008
  7  79831 43173.0 9053   3338    3224   2005     1076

These imputed values seem much more reasonable and the output’s structure acts just like a data frame within Oracle R Enterprise.

To recap, handling missing values plays an important role in data analysis and several imputation methods can be leveraged via the Transparency Layer or Embedded R. Further, Embedded R’s FUN.VALUE feature explicitly defines the output’s structure and allows for results to be immediately analyzed within Oracle R Enterprise.

The FUN.VALUE feature requires more tuning when the output comprises both numeric and character columns. Check back for a later post that explains how to define a data frame of ‘mixed class'.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today