X

Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

  • June 25, 2018

Returning Tables from Embedded R Execution .... Simplified

Jie Liu
Data Scientist

In this tips and tricks blog, we share some techniques through our own use of Oracle R Enterprise applied to data science projects that you may find useful in your projects. This time, we focus on the automated process of returning the data frame schema from the output of embedded R execution runs.

Embedded R Execution

ORE embedded R execution provides a powerful and convenient way to execute custom R scripts at the database server, from either R or SQL. It also enables running those scripts in a data-parallel or task-parallel manner. With embedded R execution, the user also gets the opportunity to call any 3rd party R package and spawn one or more R engines to run their user-defined R function in parallel. A more detailed explanation can be found in this blog post Introduction to ORE Embedded R Script Execution.

The ORE embedded R execution functions, e.g., ore.tableApply and ore.rowApply, provide flexible choices of objects to return, such as R objects, models, etc. One of the most popular choices is an ORE frame from R or tables from SQL.

In order to get an ORE frame as a return object, we need to supply the schema (column names and types) to the embedded R function in the argument FUN.VALUE.

For instance, consider the dataset iris.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1  5.1  3.5  1.4  0.2  setosa

2  4.9  3.0  1.4  0.2  setosa

3  4.7  3.2  1.3  0.2  setosa

4  4.6  3.1  1.5  0.2  setosa

5  5.0  3.6  1.4  0.2  setosa

6  5.4  3.9  1.7  0.4  setosa

...

145  6.7  3.3  5.7  2.5 virginica

146  6.7  3.0  5.2  2.3 virginica

147  6.3  2.5  5.0  1.9 virginica

148  6.5  3.0  5.2  2.0 virginica

149  6.2  3.4  5.4  2.3 virginica

150  5.9  3.0  5.1  1.8 virginica

We want to build a model for each iris species, say, using the feature Sepal.Length and return the coefficients associated with each species as a data.frame. To this end, the groupApply function can build a linear regression model on data for each species and do so in parallel. Consider the following code.

grpAp2 <-
  ore.groupApply(IRIS, IRIS$Species,
                 function(df) {
                     if (nrow(df) == 0) {
                         species <- character()
                         cf <- numeric()
                         names(cf) <- character()
                     } else {
                         species <- as.character(df$Species[1])
                         cf <- coef(lm(Sepal.Length ~ .,
                                       data = df[1:4]))
                     }
                    data.frame(Species = species,
                                CoefName = names(cf),
                                CoefValue = unname(cf),
                                stringsAsFactors = FALSE)
                 },
                 FUN.VALUE =
                 data.frame(Species = character(),
                            CoefName = character(),
                            CoefValue = numeric(),
                            stringsAsFactors = FALSE),
                 parallel = TRUE)

To get back an ore.frame from ore.groupApply, we need to supply FUN.VALUE that contains the schema of the returned ore.frame. This schema contains 3 columns, Species, CoefName, CoefValue. This is required if we want the output of the ore.groupApply function to be an ore.frame, otherwise it will be a list of the component data.frames. The data frame schema required here also needs us to specify the type of each column, such as character(), numeric(), etc. This is OK when the data frame has only a few columns, but in reality, there are many use cases where the returned ORE frame has a large number of columns. Even when the number of columns is not large, specifying the type of each column needs some effort.

In most of cases, the user can first run the user-defined function locally and then obtain the schema based on the first row from the output data. However, this needs us to repeat a similar process every time. Also, the user needs to pull the data from the database, which may not be efficient.

To automate this process and let the process run at the database server machine, we introduce a convenience function ore.getApplySchema to illustrate working with embedded R execution. The idea is to supply a representative subset data frame as an input to the function. Then the schema can be produced from the output and returned to the user. The user can then use this schema when calling the embedded R execution function to return an ore.frame. While it is not difficult to realize this process by writing a few lines of R code, we wanted to provide a general-purpose convenience function to avoid writing such code and illustrate working with embedded R execution.

Our ore.getApplySchema function below supports using rowApply, tableApply, and groupApply, however, this could be expanded to work with doEval and indexApply as well. Note that the type of the embedded R execution function needs to be specified because the input and code to retrieve the schema is different for different functions.

ore.getApplySchema<- function(ODF, function_name, ..., col = NULL, row.num = NULL, type = 'rowApply', ore.connect.flag = FALSE){
  # check whether the script gets loaded
  res <- ore.scriptList(name = function_name)
  if(nrow(res) == 0){
    stop('Function does not exist in the repository!')
  }
  rownames(ODF) <- ODF[,1]  
  if(is.null(row.num))
    INPUT <- ODF
  else 
    INPUT <- ODF[row.num,]
  switch(type,
         rowApply = {
           res <- ore.rowApply(INPUT, 
                               FUN.NAME = function_name, 
                               ...,
                               ore.connect = ore.connect.flag )
         },
         tableApply = {
           res <- ore.tableApply(INPUT, 
                                 FUN.NAME = function_name, 
                                 ...,
                                 ore.connect = ore.connect.flag )
         },
         groupApply = {
           if(is.null(col))
             stop("group apply requires the col information!")
           res <- ore.groupApply(INPUT, 
                                 INPUT[, col], 
                                 FUN.NAME = function_name, 
                                 ...,
                                 ore.connect = ore.connect.flag )           
         } 
  )
  if(type == 'tableApply'){
    schema <- ore.pull(res)[0,]
  }  
  else
    schema <- res[[1]][0,]
  return(schema)
}

For the IRIS data, we first write the user-defined function separately. Note that the user defined function should return the data frame with a schema conformed to the desired ORE schema. As seen in the following code, we add lines of code to show that the case of input of an empty data frame.

build_model <- function(df) {
  if (nrow(df) == 0) {
    species <- character()
    cf <- numeric()
    names(cf) <- character()
  } else {
    species <- as.character(df$SPECIES[1])
    cf <- coef(lm(SEPAL.LENGTH ~ .,
                  data = df[1:4]))
  }
  data.frame(SPECIES = species,
             COEFNAME = names(cf),
             COEFVALUE = unname(cf),
             stringsAsFactors = FALSE)
}

Then call our function as follows.

schema <- ore.getApplySchema(IRIS, 'build_model', col = 'SPECIES', row.num = NULL, type = 'groupApply')
[1] SPECIES   COEFNAME  COEFVALUE
<0 rows> (or 0-length row.names)

Next, we will use another example for more detailed demonstration of this function. We do this demonstration in two difference scenarios. One is done in the R environment and the other is done for SQL version of Embedded R execution.

Example in Oracle R Enterprise

To illustrate working with significantly more columns, consider an analysis on the adult dataset from the UCI data repository. The dataset contains demographic information about people. We first load the data and view a few rows:

adult.df <- read.csv(file = "/scratch/data/adult.csv", header=F)
colnames(adult.df) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", 
"occupation", "relationship", "race", "sex", "capital_gain", "capital_loss",
"hours_per_week", "native_country", "label")

head(adult.df)
  age         workclass fnlwgt  education education_num
1  50  Self-emp-not-inc  83311  Bachelors            13
2  38           Private 215646    HS-grad             9
3  53           Private 234721       11th             7
4  28           Private 338409  Bachelors            13
5  37           Private 284582    Masters            14
6  49           Private 160187        9th             5
          marital_status         occupation   relationship   race     sex
1     Married-civ-spouse    Exec-managerial        Husband  White    Male
2               Divorced  Handlers-cleaners  Not-in-family  White    Male
3     Married-civ-spouse  Handlers-cleaners        Husband  Black    Male
4     Married-civ-spouse     Prof-specialty           Wife  Black  Female
5     Married-civ-spouse    Exec-managerial           Wife  White  Female
6  Married-spouse-absent      Other-service  Not-in-family  Black  Female
  capital_gain capital_loss hours_per_week native_country  label
1            0            0             13  United-States  <=50K
2            0            0             40  United-States  <=50K
3            0            0             40  United-States  <=50K
4            0            0             40           Cuba  <=50K
5            0            0             40  United-States  <=50K
6            0            0             16        Jamaica  <=50K

Since the dataset contains a lot of categorical data, we wish to create dummy variables, or one hot encoding, for those categorical variables. This is useful because a lot of R models such as xgboost or glmnet require converting categorical variables into numerical vectors.
For instance, the categorical variable marital_status contains 6 levels such as Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married and Separated. The dummy variable will generate a vector with 6 binary values (0, 1), with each column related to each level. For persons with the marital status 'Separated', the dummy columns will be (0, 0, 0, 0, 0, 1).
There are 8 categorical variables, the total number of dummy variables will be equal to the number of distinct levels of all categorical features. As we can imagine, the result ORE frame will have a lot of columns, which could be tedious to write explicitly.
We apply the function dummy.data.frame in the 'dummies' packages to the categorical columns:

library(dummies)
factor.features <- c("workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country")
output.df <- dummy.data.frame(adult.df, names=factor.features, sep="_")
colnames(output.df)
  [1] "age"                                        "workclass_ ?"                               "workclass_ Federal-gov"                    
  [4] "workclass_ Local-gov"                       "workclass_ Never-worked"                    "workclass_ Private"                        
  [7] "workclass_ Self-emp-inc"                    "workclass_ Self-emp-not-inc"                "workclass_ State-gov"                      
 [10] "workclass_ Without-pay"                     "fnlwgt"                                     "education_ 10th"                           
 [13] "education_ 11th"                            "education_ 12th"                            "education_ 1st-4th"                        
 [16] "education_ 5th-6th"                         "education_ 7th-8th"                         "education_ 9th"                            
 [19] "education_ Assoc-acdm"                      "education_ Assoc-voc"                       "education_ Bachelors"                      
 [22] "education_ Doctorate"                       "education_ HS-grad"                         "education_ Masters"                        
 [25] "education_ Preschool"                       "education_ Prof-school"                     "education_ Some-college"                   
 [28] "education_num"                              "marital_status_ Divorced"                   "marital_status_ Married-AF-spouse"         
 [31] "marital_status_ Married-civ-spouse"         "marital_status_ Married-spouse-absent"      "marital_status_ Never-married"             
 [34] "marital_status_ Separated"             
...

In total, there are 109 columns after the dummy variables are created.
Suppose the data set is inside an Oracle database. We use tableApply to call the dummies package, but it is cumbersome to supply a data frame schema explicitly when we call tableApply.
In this case, we can use the function ore.getApplySchema to retrieve the schema.
First, let us load the dataset into Oracle Database.

adult.df <- read.csv(file = "/scratch/data/adult.csv", header=F)
colnames(adult.df) <- toupper(c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", 
"occupation", "relationship", "race", "sex", "capital_gain", "capital_loss",
"hours_per_week", "native_country", "label"))
ore.drop(table = 'ADULT')
ore.create(adult.df, table= 'ADULT') 

We write the function to generate the dummy variables:

convert_dummies <- function(adult.df){
    library(dummies)
    # do not forget to convert the column names to upper case in order to avoid adding extra "" when using the columns in queries.
  factor.features <- toupper(c("workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"))  
  output.df <- dummy.data.frame(adult.df, names=factor.features, sep="_")
  return(output.df)
}

(Note that here the column names come from the original values of the features, which contains spaces and ˜-˜, which may not be accepted for Oracle table column names. We added extra code to reformat the column names.
Upload this function to the R script repository:

ore.scriptCreate(name = "convert_dummies", convert_dummies, , overwrite = TRUE)

The code to call the function is as follows.

schema <- ore.getApplySchema(ADULT, "convert_dummies", col = NULL, row.num = NULL, type = 'tableApply')

After we run the function, the output schema is returned. Let us take a look:

> schema
  [1] AGE                                  WORKCLASS_?                          WORKCLASS_FEDERALGOV                
  [4] WORKCLASS_LOCALGOV                   WORKCLASS_NEVERWORKED                WORKCLASS_PRIVATE                   
  [7] WORKCLASS_SELFEMPINC                 WORKCLASS_SELFEMPNOTINC              WORKCLASS_STATEGOV                  
 [10] WORKCLASS_WITHOUTPAY                 FNLWGT                               EDUCATION_10TH                      
 [13] EDUCATION_11TH                       EDUCATION_12TH                       EDUCATION_1ST4TH                    
 [16] EDUCATION_5TH6TH                     EDUCATION_7TH8TH                     EDUCATION_9TH                       
 [19] EDUCATION_ASSOCACDM                  EDUCATION_ASSOCVOC                   EDUCATION_BACHELORS         
OCCUPATION_ Tech-support 
...

Looks like we retrieved all the new columns in the schema!
Now we can use the schema to actually run tableApply.

res.odf <- ore.tableApply(ADULT,
                      FUN.NAME = "convert_dummies", 
                      FUN.VALUE = schema)

The result res.odf is an ORE frame as the output. It contains all the columns of dummy variables. Let us inspect this output

> names(res.odf)
  [1] "AGE"                            "WORKCLASS_?"                    "WORKCLASS_FEDERALGOV"           "WORKCLASS_LOCALGOV"            
  [5] "WORKCLASS_NEVERWORKED"          "WORKCLASS_PRIVATE"              "WORKCLASS_SELFEMPINC"           "WORKCLASS_SELFEMPNOTINC"       
  [9] "WORKCLASS_STATEGOV"             "WORKCLASS_WITHOUTPAY"           "FNLWGT"                         "EDUCATION_10TH"                
 [13] "EDUCATION_11TH"                 "EDUCATION_12TH"                 "EDUCATION_1ST4TH"               "EDUCATION_5TH6TH"              
 [17] "EDUCATION_7TH8TH"               "EDUCATION_9TH"                  "EDUCATION_ASSOCACDM"            "EDUCATION_ASSOCVOC"            
 [21] "EDUCATION_BACHELORS"            "EDUCATION_DOCTORATE"            "EDUCATION_HSGRAD"               "EDUCATION_MASTERS"         
...
> head(res.odf)
  AGE WORKCLASS_? WORKCLASS_FEDERALGOV WORKCLASS_LOCALGOV WORKCLASS_NEVERWORKED WORKCLASS_PRIVATE WORKCLASS_SELFEMPINC WORKCLASS_SELFEMPNOTINC
1  23           0                    0                  0                     0                 1                    0                       0
2  40           0                    0                  0                     0                 1                    0                       0
3  41           0                    0                  0                     0                 0                    0                       1
4  24           0                    0                  0                     0                 0                    0                       0
5  20           1                    0                  0                     0                 0                    0                       0
6  38           0                    0                  0                     0                 1                    0                       0
  WORKCLASS_STATEGOV WORKCLASS_WITHOUTPAY FNLWGT EDUCATION_10TH EDUCATION_11TH EDUCATION_12TH EDUCATION_1ST4TH EDUCATION_5TH6TH
1                  0                    0 115458              0              0              0                0                0
2                  0                    0 347890              0              0              0                0                0
3                  0                    0 196001              0              0              0                0                0
4                  1                    0 273905              0              0              0                0                0
5                  0                    0 119156              0              0              0                0                0
6                  0                    0 179488              0              0              0                0                0
  EDUCATION_7TH8TH EDUCATION_9TH EDUCATION_ASSOCACDM EDUCATION_ASSOCVOC EDUCATION_BACHELORS EDUCATION_DOCTORATE EDUCATION_HSGRAD
1                0             0                   0                  0                   0                   0                1
2                0             0                   0                  0                   1                   0                0
3                0             0                   0                  0                   0                   0                1
4                0             0                   1                  0                   0                   0                0
5                0             0                   0                  0                   0                   0                0
6                0             0                   0                  0                   0                   0                0
  EDUCATION_MASTERS EDUCATION_PRESCHOOL EDUCATION_PROFSCHOOL EDUCATION_SOMECOLLEGE EDUCATION_NUM MARITAL_STATUS_DIVORCED
1                 0                   0                    0                     0             9                       0
2                 0                   0                    0                     0            13                       0
3                 0                   0                    0                     0             9                       0

It works! Using this function, we avoid writing the schema explicitly. This can quicken the way of applying embedded R execution.

Example in Oracle SQL Embedded R Execution

Embedded R execution can be initiated from Oracle SQL as well. Consider the following use case. A data scientist called our convenient function to produce the schema of the output. Then hand it over to an analyst who is mainly using SQL and call the R code from tableApply in SQL. How can we facilitate this process?
First, on the R side, make sure to upload the convert_dummies function to the R script repository:

ore.scriptCreate(name = "convert_dummies", convert_dummies, , overwrite = TRUE)

The main difficulty here is how to specify the output schema of the table with all dummy variables in SQL. Our solution is to save the schema into a table in Oracle Database and then call rqTableEval. The entire process can be automated from R side, by adding a few lines into the convenient function ore.getApplySchema.

ore.getApplySchema<- function(ODF, function_name, ..., col = NULL, row.num = NULL, type = 'rowApply', ore.connect.flag = FALSE, sql = FALSE, schema.table = NULL){
  # check whether the script gets loaded
  res <- ore.scriptList(name = function_name)
  if(nrow(res) == 0){
    stop('Function does not exist in the repository!')
  }
  rownames(ODF) <- ODF[,1]
  
  if(is.null(row.num))
    INPUT <- ODF
  else 
    INPUT <- ODF[row.num,]
  switch(type,
         rowApply = {
           res <- ore.rowApply(INPUT, 
                               FUN.NAME = function_name, 
                               ...,
                               ore.connect = ore.connect.flag )
         },
         tableApply = {
           res <- ore.tableApply(INPUT, 
                                 FUN.NAME = function_name, 
                                 ...,
                                 ore.connect = ore.connect.flag )
         },
         groupApply = {
           
           if(is.null(col))
             stop("group apply requires the col information!")
           res <- ore.groupApply(INPUT, 
                                 INPUT[, col], 
                                 FUN.NAME = function_name, 
                                 ...,
                                 ore.connect = ore.connect.flag )           
         } 
         
         
  )
  if(type == 'tableApply'){
    schema <- ore.pull(res)[0,]
  }
  
  else
    schema <- res[[1]][0,]
  
  if(sql == TRUE){
    stopifnot(!is.null(schema.table ))
    ore.drop(table = schema.table)
    ore.create(schema, table = schema.table)
    qry <- paste0("SELECT * FROM ", schema.table)
    return(qry)
  }
  return(schema)
}

In this use case, we can call the function in the following way.

qry <- ore.getApplySchema(ADULT, "convert_dummies", col = NULL, row.num = NULL, type = 'tableApply', sql = TRUE, schema.table = 'ADULT_SCHEMA')

The function will save the schema into a SQL table with name chosen as 'ADULT_SCHEMA'. Then, the returned value qry is a string of the query: 'SELECT * FROM ADULT_SCHEMA'. This is the query for schema used in rqTableEval.
From the SQL side, we can also create a table ADULT_RESULT to store the result. This avoids the extra work of materializing the ORE frame into an Oracle Database table.

CREATE TABLE ADULT_RESULT AS
SELECT *
  FROM table(rqTableEval(
    cursor(SELECT * FROM ADULT), NULL, 
    'SELECT * FROM ADULT_SCHEMA', 'convert_dummies'));

Let us check the results:

SELECT * FROM ADULT_RESULT;

The output shows the result is returned as a table with all dummy columns.

Conclusion

We provide a convenience function for automatically generating the result data frame schema for use in embedded R execution when returning a table. We provided an example to illustrate how to use this function in both R and SQL. This function helps to automate this process and let the user focus on other important data processing tasks.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.