ore.doEval and ore.tableApply: Which one is right for me?

When beginning to use Oracle R Enterprise, users quickly grasp the techniques and benefits of using embedded R to run scripts in database-side R engines, and gain a solid sense of the available functions for executing R scripts through Oracle Database. However, various embedded R functions are closely related, and a few tips can help in learning which functions are most appropriate for the problem you wish to solve. In this post, we'll demonstrate best practices for two of the non-parallel embedded R functions, ore.doEval and ore.tableApply.

As with all embedded R functions, both ore.doEval and ore.tableApply invoke R scripts at the database server in an R engine. The difference is that ore.doEval does not take a data table as an input parameter as it's designed
simply to execute the function provided.  In contrast, ore.tableApply accepts a data table (i.e., ore.frame) as input to be delivered to the embedded R function. The functions ore.doEval and ore.tableApply can be made equivalent simply by passing the name of the database table and pulling the table data within the ore.doEval function.   

In the following examples, we show embedded R run times for ore.doEval and ore.tableApply using simple functions that build linear models to predict flight arrival delay based on distance traveled and departure delay.

Model 1: Although ore.doEval does not explicitly accept data from a dedicated input argument, it's possible to retrieve data from the database using ore.sync
and ore.pull within the function:

R> system.time(mod1 <- ore.doEval(function(){
           dat <- ore.pull(ore.get("ONTIME_S"))
           lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)},
           ore.connect = TRUE))

    user  system elapsed
    0.008   0.000   4.941

Model 2: Data can also be passed to a function in the R interface of embedded R exectuion, as shown here with ore.doEval, when connected to the database schema where the data table resides.

R> system.time(mod2 <- ore.doEval(function(dat){
           lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)},
dat = ONTIME_S))

   user  system elapsed
   3.196   0.128   9.476

Model 3: The ore.tableApply function is designed to accept a database table, that is, an ore.frame, as the first input argument.

R> system.time(mod3 <- ore.tableApply(ONTIME_Sfunction(dat){
                       lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat = dat)}))

   user  system elapsed
   0.001   0.000   3.870

As the elapsed timings show, ore.tableApply (Model 3) is faster than both ore.doEval implementations (Model 1 and Model 2). Results may vary depending on data size and the operation being performed.  The ONTIME_S airline data used in these examples contains 220,000 rows and 26 columns, and the tests were executed on a Linux 5.8 server with 12 GB RAM and a single processor.

In summary, ore.doEval takes a function parameter, but can be programmed to source data from Oracle Database or another external source. If your processing is driven by a database table, ore.tableApply is preferable because it's optimized for data transfer from Oracle Database to R. For both approaches, the data must be able to fit in the Database R Engine’s available memory.

Unlike other embedded R functions, ore.doEval and ore.tableApply run serially by executing a single R process with the entire data in memory.  Other embedded R functions are enabled for parallel execution, and each has a distinct use case: row-wise "chunked" computations can be executed using ore.rowApply. The function ore.groupApply can be applied to grouped data for data sets that have natural partitioning. Lastly, the function ore.indexApply supports task-based execution, where one or more R engines perform the same or different calculations, or tasks, a specified number of times.

In this post, we haven't yet addressed the functions rqEval and rqTableEval, the SQL equivalents for ore.doEval and ore.tableApply.  One distinction between the R and SQL embedded R execution interfaces is that you can pass a data.frame or ore.frame as an argument in the R interface, as illustrated with ore.doEval above, however, the SQL interface takes only scalar arguments as input, for example, as with the
rqEval function. The rqTableEval function accepts a full table using a dedicated input argument, and this input may be a cursor or the result of a query.


Post a Comment:
  • HTML Syntax: NOT allowed

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.


« November 2015