Step-by-step: Returning R statistical results as a Database Table


R provides a rich set of statistical functions that we may want to use directly from SQL. Many of these results can be readily expressed as structured table data for use with other SQL tables, or for use by SQL-enabled applications, e.g., dashboards or other statistical tools.

In this blog post, we illustrate in a sequence of five simple steps  how to go from an R function to a SQL-enabled result. Taken from recent "proof of concept" customer engagement, our example involves using the function princomp, which performs a principal components analysis on a given numeric data matrix and returns the results as an object of class princomp. The customer actively uses this R function to produce loadings used in subsequent computations and analysis. The loadings is a matrix whose columns contain the eigenvectors).

The current process of pulling data from their Oracle Database, starting an R  engine, invoking the R script, and placing the results back in the database was proving non-performant and unnecessarily complex. The goal was to leverage Oracle R Enterprise to streamline this process and allow the results to be immediately accessible
through SQL.

As a best practice, here is a process that can get you from start to finish:

Step 1: Invoke from command line, understand results

If you're using a particular R function, chances are you are familiar with its content. However, you may not be familiar with its structure. We'll use an example from the R princomp documentation that uses the USArrests data set. We see that the class of the result is of type princomp, and the model prints the call and standard deviations of the components. To understand the underlying structure, we invoke the function str and see there are seven elements in the list, one of which is the matrix loadings.

mod <- princomp(USArrests, cor = TRUE)
class(mod)
mod
str(mod)


Results:

R> mod <- princomp(USArrests, cor = TRUE)
R> class(mod)
[1] "princomp"
R> mod
Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

4 variables and 50 observations.

R> str(mod)
List of 7
$ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
.. ..$ : chr [1:4] "Comp.1" "Comap.2" "Comp.3" "Comp.4"
$ center : Named num [1:4] 7.79 170.76 65.54 21.23
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ scale : Named num [1:4] 4.31 82.5 14.33 9.27
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ n.obs : int 50
$ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:50] "1" "2" "3" "4" ...
.. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ call : language princomp(x = dat, cor = TRUE)
- attr(*, "class")= chr "princomp"


Step 2: Wrap script in a function, and invoke from ore.tableApply

Since we want to invoke princomp on database data, we first push the demo data, USArrests, to the database to create an ore.frame. Other data we wish to use will also be in database tables.

We'll use ore.tableApply (for the reasons cited in the previous blog post)  providing the ore.frame as the first argument and simply returning within our function the model produced by princomp. We'll then look at its class, retrieve the result from the database, and check its class and structure once again.

Notice that we are able to obtain the exact same result we received using our local R engine as with the database R engine through embedded R execution.

dat <- ore.push(USArrests)
computePrincomp <- function(dat) princomp(dat, cor=TRUE)
res <- ore.tableApply(dat, computePrincomp)


class(res)
res.local <- ore.pull(res)
class(res.local)
str(res.local)
res.local
res


Results:

R> dat <- ore.push(USArrests)
R> computePrincomp <- function(dat) princomp(dat, cor=TRUE)
R> res <- ore.tableApply(dat, dat, computePrincomp)
R> class(res)
[1] "ore.object"
attr(,"package")
[1] "OREembed"
R> res.local <- ore.pull(res)
R> class(res.local)
[1] "princomp"


R> str(res.local)
List of 7
$ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
.. ..$ : chr [1:4] "Comp.1" "Comap.2" "Comp.3" "Comp.4"
$ center : Named num [1:4] 7.79 170.76 65.54 21.23
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ scale : Named num [1:4] 4.31 82.5 14.33 9.27
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ n.obs : int 50
$ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:50] "1" "2" "3" "4" ...
.. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ call : language princomp(x = dat, cor = TRUE)
- attr(*, "class")= chr "princomp"

R> res.local
Call:
princomp(x = dat, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

4 variables and 50 observations.
R> res
Call:
princomp(x = dat, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4
1.5748783 0.9948694 0.5971291 0.4164494


4 variables and 50 observations.


Step 3: Determine what results we really need

Since we are only interested in the loadings and any result we return needs to be a data.frame to turn it into a database row set (table result), we build the model, transform the loadings object into a data.frame, and return the data.frame as the function result. We then view the class of the result and its values.

Since we do this from the R API, we can simply print res to display the returned data.frame, as the print does an implicit ore.pull.

returnLoadings <- function(dat) {
                    mod <- princomp(dat, cor=TRUE)
                    dd <- dim(mod$loadings)
                    ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
                    ldgs$variables <- row.names(ldgs)
                    ldgs
                  }
res <- ore.tableApply(dat, returnLoadings)
class(res)
res

ore.create(USArrests,table="USARRESTS")


Results:

R> res <- ore.tableApply(dat, returnLoadings)
R> class(res)
[1] "ore.object"
attr(,"package")
[1] "OREembed"
R> res
             Comp.1     Comp.2     Comp.3     Comp.4 variables
Murder   -0.5358995  0.4181809 -0.3412327  0.64922780 Murder
Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748 Assault
UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773 UrbanPop
Rape     -0.5434321 -0.1673186  0.8177779  0.08902432 Rape


Step 4: Load script into the R Script Repository in the database

We're at the point of being able to load the script into the R Script Repository before invoking it from SQL. We can create the function from R or from SQL. In R,

ore.scriptCreate('princomp.loadings', returnLoadings)

or from SQL,

begin
--sys.rqScriptDrop('princomp.loadings');
sys.rqScriptCreate('princomp.loadings',
      'function(dat) {
        mod <- princomp(dat, cor=TRUE)
        dd <- dim(mod$loadings)
        ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
        ldgs$variables <- row.names(ldgs)
        ldgs
      }');
end;
/


Step 5: invoke from SQL SELECT statement

Finally, we're able to invoke the function from SQL using the rqTableEval table function. We pass in a cursor with the data from our USARRESTS table. We have no parameters, so the next argument is NULL. To get the results as a table, we specify a SELECT string that defines the structure of the result. Note that the column names must be identical to what is returned in the R data.frame. The last parameter is the name of the function we want to invoke from the R script repository.

Invoking this, we see the result as a table from the SELECT statement.

select *
from table(rqTableEval( cursor(select * from USARRESTS),
                        NULL,
                       'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual',
                        'princomp.loadings'));


Results:

SQL> select *
from table(rqTableEval( cursor(select * from USARRESTS),NULL,
          'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));
2 3
    Comp.1     Comp.2     Comp.3     Comp.4  variables
---------- ---------- ---------- ---------- ------------
-.53589947  .418180865 -.34123273  .649227804 Murder
-.58318363  .187985604 -.26814843 -.74340748  Assault
-.27819087 -.87280619  -.37801579  .133877731 UrbanPop
-.54343209 -.16731864   .817777908 .089024323 Rape

As you see above, we have the loadings result returned as a SQL table.


In this example, we walked through the steps of moving from invoking an R function to obtain a specific result to producing that same result from SQL by invoking an R script at the database server under the control of Oracle Database.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« May 2015
SunMonTueWedThuFriSat
     
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today