Thursday Sep 10, 2015

Consolidating wide and shallow data with ORE Datastore

Clinical trial data are often characterized by a relatively small set of participants (100s or 1000s) while the data collected and analyzed on each may be significantly larger (1000s or 10,000s). Genomic data alone can easily reach the higher end of this range. In talking with industry leaders, one of the problems pharmaceutical companies and research hospitals encounter is effectively managing such data. Storing data in flat files on myriad servers, perhaps even “closeted” when no longer actively needed, poses problems for data accessibility, backup, recovery, and security. While Oracle Database provides support for wide data using nested tables in a number of contexts, to take advantage of R native functions that handle wide data using data.frames, Oracle R Enterprise allows you to store wide data.frames directly in Oracle Database using Oracle R Enterprise datastores.

With Oracle R Enterprise (ORE), a component of the Oracle Advanced Analytics option, the ORE datastore supports storing arbitrary R objects, including data.frames, in Oracle Database. In particular, users can load wide data from a file into R and store the resulting data.frame directly the R datastore. From there, users can repeatedly load the data at much faster speeds than they can from flat files.

The following benchmark results illustrate the performance of saving and loading data.frames of various dimensions. These tests were performed on an Oracle Exadata 5-2 half rack, ORE 1.4.1, ROracle 1.2-1, and R 3.2.0. Logging is turned off on the datastore table (see performance tip below). The data.frame consists of numeric data.

Comparing Alternatives

When it comes to accessing data and saving data for use with R, there are several options, including: CSV file, .Rdata file, and the ORE datastore. Each comes with its own advantages.


“Comma separated value” or CSV files are generally portable, provide a common representation for exporting/importing data, and can be readily loaded into a range of applications. However, flat files need to be managed and often have inadequate security, auditing, backup, and recovery. As we’ll see, CSV files provide significantly slower read and write times compared to .Rdata and ORE datastore.


R’s native .Rdata flat file representation is generally efficient for reading/writing R objects since the objects are in serialized form, i.e., not converted to a textual representation as CSV data are. However, .Rdata flat files also need to be managed and often have inadequate security, auditing, backup, and recovery. While faster than CSV read and write times, .Rdata is slower than ORE datastore. Being an R-specific format, access is limited to the R environment, which may or may not be a concern.

ORE Datastore

ORE’s datastore capability allows users to organize and manage all data in a single location – the Oracle Database. This centralized repository provides Oracle Database quality security, auditing, backup, and recovery. The ORE datastore, as you’ll see below, provides read and write performance that is significantly better than CSV and .Rdata. Of course, as with .Rdata being accessed through R, accessing the datastore is through Oracle Database.

Let’s look at a few benchmark comparisons.

First, consider the execution time for loading data using each of these approaches. For 2000 columns, we see that ore.load() is 124X faster than read.csv(), and over 3 times faster than R’s load() function for 5000 rows. At 20,000 rows, ore.load() is 198X faster than read.csv() and almost 4 times faster than load().

Considering the time to save data, is over 11X faster than write.csv() and over 8X faster than save() at 2000 rows, with that benefit continuing through 20000 rows.

Looking at this across even wider data.frames, e.g., adding results for 4000 and 16000 columns, we see a similar performance benefit for the ORE datastore over save/load and write.csv/read.csv.

If you are looking to consolidate data while gaining performance benefits along with security, backup, and recovery, the Oracle R Enterprise datastore may be a preferred choice.

Example using ORE Datastore

The ORE datastore functions and ore.load() are similar to the corresponding R save() and load() functions.

In the following example, we read a CSV data file, save it in the ORE datastore using and associated it with the name “MyDatastore”. Although not shown, multiple objects can be listed in the initial arguments. Note that any R objects can be included here, not just data.frames.

From there, we list the contents of the datastore and see that “MyDatastore” is listed with the number of objects stored and the overall size. Next we can ask for a summary of the contents of “MyDatastore”, which includes the data.frame ‘dat’.

Next we remove ‘dat’ and load the contents of the datastore, reconstituting ‘dat’ as a usable data.frame object. Lastly, we delete the datastore and see that the ORE datastore is empty.

> dat <- read.csv("df.dat")
> dim(dat)
[1] 300 2000
>, name="MyDatastore")
> ore.datastore() object.count size description
1 MyDatastore 1 4841036 2015-09-01 12:07:38
> ore.datastoreSummary("MyDatastore") class size length row.count col.count
1 dat data.frame 4841036 2000 300 2000
> rm(dat)
> ore.load("MyDatastore")
[1] "dat"
> ore.delete("MyDatastore")
[1] "MyDatastore"
> ore.datastore()
[1] object.count size description
<0 rows> (or 0-length row.names)

Performance Tip

The performance of saving R objects to the datastore can be increased by temporarily turning off logging on the table that serves as the datastore in the user’s schema: RQ$DATASTOREINVENTORY. This can be accomplished using the following SQL, which can also be invoked from R:



While turning off logging speeds up inserts and index creation, it avoids writing the redo log and as such has implications for database recovery. It can be used in combination with explicit backups before and after loading data.

Monday Feb 18, 2013

Saving R Objects in Oracle Database using Oracle R Enterprise 1.3 Datastore

R allows users to save R objects to disk. The whole workspace of R objects can be saved to a file, and reloaded across R sessions, which allows users to return to their previous R environment even after quitting R or to avoid recreating objects needed in the future. One such type of object includes predictive models, which can be built in one session, and saved for scoring in another R session, or even multiple, possibly parallel R sessions.

R provides the save and load functions where objects in memory are serialized and unserialized, respectively. Figure 1 depicts an example where two R objects, a linear model and data.frame are saved to a file, and then reloaded. When objects are restored, they have the same names as when they were saved.

Figure 1: Using R save() and load() functions

Oracle R Enterprise (ORE) 1.3 supports object persistence using an R datastore in the database. Now, ORE proxy objects, as well as any R objects, can be saved and restored across R sessions as a named entity in Oracle Database. Serializing ORE objects, such as ore.frames, and saving them doesn’t work across sessions, since any referenced temporary tables or other database objects are not saved across R sessions. If these ore.frame proxy object references are not maintained, restoring them makes them incomplete and inoperative.

Figure 2 has an example similar to the previous example. The main difference is that we are using and ore.load, and providing the name of the datastore from which we want to retrieve ORE objects.

Figure 2: Using ORE datastore functions and ore.load

Each schema has its own datastore table where R objects are saved. By being managed in Oracle Database, ORE provides referential integrity of saved objects such that when otherwise temporary database objects are no longer referenced, they are auto-deleted at the end of the R session. This applies, for example, to tables created via ore.push or Oracle Data Mining models produced using the OREdm package.

Here’s a simple example:

DAT1 <- ore.push(iris)

ore.lm.mod <- ore.lm(Sepal.Length~., DAT1 )

lm.mod <- lm(mpg ~ cyl + disp + hp + wt + gear, mtcars)

nb.mod <- ore.odmNB(YEAR ~ ARRDELAY + DEPDELAY + log(DISTANCE), ONTIME_S), lm.mod, nb.mod, name = "myModels")

We’re creating four objects: an ore.frame “DAT1” consisting of the iris data set, an ore.lm model that uses DAT1,a standard R lm model using the mtcars dataset, and an ODM naïve Bayes model using ONTIME_S. We then invoke for the three models and use the datastore name “myModels”.

The R objects are saved in the datastore, and any referenced data tables or ODM models are kept in the database, otherwise these are treated as temporary database objects and dropped when the ORE session ends.

To load these objects, invoke:

ore.load(name = "myModels")

 To see the content of a datastore, you can invoke ore.datastoreSummary with the name of the datastore.

The datastore also makes it easy to access R and ORE objects within ORE embedded R execution functions. Simply, the name of the datastore is passed as an embedded R function argument. Within the function, ore.load is invoked with the name of that datastore. We'll see an example of this using the SQL API below. In addition, by maintaining persisted R objects in the database, ORE facilitates application deployment while leveraging existing database backup, recovery, and security mechanisms. 

Consider the following ORE embedded R execution example using the SQL API.  We build a simple linear model and store it in a datastore with the name "myDatastore". This R script will be stored in the database R script repository with the name "BuildModel-1". To invoke this R script, we use the rqTableEval function, providing the input data from ONTIME_S, passing the parameters that include the name of the datastore, requesting XML output for the result, and specifying the R script by name. 

 'function(dat,datastore_name) {
    mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat),name=datastore_name, overwrite=TRUE)

select *

from table(rqTableEval(
  cursor(select 1 as "ore.connect", 'myDatastore' as "datastore_name"
         from dual),

To score using this model, we create an R script named "ScoreData-1" that loads the model from the named datastore, invokes predict, and binds the predictions with the predictors to be returned as the result.To invoke this script, we again use the rqTableEval function that takes as parameters a cursor specifying the data to be scored, parameters, a description of the result as a SQL query, and the R script name.

 'function(dat, datastore_name) {
     pred <- predict(mod, newdata=dat)
     pred[as.integer(rownames(pred))] <- pred
     cbind(dat, PRED = pred)

select *
from table(rqTableEval(
           where YEAR = 2003 and MONTH = 5 and DAYOFMONTH = 2),
    cursor(select 1 as "ore.connect",
                 'myDatastore' as "datastore_name" from dual),
    'select ARRDELAY, DISTANCE, DEPDELAY, 1 PRED from ontime_s',
order by 1, 2, 3;

Datastore functionality supports the following interface:, ore.load, ore.datastore, ore.datastoreSummary, and ore.delete. See the online documentation for details. A presentation on the ORE transparency layer that provides additional examples is available here.


The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.


« October 2015