X

Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

Consolidating wide and shallow data with ORE Datastore

Mark Hornick
Director, Advanced Analytics and Machine Learning

Clinical trial data are often characterized by a relatively small set of participants (100s or 1000s) while the data collected and analyzed on each may be significantly larger (1000s or 10,000s). Genomic data alone can easily reach the higher end of this range. In talking with industry leaders, one of the problems pharmaceutical companies and research hospitals encounter is effectively managing such data. Storing data in flat files on myriad servers, perhaps even “closeted” when no longer actively needed, poses problems for data accessibility, backup, recovery, and security. While Oracle Database provides support for wide data using nested tables in a number of contexts, to take advantage of R native functions that handle wide data using data.frames, Oracle R Enterprise allows you to store wide data.frames directly in Oracle Database using Oracle R Enterprise datastores.

With Oracle R Enterprise (ORE), a component of the Oracle Advanced Analytics option, the ORE datastore supports storing arbitrary R objects, including data.frames, in Oracle Database. In particular, users can load wide data from a file into R and store the resulting data.frame directly the R datastore. From there, users can repeatedly load the data at much faster speeds than they can from flat files.

The following benchmark results illustrate the performance of saving and loading data.frames of various dimensions. These tests were performed on an Oracle Exadata 5-2 half rack, ORE 1.4.1, ROracle 1.2-1, and R 3.2.0. Logging is turned off on the datastore table (see performance tip below). The data.frame consists of numeric data.

Comparing Alternatives
When it comes to accessing data and saving data for use with R, there are several options, including: CSV file, .Rdata file, and the ORE datastore. Each comes with its own advantages.

CSV
“Comma separated value” or CSV files are generally portable, provide a common representation for exporting/importing data, and can be readily loaded into a range of applications. However, flat files need to be managed and often have inadequate security, auditing, backup, and recovery. As we’ll see, CSV files provide significantly slower read and write times compared to .Rdata and ORE datastore.

.Rdata
R’s native .Rdata flat file representation is generally efficient for reading/writing R objects since the objects are in serialized form, i.e., not converted to a textual representation as CSV data are. However, .Rdata flat files also need to be managed and often have inadequate security, auditing, backup, and recovery. While faster than CSV read and write times, .Rdata is slower than ORE datastore. Being an R-specific format, access is limited to the R environment, which may or may not be a concern.

ORE Datastore
ORE’s datastore capability allows users to organize and manage all data in a single location – the Oracle Database. This centralized repository provides Oracle Database quality security, auditing, backup, and recovery. The ORE datastore, as you’ll see below, provides read and write performance that is significantly better than CSV and .Rdata. Of course, as with .Rdata being accessed through R, accessing the datastore is through Oracle Database.

Let’s look at a few benchmark comparisons.

First, consider the execution time for loading data using each of these approaches. For 2000 columns, we see that ore.load() is 124X faster than read.csv(), and over 3 times faster than R’s load() function for 5000 rows. At 20,000 rows, ore.load() is 198X faster than read.csv() and almost 4 times faster than load().



Considering the time to save data, ore.save() is over 11X faster than write.csv() and over 8X faster than save() at 2000 rows, with that benefit continuing through 20000 rows.



Looking at this across even wider data.frames, e.g., adding results for 4000 and 16000 columns, we see a similar performance benefit for the ORE datastore over save/load and write.csv/read.csv.





If you are looking to consolidate data while gaining performance benefits along with security, backup, and recovery, the Oracle R Enterprise datastore may be a preferred choice.

Example using ORE Datastore
The ORE datastore functions ore.save() and ore.load() are similar to the corresponding R save() and load() functions.

In the following example, we read a CSV data file, save it in the ORE datastore using ore.save() and associated it with the name “MyDatastore”. Although not shown, multiple objects can be listed in the initial arguments. Note that any R objects can be included here, not just data.frames.

From there, we list the contents of the datastore and see that “MyDatastore” is listed with the number of objects stored and the overall size. Next we can ask for a summary of the contents of “MyDatastore”, which includes the data.frame ‘dat’.

Next we remove ‘dat’ and load the contents of the datastore, reconstituting ‘dat’ as a usable data.frame object. Lastly, we delete the datastore and see that the ORE datastore is empty.

> dat <- read.csv("df.dat")
> dim(dat)
[1] 300 2000
>
> ore.save(dat, name="MyDatastore")
> ore.datastore()
datastore.name object.count size creation.date description
1 MyDatastore 1 4841036 2015-09-01 12:07:38
>
> ore.datastoreSummary("MyDatastore")
object.name class size length row.count col.count
1 dat data.frame 4841036 2000 300 2000
>
> rm(dat)
> ore.load("MyDatastore")
[1] "dat"
>
> ore.delete("MyDatastore")
[1] "MyDatastore"
>
> ore.datastore()
[1] datastore.name object.count size creation.date description
<0 rows> (or 0-length row.names)
Performance Tip
The performance of saving R objects to the datastore can be increased by temporarily turning off logging on the table that serves as the datastore in the user’s schema: RQ$DATASTOREINVENTORY. This can be accomplished using the following SQL, which can also be invoked from R:
SQL> alter table RQ$DATASTOREINVENTORY NOLOGGING;
ORE> ore.exec(“alter table RQ$DATASTOREINVENTORY NOLOGGING”)

While turning off logging speeds up inserts and index creation, it avoids writing the redo log and as such has implications for database recovery. It can be used in combination with explicit backups before and after loading data.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha