R to Oracle Database Connectivity: Use ROracle for both Performance and Scalability

R users have a few choices of how to connect to their Oracle Database. The most commonly seen include: RODBC, RJDBC, and ROracle. However, these three packages have significantly different performance and scalability characteristics which can greatly impact your application development. In this blog, we’ll discuss these options and highlight performance benchmark results on a wide range of data sets.

If you use ROracle, we'd like to hear about your experience. Please take this brief survey.

By way of introduction, RODBC is an R package that implements ODBC database connectivity. There are two groups of functions: the largely internal odbc* functions implement low-level access to the corresponding ODBC functions having a similar name, and the higher level sql* functions that support read, save, copy, and manipulation of data between R data.frame objects and database tables. Here is an example using RODBC:

library(RODBC)

con <- odbcConnect("DD1", uid="rquser", pwd="rquser", rows_at_time = 500)

sqlSave(con, test_table, "TEST_TABLE")

sqlQuery(con, "select count(*) from TEST_TABLE")

d <- sqlQuery(con, "select * from TEST_TABLE")

close(con)

The R package RJDBC is an implementation of the R DBI package – database interface – that uses JDBC as the back-end connection to the database. Any database that supports a JDBC driver can be used in connection with RJDBC. Here is an example using RJDBC:

library(RJDBC)

drv <- JDBC("oracle.jdbc.OracleDriver",

classPath="…tklocal/instantclient_11_2/ojdbc5.jar", " ")
con <- dbConnect(drv, "
jdbc:oracle:thin:@myHost:1521:db", "rquser", "rquser")

dbWriteTable(con, "TEST_TABLE", test_table)

dbGetQuery(con, "select count(*) from TEST_TABLE")

d <- dbReadTable(con, "TEST_TABLE")
dbDisconnect(con)

The ROracle package is an implementation of the R DBI package that uses Oracle OCI for high performance and scalability with Oracle Databases. It requires Oracle Instant Client or Oracle Database Client to be installed on the client machine. Here is an example using ROracle:

library(ROracle)

drv <- dbDriver("Oracle")

con <- dbConnect(drv, "rquser", "rquser")

dbWriteTable(con,”TEST_TABLE”, test_table)

dbGetQuery(con, "select count(*) from TEST_TABLE")

d <- dbReadTable(con, "TEST_TABLE")

dbDisconnect(con)

Notice that since both RJDBC and ROracle implement the DBI interface, their code is the same except for the driver and connection details.

To compare these interfaces, we prepared tests along several dimensions:

  • Number of rows – 1K, 10K, 100K, and 1M
  • Number of columns – 10, 100, 1000
  • Data types – NUMBER, BINARY_DOUBLE, TIMESTAMP, and VARCHAR; Numeric data is randomly generated, all character data is 10 characters long.
  • Interface: RODBC 1.3-6 (with Data Direct 7.0 driver), RJDBC 0.2-1 (with rJava 0.9-4 with increased memory limit in JRIBootstrap.java),
    and ROracle 1.1-10 (with Oracle Database Client 11.2.0.4)
  • Types of operations: select *, create table, connect

Loading database data to an R data.frame

Where an in-database capability as provided by Oracle R Enterprise is not available, typical usage is to pull data to the R client for subsequent processing. In Figure 1, we compare the execution time to pull 10, 100, and 1000 columns of data from 1K, 10, 100K, and 1M rows for BINARY_DOUBLE data on a log-log scale. Notice that RJDBC does not scale to 100 columns x 1M rows, or above 1000 cols x 100K records. While RODBC and ROracle both scale to these volumes, ROracle is consistently faster than RODBC: up to 2.5X faster. For RJDBC, ROracle is up to 79X faster.

Figure 1: Comparison of RJDBC, RODBC, and ROracle for BINARY_DOUBLE for Select *

In Figure 2, we provide the range of results for RODBC, ROracle, and RJDBC across all data types. Notice that only ROracle provides the full range of scalability while providing superior performance in general.

ROracle is virtually always faster than RODBC: NUMBER data up to 2.5X faster, VARCHAR2 data up to 142X faster, and time stamp data up to 214X faster. RODBC fails to process 1000 columns at 1M rows.

For RJDBC, ROracle is up to 13X faster on NUMBER data, 79X faster on binary double data, 3X for VARCHAR2 data (excluding the 25X over the smallest data set). Note that RODBC and RJDBC have a limit of 255 characters on the length the VARCHAR2 columns. TIMESTAMP data is the one area where RJDBC initially shines, but then fails to scale to larger data sets.

Figure 2: Comparing the three interfaces for select * from <table>

Data set sizes represented in megabytes are captured in Table 1 for all data types. With only minor variation, the data sizes are the same across data types.

Table 1: Dataset sizes in megabytes

Creating database tables from an R data.frame

Data or results created in R may need to be written to a database table. In Figure 3, we compare the execution time to create tables with 10, 100, and 1000 columns of data with 1K, 10, 100K, and 1M rows for BINARY_DOUBLE. Notice that in all three cases, RJDBC is slowest and does not scale. RJDBC does not support the NUMBER or BINARY_DOUBLE data types, but uses FLOAT(126) instead. ROracle scaled across the remaining data types, while RODBC and RJDBC were not tested.

ROracle is 61faster than RODBC for 10 columns x 10K rows, with a median of 5X faster across all data sets. ROracle is 630X faster on 10 columns x 10K rows, with a median of 135X faster across all data sets. RJDBC did not scale to the 1M row data sets.

Figure 3: Comparison of RJDBC, RODBC, and ROracle for BINARY_DOUBLE create table

Connecting to Oracle Database

Depending on the application any sub-second response time may be sufficient. However, as depicted in Figure 4, ROracle introduces minimal time to establish a database connection. ROracle is nearly 10X faster than RJDBC and 1.6X faster than RODBC.

Figure 4: Database connection times for ROracle, RODBC, and RJDBC

In summary, for maximal performance and scalability, ROracle can support a wide range of application needs. RJDBC has significant limitations in both performance and scalability. RODBC can be more difficult to configure on various platforms and while it largely scales to the datasets tested here, its performance lags behind ROracle.

If you use ROracle, we'd like to hear about your experience. Please take this brief survey.

All tests were performed on a 16 processor machine with 4 core Intel Xeon E5540 CPUs @ 2.53 GHz and 74 GB RAM. Oracle Database was version 11.2.0.4. For JDBC, the following was modified before installing rJava.

rJava_0.9-4.tar.gz\rJava_0.9-4.tar\rJava\jri\bootstrap\JRIBootstrap.java was modified to use 2GB :

try {

System.out.println(jl.toString()+" -cp "+System.getProperty("java.class.path")+" -Xmx2g -Dstage=2 Boot");

Process p = Runtime.getRuntime().exec(new String[] {

jl.toString(), "-cp", System.getProperty("java.class.path"),"-Xmx2g", "-Dstage=2", "Boot" });

System.out.println("Started stage 2 ("+p+"), waiting for it to finish...");

System.exit(p.waitFor());

} catch (Exception re) {}

Comments:

How did you create the database table from an R data.frame (using ROracle)?
Did you define the table structure or is it possible to have ROracle create the table on the fly based on the data types in the data frame?

Posted by guest on March 04, 2014 at 02:16 PM PST #

Yes, it's possible to create the table on the fly. Here's a simple example:

1. Create data frame in R:

R> df <- data.frame(X = 1:10, Y = letters[1:10])
R> df
X Y
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j

2. Load the ROracle library, create connection and write data frame df to
database table DF:

R> library(ROracle)

R> drv <- dbDriver("Oracle")

R> con <- dbConnect(drv, "rquser", "rquser")

R> dbWriteTable(con, "DF", df)
[1] TRUE

R> dbListTables(con)
[1] "DF"

Posted by Sherry LaMonica on March 04, 2014 at 04:58 PM PST #

Hello, I can't find binary package file of ROracle from CRAN. When will it be available?

Posted by raccoonjin on March 24, 2014 at 04:47 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today