Addressing Data Order Between R and Relational Databases

Almost all data in R is a vector or is based upon vectors (vectors themselves, matrices, data frames, lists, and so forth).  The elements of a vector in R have an explicit order, and each element can be individually indexed.  R's in-memory processing relies on this order of elements for many computations, e.g., computing quantiles and summaries for time series objects.

By design, query results in relational algebra are unordered.  Repeating the same query multiple times is not guaranteed to return results in the same order. Similarly, database-backed relational data also do not guarantee row order.  However, an explicit order on database tables and views can be defined by using an ORDER BY clause in the SQL SELECT statement.  Ordering is usually achieved by having a unique identifier, either a single or multi-column key specified in the ORDER BY clause.

To bridge between ordered R data frames and normally unordered data in a relational database such as Oracle Database, Oracle R Enterprise provides the ability to create ordered and unordered ore.frame objects.  Oracle R Enterprise supports ordering an ore.frame by assigning row names using the function row.names.

Ordering Using Row Names

Oracle R Enterprise supports ordering using row names.  For example, suppose that the ore.frame object NARROW, which is a proxy object for the corresponding database table, is not indexed.  The following example illustrates using the row.names function to create a unique identifier for each row. When retrieving row names for unordered ore.frame objects, an error is returned:

R> row.names(NARROW)
Error: ORE object has no unique key

If an ore.frame is unordered, row indexing is not permitted, since there is no unique ordering.  For example, an attempt to retrieve the 5th row from the NARROW data returns an error:

R> NARROW[5,]
Error: ORE object has no unique key


The R function row.names can also be used to assign row names explicitly and thus create a unique row identifier.  We'll do this using the variable "ID" on the NARROW data:

R> row.names(NARROW) <- NARROW$ID


R> row.names(head(NARROW[ ,1:3]))
[1] 101501 101502 101503 101504 101505 101506 

We can now index to a specific row number using integer indexing:

R> NARROW[5,]     
  ID GENDER AGE MARITAL_STATUS COUNTRY EDUCATION OCCUPATION YRS_RESIDENCE CLASS
101505 101505 <NA> 34  NeverM United States of America Masters Sales 5 1

Similarly, to index a range of row numbers, use:

R> NARROW[2:3,]

ID GENDER AGE MARITAL_STATUS COUNTRY EDUCATION OCCUPATION YRS_RESIDENCE CLASS
101502 101502 <NA> 27 NeverM United States of America Bach. Sales 3 0

101503 101503 <NA> 20 NeverM United States of America HS-grad Cleric. 2 0


To index a specific row by row name, use character indexing:

R> NARROW["101502",]

  ID GENDER AGE MARITAL_STATUS COUNTRY EDUCATION OCCUPATION YRS_RESIDENCE CLASS

101502 101502 <NA> 27 NeverM United States of America Bach. Sales 3 0

Ordering Using Keys

You can also use the primary key of a database table to order an ore.frame object.  When you execute ore.connect in an R session, Oracle R Enterprise creates a connection to a schema in an Oracle Database instance. To gain access to the data in the database tables in the schema, you can explicitly call the ore.sync function. That function creates an ore.frame object that is a proxy for a table in a schema.  With the schema argument, you can specify the schema for which you want to create an R environment to proxy objects.  With the use.keys argument, you can specify whether you can to use primary keys in the table to order the ore.frame object.

To return the NARROW data to it's unordered state, remove the previously created row names:

R> row.names(NARROW) <- NULL


Using a SQL statement, alter the NARROW table to add a composite primary key:


R> ore.exec("alter table NARROW add constraint NARROW primary key (\"ID\")")


Synchronize the table to obtain the updated key using the
ore.sync command and setting the use.keys argument to TRUE.

R> ore.sync(table = "NARROW", use.keys = TRUE)


The row names of the ordered NARROW data are now the primary key column values:


R> head(NARROW[, 1:3])
   ID GENDER AGE
1 101501   <NA>  41
2 101502   <NA>  27
3 101503   <NA>  20
4 101504   <NA>  45
5 101505   <NA>  34
6 101506   <NA>  38

If your database table already contains a key, there is no need to create the key again. 
Simply execute ore.sync with use.keys set to TRUE when you want to use the primary key:


R> ore.sync(table = "TABLE_NAME", use.keys = TRUE)


Ordering database tables and views is known to reduce performance because it requires sorting. 
As most operations in R do not require ordering, the performance hit due to sorting is unnecessary, and you should generally set use.keys to FALSE in ore.sync. Only when ordering is necessary, for operations such sampling data or running the diff command to compare objects, should keys be used.

Options for Ordering

Oracle R Enterprise contains options that relate to the ordering of an ore.frame object. The ore.warn.order global option specifies whether you want Oracle R Enterprise to display a warning message if you use an unordered ore.frame object in a function whose results are order dependent. If you know what to expect from operations involving aggregates, group summaries, or embedded R computations,  then you might want to turn the warnings off so they do not appear in the output:

R> options("ore.warn.order")
$ore.warn.order

[1] TRUE


R> options("ore.warn.order" = FALSE)


R> options("ore.warn.order")

$ore.warn.order

[1] FALSE

Note that in some circumstances unordered data may appear to have a repeatable order, however since it was never guaranteed in the first place it is possible to change in future runs. Additionally, parallel queries can significantly impact the order of the result versus the sequential execution.




























Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« June 2015
SunMonTueWedThuFriSat
 
1
2
3
5
6
7
8
9
10
11
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30    
       
Today