X

Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

Loading "wide data" in Oracle Database using Oracle R Enterprise

Mark Hornick
Director, Advanced Analytics and Machine Learning

Data sets come in many shapes and sizes. Some are tall and thin, others are short and wide. Some take on the form of dense data, a.k.a., single-record case, where each row represents one entity, such as a customer or vehicle. Others take on the form of sparse data, a.k.a., transactional data, where each row typically consists of an identifier, variable name, and value, and a single "case" is represented by multiple rows sharing the same identifier.

R provides a variety of ways to morph data from one form into another, e.g., using the reshape2 package. What does this have to do with Oracle R Enterprise and Oracle Database?

As data science practitioners, we can receive data in flat files and we often want to load this data into Oracle Database as a table. However, if that data is wide, i.e., more than 1000 columns, you cannot just create a database table from a data.frame with more than 1000 columns. In my previous post, Consolidating wide and shallow data with ORE Datastore, I discussed how you could use the Oracle R Enterprise datastore feature to directly store wide tables; however, this does not allow direct manipulation in Oracle Database, since it's an R object serialized within an Oracle Database table.

While Oracle Database has the concept of nested columns, and the Oracle Advanced Analytics in-database machine learning algorithms can use nested columns to build models and score data with tens of thousands of predictors, ORE does not support nested columns. But we can use ORE to convert the data into transactional format and easily populate a table that can be used to create nested columns within Oracle Database using SQL, or to further manipulate the data using ORE or SQL. 

An Example

To illustrate, we'll first create an R data.frame with 10000 rows and 9000 columns and write it to a file:

rr <- 10000
cc <- 9000
widematrix <- matrix(rnorm(rr*cc), nrow=rr, ncol=cc)
widedata <- data.frame(widematrix)
widedata$target <- rnorm(rr)*10
widedata$id <- 1:rr
dim(widedata)
write.table(widedata, file="WIDE_DATA.csv", row.names=FALSE)
Then, we'll read that file and pivot the data into transactional format using the melt function from the reshape2 package. Finally, we create that table in Oracle Database.
widedata <- read.table("WIDE_DATA.csv", header=TRUE)
dim(widedata)
library(reshape2)
transdata <- melt(widedata, id.vars="id")
head(transdata)
dim(transdata)

library(ORE)
ore.connect(...)

ore.drop(table="TRANS_DATA")
ore.create(transdata, table="TRANS_DATA")
dim(TRANS_DATA)
Running this, we see that reading the 90,020,000 cells worth of data took about 22 minutes in open source R on an Exadata X2 machine. Performing the pivot is very fast - under 2 seconds. Creating the database table took under 7 minutes for the 90 million row table!
> dim(widedata)
[1] 10000  9002
> write.table(widedata, file="WIDE_DATA.csv")
> system.time(widedata <- read.table("WIDE_DATA.csv", header=TRUE))
    user   system  elapsed 
1332.363    9.708 1343.965 
> dim(widedata)
[1] 10000  9002
> system.time(transdata <- melt(widedata, id.vars="id"))
   user  system elapsed 
  1.231   0.001   1.236 
> dim(transdata)
[1] 90010000        3
> ore.drop(table="TRANS_DATA")
> system.time(ore.create(transdata, table="TRANS_DATA"))
   user  system elapsed 
 62.080   5.648 413.627 
> dim(TRANS_DATA)
[1] 90010000        3

Now, it's possible you don't have enough RAM on your machine to do this all at once. So there is another option that involves using R and ROracle to batch the data reads and table writes.

First, we'll create an empty table with the same structure as our result table above, TRANS_DATA_NEW. Then, we connect to Oracle Database using ROracle directly, which gives us the ability to append to an existing table using a for loop.To ensure we can make forward progress, we commit after each loop, but this isn't necessary, just be sure to commit at the end. 

ore.drop(table="TRANS_DATA_NEW")
ore.create(head(TRANS_DATA,0), table="TRANS_DATA_NEW")

con <- dbConnect(dbDriver("Oracle"), username="rquser",password="pwdoaa", 
                 dbname="(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=host)(PORT=port))(CONNECT_DATA=(SERVICE_NAME=svcname)))")

header <- names(read.table("WIDE_DATA2.csv", 
                           header=TRUE, nrows=1))
batchsize <- 100 # rows at a time
for(i in 0:(rr/batchsize-1)) {
  widedata <- read.table("WIDE_DATA2.csv", header=FALSE, 
                         skip=i*batchsize+1, nrows=batchsize)
  names(widedata) <- header
  transdata <- melt(widedata, id.vars="id")
  dbWriteTable(con,"TRANS_DATA_NEW2",transdata,
               row.names=FALSE, append=TRUE)
  dbCommit(con)
}

Once the data is in Oracle Database in transactional format, it's easy to use SQL to construct nested columns for use with the Oracle Advanced Analytics SQL API. Here's a link to an example involving setting up data for association rules mining.

 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services