Convert ddply {plyr} to Oracle R Enterprise, or use with Embedded R Execution

The plyr package contains a set of tools for partitioning a problem into smaller sub-problems that can be more easily processed. One function within {plyr} is ddply, which allows you to specify subsets of a data.frame and then apply a function to each subset. The result is gathered into a single data.frame. Such a capability is very convenient. The function ddply also has a parallel option that if TRUE, will apply the function in parallel, using the backend provided by foreach.

This type of functionality is available through Oracle R Enterprise using the ore.groupApply function. In this blog post, we show a few examples from Sean Anderson's "A quick introduction to plyr" to illustrate the correpsonding functionality using ore.groupApply.

To get started, we'll create a demo data set and load the plyr package.


set.seed(1)
d <- data.frame(year = rep(2000:2014, each = 3),
        count = round(runif(45, 0, 20)))
dim(d)
library(plyr)

This first example takes the data frame, partitions it by year, and calculates the coefficient of variation of the count, returning a data frame.


# Example 1
res <- ddply(d, "year", function(x) {
  mean.count <- mean(x$count)
  sd.count <- sd(x$count)
  cv <- sd.count/mean.count
  data.frame(cv.count = cv)
  })

To illustrate the equivalent functionality in Oracle R Enterprise, using embedded R execution, we use the ore.groupApply function on the same data, but pushed to the database, creating an ore.frame. The function ore.push creates a temporary table in the database, returning a proxy object, the ore.frame.


D <- ore.push(d)
res <- ore.groupApply (D, D$year, function(x) {
  mean.count <- mean(x$count)
  sd.count <- sd(x$count)
  cv <- sd.count/mean.count
  data.frame(year=x$year[1], cv.count = cv)
  }, FUN.VALUE=data.frame(year=1, cv.count=1))

You'll notice the similarities in the first three arguments. With ore.groupApply, we augment the function to return the specific data.frame we want. We also specify the argument FUN.VALUE, which describes the resulting data.frame. From our previous blog posts, you may recall that by default, ore.groupApply returns an ore.list containing the results of each function invocation. To get a data.frame, we specify the structure of the result.

The results in both cases are the same, however the ore.groupApply result is an ore.frame. In this case the data stays in the database until it's actually required. This can result in significant memory and time savings whe data is large.


R> class(res)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> head(res)
   year cv.count
1 2000 0.3984848
2 2001 0.6062178
3 2002 0.2309401
4 2003 0.5773503
5 2004 0.3069680
6 2005 0.3431743

To make the ore.groupApply execute in parallel, you can specify the argument parallel with either TRUE, to use default database parallelism, or to a specific number, which serves as a hint to the database as to how many parallel R engines should be used.

The next ddply example uses the summarise function, which creates a new data.frame. In ore.groupApply, the year column is passed in with the data. Since no automatic creation of columns takes place, we explicitly set the year column in the data.frame result to the value of the first row, since all rows received by the function have the same year.


# Example 2
ddply(d, "year", summarise, mean.count = mean(count))

res <- ore.groupApply (D, D$year, function(x) {
  mean.count <- mean(x$count)
  data.frame(year=x$year[1], mean.count = mean.count)
  }, FUN.VALUE=data.frame(year=1, mean.count=1))

R> head(res)
   year mean.count
1 2000 7.666667
2 2001 13.333333
3 2002 15.000000
4 2003 3.000000
5 2004 12.333333
6 2005 14.666667

Example 3 uses the transform function with ddply, which modifies the existing data.frame. With ore.groupApply, we again construct the data.frame explicilty, which is returned as an ore.frame.


# Example 3

ddply(d, "year", transform, total.count = sum(count))

res <- ore.groupApply (D, D$year, function(x) {
  total.count <- sum(x$count)
  data.frame(year=x$year[1], count=x$count, total.count = total.count)
  }, FUN.VALUE=data.frame(year=1, count=1, total.count=1))

> head(res)
   year count total.count
1 2000 5 23
2 2000 7 23
3 2000 11 23
4 2001 18 40
5 2001 4 40
6 2001 18 40

In Example 4, the mutate function with ddply enables you to define new columns that build on columns just defined. Since the construction of the data.frame using ore.groupApply is explicit, you always have complete control over when and how to use columns.


# Example 4

ddply(d, "year", mutate, mu = mean(count), sigma = sd(count),
      cv = sigma/mu)

res <- ore.groupApply (D, D$year, function(x) {
  mu <- mean(x$count)
  sigma <- sd(x$count)
  cv <- sigma/mu
  data.frame(year=x$year[1], count=x$count, mu=mu, sigma=sigma, cv=cv)
  }, FUN.VALUE=data.frame(year=1, count=1, mu=1,sigma=1,cv=1))

R> head(res)
   year count mu sigma cv
1 2000 5 7.666667 3.055050 0.3984848
2 2000 7 7.666667 3.055050 0.3984848
3 2000 11 7.666667 3.055050 0.3984848
4 2001 18 13.333333 8.082904 0.6062178
5 2001 4 13.333333 8.082904 0.6062178
6 2001 18 13.333333 8.082904 0.6062178

In Example 5, ddply is used to partition data on multiple columns before constructing the result. Realizing this with ore.groupApply involves creating an index column out of the concatenation of the columns used for partitioning. This example also allows us to illustrate using the ORE transparency layer to subset the data.


# Example 5

baseball.dat <- subset(baseball, year > 2000) # data from the plyr package
x <- ddply(baseball.dat, c("year", "team"), summarize,
           homeruns = sum(hr))

We first push the data set to the database to get an ore.frame. We then add the composite column and perform the subset, using the transparency layer. Since the results from database execution are unordered, we will explicitly sort these results and view the first 6 rows.


BB.DAT <- ore.push(baseball)
BB.DAT$index <- with(BB.DAT, paste(year, team, sep="+"))
BB.DAT2 <- subset(BB.DAT, year > 2000)
X <- ore.groupApply (BB.DAT2, BB.DAT2$index, function(x) {
  data.frame(year=x$year[1], team=x$team[1], homeruns=sum(x$hr))
  }, FUN.VALUE=data.frame(year=1, team="A", homeruns=1), parallel=FALSE)
res <- ore.sort(X, by=c("year","team"))

R> head(res)
   year team homeruns
1 2001 ANA 4
2 2001 ARI 155
3 2001 ATL 63
4 2001 BAL 58
5 2001 BOS 77
6 2001 CHA 63

Our next example is derived from the ggplot function documentation. This illustrates the use of ddply within using the ggplot2 package. We first create a data.frame with demo data and use ddply to create some statistics for each group (gp). We then use ggplot to produce the graph. We can take this same code, push the data.frame df to the database and invoke this on the database server. The graph will be returned to the client window, as depicted below.


# Example 6 with ggplot2

library(ggplot2)
df <- data.frame(gp = factor(rep(letters[1:3], each = 10)),
                 y = rnorm(30))
# Compute sample mean and standard deviation in each group
library(plyr)
ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))

# Set up a skeleton ggplot object and add layers:
ggplot() +
  geom_point(data = df, aes(x = gp, y = y)) +
  geom_point(data = ds, aes(x = gp, y = mean),
             colour = 'red', size = 3) +
  geom_errorbar(data = ds, aes(x = gp, y = mean,
                               ymin = mean - sd, ymax = mean + sd),
             colour = 'red', width = 0.4)

DF <- ore.push(df)
ore.tableApply(DF, function(df) {
  library(ggplot2)
  library(plyr)
  ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))
  ggplot() +
    geom_point(data = df, aes(x = gp, y = y)) +
    geom_point(data = ds, aes(x = gp, y = mean),
               colour = 'red', size = 3) +
    geom_errorbar(data = ds, aes(x = gp, y = mean,
                                 ymin = mean - sd, ymax = mean + sd),
                  colour = 'red', width = 0.4)
})

But let's take this one step further. Suppose we wanted to produce multiple graphs, partitioned on some index column. We replicate the data three times and add some noise to the y values, just to make the graphs a little different. We also create an index column to form our three partitions. Note that we've also specified that this should be executed in parallel, allowing Oracle Database to control and manage the server-side R engines. The result of ore.groupApply is an ore.list that contains the three graphs. Each graph can be viewed by printing the list element.

df2 <- rbind(df,df,df)
df2$y <- df2$y + rnorm(nrow(df2))
df2$index <- c(rep(1,300), rep(2,300), rep(3,300))
DF2 <- ore.push(df2)
res <- ore.groupApply(DF2, DF2$index, function(df) {
  df <- df[,1:2]
  library(ggplot2)
  library(plyr)
  ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))
  ggplot() +
    geom_point(data = df, aes(x = gp, y = y)) +
    geom_point(data = ds, aes(x = gp, y = mean),
               colour = 'red', size = 3) +
    geom_errorbar(data = ds, aes(x = gp, y = mean,
                                 ymin = mean - sd, ymax = mean + sd),
                  colour = 'red', width = 0.4)
  }, parallel=TRUE)
res[[1]]
res[[2]]
res[[3]]

To recap, we've illustrated how various uses of ddply from the plyr package can be realized in ore.groupApply, which affords the user explicit control over the contents of the data.frame result in a straightforward manner. We've also highlighted how ddply can be used within an ore.groupApply call.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« July 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today