Monday Apr 15, 2013

Is the size of your lm model causing you headaches?

If you build an R lm model with a relatively large number of rows, you may be surprised by just how large that lm model is and what impact it has on your environment and application.

Why might you care about size? The most obvious is that the size of R objects impacts the amount of RAM available for further R processing or loading of more data. However, it also has implications for how much space is required to save that model or the time required to move it around the network. For example, you may want to move the model from the database server R engine to the client R engine when using Oracle R Enterprise Embedded R Execution. If the model is too large, you may encounter latency when trying to retrieve the model or even receive the following error:

Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch,  :
  ORA-20000: RQuery error
Error : serialization is too large to store in a raw vector

If you get this error, there are at least a few options:

  • Perform summary component access, like coefficients, inside the embedded R function and return only what is needed
  • Save the model in a database R datastore and manipulate that model at the database server to avoid pulling it to the client
  • Reduce the size of the model by eliminating large and unneeded components

In this blog post, we focus on the third approach and look at the size of lm model components, what you can do to control lm model size, and the implications for doing so. With vanilla R, objects are the "memory" serving as the repository for repeatability. As a result, models tend to be populated with the data used to build them to ensure model build repeatability.

When working with database tables, this "memory" is not needed because governance mechanisms are already in place to ensure either data does not change or logs are available to know what changes took place. Hence it is unnecessary to store the data used to build the model into the model object.

An lm model consists of several components, for example:

coefficients, residuals, effects, fitted.values, rank, qr, df.residual, call, terms, xlevels, model, na.action

Some of these components may appear deceptively small using R’s object.size function. The following script builds an lm model to help reveal what R reports for the size of various components. The examples use a sample of the ONTIME airline arrival and departure delays data set for domestic flights. The ONTIME_S data set is an ore.frame proxy object for data stored in an Oracle database and consists of 219932 rows and 26 columns. The R data.frame ontime_s is this same data pulled to the client R engine using ore.pull and is ~39.4MB.

Note: The results reported below use R 2.15.2 on Windows. Serialization of some components in the lm model has been improved in R 3.0.0, but the implications are the same.

f.lm.1 <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat) <- f.lm.1(ontime_s)


54807720 bytes

Using the object.size function on the resulting model, the size is about 55MB. If only scoring data with this model, it seems like a lot of bloat for the few coefficients assumed needed for scoring. Also, to move this object over a network will not be instantaneous. But is this the true size of the model?

A better way to determine just how big an object is, and what space is actually required to store the model or time to move it across a network, is the R serialize function.


[1] 65826324

Notice that the size reported by object.size is different from that of serialize – a difference of 11MB or ~20% greater.

What is taking up so much space? Let’s invoke object.size on each component of this lm model:

lapply(, object.size)

424 bytes


13769600 bytes


3442760 bytes


48 bytes


13769600 bytes


56 bytes


17213536 bytes


48 bytes


287504 bytes


192 bytes


1008 bytes


4432 bytes


6317192 bytes

The components residuals, fitted.values, qr, model, and even na.action are large. Do we need all these components?

The lm function provides arguments to control some aspects of model size. This can be done, for example, by specifying model=FALSE and qr=FALSE. However, as we saw above, there are other components that contribute heavily to model size.

f.lm.2 <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY,
                           data = dat, model=FALSE, qr=FALSE) <- f.lm.2(ontime_s)


[1] 51650410


31277216 bytes

The resulting serialized model size is down to about ~52MB, which is not significantly smaller than the full model.The difference with the result reported by object.size is now ~20MB, or 39% smaller.

Does removing these components have any effect on the usefulness of an lm model? We’ll explore this using four commonly used functions: coef, summary, anova, and predict. If we try to invoke summary on, the following error results:


Error in qr.lm(object) : lm object does not have a proper 'qr' component.

Rank zero or should not have used lm(.., qr=FALSE).

The same error results when we try to run anova. Unfortunately, the predict function also fails with the error above. The qr component is necessary for these functions. Function coef returns without error.



0.225378249 -0.001217511 0.962528054

If only coefficients are required, these settings may be acceptable. However, as we’ve seen, removing the model and qr components, while each is large, still leaves a large model. The really large components appear to be the effects, residuals, and fitted.values. We can explicitly nullify them to remove them from the model.

f.lm.3 <- function(dat) {
data = dat, model=FALSE, qr=FALSE)
mod$effects <- mod$residuals <- mod$fitted.values <- NULL
} <- f.lm.3(ontime_s)


[1] 24089000


294968 bytes

Thinking the model size should be small, we might be surprised to see the results above. The function object.size reports ~295KB, but serializing the model shows 24MB, a difference of 23.8MB or 98.8%. What happened? We’ll get to that in a moment. First, let’s explore what effect nullifying these additional components has on the model.

To answer this, we’ll turn on model and qr, and focus on effects, residuals, and fitted.values. If we nullify effects, the anova results are invalid, but the other results are fine. If we nullify residuals, summary cannot produce residual and coefficient statistics, but it also produces an odd F-statistic with a warning:

Warning message:

In : applied to non-(list or vector) of type 'NULL'

The function anova produces invalid F values and residual statistics, clarifying with a warning:

Warning message:

In anova.lm(mod) :

ANOVA F-tests on an essentially perfect fit are unreliable

Otherwise, both predict and coef work fine.

If we nullify fitted.values, summary produces an invalid F-statistics issuing the warning:

Warning message:

In mean.default(f) : argument is not numeric or logical: returning NA

However, there are no adverse effects on results on the other three functions.

Depending on what we need from our model, some of these components could be eliminated. But let’s continue looking at each remaining component, not with object.size, but serialize. Below, we use lapply to compute the serialized length of each model component. This reveals that the terms component is actually the largest component, despite object.size reporting only 4432 bytes above.

as.matrix(lapply(, function(x) length(serialize(x,NULL))))


coefficients 130

rank 26

assign 34

df.residual 26

na.action 84056

xlevels 55

call 275

terms 24004509

If we nullify the terms component, the model becomes quite compact. (By the way, if we simply nullify terms, summary, anova, and predict all fail.) Why is the terms component so large? It turns out it has an environment object as an attribute. The environment contains the variable dat, which contains the original data with 219932 rows and 26 columns. R’s serialize function includes this object and hence the reason the model is so large. The function object.size ignores these objects.

attr($terms, ".Environment")  
<environment: 0x1d6778f8>
ls(envir = attr($terms, ".Environment"))        
[1] "dat"
d <- get("dat",envir=envir)
[1] 219932 26
length(serialize(attr($terms, ".Environment"), NULL))
[1] 38959319
object.size(attr($terms, ".Environment"))
56 bytes

If we remove this object from the environment, the serialized object size also becomes small.

rm(list=ls(envir = attr($terms, ".Environment")),
envir = attr($terms, ".Environment"))  
ls(envir = attr($terms, ".Environment"))
length(serialize(, NULL))
[1] 85500

lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE,
    qr = FALSE)

(Intercept)     DISTANCE     DEPDELAY 
   0.225378    -0.001218     0.962528 

Is the associated environment essential to the model? If not, we could empty it to significantly reduce model size. We'll rebuild the model using the function f.lm.full

f.lm.full <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat) <- f.lm.full(ontime_s)
ls(envir=attr($terms, ".Environment"))
[1] "dat"
[1] 65826324

We'll create the model removing some components as defined in function:

line-height: 115%; font-family: "Courier New";">f.lm.small <- function(dat) {
  f.lm <- function(dat) {
  mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model=FALSE)   
  mod$fitted.values <- NULL
  mod <- f.lm(dat)
  # empty the env associated with local function
  e <- attr(mod$terms, ".Environment")
  # set parent env to .GlobalEnv so serialization doesn’t include contents
parent.env(e) <- .GlobalEnv    
  rm(list=ls(envir=e), envir=e) # remove all objects from this environment
} <- f.lm.small(ontime_s)
ls(envir=attr($terms, ".Environment")) 
length(serialize(, NULL))
[1] 16219251

We can use the same function with embedded R execution. <- ore.pull(ore.tableApply(ONTIME_S, f.lm.small))
ls(envir=attr($terms, ".Environment"))
length(serialize(, NULL))
[1] 16219251
as.matrix(lapply(, function(x) length(serialize(x,NULL))))    
coefficients  130   
residuals     4624354
effects       3442434
rank          26    
fitted.values 4624354
assign        34    
qr            8067072
df.residual   26    
na.action     84056 
xlevels       55    
call          245   
terms         938   

Making this change does not affect the workings of the model for coef, summary, anova, or predict. For example, summary produces expected results:


lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE)

     Min       1Q   Median       3Q      Max
-1462.45    -6.97    -1.36     5.07   925.08

              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  2.254e-01  5.197e-02   4.336 1.45e-05 ***
DISTANCE    -1.218e-03  5.803e-05 -20.979  < 2e-16 ***
DEPDELAY     9.625e-01  1.151e-03 836.289  < 2e-16 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.73 on 215144 degrees of freedom
  (4785 observations deleted due to missingness)
Multiple R-squared: 0.7647,     Adjusted R-squared: 0.7647
F-statistic: 3.497e+05 on 2 and 215144 DF,  p-value: < 2.2e-16

Using the model for prediction also produces expected results.

lm.pred <- function(dat, mod) {
prd <- predict(mod, newdata=dat)
prd[as.integer(rownames(prd))] <- prd
cbind(dat, PRED = prd)

dat.test <- with(ontime_s, ontime_s[YEAR == 2003 & MONTH == 5,
163267        0      748       -2 -2.61037575
163268       -8      361        0 -0.21414306
163269       -5      484        0 -0.36389686
163270       -3      299        3  2.74892676
163271        6      857       -6 -6.59319662
163272      -21      659       -8 -8.27718564
163273       -2     1448        0 -1.53757703
163274        5      238        9  8.59836323
163275       -5      744        0 -0.68044960
163276       -3      199        0 -0.01690635

As shown above, an lm model can become quite large. At least for some applications, several of these components may be unnecessary, allowing the user to significantly reduce the size of the model and space required for saving or time for transporting the model. Relying on Oracle Database to store the data instead of the R model object further allows for significant reduction in model size.

Thursday Feb 23, 2012

Introduction to the ORE Statistics Engine

This Oracle R Enterprise (ORE) statistics engine tutorial is the second in a series to help users get started using ORE. (See the first tutorial on the ORE transparency layer here). Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

The ORE statistics engine is a database library consisting of native database statistical functions and new functionality added specifically for ORE. ORE intercepts R functions and pushes their execution to Oracle Database to perform computationally expensive transformations and computations. This allows R users to use the R client directly against data stored in Oracle Database, greatly increasing R's scalability and performance.

A variety of useful and common routines are available in the ORE statistics engine:

Significance Tests

Chi-square, McNemar, Bowker
Simple and weighted kappas
Cochran-Mantel-Haenzel correlation
Cramer's V
Binomial, KS, t, F, Wilcox

Distribution Functions
Beta distribution
Binomial distribution
Cauchy distribution
Chi-square distribution
Exponential distribution
Gamma distribution
Geometric distribution
Log Normal distribution
Logistic distribution
Negative Binomial distribution
Normal distribution
Poisson distribution
Sign Rank distribution
Student t distribution
Uniform distribution
Weibull distribution
Density Function
Probability Function

Other Functions

Gamma function
Natural logarithm of the Gamma function
Digamma function
Trigamma function
Error function
Complementary error function

Base SAS Equivalents
Freq, Summary, Sort
Rank, Corr, Univariate

These R functions are overridden, such that when presented with ore.frame data, the function generates a SQL query that is transparently submitted to Oracle Database. This is the case for much of the base R and stats functionality, providing the benefit of employing the database as the computational engine and seamlessly breaking through R's memory barrier.

In this post, we introduce a simple yet typical data analysis using functions from the ORE transparency layer. We begin configuring the local R environment by executing a few simple commands.

Load the ORE library and connect to Oracle Database:

> library(ORE)
ore.connect("USER", "SID", "HOST", "PASSWORD")

Invoking ore.sync synchronizes the the metadata in the database schema with the R environment:

> ore.sync("SCHEMA")

Attaching the database schema provides access to views and tables so they can be manipulated from a local R session:


We use the ONTIME_S data set typically installed in Oracle Database when ORE is installed. ONTIME_S is a subset of the airline on-time performance data (from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs.  We're providing a relatively large sample data set (220K rows), but these examples could be performed in ORE on the full data set, which contains 123 millions rows and requires 12 GB disk space . This data set is significantly larger than R can process on it's own.

ONTIME_S is a database-resident table with metadata on the R side, represented by an ore.frame object.

> class(ONTIME_S)
[1] "ore.frame" attr(,"package")
[1] "OREbase"

We focus on two columns of data: one numeric column, DEPDELAY (actual departure delay in minutes), and a categorical column, ORIGIN (airport of origin).

A typical place to begin, is, of course, looking at the structure of selected variables. We call the function summary to obtain summary statistics on the variable measuring departure delay.

> with(ONTIME_S, summary(DEPDELAY))
      Min.   1st Qu  Median  Mean  3rd Qu. Max.     NA's 
   -75.000  -2.000  0.000   8.042   6.00  1438.00  3860.00 

To investigate the skewness of the departure delay variable, we use the quantile function to calculate quantiles at 10, 20, 30,....90%.

> with(ONTIME_S, quantile(DEPDELAY, seq(0.1, 0.9, by = 0.1), na.rm = TRUE)) 10% 20% 30% 40% 50% 60% 70% 80% 90% -4  -2  -1  0  0   1  4  10  26

This shows us that the 10% quantile (-4) is 4 units away from the median, while the 90% quantile (26) is 26 units from the median.  For a symmetric distribution, the two quantiles would be about the same distance from the median.  A measure of the data spread is the interquartile range, the difference between the 25% and 75% quantile. To allow computations on the data, we remove missing values by setting the na.rm parameter to TRUE, as we did above for the quantile function.
> with(ONTIME_S, IQR(DEPDELAY, na.rm = TRUE))
[1] 8

Other measures of data spread available are the variance and standard deviation.

> with(ONTIME_S, var(DEPDELAY, na.rm = TRUE)) [1] 771.7207 > with(ONTIME_S, sd(DEPDELAY, na.rm = TRUE)) [1] 27.77986

Using in-database aggregation summaries, we can investigate the relationship between departure delay and origin a bit further. We use aggregate to calculate the mean departure delay for each airport of origin.  Results for the first five airports are displayed using the function head.

ONTIME.agg <- aggregate(ONTIME_S$DEPDELAY,
                     by = list(ONTIME_S$ORIGIN),
                     FUN = mean)
head(ONTIME.agg, 5)
  Group.1    x
1     ABE   216
2     ABI    29
3     ABQ  1392
4     ABY   9
5     ACK   2

Now that we have gained a basic impression and some insights into the ONTIME_S data, we may choose to view the data graphically.  For example, we may want to get a visual impression of the distribution of departure delay.  We use the hist function, which displays a histogram skewed on positive side, presumably because flights rarely leave early.

> with(ONTIME_S, hist(DEPDELAY, breaks = 100, col = rainbow(7), cex.lab = 0.8,
      main = "Distribution of Departure Delay", xlab = "Departure Delay (in minutes)"))

After analyzing the data through exploratory methods in ORE, we proceed to a possible next step: confirmatory statistics. Let's compute a Student's t-test using  the origin and departure delay variables we examined earlier.  The goal is to decide whether average departure delay of one airport is different from the average delay of another.

> with(ONTIME_S, t.test(DEPDELAY[ORIGIN == "SEA"], DEPDELAY[ORIGIN == "SFO"], conf.level = 0.95))

    Welch Two Sample t-test

t = -1.8406, df = 7571.893, p-value = 0.06572
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.21038010  0.06960489
sample estimates:
mean of x  mean of y
 8.475415  9.545802

The output tells us that we calculated the unpaired t-test and gives us the value of the t-distributed statistic.

We may also use a Kolmogorov–Smirnov test to determine if the departure delay variable is from the same distribution as the cumulative distribution.

> with(ONTIME_S, ks.test(DEPDELAY, "pnorm", mean(DEPDELAY), sd(DEPDELAY)))

    One-sample Kolmogorov-Smirnov test

D = 0.3497, p-value < 2.2e-16
alternative hypothesis: two-sided

At this point we could continue our data exploration by performing additional distribution tests, or proceed with the rich set of modeling and functionality ORE offers.  

ORE enables R users transparent access to data stored in Oracle Database while leveraging Oracle Database as a compute engine for scalability and high performance. We've only scratched the surface on ORE's statistical features - stay tuned for posts highlighting more advanced features of the statistical engine.

The Oracle R Enterprise User's Guide contains a number of examples demonstrating the functionality available in ORE. To view the documentation and training materials, visit our product pagePlease feel free to visit our discussion forum  and ask questions or provide comments about how we can help you and your team!


Tuesday Feb 14, 2012

Oracle R Enterprise: Getting Started

Following last week's press release, we wanted to post a series of  demonstrations using Oracle R Enterprise.  Stay tuned to learn more about Oracle R Enterprise (ORE), a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

The R programming language and environment was originally designed to hold data in memory, providing fast and efficient calculations by not requiring the user's program to access information stored on the hard drive. Modern data set size has surpassed the rate which RAM has increased. Consequently, R users will often encounter errors similar to the following:

     "cannot allocate vector of length xxx"

This error occurs because R requires the operating system to provide a block of memory large enough to contain the contents of the data file, and the operating system responds that not enough memory is available. The maximum amount of memory that can be accessed by 32-bit R is 3GB. On 64-bit versions of R, larger objects may be created - theoretically up to 8TB.  However, the Operating System imposes limitations on the resources available to a single process, and using such large objects may be unacceptably slow.

R Programmers with big data sets work around memory limitations in a variety of ways. Some opt to analyze data samples, and some divide the data into manageable batches, run jobs sequentially on a single processor, and then combine the results. This is both costly and time-consuming. For R users who like the flexibility of the R language and the support of the R community, the option to analyze and model large data sets in R is an exciting enhancement.

The Oracle R Enterprise framework allows R users to operate on tables and views directly from R in Oracle Database. Instead of loading large data files into memory, the R engine processing is moved to the database, requiring minimal resources on the user's system, regardless of the size of the data.

In this introductory series, we'll cover everything you need to know to get started with Oracle R Enterprise, including:

Part 1: The ORE transparency layer - a collection of R packages with functions to connect to Oracle Database and use R functionality in Oracle Database. This enables R users to work with data too large to fit into the memory of a user's desktop system, and leverage the scalable Oracle Database as a computational engine.

Part 2. The ORE statistics engine - a collection of statistical functions and procedures corresponding to commonly-used statistical libraries. The statistics engine packages also execute in Oracle Database.

PART 3: ORE SQL extensions supporting embedded R execution through the database on the database server. R users can execute R closures (functions) using an R or SQL API, while taking advantage of data parallelism. Using the SQL API for embedded R execution, sophisticated R graphics and results can be exposed in Oracle Business Intelligence EE dashboards and Oracle BI Publisher documents.

PART 4: Oracle R Connector for Hadoop (ORCH) - an R package that interfaces with the Hadoop Distributed File System (HDFS) and enables executing MapReduce jobs. ORCH enables R users to work directly with an Oracle Hadoop cluster, executing computations from the R environment, written in the R language and working on data resident in HDFS, Oracle Database, or local files.

But we won't stop there - expect to see posts discussing many new features in 2012, including expanded platform support and an extended set of analytics routines. Please come back frequently for updates that can help your organization mature in its implementation of in-database analytics.

Friday Feb 03, 2012

What is R?

For many in the Oracle community, the addition of R through Oracle R Enterprise could leave them wondering "What is R?"

R has been receiving a lot of attention recently, although it’s been around for over 15 years. R is an open-source language and environment for statistical computing and data visualization, supporting data manipulation and transformations, as well as sophisticated graphical displays. It's being taught in colleges and universities in courses on statistics and advanced analytics - even replacing more traditional statistical software tools. Corporate data analysts and statisticians often know R and use it in their daily work, either writing their own R functionality, or leveraging the more than 3400 open source packages. The Comprehensive R Archive Network (CRAN) open source packages support a wide range of statistical and data analysis capabilities. They also focus on analytics specific to individual fields, such as bioinformatics, finance, econometrics, medical image analysis, and others (see CRAN Task Views).

So why do statisticians and data analysts use R?

Well, R is a statistics language similar to SAS or SPSS. It’s a powerful, extensible environment, and as noted above, it has a wide range of statistics and data visualization capabilities. It’s easy to install and use, and it’s free – downloadable from the CRAN R project website.

In contrast, statisticians and data analysts typically don’t know SQL and are not familiar with database tasks. R provides statisticians and data analysts access a wide range of analytical capabilities in a natural statistical language, allowing them to remain highly productive. For example, writing R functions is simple and can be done quickly. Functions can be made to return R objects that can be easily passed to and manipulated by other R functions. By comparison, traditional statistical tools can make the implementation of functions cumbersome, such that programmers resort to macro-oriented programming constructs instead.

So why do we need anything else?

R was conceived as a single user tool that is not multi-threaded.  The client and server components are bundled together as a single executable, much like Excel.

R is limited by the memory and processing power of the machine where it runs, but in addition, being single threaded, it cannot automatically leverage the CPU capacity on a user’s multi-processor laptop without special packages and programming.

However, there is another issue that limits R’s scalability…

R’s approach to passing data between function invocations results in data duplication – this chews up memory faster. So inherently, R is not good for big data, or depending on the machine and tasks, even gigabyte-sized data sets.

This is where Oracle R Enterprise comes in. As we'll continue to discuss in this blog, Oracle R Enterprise lifts this memory and computational constraint found in R today by executing requested R calculations on data in the database, using the database itself as the computational engine. Oracle R Enterprise allows users to further leverage Oracle's engineered systems, like Exadata, Big Data Appliance, and Exalytics, for enterprise-wide analytics, as well as reporting tools like Oracle Business Intelligence Enterprise Edition dashboards and BI Publisher documents.


The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.


« April 2014