Thursday Feb 23, 2012

Introduction to the ORE Statistics Engine

This Oracle R Enterprise (ORE) statistics engine tutorial is the second in a series to help users get started using ORE. (See the first tutorial on the ORE transparency layer here). Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

The ORE statistics engine is a database library consisting of native database statistical functions and new functionality added specifically for ORE. ORE intercepts R functions and pushes their execution to Oracle Database to perform computationally expensive transformations and computations. This allows R users to use the R client directly against data stored in Oracle Database, greatly increasing R's scalability and performance.

A variety of useful and common routines are available in the ORE statistics engine:

Significance Tests

Chi-square, McNemar, Bowker
Simple and weighted kappas
Cochran-Mantel-Haenzel correlation
Cramer's V
Binomial, KS, t, F, Wilcox

Distribution Functions
Beta distribution
Binomial distribution
Cauchy distribution
Chi-square distribution
Exponential distribution
Gamma distribution
Geometric distribution
Log Normal distribution
Logistic distribution
Negative Binomial distribution
Normal distribution
Poisson distribution
Sign Rank distribution
Student t distribution
Uniform distribution
Weibull distribution
Density Function
Probability Function

Other Functions

Gamma function
Natural logarithm of the Gamma function
Digamma function
Trigamma function
Error function
Complementary error function

Base SAS Equivalents
Freq, Summary, Sort
Rank, Corr, Univariate

These R functions are overridden, such that when presented with ore.frame data, the function generates a SQL query that is transparently submitted to Oracle Database. This is the case for much of the base R and stats functionality, providing the benefit of employing the database as the computational engine and seamlessly breaking through R's memory barrier.

In this post, we introduce a simple yet typical data analysis using functions from the ORE transparency layer. We begin configuring the local R environment by executing a few simple commands.

Load the ORE library and connect to Oracle Database:

> library(ORE)
ore.connect("USER", "SID", "HOST", "PASSWORD")

Invoking ore.sync synchronizes the the metadata in the database schema with the R environment:

> ore.sync("SCHEMA")

Attaching the database schema provides access to views and tables so they can be manipulated from a local R session:


We use the ONTIME_S data set typically installed in Oracle Database when ORE is installed. ONTIME_S is a subset of the airline on-time performance data (from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs.  We're providing a relatively large sample data set (220K rows), but these examples could be performed in ORE on the full data set, which contains 123 millions rows and requires 12 GB disk space . This data set is significantly larger than R can process on it's own.

ONTIME_S is a database-resident table with metadata on the R side, represented by an ore.frame object.

> class(ONTIME_S)
[1] "ore.frame" attr(,"package")
[1] "OREbase"

We focus on two columns of data: one numeric column, DEPDELAY (actual departure delay in minutes), and a categorical column, ORIGIN (airport of origin).

A typical place to begin, is, of course, looking at the structure of selected variables. We call the function summary to obtain summary statistics on the variable measuring departure delay.

> with(ONTIME_S, summary(DEPDELAY))
      Min.   1st Qu  Median  Mean  3rd Qu. Max.     NA's 
   -75.000  -2.000  0.000   8.042   6.00  1438.00  3860.00 

To investigate the skewness of the departure delay variable, we use the quantile function to calculate quantiles at 10, 20, 30,....90%.

> with(ONTIME_S, quantile(DEPDELAY, seq(0.1, 0.9, by = 0.1), na.rm = TRUE)) 10% 20% 30% 40% 50% 60% 70% 80% 90% -4  -2  -1  0  0   1  4  10  26

This shows us that the 10% quantile (-4) is 4 units away from the median, while the 90% quantile (26) is 26 units from the median.  For a symmetric distribution, the two quantiles would be about the same distance from the median.  A measure of the data spread is the interquartile range, the difference between the 25% and 75% quantile. To allow computations on the data, we remove missing values by setting the na.rm parameter to TRUE, as we did above for the quantile function.
> with(ONTIME_S, IQR(DEPDELAY, na.rm = TRUE))
[1] 8

Other measures of data spread available are the variance and standard deviation.

> with(ONTIME_S, var(DEPDELAY, na.rm = TRUE)) [1] 771.7207 > with(ONTIME_S, sd(DEPDELAY, na.rm = TRUE)) [1] 27.77986

Using in-database aggregation summaries, we can investigate the relationship between departure delay and origin a bit further. We use aggregate to calculate the mean departure delay for each airport of origin.  Results for the first five airports are displayed using the function head.

ONTIME.agg <- aggregate(ONTIME_S$DEPDELAY,
                     by = list(ONTIME_S$ORIGIN),
                     FUN = mean)
head(ONTIME.agg, 5)
  Group.1    x
1     ABE   216
2     ABI    29
3     ABQ  1392
4     ABY   9
5     ACK   2

Now that we have gained a basic impression and some insights into the ONTIME_S data, we may choose to view the data graphically.  For example, we may want to get a visual impression of the distribution of departure delay.  We use the hist function, which displays a histogram skewed on positive side, presumably because flights rarely leave early.

> with(ONTIME_S, hist(DEPDELAY, breaks = 100, col = rainbow(7), cex.lab = 0.8,
      main = "Distribution of Departure Delay", xlab = "Departure Delay (in minutes)"))

After analyzing the data through exploratory methods in ORE, we proceed to a possible next step: confirmatory statistics. Let's compute a Student's t-test using  the origin and departure delay variables we examined earlier.  The goal is to decide whether average departure delay of one airport is different from the average delay of another.

> with(ONTIME_S, t.test(DEPDELAY[ORIGIN == "SEA"], DEPDELAY[ORIGIN == "SFO"], conf.level = 0.95))

    Welch Two Sample t-test

t = -1.8406, df = 7571.893, p-value = 0.06572
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.21038010  0.06960489
sample estimates:
mean of x  mean of y
 8.475415  9.545802

The output tells us that we calculated the unpaired t-test and gives us the value of the t-distributed statistic.

We may also use a Kolmogorov–Smirnov test to determine if the departure delay variable is from the same distribution as the cumulative distribution.

> with(ONTIME_S, ks.test(DEPDELAY, "pnorm", mean(DEPDELAY), sd(DEPDELAY)))

    One-sample Kolmogorov-Smirnov test

D = 0.3497, p-value < 2.2e-16
alternative hypothesis: two-sided

At this point we could continue our data exploration by performing additional distribution tests, or proceed with the rich set of modeling and functionality ORE offers.  

ORE enables R users transparent access to data stored in Oracle Database while leveraging Oracle Database as a compute engine for scalability and high performance. We've only scratched the surface on ORE's statistical features - stay tuned for posts highlighting more advanced features of the statistical engine.

The Oracle R Enterprise User's Guide contains a number of examples demonstrating the functionality available in ORE. To view the documentation and training materials, visit our product pagePlease feel free to visit our discussion forum  and ask questions or provide comments about how we can help you and your team!


The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.


« August 2016