Introduction to the ORE Statistics Engine
By Sherry LaMonica on Feb 23, 2012
This Oracle R Enterprise (ORE) statistics engine tutorial is the second in a series to help users get started using ORE. (See the first tutorial on the ORE transparency layer here). Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.
The ORE statistics engine is a database library consisting of native database statistical functions and new functionality added specifically for ORE. ORE intercepts R functions and pushes their execution to Oracle Database to perform computationally expensive transformations and computations. This allows R users to use the R client directly against data stored in Oracle Database, greatly increasing R's scalability and performance.
A variety of useful and common routines are available in the ORE statistics engine:
Chi-square, McNemar, Bowker
Simple and weighted kappas
Binomial, KS, t, F, Wilcox
Log Normal distribution
Negative Binomial distribution
Sign Rank distribution
Student t distribution
Natural logarithm of the Gamma function
Complementary error function
Base SAS Equivalents
Freq, Summary, Sort
Rank, Corr, Univariate
These R functions are overridden, such that when presented with ore.frame data, the function generates a SQL query that is transparently submitted to Oracle Database. This is the case for much of the base R and stats functionality, providing the benefit of employing the database as the computational engine and seamlessly breaking through R's memory barrier.
In this post, we introduce a simple yet typical data analysis using
functions from the ORE transparency layer. We begin configuring the local R environment by executing a few simple commands.
Load the ORE library and connect to Oracle Database:
> ore.connect("USER", "SID", "HOST", "PASSWORD")
Invoking ore.sync synchronizes the the metadata in the database schema with the R environment:
Attaching the database schema provides access to views and tables so they can be manipulated from a local R session:
We use the ONTIME_S data set typically installed in Oracle Database when ORE is installed. ONTIME_S is a subset of the airline on-time performance data (from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs. We're providing a relatively large sample data set (220K rows), but these examples could be performed in ORE on the full data set, which contains 123 millions rows and requires 12 GB disk space . This data set is significantly larger than R can process on it's own.
ONTIME_S is a database-resident table with metadata on the R side, represented by an ore.frame object.
 "ore.frame" attr(,"package")
A typical place to begin, is, of course, looking at the structure of selected variables. We call the function summary to obtain summary statistics on the variable measuring departure delay.
> with(ONTIME_S, summary(DEPDELAY)) Min. 1st Qu Median Mean 3rd Qu. Max. NA's -75.000 -2.000 0.000 8.042 6.00 1438.00 3860.00
To investigate the skewness of the departure delay variable, we use the quantile function to calculate quantiles at 10, 20, 30,....90%.
This shows us that the 10% quantile (-4) is 4 units away from the median, while the 90% quantile (26) is 26 units from the median. For a symmetric distribution, the two quantiles would be about the same distance from the median. A measure of the data spread is the interquartile range, the difference between the 25% and 75% quantile. To allow computations on the data, we remove missing values by setting the na.rm parameter to TRUE, as we did above for the quantile function.
Other measures of data spread available are the variance and standard deviation.
Using in-database aggregation summaries, we can investigate the relationship between departure delay and origin a bit further.
We use aggregate to calculate the mean departure delay for each airport of origin. Results for the first five airports are displayed using the function head.
> ONTIME.agg <- aggregate(ONTIME_S$DEPDELAY,
by = list(ONTIME_S$ORIGIN),
FUN = mean)
> head(ONTIME.agg, 5)
1 ABE 216
2 ABI 29
3 ABQ 1392
4 ABY 9
5 ACK 2
Now that we have gained a basic impression and some insights into the ONTIME_S data, we may choose to view the data graphically. For example, we may want to get a visual impression of the distribution of departure delay. We use the hist function, which displays a histogram skewed on positive side, presumably because flights rarely leave early.
> with(ONTIME_S, hist(DEPDELAY, breaks = 100, col = rainbow(7), cex.lab = 0.8,
main = "Distribution of Departure Delay", xlab = "Departure Delay (in minutes)"))
After analyzing the data through exploratory methods in ORE, we proceed to a possible next step: confirmatory statistics.
Let's compute a Student's t-test using the origin and departure delay variables we examined earlier. The goal is to decide whether average departure delay of one airport is different from the average delay of another.
> with(ONTIME_S, t.test(DEPDELAY[ORIGIN == "SEA"], DEPDELAY[ORIGIN == "SFO"], conf.level = 0.95))
Welch Two Sample t-test
data: DEPDELAY[ORIGIN == "SEA"] and DEPDELAY[ORIGIN == "SFO"]
t = -1.8406, df = 7571.893, p-value = 0.06572
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
mean of x mean of y
The output tells us that we calculated the unpaired t-test and gives us the value of the t-distributed statistic.
We may also use a Kolmogorov–Smirnov test to determine if the departure delay variable is from the same distribution as the cumulative distribution.
> with(ONTIME_S, ks.test(DEPDELAY, "pnorm", mean(DEPDELAY), sd(DEPDELAY)))
One-sample Kolmogorov-Smirnov test
D = 0.3497, p-value < 2.2e-16
alternative hypothesis: two-sided
At this point we could continue our data exploration by performing additional distribution tests, or proceed with the rich set of modeling and functionality ORE offers.
The Oracle R Enterprise User's Guide contains a number of examples demonstrating the functionality available in ORE. To view the documentation and training materials, visit our product page. Please feel free to visit our discussion forum and ask questions or provide comments about how we can help you and your team!