### Introduction to the ORE Statistics Engine

#### By Sherry Lamonica-Oracle on Feb 23, 2012

*This Oracle R Enterprise (ORE) statistics engine tutorial is the second in a series to help users get started using ORE. (See the first tutorial on the ORE transparency layer here). Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition. *

The *ORE* statistics engine is a database library consisting of native database statistical functions and new functionality added specifically for *ORE*. *ORE* intercepts* R* functions and pushes their execution to Oracle Database to perform computationally expensive transformations and computations. This allows R users to use the *R* client directly against data stored in Oracle Database, greatly increasing R's scalability and performance.

A variety of useful and common routines are available in the *ORE* statistics engine:** Significance Tests**

Chi-square, McNemar, Bowker

Simple and weighted kappas

Cochran-Mantel-Haenzel correlation

Cramer's V

Binomial, KS, t, F, Wilcox

**Distribution Functions**

Beta distribution

Binomial distribution

Cauchy distribution

Chi-square distribution

Exponential distribution

F-distribution

Gamma distribution

Geometric distribution

Log Normal distribution

Logistic distribution

Negative Binomial distribution

Normal distribution

Poisson distribution

Sign Rank distribution

Student t distribution

Uniform distribution

Weibull distribution

Density Function

Probability Function

Quantile distribution

Other Functions

Other Functions

Gamma function

Natural logarithm of the Gamma function

Digamma function

Trigamma function

Error function

Complementary error function

**Base SAS Equivalents**

Freq, Summary, Sort

Rank, Corr, Univariate

These R functions are overridden, such that when presented with ore.frame data, the function generates a SQL query that is transparently submitted to Oracle Database. This is the case for much of the base *R* and stats functionality, providing the benefit of employing the database as the computational engine and seamlessly breaking through *R's* memory barrier.

In this post, we introduce a simple yet typical data analysis using
functions from the ORE transparency layer. We begin configuring the local *R* environment by executing a few simple commands.

Load the *ORE* library and connect to Oracle Database:

> library(ORE)

> ore.connect("USER", "SID", "HOST", "PASSWORD")

Invoking ore.sync synchronizes the the metadata in the database schema with the R environment:

> ore.sync("SCHEMA")

> ore.ls()

Attaching the database schema provides access to views and tables so they can be manipulated from a local *R* session:

> ore.attach("SCHEMA")

We use the ONTIME_S data set typically installed in Oracle Database when *ORE* is installed. ONTIME_S is a subset of the airline on-time performance data (from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs. We're providing a relatively large sample data set (220K rows), but these examples could be performed in ORE on the full data set, which contains 123 millions rows and requires 12 GB disk space . This data set is significantly larger than R can process on it's own.

ONTIME_S is a database-resident table with metadata on the *R* side, represented by an ore.frame object.

> class(ONTIME_S)

[1] "ore.frame"
attr(,"package")

[1] "OREbase"

A typical place to begin, is, of course, looking at the structure of selected variables. We call the function summary to obtain summary statistics on the variable measuring departure delay.

> with(ONTIME_S, summary(DEPDELAY)) Min. 1st Qu Median Mean 3rd Qu. Max. NA's -75.000 -2.000 0.000 8.042 6.00 1438.00 3860.00To investigate the skewness of the departure delay variable, we use the quantile function to calculate quantiles at 10, 20, 30,....90%.

> with(ONTIME_S, quantile(DEPDELAY, seq(0.1, 0.9, by = 0.1), na.rm = TRUE)) 10% 20% 30% 40% 50% 60% 70% 80% 90% -4 -2 -1 0 0 1 4 10 26

This shows us that the 10% quantile (-4) is 4 units away from the median, while the 90% quantile (26) is 26 units from the median. For a symmetric distribution, the two quantiles would be about the same distance from the median. A measure of the data spread is the interquartile range, the difference between the 25% and 75% quantile. To allow computations on the data, we remove missing values by setting the*na.rm parameter to TRUE, as we did above for the quantile function.*

> with(ONTIME_S, IQR(DEPDELAY, na.rm = TRUE)) [1] 8

Other measures of data spread available are the variance and standard deviation.

> with(ONTIME_S, var(DEPDELAY, na.rm = TRUE)) [1] 771.7207 > with(ONTIME_S, sd(DEPDELAY, na.rm = TRUE)) [1] 27.77986

Using in-database aggregation summaries, we can investigate the relationship between departure delay and origin a bit further.
We use aggregate to calculate the mean departure delay for each airport of origin. Results for the first five airports are displayed using the function head.

> ONTIME.agg <- aggregate(ONTIME_S$DEPDELAY,

by = list(ONTIME_S$ORIGIN),

FUN = mean)

> head(ONTIME.agg, 5)

Group.1 x

1 ABE 216

2 ABI 29

3 ABQ 1392

4 ABY 9

5 ACK 2

Now that we have gained a basic impression and some insights into the ONTIME_S data, we may choose to view the data graphically. For example, we may want to get a visual impression of the distribution of departure delay. We use the hist function, which displays a histogram skewed on positive side, presumably because flights rarely leave early.

> with(ONTIME_S, hist(DEPDELAY, breaks = 100, col = rainbow(7), cex.lab = 0.8,

main = "Distribution of Departure Delay", xlab = "Departure Delay (in minutes)"))

After analyzing the data through exploratory methods in ORE, we proceed to a possible next step: confirmatory statistics.
Let's compute a Student's t-test using the origin and departure delay variables we examined earlier. The goal is to decide whether average departure delay of one airport is different from the average delay of another.

> with(ONTIME_S, t.test(DEPDELAY[ORIGIN == "SEA"], DEPDELAY[ORIGIN == "SFO"], conf.level = 0.95))

Welch Two Sample t-test

data: DEPDELAY[ORIGIN == "SEA"] and DEPDELAY[ORIGIN == "SFO"]

t = -1.8406, df = 7571.893, p-value = 0.06572

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-2.21038010 0.06960489

sample estimates:

mean of x mean of y

8.475415 9.545802

The output tells us that we calculated the unpaired t-test and gives us the value of the t-distributed statistic.

We may also use a Kolmogorov–Smirnov test to determine if the departure delay variable is from the same distribution as the cumulative distribution.

> with(ONTIME_S, ks.test(DEPDELAY, "pnorm", mean(DEPDELAY), sd(DEPDELAY)))

One-sample Kolmogorov-Smirnov test

data: DEPDELAY

D = 0.3497, p-value < 2.2e-16

alternative hypothesis: two-sided

At this point we could continue our data exploration by performing additional distribution tests, or proceed with the rich set of modeling and functionality *ORE* offers.

The Oracle R Enterprise User's Guide contains a number of examples demonstrating the functionality available in ORE. To view the documentation and training materials, visit our product page. Please feel free to visit our discussion forum and ask questions or provide comments about how we can help you and your team!