Wednesday Feb 26, 2014

Quadratic data in Oracle R Enterprise and Oracle Data Mining

I was working with some data which was stored in an Oracle database on a SPARC T4 server. I thought that the data had a quadratic component and I wanted to analyze the data using SQL Developer and Oracle Data Mining, a component of the Oracle Advanced Analytics Option. When I reviewed the initial analysis, I wasn't getting results that I had expected, and the fit of the model wasn't very good. I decided to feed some simple, synthetic quad data into Oracle Data Miner to ensure that I was using the tool properly.

Oracle R Enterprise was used as the tool to create and view the synthetic data.

From an R session that has the Oracle R Enterprise package installed, it is easy to access an Oracle Database:

require(ORE)
## Loading required package: ORE
## Loading required package: OREbase
## 
## Attaching package: 'OREbase'
## 
## The following object(s) are masked from 'package:base':
## 
##     cbind, data.frame, eval, interaction, order, paste, pmax,
##     pmin, rbind, table
## 
## Loading required package: OREstats
## Loading required package: MASS
## Loading required package: OREgraphics
## Loading required package: OREeda
## Loading required package: OREdm
## Loading required package: lattice
## Loading required package: OREpredict
## Loading required package: ORExml
ore.connect("SCOTT", "orcl", "sparc-T4", "TIGER", 1521)
## Loading required package: ROracle
## Loading required package: DBI

The following R function, quad1(), is used to calculate "y=ax^2 + bx + c",

where:
 - the data frame that is passed in has a column of x values.
 - a is in coefficients[feature, 1]
 - b is in coefficients[feature, 2]
 - c is in coefficients[feature, 3]

The function will simply calculate points along a parabolic line and is more complicated than it needs to be. I will leave it in this complicated format so that I can extend it to work with more interesting functions, such as a parabolic surface, later.  

quad1 <- function(df, coefficients) {
    feature <- 1

    coefficients[feature, 1] * df[, feature] * df[, feature] +
      coefficients[feature, 2] * df[, feature] +
      coefficients[feature, 3]
}

The following function, genData(), creates random "x" data points and uses func() to calculate the y values that correspond to the random x values.

genData <- function(nObservations, func, coefficients, nFeatures, scale) {
    dframe <- data.frame(x1 = rep(1, nObservations))
    
    for (feature in seq(nFeatures)) {
        name <- paste("x", feature, sep = "")
        dframe[name] <- runif(nObservations, -scale[feature], scale[feature])
    }
    dframe["y"] <- func(dframe, coefficients)
    return(dframe)
}

The following function, quadGraph(), is used for graphing. The points in dframe are displayed in a scatter plot. The coefficients for the known synthetic data is passed in and the corresponding line is sketched in blue. (Obviously, if you aren't working with synthetic data, it is unlikely that you will know the "true" coefficients.) The R model that is the best estimate of the data based on regression is passed in and sketched in blue.

quadGraph <- function(dframe, coefficients = NULL, model = NULL, ...) {
    with(dframe, plot(x1, y))
    title(main = "Quadratic Fit")
    legend("topright", inset = 0.05, c("True", "Model"), lwd = c(2.5, 2.5), 
        col = c("blue", "red"))
    xRange <- range(dframe[, "x1"])
    smoothX <- seq(xRange[1], xRange[2], length.out = 50)
    trueY <- quad1(data.frame(smoothX), coefficients)
    lines(smoothX, trueY, col = "blue")
    
    new = data.frame(x1 = smoothX)
    y_estimated <- predict(model, new)
    lines(smoothX, y_estimated, col = "red")
}

Here are the settings that will be used.

nFeatures <- 1  # one feature can sketch a line, 2 a surface, ...
nObservations <- 20  # How many rows of data to create for modeling
degree <- 2  # 2 is quadratic, 3 is cubic, etc
set.seed(2)  # I'll get the same coefficients every time I run 
coefficients <- matrix(rnorm(nFeatures * (degree + 1)), nFeatures, degree + 1)
scale <- (10^rpois(nFeatures, 2)) * rnorm(nFeatures, 3)

Here, synthetic data is created that matches the quadratic function and the random coefficients.

modelData <- genData(nObservations, quad1, coefficients, nFeatures, scale)

We can make this exercise at least slightly more realistic by adding some irreducible error for the regression algorithm to deal with. Add noise.

yRange <- range(modelData[, "y"])
yJitter <- (yRange[2] - yRange[1])/10
modelData["y"] <- modelData["y"] + rnorm(nObservations, 0, yJitter)

Great. At this point I have good quadratic synthetic data which can be analyzed. Feed the synthetic data to the Oracle Database.

oreDF <- ore.push(modelData)
tableName <- paste("QuadraticSample_", nObservations, "_", nFeatures, sep = "")
ore.drop(table = tableName)
ore.create(oreDF, table = tableName)

The Oracle R Enterprise function to fit the linear model works as expected.

m = ore.lm(y ~ x1 + I(x1 * x1), dat = oreDF)
summary(m)
## 
## Call:
## ore.lm(formula = y ~ x1 + I(x1 * x1), data = oreDF)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.149 -0.911 -0.156  0.888  1.894 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.3264     0.4308    3.08   0.0068 ** 
## x1           -0.0640     0.1354   -0.47   0.6428    
## I(x1 * x1)   -0.8392     0.0662  -12.68  4.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1.28 on 17 degrees of freedom
## Multiple R-squared: 0.912,	Adjusted R-squared: 0.901 
## F-statistic: 87.7 on 2 and 17 DF,  p-value: 1.1e-09
coefficients
##         [,1]   [,2]  [,3]
## [1,] -0.8969 0.1848 1.588

Notice that the "true" coefficients, that were used to create the synthetic data are close to the values from the regression. For example, the true "a" is stored in coefficients[1,1] = -0.8969 and is close to the model's I(x1 * x1) = -0.8392. Not bad given that the model was created from only 20 data points.

quadGraph(modelData, coefficients, m)

The 20 data points, which were calculated from the "true" equation, but with noisy irreducible error added, are shown in the graph. The model, estimated by ore.lm() from the 20 noisy data points, is close to true.

plot of chunk unnamed-chunk-10

At this point, my job is either complete, or ready to start, depending on your perspective. I'm happy that ore.lm() does a nice job of fitting, so maybe I'm done. But if you remember that my initial goal was to validate that SQL Developer and Oracle Data Miner work with quadratic data, my job has just begun. Now that I have known good quadratic synthetic data in the database, I'll use SQL Developer and the Oracle Data Mining to verify that everything is fine.

One more step in R. Create a second Oracle Database table that will be used to test the regression model. 

testData <- genData(nObservations, quad1, coefficients, nFeatures, scale)
oreTestData <- ore.push(testData)
tableName <- paste("QuadraticTest_", nObservations, "_", nFeatures, sep = "")
ore.drop(table = tableName)
ore.create(oreTestData, table = tableName)  

Here is the SQL Developer workflow that will be used. The synthetic data is in the Oracle Database table "QuadraticSample_20_1". The "Regress Build" node will run linear regression on the synthetic data. The test data which was generated using R in the previous paragraph, is stored in a Oracle Database table named "QuadraticTest_20_1". The Apply node will use the regression model that has been created and use the "x1" values from the test data, storing the y values in an Oracle Database table named "QUADTESTRESULTS". 

SQL Developer Data Mining Work Flow

So how did it work? A PhD in statistics would quickly tell you, "not well", and might look at you like you're an idiot if you don't know that a Model F Value Statistic of 3.25 isn't good. My more pedestrian approach is to plot the results of applying the model to the test data. 

The predictive confidence of the model that was created is poor:

Predictive Confidence

Pull the test result data into R for viewing:

ore.sync()
ore.attach()
testResults <- ore.pull(QUADTESTRESULTS)
## Warning: ORE object has no unique key - using random order
colnames(testResults)[1] <- "y"
with(testResults, plot(x1, y))
title(main = "Results of Applying Model to Test Data")

 Hmm, that doesn't look parabolic to me:

Linear fit, not quadratic

Now that I'm quite sure that SQL Developer and Oracle Data Mining isn't giving an expected fit, check through the advanced settings: 

Advanced Setting

There it is!!

Feature Selection

 Set the feature generation to use quadratic candidates, re-run the model.

The predictive confidence has improved. 

Bring the new results back into R:

Data viewed from R looks parabolic

Also, your statistician friends will be happy because the new model has a Model F Value Statistic of 124. Exciting, right? 

Now, off to work on parabolic surfaces...

Monday Feb 10, 2014

Finding datasets for South African Heart Disease

I've been working with R recently and have been looking for interesting data sets, for example:

http://statweb.stanford.edu/~tibs/ElemStatLearn/data.html

The following command is documented but fails because R can't track through 2 302 status codes and a "301 Moved Permanently":

> read.table("http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data", sep=",",head=T,row.names=1)
Error in file(file, "rt") : cannot open the connection


This worked:

> read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data", sep=",",head=T,row.names=1)

Even better:

http://cran.r-project.org/web/packages/ElemStatLearn/ElemStatLearn.pdf

> install.packages("ElemStatLearn")
> help(package="ElemStatLearn")
> library( package = ElemStatLearn )
> str(SAheart)
> summary(SAheart)
> data(SAheart)

R attributes and regexpr

I've been working with R recently.

Here is an example of using the match.length attribute that is returned from regexpr:

Three strings in a vector, first and third include an embedded string:

data=c("<a href=\"ch4.html\">Chapter 1</a>",
       "no quoted string is embedded in this string",
       "<a   href=\"appendix.html\">Appendix</a>")

Use regexpr to locate the embedded strings:

> locations <- regexpr("\"(.*?)\"", data)

Matches are in the first string (at 9 with length 10) and third string (at 11 with length 15):

> locations
[1]  9 -1 11
attr(,"match.length")
[1] 10 -1 15
attr(,"useBytes")
[1] TRUE

Vector from the attribute:

> attr(locations,"match.length")
[1] 10 -1 15

Use substr and the attribute vector to extract the strings:

> quoted_strings=substr( data, 
                         locations, 
                         locations+attr(locations,"match.length")-1 )    
> quoted_strings
[1] "\"ch4.html\""      ""                  "\"appendix.html\""

Maybe you'd like to remove the embedded quote characters from your strings:

> gsub("\"", "", quoted_strings)
[1] "ch4.html"      ""              "appendix.html"

An alternative is to use regmatches:

> regmatches(data,locations)
[1] "\"ch4.html\""      "\"appendix.html\""
About

user12620111

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today