I was working with some data which was stored in an Oracle database on a SPARC T4 server. I thought that the data had a quadratic component and I wanted to analyze the data using SQL Developer and Oracle Data Mining, a component of the Oracle Advanced Analytics Option. When I reviewed the initial analysis, I wasn't getting results that I had expected, and the fit of the model wasn't very good. I decided to feed some simple, synthetic quad data into Oracle Data Miner to ensure that I was using the tool properly.
Oracle R Enterprise was used as the tool to create and view the synthetic data.
From an R session that has the Oracle R Enterprise package installed, it is easy to access an Oracle Database:
require(ORE)
## Loading required package: ORE
## Loading required package: OREbase
##
## Attaching package: 'OREbase'
##
## The following object(s) are masked from 'package:base':
##
## cbind, data.frame, eval, interaction, order, paste, pmax,
## pmin, rbind, table
##
## Loading required package: OREstats
## Loading required package: MASS
## Loading required package: OREgraphics
## Loading required package: OREeda
## Loading required package: OREdm
## Loading required package: lattice
## Loading required package: OREpredict
## Loading required package: ORExml
ore.connect("SCOTT", "orcl", "sparc-T4", "TIGER", 1521)
## Loading required package: ROracle
## Loading required package: DBI
The following R function, quad1(), is used to calculate "y=ax^2 + bx + c",
where:
- the data frame that is passed in has a column of x values.
- a is in coefficients[feature, 1]
- b is in coefficients[feature, 2]
- c is in coefficients[feature, 3]
The function will simply calculate points along a parabolic line and is more complicated than it needs to be. I will leave it in this complicated format so that I can extend it to work with more interesting functions, such as a parabolic surface, later.
quad1 <- function(df, coefficients) {
feature <- 1
coefficients[feature, 1] * df[, feature] * df[, feature] +
coefficients[feature, 2] * df[, feature] +
coefficients[feature, 3]
}
The following function, genData(), creates random "x" data points and uses func() to calculate the y values that correspond to the random x values.
genData <- function(nObservations, func, coefficients, nFeatures, scale) {
dframe <- data.frame(x1 = rep(1, nObservations))
for (feature in seq(nFeatures)) {
name <- paste("x", feature, sep = "")
dframe[name] <- runif(nObservations, -scale[feature], scale[feature])
}
dframe["y"] <- func(dframe, coefficients)return(dframe)
}
The following function, quadGraph(), is used for graphing. The points in dframe are displayed in a scatter plot. The coefficients for the known synthetic data is passed in and the corresponding line is sketched in blue. (Obviously, if you aren't working with synthetic data, it is unlikely that you will know the "true" coefficients.) The R model that is the best estimate of the data based on regression is passed in and sketched in blue.
quadGraph <- function(dframe, coefficients = NULL, model = NULL, ...) {with(dframe, plot(x1, y))title(main = "Quadratic Fit")legend("topright", inset = 0.05, c("True", "Model"), lwd = c(2.5, 2.5),
col = c("blue", "red"))
xRange <- range(dframe[, "x1"])
smoothX <- seq(xRange[1], xRange[2], length.out = 50)
trueY <- quad1(data.frame(smoothX), coefficients)lines(smoothX, trueY, col = "blue")
new = data.frame(x1 = smoothX)
y_estimated <- predict(model, new)lines(smoothX, y_estimated, col = "red")
}
Here are the settings that will be used.
nFeatures <- 1 # one feature can sketch a line, 2 a surface, ...
nObservations <- 20 # How many rows of data to create for modeling
degree <- 2 # 2 is quadratic, 3 is cubic, etc
set.seed(2) # I'll get the same coefficients every time I run
coefficients <- matrix(rnorm(nFeatures * (degree + 1)), nFeatures, degree + 1)
scale <- (10^rpois(nFeatures, 2)) * rnorm(nFeatures, 3)
Here, synthetic data is created that matches the quadratic function and the random coefficients.
modelData <- genData(nObservations, quad1, coefficients, nFeatures, scale)
We can make this exercise at least slightly more realistic by adding some irreducible error for the regression algorithm to deal with. Add noise.
yRange <- range(modelData[, "y"])
yJitter <- (yRange[2] - yRange[1])/10
modelData["y"] <- modelData["y"] + rnorm(nObservations, 0, yJitter)
Great. At this point I have good quadratic synthetic data which can be analyzed. Feed the synthetic data to the Oracle Database.
oreDF <- ore.push(modelData)
tableName <- paste("QuadraticSample_", nObservations, "_", nFeatures, sep = "")
ore.drop(table = tableName)
ore.create(oreDF, table = tableName)
The Oracle R Enterprise function to fit the linear model works as expected.
m = ore.lm(y ~ x1 + I(x1 * x1), dat = oreDF)
summary(m)
##
## Call:
## ore.lm(formula = y ~ x1 + I(x1 * x1), data = oreDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.149 -0.911 -0.156 0.888 1.894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3264 0.4308 3.08 0.0068 **
## x1 -0.0640 0.1354 -0.47 0.6428
## I(x1 * x1) -0.8392 0.0662 -12.68 4.3e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.28 on 17 degrees of freedom
## Multiple R-squared: 0.912,Adjusted R-squared: 0.901
## F-statistic: 87.7 on 2 and 17 DF, p-value: 1.1e-09
coefficients
## [,1] [,2] [,3]
## [1,] -0.8969 0.1848 1.588
Notice that the "true" coefficients, that were used to create the synthetic data are close to the values from the regression. For example, the true "a" is stored in coefficients[1,1] = -0.8969 and is close to the model's I(x1 * x1) = -0.8392. Not bad given that the model was created from only 20 data points.
quadGraph(modelData, coefficients, m)
The 20 data points, which were calculated from the "true" equation, but with noisy irreducible error added, are shown in the graph. The model, estimated by ore.lm() from the 20 noisy data points, is close to true.
At this point, my job is either complete, or ready to start, depending on your perspective. I'm happy that ore.lm() does a nice job of fitting, so maybe I'm done. But if you remember that my initial goal was to validate that SQL Developer and Oracle Data Miner work with quadratic data, my job has just begun. Now that I have known good quadratic synthetic data in the database, I'll use SQL Developer and the Oracle Data Mining to verify that everything is fine.
One more step in R. Create a second Oracle Database table that will be used to test the regression model.
testData <- genData(nObservations, quad1, coefficients, nFeatures, scale)
oreTestData <- ore.push(testData)
tableName <- paste("QuadraticTest_", nObservations, "_", nFeatures, sep = "")
ore.drop(table = tableName)
ore.create(oreTestData, table = tableName)
Here is the SQL Developer workflow that will be used. The synthetic data is in the Oracle Database table "QuadraticSample_20_1". The "Regress Build" node will run linear regression on the synthetic data. The test data which was generated using R in the previous paragraph, is stored in a Oracle Database table named "QuadraticTest_20_1". The Apply node will use the regression model that has been created and use the "x1" values from the test data, storing the y values in an Oracle Database table named "QUADTESTRESULTS".
So how did it work? A PhD in statistics would quickly tell you, "not well", and might look at you like you're an idiot if you don't know that a Model F Value Statistic of 3.25 isn't good. My more pedestrian approach is to plot the results of applying the model to the test data.
The predictive confidence of the model that was created is poor:
Pull the test result data into R for viewing:
ore.sync()
ore.attach()
testResults <- ore.pull(QUADTESTRESULTS)
## Warning: ORE object has no unique key - using random order
colnames(testResults)[1] <- "y"
with(testResults, plot(x1, y))
title(main = "Results of Applying Model to Test Data")
Hmm, that doesn't look parabolic to me:
Now that I'm quite sure that SQL Developer and Oracle Data Mining isn't giving an expected fit, check through the advanced settings:
There it is!!
Set the feature generation to use quadratic candidates, re-run the model.
The predictive confidence has improved.
Bring the new results back into R:
Also, your statistician friends will be happy because the new model has a Model F Value Statistic of 124. Exciting, right?
Now, off to work on parabolic surfaces...