SVD, or Singular Value Decomposition, is one of several techniques that can be used to reduce the dimensionality, i.e., the number of columns, of a data set. Why would we want to reduce the number of dimensions? In predictive analytics, more columns normally means more time required to build models and score data. If some columns have no predictive value, this means wasted time, or worse, those columns contribute noise to the model and reduce model quality or predictive accuracy.
Dimensionality reduction can be achieved by simply dropping columns, for example, those that may show up as collinear with others or identified as not being particularly predictive of the target as determined by an attribute importance ranking technique. But it can also be achieved by deriving new columns based on linear combinations of the original columns. In both cases, the resulting transformed data set can be provided to machine learning algorithms to yield faster model build times, faster scoring times, and more accurate models.
While SVD can be used for dimensionality reduction, it is often used in digital signal processing for noise reduction, image compression, and other areas.
SVD is an algorithm that factors an m x n matrix, M, of real or complex values into three component matrices, where the factorization has the form USV*. U is an m x p matrix. S is a p x p diagonal matrix. V is an n x p matrix, with V* being the transpose of V, a p x n matrix, or the conjugate transpose if M contains complex values. The value p is called the rank. The diagonal entries of S are referred to as the singular values of M. The columns of U are typically called the left-singular vectors of M, and the columns of V are called the right-singular vectors of M.
Consider the following visual representation of these matrices:
One of the features of SVD is that given the decomposition of M into U, S, and V, one can reconstruct the original matrix M, or an approximation of it. The singular values in the diagonal matrix S can be used to understand the amount of variance explained by each of the singular vectors. In R, this can be achieved using the computation:
cumsum(S^2/sum(S^2))
When plotted, this provides a visual understanding of the variance captured by the model. The figure below indicates that the first singular vector accounts for 96.5% of the variance, the second with the first accounts for over 99.5%, and so on.
As such, we can use this information to limit the number of vectors to the amount of variance we wish to capture. Reducing the number of vectors can help eliminate noise in the original data set when that data set is reconstructed using the subcomponents of U, S, and V.
OREâ€™s parallel, distributed SVD
With Oracle R Enterpriseâ€™s parallel distributed implementation of Râ€™s svd function, only the S and V components are returned. More specifically, the diagonal singular values are returned of S as the vector d. If we store the result of invoking svd on matrix dat in svd.mod, U can be derived from these using M as follows:
svd.mod <- svd(dat)
U <- dat %*% svd.mod$v %*% diag(1./svd.mod$d)
U.reduced <-dat %*% svd.mod$v[,1:k,drop=FALSE] %*% diag((svd.mod$d)[1:k,drop=FALSE])
dimReduce <- function(x, k=floor(ncol(x)/2), supplemental.cols=NULL) {
colIdxs <- which(colnames(x) %in% supplemental.cols)
colNames <- names(x[,-colIdxs])
sol <- svd(x[,-colIdxs])
sol.U <- as.matrix(x[,-colIdxs]) %*% (sol$v)[,1:k,drop=FALSE] %*%
diag((sol$d)[1:k,drop=FALSE])
sol.U = sol.U@data
res <- cbind(sol.U,x[,colIdxs,drop=FALSE])
names(res) <- c(names(sol.U@data),names(x[,colIdxs]))
res
}
dat <- iris
dat$ID <- seq_len(nrow(dat))
ore.drop("IRIS2")
ore.create(dat,table="IRIS2")
row.names(IRIS2) <- IRIS2$ID
# ore.exec("alter table IRIS2 add constraint IRIS2 primary key (\"ID\")")
# ore.sync(table = "IRIS2", use.keys = TRUE)
IRIS2[1:5,]
IRIS2.reduced <- dimReduce(IRIS2, 2, supplemental.cols=c("ID","Species"))
dim(IRIS2.reduced) # 150 4
library(rpart)
m1 <- rpart(Species~.,iris)
res1 <- predict(m1,iris,type="class")
table(res1,iris$Species)
#res1 setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 49 5
# virginica 0 1 45
dat2 <- ore.pull(IRIS2.reduced)
m2 <- rpart(Species~.-ID,dat2)
res2 <- predict(m2,dat2,type="class")
table(res2,iris$Species)
# res2 setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 47 0
# virginica 0 3 50
m2.1 <- ore.odmDT(Species~.-ID, IRIS2.reduced)
res2.1 <- predict(m2.1,IRIS2.reduced,type="class",supplemental.cols = "Species")
table(res2.1$PREDICTION, res2.1$Species)
# res2 setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 47 0
# virginica 0 3 50
dimReduce <- function(x, k=floor(ncol(x)/2), supplemental.cols=NULL, dsname="svd.model") {
colIdxs <- which(colnames(x) %in% supplemental.cols)
if (length(colIdxs) > 0) {
sol <- svd(x[,-colIdxs])
sol.U <- as.matrix(x[,-colIdxs]) %*% (sol$v)[,1:k,drop=FALSE] %*%
diag((sol$d)[1:k,drop=FALSE])
res <- cbind(sol.U@data,x[,colIdxs,drop=FALSE])
# names(res) <- c(names(sol.U@data),names(x[,colIdxs]))
res
} else {
sol <- svd(x)
sol.U <- as.matrix(x) %*% (sol$v)[,1:k,drop=FALSE] %*%
diag((sol$d)[1:k,drop=FALSE])
res <- sol.U@data
}
ore.save(sol, name=dsname, overwrite=TRUE)
res
}
dimReduceApply <- function(x, k=floor(ncol(x)/2), supplemental.cols=NULL, dsname="svd.model") {
colIdxs <- which(colnames(x) %in% supplemental.cols)
ore.load(dsname)
if (length(colIdxs) > 0) {
sol.U <- as.matrix(x[,-colIdxs]) %*% (sol$v)[,1:k,drop=FALSE] %*%
diag((sol$d)[1:k,drop=FALSE])
res <- cbind(sol.U@data,x[,colIdxs,drop=FALSE])
# names(res) <- c(names(sol.U@data),names(x[,colIdxs]))
res
} else {
sol.U <- as.matrix(x) %*% (sol$v)[,1:k,drop=FALSE] %*%
diag((sol$d)[1:k,drop=FALSE])
res <- sol.U@data
}
res
}
# load data from fileExecution Times
train <- read.csv("D:/datasets/digit-recognizer-train.csv")
dim(train) # 42000 786
train$ID <- 1:nrow(train) # assign row id
ore.drop(table="DIGIT_TRAIN")
ore.create(train,table="DIGIT_TRAIN") # create as table in the database
dim(DIGIT_TRAIN) # 42000 786
# Split the original training data into train and
# test sets to evaluate model accuracy
set.seed(0)
dt <- DIGIT_TRAIN
ind <- sample(1:nrow(dt),nrow(dt)*.6)
group <- as.integer(1:nrow(dt) %in% ind)
row.names(dt) <- dt$ID
sample.train <- dt[group==TRUE,]
sample.test <- dt[group==FALSE,]
dim(sample.train) # 25200 786
dim(sample.test) # 16800 786
# Create train table in database
ore.create(sample.train, table="DIGIT_SAMPLE_TRAIN")
# Create test table in database
ore.create(sample.test, table="DIGIT_SAMPLE_TEST")
# Add persistent primary key for row indexing
# Note: could be done using row.names(DIGIT_SAMPLE_TRAIN) <- DIGIT_SAMPLE_TRAIN$ID
ore.exec("alter table DIGIT_SAMPLE_TRAIN add constraint
DIGIT_SAMPLE_TRAIN primary key (\"ID\")")
ore.exec("alter table DIGIT_SAMPLE_TEST add constraint
DIGIT_SAMPLE_TEST primary key (\"ID\")")
ore.sync(table = c("DIGIT_SAMPLE_TRAIN","DIGIT_SAMPLE_TRAIN"), use.keys = TRUE)
# SVM model
m1.svm <- ore.odmSVM(label~.-ID, DIGIT_SAMPLE_TRAIN, type="classification")
pred.svm <- predict(m1.svm, DIGIT_SAMPLE_TEST,
supplemental.cols=c("ID","label"),type="class")
cm <- with(pred.svm, table(label,PREDICTION))
library(caret)
confusionMatrix(cm)
# Confusion Matrix and Statistics
#
# PREDICTION
# label 0 1 2 3 4 5 6 7 8 9
# 0 1633 0 4 2 3 9 16 2 7 0
# 1 0 1855 12 3 2 5 4 2 23 3
# 2 9 11 1445 22 26 8 22 30 46 10
# 3 8 9 57 1513 2 57 16 16 41 15
# 4 5 9 10 0 1508 0 10 4 14 85
# 5 24 12 14 52 28 1314 26 6 49 34
# 6 10 2 7 1 8 26 1603 0 6 0
# 7 10 8 27 4 21 8 1 1616 4 70
# 8 12 45 14 40 7 47 13 10 1377 30
# 9 12 10 6 19 41 15 2 54 15 1447
#
# Overall Statistics
#
# Accuracy : 0.9114
# 95% CI : (0.907, 0.9156)
# No Information Rate : 0.1167
# P-Value [Acc > NIR] : < 2.2e-16
#...
options(ore.parallel=4)
sample.train.reduced <- dimReduce(DIGIT_SAMPLE_TRAIN, 40, supplemental.cols=c("ID","label"))
sample.test.reduced <- dimReduceApply(DIGIT_SAMPLE_TEST, 40, supplemental.cols=c("ID","label"))
ore.drop(table="DIGIT_SAMPLE_TRAIN_REDUCED")
ore.create(sample.train.reduced,table="DIGIT_SAMPLE_TRAIN_REDUCED")
ore.drop(table="DIGIT_SAMPLE_TEST_REDUCED")
ore.create(sample.test.reduced,table="DIGIT_SAMPLE_TEST_REDUCED")
m2.svm <- ore.odmSVM(label~.-ID,
DIGIT_SAMPLE_TRAIN_REDUCED, type="classification")
pred2.svm <- predict(m2.svm, DIGIT_SAMPLE_TEST_REDUCED,
supplemental.cols=c("label"),type="class")
cm <- with(pred2.svm, table(label,PREDICTION))
confusionMatrix(cm)
# Confusion Matrix and Statistics
#
# PREDICTION
# label 0 1 2 3 4 5 6 7 8 9
# 0 1652 0 3 3 2 7 4 1 3 1
# 1 0 1887 8 2 2 1 1 3 3 2
# 2 3 4 1526 11 20 3 7 21 27 7
# 3 0 3 29 1595 3 38 4 16 34 12
# 4 0 4 8 0 1555 2 11 5 9 51
# 5 5 6 2 31 6 1464 13 6 10 16
# 6 2 1 5 0 5 18 1627 0 5 0
# 7 2 6 22 7 10 2 0 1666 8 46
# 8 3 9 9 34 7 21 9 7 1483 13
# 9 5 2 8 17 30 10 3 31 20 1495
#
# Overall Statistics
#
# Accuracy : 0.9494
# 95% CI : (0.946, 0.9527)
# No Information Rate : 0.1144
# P-Value [Acc > NIR] : < 2.2e-16
#...
# CASE 2 with Random Forest
m2.rf <- ore.randomForest(label~.-ID, DIGIT_SAMPLE_TRAIN,ntree=25)
pred2.rf <- predict(m2.rf, DIGIT_SAMPLE_TEST, supplemental.cols=c("label"),type="response")
cm <- with(pred2.rf, table(label,prediction))
confusionMatrix(cm)
# Confusion Matrix and Statistics
#
# prediction
# label 0 1 2 3 4 5 6 7 8 9
# 0 1655 0 1 1 2 0 7 0 9 1
# 1 0 1876 12 8 2 1 1 2 6 1
# 2 7 4 1552 14 10 2 5 22 10 3
# 3 9 5 33 1604 1 21 4 16 27 14
# 4 1 4 3 0 1577 1 9 3 3 44
# 5 9 6 2 46 3 1455 18 1 9 10
# 6 13 2 3 0 6 14 1621 0 3 1
# 7 1 6 31 5 16 3 0 1675 3 29
# 8 3 7 15 31 11 20 8 4 1476 20
# 9 9 2 7 23 32 5 1 15 12 1515
#
# Overall Statistics
#
# Accuracy : 0.9527
# 95% CI : (0.9494, 0.9559)
# No Information Rate : 0.1138
# P-Value [Acc > NIR] : < 2.2e-16
#...
m1.rf <- ore.randomForest(label~.-ID, DIGIT_SAMPLE_TRAIN_REDUCED,ntree=25)
pred1.rf <- predict(m1.rf, DIGIT_SAMPLE_TEST_REDUCED,
supplemental.cols=c("label"),type="response")
cm <- with(pred1.rf, table(label,prediction))
confusionMatrix(cm)
# Confusion Matrix and Statistics
#
# prediction
# label 0 1 2 3 4 5 6 7 8 9
# 0 1630 0 4 5 2 8 16 3 5 3
# 1 0 1874 17 4 0 5 2 2 4 1
# 2 15 2 1528 17 10 5 10 21 16 5
# 3 7 1 32 1601 4 25 10 8 34 12
# 4 2 6 6 3 1543 2 17 4 4 58
# 5 9 1 5 45 12 1443 11 3 15 15
# 6 21 3 8 0 5 15 1604 0 7 0
# 7 5 11 33 7 17 6 1 1649 2 38
# 8 5 13 27 57 14 27 9 12 1404 27
# 9 10 2 6 22 52 8 5 41 12 1463
#
# Overall Statistics
#
# Accuracy : 0.9368
# 95% CI : (0.9331, 0.9405)
# No Information Rate : 0.1139
# P-Value [Acc > NIR] : < 2.2e-16
#...