Tuesday Feb 18, 2014

Low-Rank Matrix Factorization in Oracle R Advanced Analytics for Hadoop

This guest post from Arun Kumar, a graduate student in the Department of Computer Sciences at the University of Wisconsin-Madison, describes work done during his internship in the Oracle Advanced Analytics group.

Oracle R Advanced Analytics For Hadoop (ORAAH), a component of Oracle’s Big Data Connectors software suite is a collection of statistical and predictive techniques implemented on Hadoop infrastructure. In this post, we introduce and explain techniques for a popular machine learning task that has diverse applications ranging from predicting ratings in recommendation systems to feature extraction in text mining namely matrix completion and factorization. Training, scoring, and prediction phases for matrix completion and factorization are available in ORAAH. The models generated can also be transparently loaded into R for ad-hoc inspection. In this blog, post we describe implementation specifics of these two techniques available in ORAAH.

Motivation

Consider an e-commerce company that displays products to potential customers on its webpage and collects data about views, purchases, ratings (e.g., 1 to 5 stars), etc. Increasingly, such online retailers are using machine learning techniques to predict in advance which products a customer is likely to rate highly and recommend such products to the customers in the hope that they might purchase them. Users build a statistical model based on the past history of ratings by all customers on all products. One popular model to generate predictions from such a hyper-sparse matrix is the latent factor model, also known as the low-rank matrix factorization model (LMF).

The setup is the following – we are given a large dataset of past ratings (potentially in the billions), say, with the schema (Customer ID, Product ID, Rating). Here, Customer ID refers to a distinct customer, Product ID refers to a distinct product, and Rating refers to a rating value, e.g., 1 to 5. Conceptually, this dataset represents a large matrix D with m rows (number of customers) and n columns (number of products), where the entries are the available ratings. Notice that this matrix is likely to be extremely sparse, i.e., many ratings could be missing since most customers typically rate only a few products. Thus, the task here is matrix completion – we need to predict the missing ratings so that it can be used for downstream processing such as displaying the top recommendations for each customer.

The LMF model assumes that the ratings matrix can be approximately generated as a product of two factor matrices, L and R, which are much smaller than D (lower rank). The idea is that the product L * R will approximately reconstruct the existing ratings and also automatically predict the missing ratings in D. More precisely, for each available rating (i,j,v) in D, we have (L x R) [i,j] ≈ v, while for each missing rating (i',j') in D, the predicted rating is (L x R) [i',j']. The model has a parameter r, which dictates the rank of the factor matrices, i.e., L is m x r, while R is r x n.

Matrix Completion in ORAAH

LMF can be invoked out-of-the-box using the routine orch.lmf. An execution based on the above example is shown below. The dataset of ratings is in a CSV file on HDFS with the schema above (named “retail_ratings” here).


input <- hdfs.attach("retail_ratings")
fit <- orch.lmf(input)

# Export the model into R memory
lr <- orch.export.fit(fit)

# Compute the prediction for the point (100, 50)

# First column of lr$L contains the userid
userid <- lr$L[,1] == 100 # find row corresponding to user id 100
L <- lr$L[, 2:(rank+1)]

#First column contains the itemid
itemid <- lr$R[,1] == 50 # find row corresponding to item id 50
R <- lr$R[, 2:(rank+1)]

# dot product as sum of terms obtained through component wise multiplication
pred <- sum(L[userid,] * R[itemid,])

The factor matrices can be transparently loaded into R for further inspection and for ad-hoc predictions of specific customer ratings using R. The algorithm we use for training the LMF model is called Incremental Gradient Descent (IGD), which has been shown to be one of the fastest algorithms for this task [1, 2].

The entire set of arguments for the function orch.lmf along with a brief description of each and their default values is given in the table below. The latin parameter configures the degree of parallelism for executing IGD for LMF on Hadoop [2]. ORAAH sets this automatically based on the dimensions of the problem and the memory available to each Mapper. Each Mapper fits its partition of the model in memory, and the multiple partitions run in parallel to learn different parts of the model. The last five parameters configure IGD and need to be tuned by the user to a given dataset since they can impact the quality of the model obtained.

ORAAH also provides routines for predicting ratings as well as for evaluating the model (computing the error of the model on a given labeled dataset) on a large scale over HDFS-resident datasets. The routine for prediction of ratings is predict, and for evaluating is orch.evaluate. Use help(orch.lmf) for online documentation, and demo(orch_lmf_jellyfish) for a fully working example including model fit, evaluation, and prediction.

Other Matrix Factorization Tasks

While LMF is primarily used for matrix completion tasks, it can also be used for other matrix factorization tasks that arise in text mining, computer vision, and bio-informatics, e.g., dimension reduction and feature extraction. In these applications, the input data matrix need not necessarily be sparse. Although many zeros might be present, they are not treated as missing values. The goal here is simply to obtain a low-rank factorization D ≈ L x R as accurately as possible, i.e., the product L x R should recover all entries in D, including the zeros. Typically, such applications use a Non-Negative Matrix Factorization (NMF) approach due to non-negativity constraints on the factor matrix entries. However, many of these applications often do not need non-negativity in the factor matrices. Using NMF algorithms for such applications leads to poorer-quality solutions. Our implementation of matrix factorization for such NMF-style tasks can be invoked out-of-the-box in ORAAH using the routine orch.nmf, which has the same set of arguments as LMF.

Experimental Results & Comparison with Apache Mahout

We now present an empirical evaluation of the performance, quality, and scalability of the ORAAH LMF tool based on IGD and compare it to the most widely used off-the-shelf tool for LMF on Hadoop – an implementation of the ALS algorithm from Apache Mahout [3].

All our experiments are run on an Oracle Big Data Appliance Hadoop cluster with nine nodes, each with Intel Xeon X5675 12-core 3.07GHz processors, 48 GB RAM, and 20 TB disk. We use 256MB HDFS blocks and 10 reducers for MapReduce jobs.

We use two standard public datasets for recommendation tasks – MovieLens10M (referred to as MLens) and Netflix – for the performance and quality comparisons (insert URL). To study scalability aspects, we use several synthetic datasets of different sizes by changing the number of rows, number of columns, and/or number of ratings. The table below presents the data set statistics.


Results: Performance and Quality

We first present end-to-end overview of the performance and quality achieved by our implementation and Mahout on MLens and Netflix. The rank parameter was set at 50 (a typical choice for such tasks) and the other parameters for both tools were chosen using a grid search. The quality of the factor matrices was determined using the standard measure of root mean square error (RMSE) [2]. We use a 70%-15%-15% Wold holdout of the datasets, i.e., 70% for training, 15% for testing, and 15% for validation of generalization error. The training was performed until 0.1% convergence, i.e., until the fractional decrease in the training RMSE after every iteration reached 0.1%. The table below presents the results.

1. ORAAH LMF has a faster performance than Mahout LMF on the overall training runtime on both datasets – 1.8x faster on MLens and 2.3x faster on Netflix.
2. The per-iteration runtime of ORAAH LMF is much lower than that of Mahout LMF – between 4.4x and 5.4x.
3. Although ORAAH LMF runs more iterations than Mahout LMF, the huge difference in the per-iteration runtimes make the overall runtime smaller for ORAAH LMF.
4. The training quality (training RMSE) achieved is comparable across both tools on both datasets. Similarly, the generalization quality is also comparable. Thus, ORAAH LMF can offer state-of-the-art quality along with faster performance.

Results: Scalability

The ability to scale along all possible dimensions of the data is key to big data analytics. Both ORAAH LMF and Mahout LMF are able to scale to billions of ratings by parallelizing and distributing computations on Hadoop. But we now show that unlike Mahout LMF, ORAAH LMF is also able to scale to hundreds of millions of customers (m) and products (n), and also scales well with the rank results along these three dimensions – m, n, and r. parameter (r, which affects the size of the factor matrices). The figure below presents the scalability.

1. Figures (A) and (B) plot the results for the Syn-row and Syn-col datasets, respectively (r = 2). ORAAH LMF scales linearly with both number of rows (m) and number of columns (n), while Mahout LMF does not show up on either plot because it crashes at all these values of m. In fact, we verified that Mahout LMF does not scale beyond even m = 20 M! The situation is similar with n. This is because Mahout LMF assumes that the factor matrices L and R fit entirely in the memory of each Mapper. In contrast, ORAAH LMF uses a clever partitioning scheme on all matrices ([2]) and can thus scale seamlessly on all dataset dimensions.
2. Figure (C) shows the impact of the rank parameter r. ORAAH LMF scales linearly with r and the per-iteration runtime roughly doubles between r = 20 and r = 100. However, the per-iteration runtime of Mahout LMF varies quadratically with r, and in fact, increases by a factor of 40x between r = 20 and r = 100! Thus, ORAAH LMF is also able to scale better with r.
3. Finally, on the tera-scale dataset Syn-tera with 1 billion rows, 10 million columns, and 20 billion ratings, ORAAH LMF (for r = 2) finishes an iteration in just under 2 hours!

Acknowledgements

The matrix factorization features in ORAAH were implemented and benchmarked by Arun Kumar during his summer internship at Oracle under the guidance of Vaishnavi Sashikanth. He is pursuing his PhD in computer science from the University of Wisconsin-Madison. This work is the result of a collaboration between Oracle and the research group of Dr. Christopher Ré, who is now at Stanford University. Anand Srinivasan helped integrate these features into ORAAH.

References

[1] Towards a Unified Architecture for in-RDBMS Analytics. Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. ACM SIGMOD 2012.

[2] Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion. Benjamin Recht and Christopher Ré. Mathematical Programming Computation 2013.

[3] Apache Mahout. http://mahout.apache.org/.

Thursday Feb 13, 2014

Monitoring progress of embedded R functions

When you run R functions in the database, especially functions involving multiple R engines in parallel, you can monitor their progress using the Oracle R Enterprise datastore as a central location for progress notifications, or any intermediate status or results. In the following example, based on ore.groupApply, we illustrate instrumenting a simple function that builds a linear model to predict flight arrival delay based on a few other variables.

In the function modelBuildWithStatus, the function verifies that there are rows for building the model after eliminating incomplete cases supplied in argument dat. If not empty, the function builds a model and reports “success”, otherwise, it reports “no data.” It’s likely that the user would like to use this model in some way or save it in a datastore for future use, but for this example, we just build the model and discard it, validating that a model can be built on the data.


modelBuildWithStatus <-
  function(dat) {
    dat <- dat[complete.cases(dat),]
    if (nrow(dat)>0L) {
      mod <- lm(ARRDELAY ~ DISTANCE + AIRTIME + DEPDELAY, dat);
      "success"
    } else
      "no_data"
    }

When we invoke this using ore.groupApply, the goal is to build one model per “unique carrier” or airline. Using an ORE 1.4 feature, we specify the degree of parallelism using the parallel argument, setting it to 2.


res <- ore.groupApply(ONTIME_S[, c("UNIQUECARRIER","DISTANCE", "ARRDELAY", "DEPDELAY", "AIRTIME")],
        ONTIME_S$UNIQUECARRIER,
        modelBuildWithStatus,
        parallel=2L)

res.local<-ore.pull(res)
res.local[unlist(res.local)=="no_data"]

The result tells us about the status of each execution. Below, we print the unique carries that had no data.


R> res.local<-ore.pull(res)
R> res.local[unlist(res.local)=="no_data"]
$EA
[1] "no_data"

$`ML(1)`
[1] "no_data"

$`PA(1)`
[1] "no_data"

$PI
[1] "no_data"

$PS
[1] "no_data"

To monitor the progress of each execution, we can identify the group of data being processed in each function invocation using the value from the UNIQUECARRIER column. For this particular data set, we use the first two characters of the carrier’s symbol appended to “group.” to form a unique object name for storing in the datastore identified by job.name. (If we don’t do this, the value will form an invalid object name.) Note that since the UNIQUECARRIER column contains uniform data, we need only the first value.

The general idea for monitoring progress is to save an object in the datastore named for each execution of the function on a group. We can then list the contents of the named datastore and compute a percentage complete, which is discussed later in this post. For the “success” case, we assign the value “SUCCESS” to the variable named by the string in nm that we created earlier. Using ore.save, this uniquely named object is stored in the datastore with the name in job.name. We use the append=TRUE flag to indicate that the various function executions will be sharing the same named datastore.
If there is no data left in dat, we assign “NO DATA” to the variable named in nm and save that. Notice in both cases, we’re still returning “success” or “no data” so these come back in the list returned by ore.groupApply. However, we can return other values instead, e.g., the model produced.


modelBuildWithMonitoring <-
  function(dat, job.name) {
  nm <- paste("group.", substr(as.character(dat$UNIQUECARRIER[1L]),1,2), sep="")
  dat <- dat[complete.cases(dat),]
  if (nrow(dat)>0L) {
    mod <- lm(ARRDELAY ~ DISTANCE + AIRTIME + DEPDELAY, dat);
    assign(nm, "SUCCESS")
    ore.save(list=nm, name=job.name, append=TRUE)
    "success"
  } else {
    assign(nm, "NO DATA")
    ore.save(list=nm, name=job.name, append=TRUE)
    "no data"
  }
}

When we use this function in ore.groupApply, we provide the job.name and ore.connect arguments as well. The variable ore.connect must be set to TRUE in order to use the datastore. As the ore.groupApply executes, the datastore named by job.name will be increasingly getting objects added with the name of the carrier. First, delete the datastore named “job1”, if it exists.


ore.delete(name="job1")

res <- ore.groupApply(ONTIME_S[, c("UNIQUECARRIER","DISTANCE", "ARRDELAY", "DEPDELAY", "AIRTIME")],
        ONTIME_S$UNIQUECARRIER,
        modelBuildWithMonitoring,
        job.name="job1", parallel=2L, ore.connect=TRUE)

To see the progress during execution, we can use the following function, which takes a job name and the cardinality of the INDEX column to determine the percent complete. This function is invoked in a separate R engine connected to the same schema. If the job name is found, we print the percent complete, otherwise stop with an error message.


check.progress <- function(job.name, total.groups) {
  if ( job.name %in% ore.datastore()$datastore.name )
    print(sprintf("%.1f%%", nrow(ore.datastoreSummary(name=job.name))/total.groups*100L))
  else
    stop(paste("Job", job.name, " does not exist"))
}

To invoke this, compute the total number of groups and provide this and the job name to the function check.progress.
total.groups <- length(unique(ONTIME_S$UNIQUECARRIER))
check.progress("job1",total.groups)

However, we really want a loop to report on the progress automatically. One simple approach is to set up a while loop with a sleep delay. When we reach 100%, stop. To be self-contained, we include a simplification of the function above as a local function.


check.progress.loop <- function(job.name, total.groups, sleep.time=2) {
  check.progress <- function(job.name, total.groups) {
    if ( job.name %in% ore.datastore()$datastore.name )
      print(sprintf("%.1f%%", nrow(ore.datastoreSummary(name=job.name))/total.groups*100L))
    else
      paste("Job", job.name, " does not exist")
  }
  while(1) {
    try(x <- check.progress(job.name,total.groups))
    Sys.sleep(sleep.time)
    if(x=="100.0%") break
  }
}

As before, this function is invoked in a separate R engine connected to the same schema.


check.progress.loop("job1",total.groups)

Looking at the results, we can see the progress reported at one second intervals. Since the models build quickly, it doesn’t take long to reach 100%. For functions that take longer to execute or where there are more groups to process, you may choose a longer sleep time. Following this, we look at the datastore “job1” using ore.datastore and its contents using ore.datastoreSummary.


R> check.progress.loop("job1",total.groups,sleep.time=1)
[1] "6.9%"
[1] "96.6%"
[1] "100.0%"

R> ore.datastore(name="job1")
  datastore.name object.count size      creation.date description
1 job1 29 1073 2014-02-13 22:03:20
R> ore.datastoreSummary(name="job1")
object.name class size length row.count col.count
1 group.9E character 37 1 NA NA
2 group.AA character 37 1 NA NA
3 group.AQ character 37 1 NA NA
4 group.AS character 37 1 NA NA
5 group.B6 character 37 1 NA NA
6 group.CO character 37 1 NA NA
7 group.DH character 37 1 NA NA
8 group.DL character 37 1 NA NA
9 group.EA character 37 1 NA NA
10 group.EV character 37 1 NA NA
11 group.F9 character 37 1 NA NA
12 group.FL character 37 1 NA NA
13 group.HA character 37 1 NA NA
14 group.HP character 37 1 NA NA
15 group.ML character 37 1 NA NA
16 group.MQ character 37 1 NA NA
17 group.NW character 37 1 NA NA
18 group.OH character 37 1 NA NA
19 group.OO character 37 1 NA NA
20 group.PA character 37 1 NA NA
21 group.PI character 37 1 NA NA
22 group.PS character 37 1 NA NA
23 group.TW character 37 1 NA NA
24 group.TZ character 37 1 NA NA
25 group.UA character 37 1 NA NA
26 group.US character 37 1 NA NA
27 group.WN character 37 1 NA NA
28 group.XE character 37 1 NA NA
29 group.YV character 37 1 NA NA

The same basic technique can be used to note progress in any long running or complex embedded R function, e.g., in ore.tableApply or ore.doEval. At various points in the function, sequence-named objects can be added to a datastore. Moreover, the contents of those objects can contain incremental or partial results, or even debug output.

While we’ve focused on the R API for embedded R execution, the same functions could be invoked using the SQL API. However, monitoring would still be done from an interactive R engine.

Tuesday Feb 04, 2014

Invoking R scripts via Oracle Database: Theme and Variation, Part 6

How can I use "group apply" to partition data over multiple columns for parallel execution?
How can I use R for statistical computations and return results as a database table?

In this blog post of our theme and variation series, we answer these two questions through several examples, highlighting both R and SQL interfaces.

So far in this blog series on Oracle R Enterprise embedded R execution we've covered:

Part 1: ore.doEval / rqEval
Part 2: ore.tableApply / rqTableEval
Part 3: ore.groupApply / “rqGroupApply”
Part 4: ore.rowApply / rqRowEval
Part 5: ore.indexApply

Using ore.groupApply for partitioning data on multiple columns

While the “group apply” functionality is quite powerful as it is, users sometimes want to partition data on multiple columns. Since ore.groupApply currently takes only a single column for the INDEX argument, users can create a new column that is the concatenation of the columns of interest, and provide this column to the INDEX argument. We’ll illustrate this first using the R API, and then the SQL API.

R API

We adapt an example from Part 3 to illustrate partitioning data on multiple columns. Instead of building a C5.0 model, we’ll use the same CHURN_TRAIN data set, but build an rpart model since it will produce rules on the partitions of data we’ve chosen for the example, namely, voice_mail_plan and international_plan. To understand the number of rows we can expect in each partition, we’ll use the R table function. We then add a new column that pastes together the two columns of interest to create a new column called “vmp_ip”.


library(C50)
data(churn)

ore.create(churnTrain, "CHURN_TRAIN")

table(CHURN_TRAIN$international_plan, CHURN_TRAIN$voice_mail_plan)
CT <- CHURN_TRAIN
CT$vmp_ip <- paste(CT$voice_mail_plan,CT$international_plan,sep="-")
head(CT)

Each invocation of the function “my.rpartFunction” will receive data from one of the partitions identified in vmp_ip. Since our source partition columns are constants, we set them to NULL. The character vectors are converted to factors and the model is built to predict churn and saved in an appropriately named datastore. Instead of returning TRUE as done in the previous example, we create a list to return the specific partition column values, the distribution of churn values, and the model itself.


ore.scriptDrop("my.rpartFunction")
ore.scriptCreate("my.rpartFunction",
  function(dat,datastorePrefix) {
    library(rpart)
    vmp <- dat[1,"voice_mail_plan"]
    ip <- dat[1,"international_plan"]
    datastoreName <- paste(datastorePrefix,vmp,ip,sep="_")
    dat$voice_mail_plan <- NULL
    dat$international_plan <- NULL
    dat$state <- as.factor(dat$state)
    dat$churn <- as.factor(dat$churn)
    dat$area_code <- as.factor(dat$area_code)
    mod <- rpart(churn ~ ., data = dat)
    ore.save(mod, name=datastoreName, overwrite=TRUE)
    list(voice_mail_plan=vmp,
        international_plan=ip,
        churn.table=table(dat$churn),
        rpart.model = mod)
  })

After loading the rpart library and setting the datastore prefix, we invoke ore.groupApply using the derived column vmp_ip as the input to argument INDEX. After building the models, we’ll look at the first entry in the list returned. Using ore.load, we can load the model for the case where the customer neither has the voice mail plan, nor the international plan.


library(rpart)

datastorePrefix="my.rpartModel"

res <- ore.groupApply( CT, INDEX=CT$vmp_ip,
      FUN.NAME="my.rpartFunction",
      datastorePrefix=datastorePrefix,
      ore.connect=TRUE)
res[[1]]
ore.load(name=paste(datastorePrefix,"no","no",sep="_"))
mod
SQL API

To invoke this from the SQL API, we use the same approach as covered in Part 3. While we could create the table CT from the ore.frame used above, instead the following illustrates creating the derived column in SQL and explicitly defining a VIEW.


CREATE OR REPLACE VIEW CT AS
  SELECT t.*, "voice_mail_plan" || '-' || "international_plan" as "vmp_ip"
  FROM CHURN_TRAIN t;

Next, we create a PL/SQL PACKAGE and FUNCTION for the invocation.


CREATE OR REPLACE PACKAGE churnPkg AS
  TYPE cur IS REF CURSOR RETURN CT%ROWTYPE;
END churnPkg;
/
CREATE OR REPLACE FUNCTION churnGroupEval(
  inp_cur churnPkg.cur,
  par_cur SYS_REFCURSOR,
  out_qry VARCHAR2,
  grp_col VARCHAR2,
  exp_txt CLOB)
RETURN SYS.AnyDataSet
PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH ("vmp_ip"))
CLUSTER inp_cur BY ("vmp_ip")
USING rqGroupEvalImpl;
/

Then, we can invoke the R function by name in the SELECT statement as follows:


select *
from table(churnGroupEval(
  cursor(select * from CT),
  cursor(select 1 as "ore.connect",' my.rpartModel2' as "datastorePrefix" from dual),
  'XML', 'state', 'my.rpartFunction'));

As another variation on this theme, suppose that you didn’t want to include all the columns from the source data set. To achieve this, you could create a view and define the PACKAGE from the view. However, you could also define a record that contains the specific columns of interest. This is a standard PL/SQL specification that can be used in combination with “group apply”.


CREATE OR REPLACE PACKAGE churnPkg2 AS
  TYPE rec IS RECORD ("vmp_ip" varchar2(8),
    "churn" varchar2(4),
    "state" varchar2(4),
    "account_length" NUMBER(38));
  TYPE cur IS REF CURSOR RETURN rec;
END churnPkg2;
/

If you don’t want to or cannot create a view, this allows you to specify the exact columns required for model building. Reducing the number of columns on input can improve performance, since only required data will be passed to the server-side R engine. Notice that we could have used this above since we remove the columns for the source partition columns.

How to return results from R statistical functions as database table data

R provides a wide range of statistical and advanced analytics functions. While Oracle Database contains a wide range of statistical functional in SQL, R further extends this set. In this next topic, we illustrate how to return statistical results as a SQL table for use with other SQL queries or to feed SQL-based applications.

As our example, we’ll use the R principal components function princomp. Our goal is to return the loadings of the PCA model as a database table. For our data set, we’ll use the USArrests data set provided with R. We can view the results of princomp in the mod variable, which has class “princomp”. We then push this data to Oracle Database, getting an ore.frame object.


mod <- princomp(USArrests, cor = TRUE)
class(mod)
mod
dat <- ore.push(USArrests)

R> mod <- princomp(USArrests, cor = TRUE)
R> class(mod)
[1] "princomp"
R> mod
Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

4 variables and 50 observations.
R> dat <- ore.push(USArrests)

In the first case considered, we use ore.tableApply to return simply the princomp object. When we do this we’re getting back a serialized object of type ore.object, but the actual princomp object still resides in the database. We can pull this object from the database to get a local princomp object, but this type of result cannot be directly returned as a SQL table because we need an object of class data.frame (which we’ll address later).


res <- ore.tableApply(dat,
      function(dat) {
        princomp(dat, cor=TRUE)
      })
class(res)
res.local <- ore.pull(res)
class(res.local)
str(res.local)
res.local
res

In the following output, we see the result is an ore.object that we pull from the database to get a princomp object. We examine the structure of the object and focus on the loadings element. In the example, we print res.local and res. Since res is an ore.object, it automatically gets pulled to the client before printing it.


R> res <- ore.tableApply(dat,
+ function(dat) {
+ princomp(dat, cor=TRUE)
+ })
R> class(res)
[1] "ore.object"
attr(,"package")
[1] "OREembed"
R> res.local <- ore.pull(res)
R> class(res.local)
[1] "princomp"
R> str(res.local)
List of 7
$ sdev : Named num [1:4] 1.575 0.995 0.597 0.416
..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
.. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ center : Named num [1:4] 7.79 170.76 65.54 21.23
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ scale : Named num [1:4] 4.31 82.5 14.33 9.27
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ n.obs : int 50
$ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:50] "1" "2" "3" "4" ...
.. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
$ call : language princomp(x = dat, cor = TRUE)
- attr(*, "class")= chr "princomp"
R> res.local
Call:
princomp(x = dat, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

4 variables and 50 observations.
R> res
Call:
princomp(x = dat, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

4 variables and 50 observations.

In this next case, we focus on the loadings component of the princomp object, which contains the matrix of variable loadings, that is a matrix whose columns contain the eigenvectors. This is of class "loadings"…still not a data.frame. To convert the loadings component to a data.frame, we determine the dimensions of the matrix and then construct a data.frame by accessing the cells of the loading object. To get the variables associated with each row, we assign to the column variables the row names of the loadings. Finally, we return the loadings data.frame.


res <- ore.tableApply(dat,
      function(dat) {
        mod <- princomp(dat, cor=TRUE)
        dd <- dim(mod$loadings)
        ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
        ldgs$variables <- row.names(ldgs)
        ldgs
      })
class(res)
res

In the output below, notice that we still have an ore.object being returned, but it’s in the form of a data.frame.


R> res <- ore.tableApply(dat,
+ function(dat) {
+ mod <- princomp(dat, cor=TRUE)
+ dd <- dim(mod$loadings)
+ ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
+ ldgs$variables <- row.names(ldgs)
+ ldgs
+ })
R> class(res)
[1] "ore.object"
attr(,"package")
[1] "OREembed"
R> res
        Comp.1    Comp.2     Comp.3     Comp.4 variables
Murder -0.5358995 0.4181809 -0.3412327 0.64922780 Murder
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748 Assault
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773 UrbanPop
Rape -0.5434321 -0.1673186 0.8177779 0.08902432 Rape

We can address this last issue by specifying the FUN.VALUE argument to get an ore.frame result (left as an exercise to the reader). But our main goal is to enable returning the loadings from SQL as a database table. For that, we create the function in the R script repository and construct the appropriate SQL query. In preparation for the next example, we’ll create the table USARRESTS using the R data set.


ore.create(USArrests,table="USARRESTS")

Now, we’ll switch to SQL. We’re introducing the functions sys.rqScriptDrop and sys.rqScriptCreate, which are used within a BEGIN END PL/SQL block, to store the R function ‘princomp.loadings’.


begin
--sys.rqScriptDrop('princomp.loadings');
sys.rqScriptCreate('princomp.loadings',
      'function(dat) {
        mod <- princomp(dat, cor=TRUE)
        dd <- dim(mod$loadings)
        ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])
        ldgs$variables <- row.names(ldgs)
        ldgs
      }');
end;
/

The SELECT statement provides input data by selecting all data from USARRESTS. There are no arguments to pass, so the next parameter is NULL. The SELECT string describes the format of the result. Notice that the column names must match in name (including case) and type. The last parameter is the name of the function stored in the R script repository.


select *
from table(rqTableEval( cursor(select * from USARRESTS),NULL,
          'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));

SQL> select *
from table(rqTableEval( cursor(select * from USARRESTS),NULL,
          'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));
2 3
    Comp.1     Comp.2     Comp.3     Comp.4 variables
---------- ---------- ---------- ---------- ------------
-.53589947 .418180865 -.34123273 .649227804 Murder
-.58318363 .187985604 -.26814843 -.74340748 Assault
-.27819087 -.87280619 -.37801579 .133877731 UrbanPop
-.54343209 -.16731864 .817777908 .089024323 Rape

If you have interesting embedded R scenarios to share with the ORE community, please consider posting a comment.

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« February 2014 »
SunMonTueWedThuFriSat
      
1
2
3
5
6
7
8
9
10
11
12
14
15
16
17
19
20
21
22
23
24
25
26
27
28
 
       
Today