Monday Jul 29, 2013

Oracle R Distribution for R-3.0.1 released

We're pleased to announce that the Oracle R Distribution 3.0.1 Linux RPMs are now available on Oracle's public yum. R-3.0.1, code-named "Good Sport", is the second release in the R-3.0.x series. This new series in R doesn't announce new features, but indicates that the code base has developed to a new level of maturity.

However, there are some significant improvements in the 3.0 series worth mentioning.  R-3.0.0 introduces the use of large vectors in R, and eliminates some restrictions in the core R engine by allowing R to use the memory available on 64-bit systems more efficiently. Prior to this release, objects had a hard-coded limit of 2^31-1 elements, or roughly 2.1 billion elements.  Objects exceeding this limit were treated as missing (NA) and R sometimes returned a warning, regardless of available memory on the system. Starting in R-3.0.0, objects can exceed this limit, which is a significant improvement. Here's the relevant statement from the R-devel NEWS file:

 There is a subtle change in behaviour for numeric index 
 values 2^31 and larger. These never used to be legitimate 
 and so were treated as NA, sometimes with a warning. They 
 are now legal for long vectors so there is no longer a 
 warning, and x[2^31] <- y will now extend the vector on a 
 64-bit platform and give an error on a 32-bit one. 

R-3.0.1 adds to these updates by improving serialization for big objects and fixing a variety of bugs.

Older open source R packages will need to be re-installed after upgrading from ORD 2.15.x to ORD 3.0.1, which is accomplished by running:

R> update.packages(checkBuilt = TRUE) 

This command upgrades open source packages if a more recent version exists on CRAN or if the installed package was build with an older version of R.

Oracle R Distribution 3.0.1 will be compatible with future versions of Oracle R Enterprise.  As of this posting, we recommend using ORD 2.15.3 with Oracle R Enterprise 1.3.1.  When installing ORD for use with ORE 1.3.1, be sure to use the command yum install R- 2.15.3, otherwise R-3.0.1 will be installed by default.

 
ORD 3.0.1 binaries for AIX, Solaris x86, and Solaris SPARC platforms will be available from Oracle's free and Open Source portal soon. Please check back for updates.

Friday Jul 19, 2013

Oracle R Connector for Hadoop 2.2.0 released

Oracle R Connector for Hadoop 2.2.0 is now available for download. The Oracle R Connector for Hadoop 2.x series has introduced numerous enhancements, which are highlighted in this article and summarized as follows:

 ORCH 2.0.0
 ORCH 2.1.0
 ORCH 2.2.0

 Analytic Functions

  • orch.lm
  • orch.lmf
  • orch.neural
  • orch.nmf

Oracle Loader for Hadoop (OLH) support

CDH 4.2.0

ORCHhive transparency layer

.

.

.

.

.

.

Analytic Functions
  • orch.cor
  • orch.cov
  • orch.kmeans
  • orch.princomp
  • orch.sample - by percent

Configurable delimiters in text input data files

Map-only and reduce-only jobs

Keyless map/reduce output

"Pristine" data mode for high performance data access

HDFS cache of metadata

Hadoop Abstraction Layer (HAL)

.

Analytic Functions
  • orch.sample - by number of rows

CDH 4.3.0

Full online documentation

Support integer and matrix data types in hdfs.attach with detection of "pristine" data

Out-of-the-box support for "pristine" mode for high I/O performance

HDFS cache to improve interactive performance when navigating HDFS directories and file lists

HDFS multi-file upload and download performance enhancements

HAL for Hortonworks Data Platform 1.2 and Apache Hadoop 1.0

ORCH 2.0.0

In ORCH 2.0.0, we introduced four Hadoop-enabled analytic functions supporting linear  regression, low rank matrix factorization, neural network, and non-negative matrix factorization. These enable R users to immediately begin using advanced analytics functions on HDFS data using the MapReduce paradigm on a Hadoop cluster without having to design and implement such algorithms themselves.

While ORCH 1.x supported moving data between the database and HDFS using sqoop, ORCH 2.0.0 supports the use of Oracle Loader for Hadoop (OLH) to move very large data volumes from HDFS to Oracle Database in a efficient and high performance manner.

ORCH 2.0.0 supported Cloudera Distribution for Hadoop (CDH) version 4.2.0 and introduced the ORCHhive transparency layer, which leverages the Oracle R Enterprise transparency layer for SQL, but instead maps to HiveQL, a SQL-like language for manipulating HDFS data via Hive tables.

ORCH 2.1.0

In ORCH 2.1.0, we added several more analytic functions, including correlation and covariance, clustering via K-Means, principle component analysis (PCA), and sampling by specifying the percent of records to return.

ORCH 2.1.0 also brought a variety of features, including: configurable delimiters (beyond comma delimited text files, using any ASCII delimiter), the ability to specify mapper-only and reduce-only jobs, and the output of NULL keys in mapper and reducer functions.

To speed the loading of data into Hadoop jobs, ORCH introduced “pristine” mode where the user guarantees that the data meets certain requirements so that ORCH skips a time-consuming data validation step. “Pristine” data requires that numeric columns contain only numeric data, that missing values are either R’s NA or the null string, and that all rows have the same number of columns. This improves performance of hdfs.get on a 1GB file by a factor of 10.

ORCH 2.1.0 introduced the caching of ORCH metadata to improve response time of ORCH functions, such as hdfs.ls, hdfs.describe, and hdfs.mget between 5x and 70x faster.

The Hadoop Abstraction Layer, or HAL, enables ORCH to work on top of various Hadoop versions or variants, including Apache/Hortonworks, Cloudera Hadoop distributions: CDH3, and CDH 4.x with MR1 and MR2.

ORCH 2.2.0

In the latest release, ORCH 2.2.0, we’ve augmented orch.sample to allow specifying the number of rows in addition to percentage of rows. CDH 4.3 is now supported, and ORCH functions provide full online documentation via R's help function or ?. The function hdfs.attach now support integer and matrix data types and the ability to detect pristine data automatically. HDFS bulk directory upload and download performance speeds were also improved. Through the caching and automatic synchronization of ORCH metadata and file lists, the responsiveness of metadata HDFS-related functions has improved by 3x over ORCH 2.1.0, which also improves performance of hadoop.run and hadoop.exec functions. These improvements in turn bring a more interactive user experience for the R user when working with HDFS.

Starting in ORCH 2.2.0, we introduced out-of-the-box tuning optimizations for high performance and expanded HDFS caching to include the caching of file lists, which further improves performance of HDFS-related functions.

The function hdfs.upload now supports the option to upload multi-file directories in a single invocation, which optimizes the process. When downloading an HDFS directory, hdfs.download is optimized to issue a single HDFS command to download files into one local temporary directory before combining the separate parts into a single file.

The Hadoop Abstraction Layer (HAL) was extended to support Hortonworks Data Platform 1.2 and Apache Hadoop 1.0. In addition, ORCH now allows the user to override the Hadoop Abstraction Layer version for use with unofficially supported distributions of Hadoop using system environment variables. This enables testing and certification of ORCH by other Hadoop distribution vendors.

Certification of ORCH on non-officially supported platforms can be done using a separate test kit (available for download upon request: mark.hornick@oracle.com) that includes an extensive set of tests for core ORCH functionality and that can be run using the ORCH built-in testing framework. Running the tests pinpoints the failures and ensures that ORCH is compatible with the target platform.

See the ORCH 2.2.0 Change List and Release Notes for additional details. ORCH 2.2.0 can be downloaded here.


Thursday Jul 18, 2013

Simple and Advanced Time series with Oracle R Enterprise

This guest post from Marcos Arancibia describes how to use Oracle R Enterprise to analyze Time Series data.

In this article, we give an overview of how to use Time Series Analysis against data stored in Oracle Database, using the Embedded R Execution capability to send time series computations to the Oracle Database server instead processing at the client. We will also learn how to retrieve the final series or forecasts and retrieve them to the client for plotting, forecasting, and diagnosing.

One key thing to keep in mind when using Time Series techniques with data that is stored in Oracle Database is the order of the rows, or records. Because of the parallel capabilities of Oracle Database, when queried for records, one might end up receiving records out of order if an option for order is not specified.

Simple Example using Stock Data

Let’s start with a simple Time Series example. First we will need to connect to our Oracle Database using ORE. Then, using the package TTR, we will access Oracle Stock data from YahooData service, from January 1, 2008 to January 1, 2013 and push it to the database.

# Load the ORE library and connect to Oracle Database

library(ORE)

ore.connect("myuser","mysid","myserver","mypass",port=1521,all=TRUE)

library(TTR)

# Get data in XTS format

xts.orcl <- getYahooData("ORCL", 20080101, 20130101)

# Convert it to a data frame and gets the date

# Makes the date the Index

df.orcl <- data.frame(xts.orcl)

df.orcl$date <- (data.frame(date=index(xts.orcl))$date)

# Create/overwrite data in Oracle Database

# to a Table called ORCLSTOCK

ore.drop(table="ORCLSTOCK")

ore.create(df.orcl,table="ORCLSTOCK")

# IMPORTANT STEP!!!

# Ensure indexing is kept by date

rownames(ORCLSTOCK) <- ORCLSTOCK$date

# Ensure the data is in the DB

ore.ls()

# Review column names, data statistics and

# print a sample of the data

names(ORCLSTOCK)

>names(ORCLSTOCK)

[1] "Open" "High" "Low" "Close" "Volume"

[6] "Unadj.Close" "Div" "Split" "Adj.Div" "date"

summary(ORCLSTOCK$Close)

>summary(ORCLSTOCK$Close)

Min. 1st Qu. Median Mean 3rd Qu. Max.

13.36 20.53 24.22 24.79 29.70 35.73

head(ORCLSTOCK)

>head(ORCLSTOCK)

Open High Low Close Volume

2008-01-02 01:00:00 21.74414 22.00449 21.58022 21.68629 44360179

2008-01-03 01:00:00 21.62843 22.28413 21.62843 22.28413 43600532

2008-01-04 01:00:00 21.95628 22.06235 21.01130 21.24272 46391263

2008-01-07 01:00:00 21.17523 21.67664 21.01130 21.45486 41527032

2008-01-08 01:00:00 21.44522 21.52236 20.38453 20.39417 45155398

2008-01-09 01:00:00 20.57738 20.91487 20.39417 20.83773 49750304

Unadj.Close Div Split Adj.Div date

2008-01-02 01:00:00 22.49 NA NA NA 2008-01-02

2008-01-03 01:00:00 23.11 NA NA NA 2008-01-03

2008-01-04 01:00:00 22.03 NA NA NA 2008-01-04

2008-01-07 01:00:00 22.25 NA NA NA 2008-01-07

2008-01-08 01:00:00 21.15 NA NA NA 2008-01-08

2008-01-09 01:00:00 21.61 NA NA NA 2008-01-09

Pull data from the database for a simple plot

# Pull data from Oracle Database (only the necessary columns)

orcl <- ore.pull(ORCLSTOCK[,c("date","Close","Open","Low","High")])

# Simple plot with base libraries - Closing

plot(orcl$date,orcl$Close,type="l",col="red",xlab="Date",ylab="US$",

main="Base plot:Daily ORACLE Stock Closing points")

# Simple plot with base libraries - Other Series

plot(orcl$date,orcl$Open,type="l",col="blue",xlab="Date",ylab="US$",

main="Base plot:Daily ORACLE Stock: Open/High/Low points")

lines(orcl$date,orcl$High,col="green")

lines(orcl$date,orcl$Low,col="orange")

legend("topleft", c("Opening","High","Low"),

col=c("blue","green","orange"),lwd=2,title = "Series",bty="n")

A different plot option, using the package xts

library(xts)

# Pull data from Oracle Database (only the necessary columns)

orcl <- ore.pull(ORCLSTOCK[,c("date","Close","Open","Low","High")])

# Convert data to Time Series format

orcl.xts <- as.xts(orcl,order.by=orcl$date,dateFormat="POSIXct")

# Plot original series

plot(orcl.xts$Close,major.ticks='months',minor.ticks=FALSE,

main="Time Series plot:Daily ORACLE Stock Closing points",col="red")

Simple Time Series: Moving Average Smoothing

We might be tempted to call functions like the Smoothing Moving Average from open-source CRAN packages against Oracle Database Tables, but those packages do not know what to do with an “ore.frame”. For that process to work correctly, we can either load the data locally or send the process for remote execution on the Database Server by using Embedded R Execution.

We will also explore the built-in Moving Average process from ore.rollmean() as a third alternative.

ALTERNATIVE 1 - The first example is pulling the data from Oracle Database into a ts (time series) object first, for a Client-side smoothing Process.

library(TTR)

# Pull part of the database table into a local data.frame

sm.orcl <- ore.pull(ORCLSTOCK[,c("date","Close")])

# Convert "Close" attribute into a Time Series (ts)

ts.orcl <- ts(sm.orcl$Close)

# Use SMA - Smoothing Moving Average algorithm from package TTR

ts.sm.orcl <-ts(SMA(ts.orcl,n=30),frequency=365, start=c(2008,1) )

# Plot both Series together

plot(sm.orcl$date,sm.orcl$Close,type="l",col="red",xlab="Date",ylab="US$",

main="ORCL Stock Close CLIENT-side Smoothed Series n=30 days")

lines(sm.orcl$date,ts.sm.orcl,col="blue")

legend("topleft", c("Closing","MA(30) of Closing"),

col=c("red","blue"),lwd=2,title = "Series",bty="n")

ALTERNATIVE 2 – In this alternative, we will use a Server-side example for running the Smoothing via Moving Average, without bringing all data to the client. Only the result is brought locally for plotting. Remember that the TTR package has to be installed on the Server in order to be called.

# Server execution call using ore.tableApply

# Result is an ore.list that remains in the database until needed

sv.orcl.ma30 <-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect = TRUE,

function(dat) {

library(TTR)

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

list(res1 <- ts(ordered$Close,frequency=365, start=c(2008,1)),

res2 <- ts(SMA(res1,n=30),frequency=365, start=c(2008,1)),

res3 <- ordered$date)

}

);

# Bring the results locally for plotting

local.orcl.ma30 <- ore.pull(sv.orcl.ma30)

# Plot two series side by side

# (the third element of the list is the date)

plot(local.orcl.ma30[[3]],local.orcl.ma30[[1]],type="l",

col="red",xlab="Date",ylab="US$",

main="ORCL Stock Close SERVER-side Smoothed Series n=30 days")

# Add smoothed series

lines(local.orcl.ma30[[3]],

local.orcl.ma30[[2]],col="blue",type="l")

# Add legend

legend("topleft", c("Closing","Server MA(30) of Closing"),

col=c("red","blue"), lwd=2,title = "Series", bty="n")

ALTERNATIVE 3 – In this alternative we will use a Server-side example with the computation of Moving Averages using the native ORE in-Database functions without bringing data to the client. Only the result is brought locally for plotting.

Just one line of code is needed to generate an in-Database Computation of Moving averages and the creation of a new VIRTUAL column in the Oracle Database. We will call this new column rollmean30.

We will use the function ore.rollmean(). The option align="right" makes the MA look at only the past k days (30 in this case), or less, depending on the point in time. This creates a small difference between this method and the previous methods in the beginning of the series, since ore.rollmean() can actually calculate the first sets of days using smaller sets of data available, while other methods discard this data.

# Moving Average done directly in Oracle Database

ORCLSTOCK$rollmean30 <- ore.rollmean(ORCLSTOCK$Close, k = 30, align="right")

# Check that new variable is in the database

head(ORCLSTOCK)

>head(ORCLSTOCK)

Open High Low Close Volume

2008-01-02 01:00:00 21.74414 22.00449 21.58022 21.68629 44360179

2008-01-03 01:00:00 21.62843 22.28413 21.62843 22.28413 43600532

2008-01-04 01:00:00 21.95628 22.06235 21.01130 21.24272 46391263

2008-01-07 01:00:00 21.17523 21.67664 21.01130 21.45486 41527032

2008-01-08 01:00:00 21.44522 21.52236 20.38453 20.39417 45155398

2008-01-09 01:00:00 20.57738 20.91487 20.39417 20.83773 49750304

Unadj.Close Div Split Adj.Div date rollmean30

2008-01-02 01:00:00 22.49 NA NA NA 2008-01-02 21.68629

2008-01-03 01:00:00 23.11 NA NA NA 2008-01-03 21.98521

2008-01-04 01:00:00 22.03 NA NA NA 2008-01-04 21.73771

2008-01-07 01:00:00 22.25 NA NA NA 2008-01-07 21.66700

2008-01-08 01:00:00 21.15 NA NA NA 2008-01-08 21.41243

2008-01-09 01:00:00 21.61 NA NA NA 2008-01-09 21.31665

# Get results locally for plotting

local.orcl <- ore.pull(ORCLSTOCK[,c("date","Close", "rollmean30")])

sub.orcl <- subset(local.orcl,local.orcl$date> as.Date("2011-12-16"))

# Plot the two series side by side

# First plot original series

plot(local.orcl$date, local.orcl$Close,type="l", col="red",xlab="Date",ylab="US$",

main="ORCL Stock Close ORE Computation of Smoothed Series n=30 days")

# Add smoothed series

lines(local.orcl$date,local.orcl$rollmean30,col="blue",type="l")

# Add legend

legend("topleft", c("Closing","ORE MA(30) of Closing"),

col=c("red","blue"),lwd=2,title = "Series",bty="n")

Seasonal Decomposition for Time Series Diagnostics

Now that we have learned how to execute these processes using Embedded R, we can start using other methodologies required for Time Series using the same Server-side computation and local plotting.

It is typical for an analyst to try to understand a Time Series better by looking at some of the basic diagnostics like the Seasonal Decomposition of Time Series by Loess. These can be achieved by using the stl() command in the following process:

# Server execution

sv.orcl.dcom <-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect = TRUE,

function(dat) {

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

ts.orcl <- ts(ordered$Close,frequency=365, start=c(2008,1))

res <- stl(ts.orcl,s.window="periodic")

}

);

# Get result for plotting

local.orcl.dcom <- ore.pull(sv.orcl.dcom)

plot(local.orcl.dcom, main="Server-side Decomposition of ORCL Time-Series",col="blue")

Another typical set of diagnostic charts includes Autocorrelation and Partial Autocorrelation function plots. These can be achieved by using the acf() command with the proper options in Embedded R Execution, so computations happen at the Oracle Database server:

# Server-side ACF and PACF computation

# Use function acf() and save result as a list

sv.orcl.acf <-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect=TRUE,

function(dat){

ts.orcl <- ts(dat$Close,frequency=365, start=c(2008,1))

list(res1 <- acf(ts.orcl,lag.max=120,type="correlation"),res2 <- acf(ts.orcl,lag.max=30, type="partial"))

}

);

# Get results for plotting

# ACF and PACF as members of the list pulled

local.orcl.acf <- ore.pull(sv.orcl.acf)

plot(local.orcl.acf[[1]],main="Server-side ACF Analysis for Series ORCL",col="blue",lwd=2)

plot(local.orcl.acf[[2]],main="Server-side PACF Analysis for Series ORCL",col="blue",lwd=5)

Simple Exponential Smoothing

Using the popular package “forecast”, we will use the ses() function to calculate a 90 days horizon (h=90) into the future, using the option criterion=MSE for the model. The package forecast needs to be installed on the Oracle Database server R engine.

Then, we will bring the resulting model locally for plotting. Remember to load the library “forecast” locally as well, to be able to interpret the meaning of the ses() output when it’s brought locally.

# Execute ses() call in the server

sv.orcl.ses <-

ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

function(dat) {

library(forecast)

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

ts.orcl <- ts(ordered$Close,frequency=365, start=c(2008,1) )

res <- ses(ts.orcl, h=90, alpha=0.1, initial="simple")

}

);

# Get SES result locally for plotting

# Since remote object contains a SES model from package forecast,

# load package locally as well

library(forecast)

plot.orcl.ses <- ore.pull(sv.orcl.ses)

plot(plot.orcl.ses,col="blue",fcol="red",

main="ORCL with Server-side SES - Simple Exponential Smoothing Forecast")

Holt Exponential Smoothing

Using the popular package “forecast”, we will use the holt() function to calculate a 90 days horizon (h=90) into the future, requesting the Intervals of confidence of 80 and 95%. Again. the package “forecast” needs to be installed on the Oracle Database server R engine.

Then, we will bring the resulting model locally for plotting. Remember to load the library forecast locally as well, to be able to interpret the meaning of the holt() output when it’s brought locally.

# Execute holt() call in the server

sv.orcl.ets <-

ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

function(dat) {

library(forecast)

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

ts.orcl <- ts(ordered$Close,frequency=365, start=c(2008,1))

res <- holt(ts.orcl, h=90, level=c(80,95), initial="optimal")

}

);

# Get resulting model from the server

# Since remote object contains a Holt Exponential Smoothing

# model from package forecast, load package locally as well

library(forecast)

local.orcl.ets <- ore.pull(sv.orcl.ets)

plot(local.orcl.ets,col="blue",fcol="red",

main="ORCL Original Series Stock Close with Server-side Holt Forecast")

ARIMA – Auto-Regressive Interactive Moving Average

There are at least two options for fitting an ARIMA model into a Time Series. One option is to use the package “forecast”, that allows for an automatic arima fitting (auto.arima) to find the best parameters possible based on the series.

For more advanced users, the arima() function in the “stats” package itself allows for choosing the model parameters.

# ARIMA models on the server using auto.arima() from package forecast

arimaModel <-

ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

FUN = function(dat){

# load forecast library to use auto.arima

library(forecast)

# sort the table into a temp file by date

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

# convert column into a Time Series

# format ts(...) and request creation of an automatic

# ARIMA model auto.arima(...)

res <- auto.arima(ts(ordered$Close,frequency=365, start=c(2008,1)),

stepwise=TRUE, seasonal=TRUE)

})

# Alternative using the arima() from package “stats”.

arimaModel <-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect=TRUE,

FUN = function(dat){

# sort table into a temp file by date

ordered <- dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

# convert column into a Time Series

# format ts(...) and request creation of a specific

# ARIMA model using arima(), for example an ARIMA(2,1,2)

res <- arima(ts(ordered$Close,frequency=365, start=c(2008,1)),

order = c(2,1,2))

})

# Load forecast package locally to use the model

# for plotting and producing forecasts

library(forecast)

# Show remote resulting Time Series model

>arimaModel

Series: ts(ordered$Close, frequency = 365, start = c(2008, 1))

ARIMA(2,1,0)

Coefficients:

ar1 ar2

-0.0935 -0.0192

s.e. 0.0282 0.0282

sigma^2 estimated as 0.2323: log likelihood=-866.77

AIC=1739.55 AICc=1739.57 BIC=1754.96

# Get remote model using ore.pull for local prediction and plotting

local.arimaModel <- ore.pull(arimaModel)

# Generate forecasts for the next 15 days

fore.arimaModel <- forecast(local.arimaModel, h=15)

# Use the following option if you need to remove scientific notation of

# numbers that are too large in charts

options(scipen=10)

# Generate the plot of forecasts, including interval of confidence

# Main title is generated automatically indicating the type of model

# chosen by the Auto ARIMA process

plot(fore.arimaModel,type="l", col="blue", xlab="Date",

ylab="Closing value (US$)", cex.axis=0.75, font.lab="serif EUC",

sub="Auto-generated ARIMA for ORCL Stock Closing"

)

# Generate and print forecasted data points plus standard errors

# of the next 15 days

forecasts <- predict(local.arimaModel, n.ahead = 15)

>forecasts

$pred

Time Series:

Start = c(2011, 165)

End = c(2011, 179)

Frequency = 365

[1] 33.29677 33.29317 33.29395 33.29395 33.29393 33.29393 33.29393 33.29393 33.29393 33.29393 33.29393

[12] 33.29393 33.29393 33.29393 33.29393

$se

Time Series:

Start = c(2011, 165)

End = c(2011, 179)

Frequency = 365

[1] 0.4819417 0.6504925 0.7807798 0.8928901 0.9924032 1.0827998 1.1662115 1.2440430 1.3172839 1.3866617

[11] 1.4527300 1.5159216 1.5765824 1.6349941 1.6913898

Monday Jul 08, 2013

Accessing Data from Multiple Schemas using Oracle R Enterprise

The most common Oracle R Enterprise configuration is to connect directly to a database schema that contains tables you wish to analyze. However, users may occasionally need to access tables that exist in other schemas. Oracle R Enterprise allows several options when accessing tables from another schema is desired. Database tables and views are currently supported, and these form the basis for our recommendations.

Named Schema Access

If you have SELECT TABLE or SELECT ANY TABLE privilege on tables in another schema, you can access these tables after connecting to the database with your own schema credentials. The function ore.sync synchronizes database table and view metadata with the R client environment. For example, by setting schema to “user2”, user1 will see all of user2's tables on which user1 has been granted access:

R> library(ORE)
R> ore.connect(user="user1", sid="sid", host="hostname", password="password")
R> ore.sync(schema = "user2", table="myTable")
R> ore.attach(schema = "user2")
R> ore.ls()
[1] "myTable" 

Here, we combine the schema and table arguments to look at a specific table, but this can be omitted to access all tables available in schema "user2" at once:

R> ore.sync(schema = "user2")
R> ore.ls()
  [1] "myTable" "anotherTable"

Accessing a materialized table typically offers the best query performance for operations such as joins, however in 
other cases, such as calculating simple summaries, the performance advantage may be negligible.

Create Views in Local Schema

Another option is to map views in your own schema to the tables or views in the another schema. You can restrict users to the view instead of the underlying table, thereby enhancing security, and also include in the view only those columns needed. For example, if a user exports the contents of a carefully defined view, they will see only the table columns selected by the view - no unselected columns, unique identifiers or table keys. Views also simplify the user experience by exposing only those database tables the user can or should access. The only catch when using views is that you must update those views if the underlying tables or views change.

Provided you've been granted CREATE VIEW privilege and SELECT TABLE access, use the ore.exec function to execute the SQL statement that will create the view from the R client

R> library(ORE)
R> ore.connect(user="user1", sid="sid", host="hostname", password="password")
R> ore.exec("create view myView as select * from user2.myTable")
R> ore.sync(table = "myView")
  [1] "myView" 

The code above assumes you already have privileges to access the table or view. If you do not, log in as sysdba or to the schema of interest in invoke:

SQL> grant select on MYTABLE to user1;

Oracle Wallet

Password credentials for connecting to databases can be stored in a client-side Oracle Wallet, a container used to encrypt authentication credentials. The contents of the wallet are not readable, eliminating the need to expose schema credentials when connecting to the database.  Security risks are reduced because such passwords are not exposed in clear text. Oracle R Enterprise 1.3 and later is integrated with Oracle Wallet, providing a secure way for R scripts to avoid storing passwords in the script. For detailed information about creating wallets, see Oracle Database Advanced Security Administrator's GuideSteps for using Oracle Wallet with Oracle R Enterprise are provided in the Oracle R Enterprise Installation and Administration Guide.

If you have a creative technique for accessing data across schemas or other platforms, please recommend it in the blog comments, along with any opinions you have on these approaches.


About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« July 2013 »
SunMonTueWedThuFriSat
 
1
2
3
4
5
6
7
9
10
11
12
13
14
15
16
17
20
21
22
23
24
25
26
27
28
30
31
   
       
Today