In a variety of machine learning applications, there are often requirements for training multiple models. For example, in the internet of things (IoT) industry, a unique model needs to be built for each household with installed sensors that measure temperature, light or power consumption. Another example can be found in the online advertising industry. To serve personalized online advertisements or recommendations, a huge number of individualized models has to be built and maintained because each online user has a unique browsing history. Moreover, such a model has to be updated in a frequent manner to capture the change of consumer behavior. When the number of models goes high, and even the algorithm is carefully designed and proved to be solid, it could be a challenge to implement in production. Especially for time sensitive applications, multiple model training cannot afford the extra delay caused by iterations through a huge number of models. A good example is the financial industry. In this article, we will show an example of fitting multiple foreign exchange (FX) rate models and demonstrate how we can leverage the powerful parallel computation capability provided by the Oracle R Enterprise (ORE) , component of Oracle Advanced Analytics - an option to Oracle Database Enterprise Edition.
FX Rate Data
The FX rate data can be obtained from Federal Reserve Economic Data . Instead of going online to fetch the data manually, a library in R called ‘quantmod’ provides a convenient way of downloading the data. Here is the code used for this purpose.
library(quantmod)
symbol = "DEXCAUS"
getSymbols(symbol,src="FRED")
The symbol “DEXCAUS” means the FX rate of Canadian dollar to US dollar. In this example, we downloaded foreign exchanges rates for 22 currencies and focused on the time range from 1999 to 2015.
rm(list=ls())
symbols <- c( "DEXBZUS", "DEXCAUS", "DEXCHUS", "DEXDNUS", "DEXHKUS", "DEXINUS", "DEXJPUS", "DEXKOUS", "DEXMXUS", "DEXNOUS", "DEXSDUS", "DEXSFUS", "DEXSIUS", "DEXSLUS", "DEXSZUS", "DEXTAUS", "DEXTHUS", "DEXUSAL", "DEXUSEU", "DEXUSNZ", "DEXUSUK", "DEXVZUS")
for( symbol in symbols){ getSymbols(symbol, src="FRED") }
mergecode<-paste("merge(", paste(symbols,collapse=","),")", collapse="")
merged.df<-eval(parse(text= mergecode))
fxrates.df.raw <- data.frame(date=index(merged.df), coredata(merged.df))
fxrates.df.raw <- fxrates.df.raw[fxrates.df.raw$date>'1999-01-04',]
Non-stationarity
Let us take a first look into the FX rate data. We plot the FX rate of the Canadian dollar to the US dollar:
ggplot(fxrates.df.raw[fxrates.df.raw$date>'2015-01-01',], aes(date, DEXCAUS)) + geom_line()+ labs(x = "day", title = "CA dollar ~ US dollar FX Rate") + theme(plot.title = element_text(hjust = 0.5))
At a first glance, the series does not look to be stationary. To confirm it, we can run an Augmented Dickey–Fuller test to check if it has unit roots, which means that the series can have F(t) = ρF(t-1) + a(t) with ρ = 1. We can use the R library fUnitRoots to do the test. The null hypothesis is that the unit root exists.
library(fUnitRoots)
adfTest(fxrates.df.raw$DEXCAUS)
The result shows as follows:
Title: Augmented Dickey-Fuller Test
Test Results: PARAMETER: Lag Order: 1 STATISTIC: Dickey-Fuller: -0.8459 P VALUE: 0.3467
Since p >> 0.05, we cannot reject the null hypothesis. This suggests that there is unit root in this series and thus it is confirmed that the time series is non-stationary.
FX Rate Prediction
Foreign exchange rate series are known to be difficult to predict. For a time, the predictability is questioned since it seems untied to several economic fundamentals link. Thus, a random walk model is often used as a benchmark. In this article, we will implement a random walk model for demonstration purposes.
A random walk model is formulated as
F(t) = F(t-1) + a(t),
where a(t) is the zero mean random noise.
In R, we can use the following function to fit a random walk model.
arima(data, c(0,1,0))
This basically means that we remove both the MA and AR parts and only retain the integral part, which is exactly the random walk model.
The prediction is often backtested in a moving window fashion. For each time step t, the model is trained using data over [t-L-1, t-1], which is a window with length L. The prediction result is then evaluated by out of sample (OOS) data. Then, we move the window forward for every t and calculate the out of sample error. Here, we only use one sample as OOS data, which means that we use a window of historical data to predict the next day’s FX rate.
There are many ways to evaluate the result of backtesting. Here, we adopted the R squared as a measure of the goodness of fit. The closer the R squared is towards 1, the more accurate the prediction will be.
Combining all the ingradients, we now can write a function in R for making the predictions for one currency
pred.fxrate <- function (data.fxrate) {
data.fxrate <- data.fxrate[order(data.fxrate$date),]
N <- nrow(data.fxrate)
L <- 300
pred <- rep(0, N-L)
country <- data.fxrate$country[1]
for(i in (L+1):N){
model <- arima(data.fxrate$rate[(i-L):i-1],c(0,1,0))
pred[i-L] <- predict(model,1)[[1]][1] }
R.sq <- 1 - sum((pred - data.fxrate$rate[(L+1):N])^2, na.rm =TRUE)/sum((mean(data.fxrate$rate[(L+1):N], na.rm =TRUE) - data.fxrate$rate[(L+1):N])^2, na.rm =TRUE)
pred.df <- as.data.frame(data.fxrate$date[(L+1):N])
pred.df$pred <- pred names(pred.df) <- c("date", "pred")
plot(data.fxrate$date, data.fxrate$rate, type = "l")
lines(pred.df$date, pred.df$pred, col="red")}
Note that the lines that compute the R squared has the option na.rm = TRUE on. It is because the data contains null values.
We can test the function using CA dollar using the data from 2014 to 2017. The R squared is 0.97. Seems that we have a decent model!
As mentioned at the beginning, there are quite a few currencies and we probably do not want to loop through them. A solution is use the “group apply” capability provided by Oracle R Enterprise (ORE). That allows us to store the data as a table in Oracle Database (in many cases, it is the original data location), then run the function we wrote above in parallel for each currency.
First, we need to merge all the FX data together and change the schema as follows.
fxrates.df <- data.frame(date=character(),
country=character(),
rate=double())
date.col <- fxrates.df.raw$date
symbols <- names(fxrates.df.raw)[-1]
n.row <- length(date.col)
for( symbol in symbols){
symbol.data <- as.data.frame(date.col)
symbol.data$country <- rep(symbol, n.row)
symbol.data$return <- fxrates.df.raw[,symbol]
fxrates.df <- rbind(fxrates.df, symbol.data)
}
names(fxrates.df) <- c("date", "country", "rate")
fxrates.df <- fxrates.df[fxrates.df$date > '2014-01-01', ]
fxrates.df <- fxrates.df[order(fxrates.df$date),]
The data frame we obtained looks like:
date country rate
2014-01-02 DEXCAUS 1.0634
2014-01-03 DEXCAUS 1.0612
2014-01-06 DEXCAUS 1.0658
2014-01-07 DEXCAUS 1.0742
2014-01-08 DEXCAUS 1.0802
2014-01-09 DEXCAUS 1.0850
Then, we create the table in Oracle Database with ORE.
ore.drop(table="FX_RATE") # to remove the table if it already exists
ore.create(fxrates.df, table="FX_RATE")
After the table is created, we call the ore.groupApply function on the column ‘country’. That will run the function pred.fxrate on the FX rate of each currency, using at most four parallel executing R engines spawned by Oracle Database.
res <- ore.groupApply(FX_RATE,
FX_RATE$country,
pred.fxrate,
ore.connect=TRUE,
parallel = 4)
Another way to store the result is creating an object in the ORE R Datastore. For instance, we can add the following code into the function pred.fxrate.
R.sq <- 1 - sum((pred - data.fxrate$rate[(L+1):N])^2)/sum((mean(data.fxrate$rate[(L+1):N]) - data.fxrate$rate[(L+1):N])^2)
name <- paste("Rsq_",country,sep="")
assign(name,R.sq)
try(ore.save(list=name, name="Rsquares",append=TRUE))
Then, after running the ore.groupApply function, we can retrieve the objects through ORE functions as below.
Based on the R squared, the results look decent and will be even better if we can access data about other economic fundamentals and build an ensemble model. Due to the scope of this blog, we will leave this exploration to the reader.
Invoke R scripts from SQL side
Another scenario may require storing the result, such as R squared scores, into a structured format as a table in the database. Or we may need to store the generated image in the database. These can also be done by calling the R functions using capabilities provided by Oracle R Enterprise (ORE) on the SQL side.
Let us first look at how we store the R squared scores as a table. Suppose we want to build the model over each currency in SQL. We can first create a SQL function that has the group apply capability. Recall that we have all data stored in FX_RATE. All we need to do is that we create a group apply function and also supply the script that build the model.
CREATE OR REPLACE PACKAGE fxratePkg AS
TYPE cur IS REF CURSOR RETURN FX_RATE%ROWTYPE;
END fxratePkg;
CREATE OR REPLACE FUNCTION fxrateGroupEval(
inp_cur fxratePkg.cur,
par_cur SYS_REFCURSOR,
out_qry VARCHAR2,
grp_col VARCHAR2,
exp_txt CLOB)
RETURN SYS.AnyDataSet
PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH("country"))
CLUSTER inp_cur BY ("country")
USING rqGroupEvalImpl;
This function is a PL/SQL function that can do the group apply. You can view it as a counterpart of ore.groupApply. Next, we store the script that build the model in the database.
begin
sys.rqScriptDrop('RW_model'); -- call if the model already exists.
sys.rqScriptCreate('RW_model',
'function (data.fxrate) {
data.fxrate <- data.fxrate[order(data.fxrate$date),]
N <- nrow(data.fxrate)
L <- 300
pred <- rep(0, N-L)
country <- data.fxrate$country[1]
for(i in (L+1):N){
model <- arima(data.fxrate$rate[(i-L):i-1],c(0,1,0))
pred[i-L] <- predict(model,1)[[1]][1]
}
R.sq <- 1 - sum((pred - data.fxrate$rate[(L+1):N])^2, na.rm =TRUE)/sum((mean(data.fxrate$rate[(L+1):N], na.rm=TRUE) - data.fxrate$rate[(L+1):N])^2, na.rm =TRUE)
data.frame(CURRENCY=country, RSQ = R.sq)
}');
end;
Note that in order to form a table, we need to create a data frame to store each single result. With both the group apply SQL function and the script stored, we now can call it within SQL.
select *
from table(fxrateGroupEval( cursor(select /*+ parallel(t, 4) */ * from FX_RATE t),
cursor(select 1 as "ore.connect" from dual),
'SELECT ''aaaaaaa'' CURRENCY, 1 RSQ FROM DUAL', 'country', 'RW_model'));
Note that “aaaaaaa” is a way to declare the format of the column, which is a 7 character text column. Moreover, we can even store the plots generated by each FX model. We can modify the function as below.
begin
sys.rqScriptDrop('RW_model_plot');
sys.rqScriptCreate('RW_model_plot',
'function (data.fxrate) {
data.fxrate <- data.fxrate[order(data.fxrate$date),]
N <- nrow(data.fxrate)
L <- 300
pred <- rep(0, N-L)
country <- data.fxrate$country[1]
for(i in (L+1):N){
model <- arima(data.fxrate$rate[(i-L):i-1],c(0,1,0))
pred[i-L] <- predict(model,1)[[1]][1]
}
pred.df <- as.data.frame(data.fxrate$date[(L+1):N])
pred.df$pred <- pred
names(pred.df) <- c("date", "pred")
plot(data.fxrate$date, data.fxrate$rate, type = "l")
lines(pred.df$date, pred.df$pred, col="red")
}');
end;
Then, we can all the new SQL function and generate an table of images.
select * from table(fxrateGroupEval( cursor(select /*+ parallel(t, 4) */ * from FX_RATE t),
cursor(select 1 as "ore.connect" from dual),
'PNG', 'country', 'RW_model_plot'));
The output, if viewed at SQL developer, is as follows.
Note that now the image is generated and stored as a BLOB (binary large object) in the table. We can double click on the BLOB item and view the image in pop-up window (make sure the view as image box is checked).