- June 21, 2012

This blog post will show how we used Oracle R Enterprise to tackle a customer’s big calculation problem across a big data set.

Databases are great for managing large amounts of data in a central place with rigorous enterprise-level controls. R is great for doing advanced computations. Sometimes you need to do advanced computations on large amounts of data, subject to rigorous enterprise-level concerns. This blog post shows how Oracle R Enterprise enables R plus the Oracle Database enabled us to do some pretty sophisticated calculations across 1 million accounts (each with many detailed records) in minutes.

A financial services customer of mine has a need to calculate the historical internal rate of return (IRR) for its customers’ portfolios. This information is needed for customer statements and the online web application. In the past, they had solved this with a home-grown application that pulled trade and account data out of their data warehouse and ran the calculations. But this home-grown application was not able to do this fast enough, plus it was a challenge for them to write and maintain the code that did the IRR calculation.

Internal Rate of Return is an interesting calculation in that in most real-world scenarios it is impractical to calculate exactly. Rather, IRR is a calculation where approximation techniques need to be used. In this blog post, we will discuss calculating the “money weighted rate of return” but in the actual customer proof of concept we used R to calculate both money weighted rate of returns and time weighted rate of returns. You can learn more about the money weighted rate of returns here: http://www.wikinvest.com/wiki/Money-weighted_return

We will start with calculating the IRR in standalone/desktop R. In our second post, we will show how to take this desktop R function, deploy it to an Oracle Database, and make it work at real-world scale. The first step we did was to get some sample data. For a historical IRR calculation, you have a balances and cash flows. In our case, the customer provided us with several accounts worth of sample data in Microsoft Excel.

The above figure shows part of the spreadsheet of sample data. The data provides balances and cash flows for a sample account (BMV=beginning market value. FLOW=cash flow in/out of account. EMV=ending market value).

Once we had the sample spreadsheet, the next step we did was to read the Excel data into R. This is something that R does well. R offers multiple ways to work with spreadsheet data. For instance, one could save the spreadsheet as a .csv file. In our case, the customer provided a spreadsheet file containing multiple sheets where each sheet provided data for a different sample account. To handle this easily, we took advantage of the RODBC package which allowed us to read the Excel data sheet-by-sheet without having to create individual .csv files. We wrote ourselves a little helper function called getsheet() around the RODBC package. Then we loaded all of the sample accounts into a data.frame called SimpleMWRRData.

At this point, it was time to write the money weighted rate of return (MWRR) function itself. The definition of MWRR is easily found on the internet or if you are old school you can look in an investment performance text book. In the customer proof, we based our calculations off the ones defined in the *The Handbook of Investment Performance: A User’s Guide* by David Spaulding since this is the reference book used by the customer. (One of the nice things we found during the course of this proof-of-concept is that by using R to write our IRR functions we could easily incorporate the specific variations and business rules of the customer into the calculation.)

The key thing with calculating IRR is the need to solve a complex equation with a numerical approximation technique. For IRR, you need to find the value of the rate of return (r) that sets the Net Present Value of all the flows in and out of the account to zero. With R, we solve this by defining our NPV function:

where bmv is the beginning market value, cf is a vector of cash flows, t is a vector of time (relative to the beginning), emv is the ending market value, and tend is the ending time.

Since solving for r is a one-dimensional optimization problem, we decided to take advantage of R’s optimize method (http://stat.ethz.ch/R-manual/R-patched/library/stats/html/optimize.html). The optimize method can be used to find a minimum or maximum; to find the value of r where our npv function is closest to zero, we wrapped our npv function inside the abs function and asked optimize to find the minimum. Here is an example of using optimize:

where low and high are scalars that indicate the range to search for an answer.

To test this out, we need to set values for bmv, cf, t, emv, tend, low, and high. We will set low and high to some reasonable defaults.

For example, this account had a negative 2.2% money weighted rate of return.

With numerical approximation methods like optimize, sometimes you will not be able to find an answer with your initial set of inputs. To account for this, our approach was to first try to find an answer for r within a narrow range, then if we did not find an answer, try calling optimize() again with a broader range. See the R help page on optimize() for more details about the search range and its algorithm.

At this point, we can now write a simplified version of our MWRR function. (Our real-world version is more sophisticated in that it calculates rate of returns for 5 different time periods [since inception, last quarter, year-to-date, last year, year before last year] in a single invocation. In our actual customer proof, we also defined time-weighted rate of return calculations. The beauty of R is that it was very easy to add these enhancements and additional calculations to our IRR package.)

To simplify code deployment, we then created a new package of our IRR functions and sample data. For this blog post, we only need to include our SimpleMWRR function and our SimpleMWRRData sample data. We created the shell of the package by calling:

To turn this package skeleton into something usable, at a minimum you need to edit the SimpleMWRR.Rd and SimpleMWRRData.Rd files in the \man subdirectory. In those files, you need to at least provide a value for the “title” section.

Once that is done, you can change directory to the IRR directory and type at the command-line:

The myIRR package for this blog post (which has both SimpleMWRR source and SimpleMWRRData sample data) is downloadable from here: myIRR package

Here is an example of testing our IRR function once it was converted to an installable package:

So far, we have shown how to calculate IRR for a single account. The real-world issue is how do you calculate IRR for all of the accounts?

This is the kind of situation where we can leverage the “Split-Apply-Combine” approach (see http://www.cscs.umich.edu/~crshalizi/weblog/815.html). Given that our sample data can fit in memory, one easy approach is to use R’s “by” function. (Other approaches to Split-Apply-Combine such as plyr can also be used. See http://4dpiecharts.com/2011/12/16/a-quick-primer-on-split-apply-combine-problems/).

Here is an example showing the use of “by” to calculate the money weighted rate of return for each account in our sample data set.

At this point, you’ve seen the power of R being used to calculate IRR. There were several good things:

- R could easily work with the spreadsheets of sample data we were given
- R’s optimize() function provided a nice way to solve for IRR- it was both fast and allowed us to avoid having to code our own iterative approximation algorithm
- R was a convenient language to express the customer-specific variations, business-rules, and exceptions that often occur in real-world calculations- these could be easily added to our IRR functions
- The Split-Apply-Combine technique can be used to perform calculations of IRR for multiple accounts at once.

However, there are several challenges yet to be conquered at this point in our story:

- The actual data that needs to be used lives in a database, not in a spreadsheet
- The actual data is much, much bigger- too big to fit into the normal R memory space and too big to want to move across the network
- The overall process needs to run fast- much faster than a single processor
- The actual data needs to be kept secured- another reason to not want to move it from the database and across the network
- And the process of calculating the IRR needs to be integrated together with other database ETL activities, so that IRR’s can be calculated as part of the data warehouse refresh processes

In our next blog post in this series, we will show you how Oracle R Enterprise solved these challenges.

Comments ( 0 )

Oracle