Tuesday Dec 23, 2014

Pain Point #4: “Recoding R (or other) models into SQL, C, or Java takes time and is error prone”

In the previous post in this series Addressing Analytic Pain Points, I focused on some issues surrounding production deployment of advanced analytics solutions. One specific aspect of production deployment involves how to get predictive model results (e.g., scores) from R or leading vendor tools into applications that are based on programming languages such as SQL, C, or Java. In certain environments, one way to integrate predictive models involves recoding them into one of these languages. Recoding involves identifying the minimal information needed for scoring, i.e., making predictions, and implementing that in a language that is compatible with the target environment. For example, consider a linear regression model with coefficients. It can be fairly straightforward to write a SQL statement or a function in C or Java to produce a score using these coefficients. This translated model can then be integrated with production applications or systems.

While recoding has been a technique used for decades, it suffers from several drawbacks: latency, quality, and robustness. Latency refers to the time delay between the data scientist developing the solution and leveraging that solution in production. Customers recount historic horror stories where the process from analyst to software developers to application deployment took months. Quality comes into play on two levels: the coding and testing quality of the software produced, and the freshness of the model itself. In fast changing environments, models may become “stale” within days or weeks. As a result, latency can impact quality. In addition, while a stripped down implementation of the scoring function is possible, it may not account for all cases considered by the original algorithm implementer. As such, robustness, i.e., the ability to handle greater variation in the input data, may suffer.

One way to address this pain point is to make it easy to leverage predictive models immediately (especially open source R and in-database Oracle Advanced Analytics models), thereby eliminating the need to recode models. Since enterprise applications normally know how to interact with databases via SQL, as soon as a model is produced, it can be placed into production via SQL access. In the case of R models, these can be accessed using Oracle R Enterprise embedded R execution in parallel via ore.rowApply and, for select models, the ore.predict capability performs automatic translation of native R models for execution inside the database. In the case of native SQL Oracle Advanced Analytics interface algorithms, as found in Oracle Data Mining and exposed through an R interface in Oracle R Enterprise, users can perform scoring directly in Oracle Database. This capability minimizes or even eliminates latency, dramatically increases quality, and leverages the robustness of the original algorithm implementations.

Sunday Dec 14, 2014

Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex”

Continuing in our series Addressing Analytic Pain Points, another concern for data scientists and analysts, as well as enterprise management, is how to leverage analytic results in production systems. These production systems can include (i) dashboards used by management to make business decisions, (ii) call center applications where representatives see personalized recommendations for the customer they’re speaking to or how likely that customer is to churn, (iii) real-time recommender systems for customer retail web applications, (iv) automated network intrusion detection systems, and (v) semiconductor manufacturing alert systems that monitor product quality and equipment parameters via sensors – to name a few.

When a data scientist or analyst begins examining a data-based business problem, one of the first steps is to acquire the available data relevant to that problem. In many enterprises, this involves having it extracted from a data warehouse and operational systems, or acquiring supplemental data from third parties. They then explore the data, prepare it with various transformations, build models using a variety of algorithms and settings, evaluate the results, and after choosing a “best” approach, produce results such as predictions or insights that can be used by the enterprise.

If the end goal is to produce a slide deck or report, aside from those final documents, the work is done. However, reaping financial benefits from advanced analytics often needs to go beyond PowerPoint! It involves automating the process described above: extract and prepare the data, build and select the “best” model, generate predictions or highlight model details such as descriptive rules, and utilize them in production systems.

One of the biggest challenges enterprises face involves realizing the promised benefits in production that the data scientist achieved in the lab. How do you take that cleverly crafted R script, for example, and put all the necessary “plumbing” around it to enable not only the execution of the R script, but the movement of data and delivery of results where they are needed, parallel and distributed script execution across compute nodes, and execution scheduling.

As a production deployment, care needs to taken to safeguard against potential failures in the process. Further, more “moving parts” result in greater complexity. Since the plumbing is often custom implemented for each deployment, this plumbing needs to be reinvented and thoroughly tested for each project. Unfortunately, code and process reuse is seldom realized across an enterprise even for similar projects, which results in duplication of effort.

Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining) with Oracle Database provides an environment that eliminates the need for a separately managed analytics server, the corresponding movement of data and results between such a server and the database, and the need for custom plumbing. Users can store their R and SQL scripts directly in Oracle Database and invoke them through standard database mechanisms. For example, R scripts can be invoked via SQL, and SQL scripts can be scheduled for execution through Oracle Database’s DMBS_SCHEDULER package. Parallel and distributed execution of R scripts is supported through embedded R execution, while the database kernel supports parallel and distributed execution of SQL statements and in-database data mining algorithms. In addition, using the Oracle Advanced Analytics GUI, Oracle Data Miner, users can convert “drag and drop” analytic workflows to SQL scripts for ease of deployment in Oracle Database.

By making solution deployment a well-defined and routine part of the production process and reducing complexity through fewer moving parts and built-in capabilities, enterprises are able to realize and then extend the value they get from predictive analytics faster and with greater confidence.

Monday Aug 12, 2013

Quick! Swap those models – I’ve got a better one

(or, Why in-database analytics enables real-time scoring and can make model deployment easy)

Refreshing predictive models is a standard part of the process when deploying advanced analytics solutions in production environments. In addition, many predictive models need to be used in a real-time setting for scoring customers, whether that is for fraud detection, predicting churn, or recommending next likely product. One of the problems with using vanilla R is that real-time scoring often requires starting an R engine for each score, or enabling some ad hoc mechanism for real-time scoring, which can increase application complexity.

In this blog post, we look at how Oracle R Enterprise enables:

  • Building models in-database on database data from R
  • Renaming in-database models for use by a stored procedure
  • Invoking the stored procedure to make predictions from SQL
  • Building a second model and swapping it with the original
  • Moving a model from development environment to production environment

Building the model in R

So let’s start with building a generalized linear model (GLM) in Oracle Database. For illustration purposes, we’ll use the longley data set from R – a macroeconomic data set that provides a well-known example for a highly collinear regression. In R, type ?longley for the full description of the data set.

Using the following R script, we create the database table LONGLEY_TABLE from the longley data.frame and then build the model using the in-database GLM algorithm. We’re predicting the number of people employed using the remaining variables. Then, we view the model details using summary and the auto-generated fit.name. This fit.name corresponds to the name of the Oracle Data Mining (ODM) model in the database, which is auto-generated. Next, we use the model to predict using the original data, just for a confirmation that the model works as expected.

ore.connect("rquser","my_sid","my_host","rquser_pswd",1521, all=TRUE)

ore.create(longley, table="LONGLEY_TABLE")

mod.glm <- ore.odmGLM(Employed ~ ., data = LONGLEY_TABLE)

summary(mod.glm)

mod.glm$fit.name

predict(fit1, LONGLEY_TABLE)

While a user can refer to the ODM model by its name in fit.name, for example, when working with it in SQL or the Oracle Data Miner GUI, this may not be convenient since it will look something like ORE$23_123. In addition, unless the R object mod.glm is saved in an ORE datastore (an ORE feature corresponding to R’s save and load functions using ore.save and ore.load, but in the database), at the end of the session, this object and corresponding ODM model will be removed.

In addition, we’ll want to have a common name for the model so that we can swap an existing model with a new model and not have the change higher level code. To rename an ODM model, we can use the PL/SQL statement shown here, invoked with R using ore.exec. Of course, this could also be done from any SQL interface, e.g., SQL*Plus, SQL Developer, etc., just supplying the explicit SQL.

ore.exec(paste("BEGIN DBMS_DATA_MINING.RENAME_MODEL(model_name => '", mod.glm$fit.name, "', new_model_name => 'MY_GLM_MODEL'); END;",sep=""))

So now, we have the ODM model named MY_GLM_MODEL. Keep in mind, after the model is renamed, the original model no longer exists and the R object is invalid – at least from the standpoint of being able to use it in functions like summary or predict.

Scoring data from a SQL procedure

As noted above, users can score in batch from R, however, they can also score in batch from SQL. But we’re interested in real-time scoring from the database using the in-database model. This can be done directly in a SQL query but providing the input data in the query itself. This eliminates having to write data to a database table and then doing a lookup to retrieve the data for scoring – making it real-time.

The following SQL does just this. The WITH clause defines the input data, selecting from dual. The SELECT clause uses the model MY_GLM_MODEL to make the prediction using the data defined by data_in.

WITH data_in as (select 2013 "Year",

234.289 "GNP",

235.6 "Unemployed",

107.608 "Population",

159 "Armed.Forces",

83 "GNP.deflator",

60.323 "Employed"

from dual )

SELECT PREDICTION(MY_GLM_MODEL USING *) "PRED"

FROM data_in

While we could invoke the SQL directly, having a stored procedure in the database can give us more flexibility. Here’s the stored procedure version in PL/SQL.

CREATE OR REPLACE

PROCEDURE MY_SCORING_PROC (year_in IN NUMBER,

gnp_in IN BINARY_DOUBLE,

unemployed_in IN BINARY_DOUBLE,

population_in IN BINARY_DOUBLE,

armed_forces_in IN BINARY_DOUBLE,

gnp_deflator_in IN BINARY_DOUBLE,

employed_in IN BINARY_DOUBLE,

pred_out OUT NUMBER) AS

BEGIN

WITH data_in as (select year_in "Year",

gnp_in "GNP",

unemployed_in "Unemployed",

population_in "Population",

armed_forces_in "Armed.Forces",

gnp_deflator_in "GNP.deflator",

employed_in "Employed"

from dual ),

model_score as (SELECT PREDICTION(MY_GLM_MODEL USING *) "PRED"

FROM data_in )

select PRED into pred_out from model_score;

EXCEPTION

WHEN OTHERS THEN

raise_application_error(-20001,

'An error was encountered - '||SQLCODE||' -ERROR- '||SQLERRM);

END;

To invoke the stored procedure, we can do the following:

SET SERVEROUTPUT ON

DECLARE

score NUMBER;

BEGIN

MY_SCORING_PROC(1947, 234.289, 235.6, 107.608, 159, 83, 60.323, score);

DBMS_OUTPUT.PUT_LINE('Score: '|| score);

END;

Refreshing the model from R

Let’s say the model above has been in production for a while, but has become stale – that is, it’s not predicting as well as it used to due to changing patterns in the data. To refresh it, we build a new model. For illustration purposes, we’re going to use the same data (so an identical model will be produced, except for its name).

mod.glm2 <- ore.odmGLM(Employed ~ ., data = LONGLEY_TABLE)

summary(mod.glm2)

mod.glm2$fit.name

To swap the models, we delete the existing model called MY_GLM_MODEL and rename the new model to MY_GLM_MODEL. Again, we can do this from R using PL/SQL and through ore.exec.

ore.exec(paste("BEGIN DBMS_DATA_MINING.DROP_MODEL('MY_GLM_MODEL'); DBMS_DATA_MINING.RENAME_MODEL(model_name => '",mod.glm2$fit.name,"', new_model_name => 'MY_GLM_MODEL'); END;",sep=""))

We can now re-execute the stored procedure and the new model will be used.

SET SERVEROUTPUT ON

DECLARE

score NUMBER;

BEGIN

MY_SCORING_PROC(1947, 234.289, 235.6, 107.608, 159, 83, 60.323, score);

DBMS_OUTPUT.PUT_LINE('Score: '|| score);

END;

You may have noticed that this approach can introduce a brief period where no model is accessible - between the DROP_MODEL and RENAME_MODEL. A better approach involves the use of SYNONYMs. In general, synonyms provide both data independence and location transparency, being an alternative name for a table, view, sequence, procedure, stored function, and other database objects. We can use this in conjunction with our stored procedure above. First, create a synonym for the original scoring procedure.

CREATE or REPLACE SYNONYM MY_SCORING_PROC_SYM for MY_SCORING_PROC;

When invoking the procedure from your application, use the name MY_SCORING_PROC_SYM in place of MY_SCORING_PROC.  Instead of renaming the model, create a second stored procedure, with a different name, e.g., MY_SCORING_PROC_2. The new procedure references the name of the newly build model internally. 

When it is time to swap the models, invoke the following to change the procedures.

 

CREATE or REPLACE SYNONYM MY_SCORING_PROC_SYM for MY_SCORING_PROC_2;

Another benefit of this approach is that replaced models can still be kept should you need to revert to a previous version. 

Moving an in-database model from one machine to another

In a production deployment, there’s often the need to move a model from the development environment to the production environment. For example, the data scientist may have built the model in a development / sandbox environment and now needs to move it to the production machine(s).

In-database models provide functions EXPORT_MODEL and IMPORT_MODEL as part of the DBMS_DATA_MINING SQL package. See the 11g documentation for details. These calls can be invoked from R, but we’ll show this from SQL just to keep the flow easier to see.

From a SQL prompt, e.g., from SQL*Plus, connect to the schema that contains the model. Create a DIRECTORY object where the exported model file will be stored. List the model names available to this schema, which should contain MY_GLM_MODEL. Then, export the model

CONNECT rquser/rquser_psw

CREATE OR REPLACE DIRECTORY rquserdir AS '/home/MY_DIRECTORY';

-- list the models available to rquser

SELECT name FROM dm_user_models;

-- export the model called MY_GLM_MODEL to a dump file in same schema

EXECUTE DBMS_DATA_MINING.EXPORT_MODEL ('MY_GLM_MODEL_out',

'RQUSERDIR',

'name = ''MY_GLM_MODEL''');

At this point, you have the ODM model named MY_GLM_MODEL in the file MY_GLM_MODEL_out01.dmp stored in the file system under /home/MY_DIRECTORY. This file can now be moved to the production environment and the model loaded into the target schema.

Log into the new schema and invoke IMPORT_MODEL.

CONNECT rquser2/rquser2_psw

EXECUTE DBMS_DATA_MINING.IMPORT_MODEL (MY_GLM_MODEL_out01.dmp',

'RQUSERDIR', 'name = ''MY_GLM_MODEL''',

'IMPORT', NULL, 'glm_imp_job', 'rquser:rquser2');

Summary

In this post, we’ve highlighted how to build an in-database model in R and use it for scoring through SQL in a production, re-time settings. In addition, we showed how it is possible to swap, or refresh, models in a way that can leave your application code untouched. Finally, we highlighted database functionality that allows you to move in-database models from one database environment to another.

Users should note that all the functionality shown involving SQL, or being invoked through ore.exec, can be easily wrapped in R functions that could ultimately become part of ORE. If any of our readers are interested in giving this a try, we can post your solution here to share with the R and Oracle community. For the truly adventurous, check out the Oracle Database package DBMS_FILE_TRANSFER to consider wrapping the ability to move model files from R as well.

Wednesday Feb 06, 2013

Oracle R Enterprise 1.3 gives predictive analytics an in-database performance boost

Recently released Oracle R Enterprise 1.3 adds packages to R that enable even more in-database analytics. These packages provide horizontal, commonly used techniques that are blazingly fast in-database for large data. With Oracle R Enterprise 1.3, Oracle makes R even better and usable in enterprise settings. (You can download ORE 1.3 here and documentation here.)

When it comes to predictive analytics, scoring (predicting outcomes using a data mining model) is often a time critical operation. Scoring can be done online (real-time), e.g., while a customer is browsing a webpage or using a mobile app, where on-the-spot recommendations can be made based on current actions. Scoring can also be done offline (batch), e.g., predict which of your 100 million customers will respond to each of a dozen offers, e.g., where applications leverage results to identify which customers should be targeted with a particular ad campaign or special offer.

In this blog post, we explore where using Oracle R Enterprise pays huge dividends. When working with small data, R can be sufficient, even when pulling data from a database. However, depending on the algorithm, benefits of in-database computation can be seen in a few thousand rows. The time difference with 10s of thousands of rows makes an interactive session more interactive, whereas 100s of thousands of rows becomes a real productivity gain, and on millions (or billions) of rows, becomes a competitive advantage! In addition to performance benefits, ORE integrates R into the database enabling you to leave data in place.

We’ll look at a few proof points across Oracle R Enterprise features, including:

  • OREdm – a new package that provides R access to several in-database Oracle Data Mining algorithms (Attribute Importance, Decision Tree, Generalized Linear Models, K-Means, Naïve Bayes, Support Vector Machine).
  • OREpredict – a new package that enables scoring models built using select standard R algorithms in the database (glm, negbin, hclust, kmeans, lm, multinom, nnet, rpart).
  • Embedded R Execution – an ORE feature that allows running R under database control and boosts real performance of CRAN predictive analytics packages by providing faster access to data than occurs between the database and client, as well as leveraging a more powerful database machine with greater RAM and CPU resources.

OREdm

Pulling data out of a database for any analytical tool impedes interactive data analysis due to access latency, either directly when pulling data out of the database or indirectly via an IT process that involves requesting data to be staged in flat files. Such latencies can quickly become intolerable. On the R front, you’ll also need to consider whether the data will fit in memory. If flat files are involved, consideration needs to be given to how files will be stored, backed up, and secured.

Of course, model building and data scoring execution time is only part of the story. Consider a scenario A, the “build combined script,” where data is extracted from the database, and an R model built and persisted for later use. In the corresponding scenario B, the “score combined script”, data is pulled from the database, a previously built model loaded, data scored, and the scores written to the database. This is a typical scenario for use in, e.g., enterprise dashboards or within an application supporting campaign management or next-best-offer generation. In-database execution provides significant performance benefits, even for relatively small data sets as included below. Readers should be able to reproduce such results at these scales. We’ve also included a Big Data example by replicating the 123.5 million row ONTIME data set to 1 billion rows. Consider the following examples:

Linear Models: We compared R lm and ORE ore.lm in-database algorithm on the combined scripts. On datasets ranging from 500K to 1.5M rows with 3-predictors, in-database analytics showed an average 2x-3x performance improvement for build, and nearly 4x performance improvement for scoring. Notice in Figure 1 that the trend is significantly less for ore.lm than lm, indicating greater scalability for ore.lm.

Figure 1. Overall lm and ore.lm execution time for model building (A) and data scoring (B)

Figure 2 provides a more detailed view comparing data pull and model build time for build detail, followed by data pull, data scoring, and score writing for score detail. For model building, notice that while data pull is a significant part of lm’s total build time, the actual build time is still greater than ore.lm. A similar statement can be made in the case of scoring.

Figure 2. Execution time components for lm and ore.lm (excluding model write and load)

Naïve Bayes from the e1071 package: On 20-predictor datasets ranging from 50k to 150k rows, in-database ore.odmNB improved data scoring performance by a factor of 118x to 418x, while the full scenario B execution time yielded a 13x performance improvement, as depicted in Figure 3B. Using a non-parallel execution of ore.odmNB, we see the cross-over point where ore.odmNB overtakes R, but more importantly, the slope of the trend points to the greater scalability of ORE, as depicted in Figure 3A for the full scenario A execution time.


Figure 3. Overall naiveBayes and ore.odmNB execution time for model building (A) and data scoring (B)

K-Means clustering: Using 6 numeric columns from the ONTIME airline data set ranging from 1 million to 1 billion rows, we compare in-database ore.odmKMeans with R kmeans through embedded R execution with ore.tableApply. At 100 million rows, ore.odmKMeans demonstrates better performance than kmeans , and scalability at 1 billion rows. The performance results depicted in Figure 4 uses a log-log plot. The legend shows the function invoked and corresponding parameters, using subset of ONTIME data set d. While ore.odmKMeans scales linearly with number of rows, R kmeans does not. Further, R kmeans did not complete at 1 billion rows.

Figure 4: K-Means clustering model building on Big Data

OREpredict

With OREpredict, R users can also benefit from in-database scoring of R models. This becomes evident not only when considering the full “round trip” of pulling data from the database, scoring in R, and writing data back to the database, but also for the scoring itself.

Consider an lm model built using a dataset with 4-predictors and 1 million to 5 million rows. Pulling data from the database, scoring, and writing the results back to the database shows a pure R-based approach taking 4x - 9x longer than in-database scoring using ore.predict with that same R model. Notice in Figure 5 that the slope of the trend is dramatically less for ore.predict than predict, indicating greater scalability. When considering the scoring time only, ore.predict was 20x faster than predict in R for 5M rows. In ORE 1.3, ore.predict is recommended and will provide speedup over R for numeric predictors.

Figure 5. Overall lm execution time using R predict vs. ORE ore.predict

For rpart, we see a similar result. On a 20-predictor, 1 million to 5 million row data set, ore.predict resulted in a 6x – 7x faster execution. In Figure 5, we again see that the slope of the trend is dramatically less for ore.predict than predict, indicating greater scalability. When considering the scoring time only, ore.predict was 123x faster than predict in R for 5 million rows.

Figure 6. Overall rpart Execution Time using R predict vs. ORE ore.predict

This scenario is summarized in Figure 7. In the client R engine, we have the ORE packages installed. There, we invoke the pure R-based script, which requires pulling data from the database. We also invoke the ORE-based script that keeps the data in the database.

Figure 7. Summary of OREpredict performance gains

To use a real world data set, we again consider the ONTIME airline data set with 123.5 million rows. We will build lm models with varying number of coefficients derived by converting categorical data to multiple columns. The variable p corresponds to the number of coefficients resulting from the transformed formula and is dependent on the number of distinct values in the column. For example, DAYOFWEEK has 7 values, so with DEPDELAY, p=9. In Figure 8, you see that using an lm model with embedded R for a single row (e.g., one-off or real-time scoring), has much more overhead (as expected given that an R engine is being started) compared to ore.predict, which shows subsecond response time through 40 coefficients at 0.54 seconds, and the 106 coefficients at 1.1 seconds. Here are the formulas describing the columns included in the analysis:

  • ARRDELY ~ DEPDELAY (p=2)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK (p=8)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH (p=19)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH + YEAR (p=40)
  • ARRDELY ~ DEPDELAY + DAYOFWEEK + MONTH + YEAR (p=106)


Figure 8. Comparing performance of ore.predict with Embedded R Execution for lm

Compare this with scoring the entire ONTIME table of 123.5 million rows. We see that ore.predict outperforms embedded R until about 80 coefficients, when embedded R becomes the preferred choice. 

Data Movement between R and Database: Embedded R Execution

One advantage of R is its community and CRAN packages. The goal for Oracle R Enterprise with CRAN packages is to enable reuse of these packages while:

  • Leveraging the parallelization and efficient data processing capabilities of Oracle Database
  • Minimizing data transfer and communication overhead between R and the database
  • Leveraging R as a programming language for writing custom analytics

There are three ways in which we’ll explore the performance of pulling data.

1) Using ore.pull at a separate client R engine to pull data from the database

2) Using Embedded R Execution and ore.pull within an embedded R script from a database-spawned R engine

3) Using Embedded R Execution functions for data-parallelism and task-parallelism to pass database data to the embedded R script via function parameter

With ORE Embedded R Execution (ERE), the database delivers data-parallelism and task-parallelism, and reduces data access latency due to optimized data transfers into R. Essentially, R runs under the control of the database. As illustrated in Figure 9, loading data at the database server is 12x faster than loading data from the database to a separate R client. Embedded R Execution also provides a 13x advantage when using ore.pull invoked at the database server within an R closure (function) compared with a separate R client. The data load from database to R client is depicted as 1x – the baseline for comparison with embedded R execution data loading.

Figure 9. Summary of Embedded R Execution data load performance gains

Data transfer rates are displayed in Figure 10, for a table with 11 columns and 5 million to 15 million rows of data. Loading data via ORE embedded R execution using server-side ore.pull or through the framework with, e.g., ore.tableApply (one of the embedded R execution functions) is dramatically faster than a non-local client load via ore.pull. The numbers shown reflect MB/sec data transfer rates, so a bigger bar is better!

Figure 10. Data load and write execution time with 11 columns

While this is impressive, let’s expand our data up to 1 billion rows. To create our 1 billion row data set (1.112 billion rows), we duplicated the 123.5 million row ONTIME dataset 9 times, replacing rows with year 1987 with years 2010 through 2033, and selecting 6 integer columns (YEAR, MONTH, DAYOFMONTH, ARRDELAY, DEPDELAY, DISTANCE) with bitmap index of columns (YEAR, MONTH, DAYOFMONTH). The full data set weighs in at ~53 GB.

In Figure 11, we see linear scalability for loading data into the client R engine. Times range from 2.8 seconds for 1 million rows, to 2700 seconds for 1 billion rows. While your typical user may not need to load 1 billion rows into R memory, this graph demonstrates the feasibility to do so.

Figure 11. Client Load of Data via ore.pull for Big Data

In Figure12, we look at how degree of parallelism (DOP) affects data load times involving ore.rowApply. This test addresses the question of how fast ORE can load 1 billion, e.g., when scoring data. The degree of parallelism corresponds to the number of R engines that are spawned for concurrent execution at the database server. The number of chunks the data is divided into is 1 for a single degree of parallelism, and 10 times the DOP for the remaining tests. For DOP of 160, the data was divided into 1600 chunks, i.e., 160 R engines were spawned, each processing 10 chunks. The graph on the left depicts that execution times improve for the 1 billion row data set through DOP of 160. As expected, at some point, the overhead of spawning additional R engines and partitioning the data outweighs the benefit. At its best time, processing 1 billion rows took 43 seconds.

Figure 12. Client Load of Data via ore.pull for Big Data

In the second graph of Figure 12, we contrast execution time for the “sweet spot” identified in the previous graph with varying number of rows. Using this DOP of 160, with 1600 chunks of data, we see that through 100 million rows, there is very little increase in execution time (between 6.4 and 8.5 seconds in actual time). While 1 billion rows took significantly more, it took only 43 seconds.

We can also consider data write at this scale. In Figure 13, we also depict linear scalability from 1 million through 1 billion rows using the ore.create function to creating database tables from R data. Actual times ranged from 2.6 seconds to roughly 2600 seconds.

Figure 13. Data Write using ore.create for Big Data

ORE supports data-parallelism to enable, e.g., building predictive models in parallel on partitions of the data. Consider a marketing firm that micro-segments customers and builds predictive models on each segment. ORE embedded R execution automatically partitions the data, spawns R engines according to the degree of parallelism specified, and executes the specified user R function. To address how efficiently ore.groupApply can process data, Figure 14 shows the total execution time to process the 123.5M rows from the ONTIME data with varying number of columns. The figure shows that ore.groupApply scales linearly as the number of columns increases. Three columns were selected based on their number of distinct values: TAILNUM 12861, DEST 352, and UNIQUECARRIER 29. For UNIQUECARRIER, all columns (total of 29 columns) could not be completed since 29 categories resulted in data too large for a single R engine.

Figure 14. Processing time for 123.5M rows via ore.groupApply

ORE also supports row-parallelism, where the same embedded R function can be invoked on chunks of rows. As with ore.groupApply, depending on the specified degree of parallelism, a different chunk of rows will be submitted to a dynamically spawned database server-side R engine. Figure 15 depicts a near linear execution time to process the 123.5M rows from ONTIME with varying number of columns. The chunk size can be specified, however, testing 3 chunk sizes (10k, 50k, and 100k rows) showed no significant difference in overall execution time, hence a single line is graphed.

Figure 15. Processing time for 123.5M rows via ore.rowApply for chunk sizes 10k-100k

All tests were performed on an Exadata X3-8. Except as noted, the client R session and database were actually on the same machine, so network latency for data read and write were minimum. Over a LAN or WAN, the benefits of in-database execution and ORE will be even more dramatic.

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today