There are several capabilities that data scientists benefit from when performing Big Data advanced analytics and machine learning with R. These revolve around efficient data access and manipulation, access to parallel and distributed machine learning algorithms, data and task parallel execution, and ability to deploy results quickly and easily. Data scientists using R want to leverage the R ecosystem as much as possible, whether leveraging the expansive set of open source R packages in their solutions, or leveraging their R scripts directly in production to avoid costly recoding or custom application integration solutions.
Data Access and Manipulation
R generally loads and processes small to medium-sized data sets with sufficient performance. However, as data volumes increase, moving data into a separate analytics engine becomes a non-starter. Moving large volume data across a network takes a non-trivial amount of time, but even if the user is willing/able to wait, client machines often have insufficient memory either for the data itself, or for desired R processing. This makes processing such data intractable if not impossible. Moreover, R functions are normally single threaded and do not benefit from multiple CPUs for parallel processing.
Systems that enable transparent access to and manipulation of data from R, have the benefit of short-circuiting costly data movement and client memory requirements, while allowing data scientists to leverage the R language constructs and functions. In the case of Oracle R Enterprise (ORE) and Oracle R Advanced Analytics for Hadoop (ORAAH), R users work with proxy objects for Oracle Database and HIVE tables. R functions that normally operate on data.frame and other objects are overloaded to work with ore.frame proxy objects. In ORE, R function invocations are translated to Oracle SQL for execution in Oracle Database. In ORAAH, these are translated to HiveQL for execution by Hive map-reduce jobs.
By leveraging Oracle Database, ORE enables scalable, high performance execution of functions for data filtering, summary statistics, and transformations, among others. Since data is not moved into R memory, there are two key benefits: no latency for moving data and no client-side memory limitations. Moreover, since the R functionality is translated to SQL, users benefit from Oracle Database table indexes, data partitioning, and query optimization, in addition to executing on a likely more powerful and memory-rich machine. Using Hive QL, ORAAH provides scalability implicitly using map-reduce, accessing data directly from Hive.
Data access and manipulation is further expanded through the use of Oracle Big Data SQL, where users can reference Hadoop data, e.g., as stored on Oracle Big Data Appliance, as though they were database tables. Those tables are also mapped to ore.frame objects and can be used in ORE functions. Big Data SQL transparently moves query processing to the most effective platform, which minimizes data movement and maximizes performance.
Parallel Machine Learning Algorithms
Machine learning algorithms as found in R, while rich in variety and quality, typically do not leverage multi-threading or parallelism. Aside from specialized packages, scaling to bigger data becomes problematic for R users, both for execution time as well as the need to load the full data set into memory with enough memory left over for computation.
For Oracle Database, custom parallel distributed algorithms are integrated with the Oracle Database kernel and ORE infrastructure. These enable building models and scoring on “big data” by being able to leverage machines with 100s of processors and terabytes of memory. With ORAAH, custom parallel distributed algorithms are also provided, but leverage Apache Spark and Hadoop. To further expand the set of algorithms, ORAAH exposes Apache Spark MLlib algorithms using R’s familiar formula specification for building models and integration with ORAAH’s model matrix and scoring functionality.
The performance benefits are significant, for example: compared to R’s randomForest and lm, ORE’s random forest 20x faster using 40 degrees of parallelism (DOP), and 110x faster for ORE’s lm with 64 DOP. ORAAH’s glm can be 4x – 15x faster than MLlib’s Spark-based algorithms depending on how memory is constrained (48 GB down to 8 GB). This high performance in the face of reduced memory requirements means that more users can share the same hardware resources for concurrent model building.
Data and Task Parallel Execution
Aside from parallel machine learning algorithms, users want to easily specify data-parallel and task-parallel execution. Data-parallel behavior is often referred to as “embarrassingly parallel” since it’s extremely easy to achieve – partition the data and invoke a user-defined R function on each partition of data in parallel, then collect the results. Task-parallel behavior takes a user-defined R function and executes it n times, with an index passed to the function corresponding to the thread being executed (1..n). This index facilitates setting random seeds for monte carlo simulations or selecting behavior to execute within the R function.
ORE’s embedded R execution provides specialized operations that support both data-parallel (ore.groupApply, ore.rowApply) and task-parallel (ore.indexApply) execution where users can specify the degree of parallelism, i.e., the number of parallel R engines desired. ORAAH enables specifying map-reduce jobs where the mapper and reducer are specified as R functions and can readily support the data-parallel and task-parallel behavior. With both ORE and ORAAH, users can leverage CRAN packages in their user-defined R functions.
Does the performance of CRAN packages improve as well? In general, there is no automated way to parallelize an arbitrary algorithm, e.g., have an arbitrary CRAN package’s functions become multi-threaded and/or execute across multiple servers. Algorithm designers often need to decompose a problem into chunks that can be performed in parallel and then integrate those results, sometimes in an iterative fashion. However, since the R functions are executed at the database server, which is likely a more powerful machine, performance may be significantly improved, especially for data loaded at inter-process communication speeds as opposed to ethernet. There are some high performance libraries that provide primitive or building block functionality, e.g., matrix operations and algorithms like FFT or SVD, that can transparently boost the performance of those operations. Such libraries include Intel’s Math Kernel Library (MKL), which is included with ORE and ORAAH for use with Oracle R Distribution – Oracle’s redistribution of open source R.
Aside from the subjective decision about whether a given data science solution should be put in production, the next biggest hurdle includes the technical obstacles for putting an R-based solution into production and having those results available to production applications.
For many enterprises, one approach has been to recode predictive models using C, SQL, or Java so they can be more readily used by applications. However, this takes time, is error prone, and requires rigorous testing. All too often, models become stale while awaiting deployment. Alternatively, some enterprises will hand-craft the “plumbing” required to spawn an R engine (or engines), load data, execute R scripts, and pass results to applications. This can involve reinventing complex infrastructure for each project, while introducing undesirable complexity and failure conditions.
ORE provides embedded R execution, which allows users to store their R scripts as functions in the Oracle Database R Script Repository and then invoke those functions by name, either from R or SQL. The SQL invocation facilitates production deployment. User-defined R functions that return a data.frame can have that result returned from SQL as a database table. Similarly, user-defined R functions that return images can have those images returned, one row per image, as a database table with a BLOB column containing the PNG images. Since most enterprise applications use SQL already, invoking user-defined R functions directly and getting back values becomes straightforward and natural. Embedded R execution also enables the use of job scheduling via the Oracle Database DBMS_SCHEDULER functionality. For ORAAH, user-defined R functions that invoke ORAAH functionality can also be stored in the Oracle Database R Script Repository for execution by name from R or SQL, also taking advanced of database job scheduling.
These capabilities – efficient data access and manipulation, access to parallel and distributed machine learning algorithms, data and task parallel execution, and ability to deploy results quickly and easily – enable data scientists to perform Big Data advanced analytics and machine learning with R.
Oracle R Enterprise is a component of the Oracle Advanced Analytics option to Oracle Database. Oracle R Advanced Analytics for Hadoop is a component of the Oracle Big Data Connectors software suite for use on Cloudera and Hortonworks, and both Oracle Big Data Appliance and non-Oracle clusters.
This installment of the Data Science Maturity Model (DSMM) blog series contains a summary table of the dimensions and levels. Enterprises...
In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'tools': What tools are used within the...