By Mark Hornick-Oracle on Jul 24, 2014
At the user!2014 conference at UCLA in early July, which was a stimulating and well-attended conference, I spoke about Oracle’s R Technologies during the sponsor talks. One of my slides focused on examples of analytics pain points we often hear from customers and prospects. For example,
“It takes too long to get my data or to get the ‘right’ data”
“I can’t analyze or mine all of my data – it has to be sampled”
“Putting R models and results into production is ad hoc and complex”
“Recoding R models into SQL, C, or Java takes time and is error prone”
“Our company is concerned about data security, backup and recovery”
“We need to build 10s of thousands of models fast to meet business objectives”
After the talk, several people approached me remarking how these are exactly the problems they encounter in their organizations. One person even asked, if I’d interviewed her for my talk since she is experiencing every one of these pain points.
Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, addresses these pain points. Let’s take a look one by one.
If it takes too long to get your data, perhaps because your moving it from the database where it resides to your external analytics server or laptop, the ideal solution is don’t move it. Analyze it where it is. This is exactly what Oracle R Enterprise allows you to do using the transparency layer and in-database predictive analytics capabilities. With Oracle R Enterprise, R functions normally performed on data.frames are translated to SQL for execution in the database, taking advantage of query optimization, indexes, parallel-distributed execution, etc. With the advent of Oracle Data In-Memory option, this has even more advantages, but that’s a topic for another post. The second part of this pain point is getting access to the “right” data. Allowing your data scientist to have a sandbox with access to the range of data necessary to perform his/her work avoids the delay of requesting flat file extracts via the DBA, only to realize that more or different data is required. The cycle time in getting the “right” data impedes progress, not to mention annoying some key individuals in your organization. We’ll come back to the security aspects later.
Increasingly, data scientists want to avoid sampling data when analyzing data or building predictive models. Minimally, they at least want to use much more data than may fit in typical analytics servers. Oracle R Enterprise provides an R interface to powerful in-database analytic functions and data mining algorithms. These algorithms are designed to work in a parallel distributed manner whether the data fits in memory or not. In other cases, sampling is desired, if not required, but this results in the chicken-and-egg problem: The data need to be sampled since they won’t fit in memory, but the data are too big to fit in memory to sample! Users have developed home grown techniques to chunk the data and combine partial samples; however, they shouldn’t have to. When sampling is desired/required, with Oracle R Enterprise, we are able to leverage row indexing and in-database sampling to extract only database table rows that are in the sample, using standard R syntax or Oracle R Enterprise-based sampling functions.
Our next pain point involves production deployment. Many good predictive models have been laid waste for lack of integration with or complexity introduced by production environments. Enterprise applications and dashboards often speak SQL and know how to access data. However, to craft a solution that extracts data, invokes an R script in an external R engine, and places batch results back in the database requires a lot of manual coding, often leveraging ad hoc cron jobs. Oracle R Enterprise enables the execution of R scripts on the database server machine, in local R engines under the control of Oracle Database. This can be done from R and SQL. Using the SQL API, R scripts can be invoked to return results in the form of table data, images, and XML. In addition, data can be moved to these R engines more efficiently, and the powerful database hardware, such as Exadata machines, can be leveraged for data-parallel and task-parallel R script execution.
When users don’t have access to a tight integration between R and SQL as noted above, another pain point involves using R only to build the models and relying on developers to recode the scoring procedures in a programming language that fits with the production environment, e.g., SQL, C, or Java. This has multiple downsides: it takes time to recode, manual recoding is error prone, and the resulting code requires significant testing. When the model is refreshed, the process repeats.
The pain points discussed so far also suffer from concerns about security, backup, and recovery. If data is being moved around in flat files, what security protocols or access controls are placed on those flat files? How can access be audited? Oracle R Enterprise enables analytics users to leverage an Oracle Database secured environment for data access. Moving on, if R scripts, models, and other R objects are stored and managed as flat files, how are these backed up? How are they synced with the deployed application? By storing all these artifacts in Oracle Database via Oracle R Enterprise, backup is a normal part of DBA operation with established protocols. The R Script Repository and Datastore simplify backup. Crafting ad hoc solutions involving third party analytic servers, there is the issue of recovery, or resilience to failures. Fewer moving parts mean lower complexity. Programming for failure contingencies in a distributed application adds significant complexity to an application. Allowing Oracle Database to control the execution of R scripts in database server side R engines reduces complexity and frees application developers and data scientists to focus on the more creative aspects of their work.
Lastly, users of advanced analytics software – data scientists, analysts, statisticians – are increasing pushing the barrier of scalability. Not just in volume of data processed, but in the number and frequency of their computations and analyses, e.g., predictive model building. Where only a few models are involved, it may be tractable to manage a few files to store predictive models on disk (although as noted above, this has its own complications). When you need to build thousands of models or hundreds of thousands of models, managing these models becomes a challenge in its own right.
In summary, customers are facing a wide range of pain points in their analytics activities. Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, addresses these pain points allowing data scientists, analysts, and statisticians, as well as the IT staff who supports them, to be more productive, while promoting and enabling new uses of advanced analytics.