Monday Sep 17, 2012

Podcast interview with Michael Kane

In this podcast interview with Michael Kane, Data Scientist and Associate Researcher at Yale University, Michael discusses the R statistical programming language, computational challenges associated with big data, and two projects involving data analysis he conducted on the stock market "flash crash" of May 6, 2010, and the tracking of transportation routes bird flu H5N1. Michael also worked with Oracle on Oracle R Enterprise, a component of the Advanced Analytics option to Oracle Database Enterprise Edition. In the closing segment of the interview, Michael comments on the relationship between the data analyst and the database administrator and how Oracle R Enterprise provides secure data management, transparent access to data, and improved performance to facilitate this relationship.

Listen now...

Friday Feb 03, 2012

What is R?

For many in the Oracle community, the addition of R through Oracle R Enterprise could leave them wondering "What is R?"

R has been receiving a lot of attention recently, although it’s been around for over 15 years. R is an open-source language and environment for statistical computing and data visualization, supporting data manipulation and transformations, as well as sophisticated graphical displays. It's being taught in colleges and universities in courses on statistics and advanced analytics - even replacing more traditional statistical software tools. Corporate data analysts and statisticians often know R and use it in their daily work, either writing their own R functionality, or leveraging the more than 3400 open source packages. The Comprehensive R Archive Network (CRAN) open source packages support a wide range of statistical and data analysis capabilities. They also focus on analytics specific to individual fields, such as bioinformatics, finance, econometrics, medical image analysis, and others (see CRAN Task Views).

So why do statisticians and data analysts use R?

Well, R is a statistics language similar to SAS or SPSS. It’s a powerful, extensible environment, and as noted above, it has a wide range of statistics and data visualization capabilities. It’s easy to install and use, and it’s free – downloadable from the CRAN R project website.

In contrast, statisticians and data analysts typically don’t know SQL and are not familiar with database tasks. R provides statisticians and data analysts access a wide range of analytical capabilities in a natural statistical language, allowing them to remain highly productive. For example, writing R functions is simple and can be done quickly. Functions can be made to return R objects that can be easily passed to and manipulated by other R functions. By comparison, traditional statistical tools can make the implementation of functions cumbersome, such that programmers resort to macro-oriented programming constructs instead.

So why do we need anything else?

R was conceived as a single user tool that is not multi-threaded.  The client and server components are bundled together as a single executable, much like Excel.

R is limited by the memory and processing power of the machine where it runs, but in addition, being single threaded, it cannot automatically leverage the CPU capacity on a user’s multi-processor laptop without special packages and programming.

However, there is another issue that limits R’s scalability…

R’s approach to passing data between function invocations results in data duplication – this chews up memory faster. So inherently, R is not good for big data, or depending on the machine and tasks, even gigabyte-sized data sets.

This is where Oracle R Enterprise comes in. As we'll continue to discuss in this blog, Oracle R Enterprise lifts this memory and computational constraint found in R today by executing requested R calculations on data in the database, using the database itself as the computational engine. Oracle R Enterprise allows users to further leverage Oracle's engineered systems, like Exadata, Big Data Appliance, and Exalytics, for enterprise-wide analytics, as well as reporting tools like Oracle Business Intelligence Enterprise Edition dashboards and BI Publisher documents.

Tuesday Jan 17, 2012

Welcome to Oracle R Enterprise!

Welcome to the Oracle R Enterprise blog - brought to you by the Oracle Advanced Analytics group. We'll be sharing best practices, tips, and tricks for applying Oracle R Enterprise and Oracle R Connector for Hadoop in both traditional and new "big data" environments. Oracle R Enterprise, along with Oracle Data Mining, are the two components of the new Oracle Advanced Analytics Option to Oracle Database.  

Here's a brief introduction to Oracle's R offerings: Oracle R Distribution, Oracle R Enterprise, and Oracle R Connector for Hadoop.

Oracle R Distribution provides an Oracle-supported distribution of open source R — enhanced with Intel’s MKL libraries for high performance mathematical computations on x86 hardware. The Oracle R Distribution facilitates enterprise acceptance of R, since the lack of a major corporate sponsor has made some companies concerned about fully adopting R.

Oracle R Enterprise (ORE) integrates the open-source R statistical environment and language with Oracle Database 11g, and the Oracle engineered solutions of Oracle Exadata and Oracle Big Data Appliance. ORE delivers enterprise-level advanced analytics based on the R environment, leveraging the database as an analytical compute engine. This allows R users like data analysts and statisticians to use the R client directly against data stored in Oracle Database 11g—vastly increasing scalability, performance, and security.

As an embedded component of the RDBMS, ORE eliminates R’s memory constraints since it can work on data directly in the database. R users can also execute R scripts in Oracle Database to support enterprise production applications. R's data.frame results and sophisticated graphics can be delivered through Oracle BI Publisher documents and OBIEE dashboards. Since it’s R, users are also able to leverage the latest contributed open source packages.

For data mining, R users not only can build models using any of the algorithms in the CRAN machine learning task view, but also leverage in-database implementations for predictions (e.g., stepwise regression, GLM, SVM), attribute selection, clustering, feature extraction via non-negative matrix factorization, association rules, and anomaly detection.

Oracle R Connector for Hadoop, one of the connectors available for Oracle Big Data Appliance, allows R users to work with the Hadoop Distributed File System (HDFS) and execute MapReduce programs on the Big Data Appliance Hadoop Cluster. R users write mapper and reducer functions in the R language, and invoke MapReduce jobs from the R environment.

We'll be exploring these components and their application in future posts.



The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.


« April 2014