X

Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

Data Science Maturity Model - Summary Table for Enterprise Assessment (Part 12)

This installment of the Data Science Maturity Model (DSMM) blog series contains a summary table of the dimensions and levels. Enterprises embracing data science as a core competency may want to evaluate what level they have achieved relative to each dimension - in some cases, an enterprise may straddle more than one level. As a next step, the enterprise may use this maturity model to identify a level in each dimension to which they aspire, or fashion a new Level 6. Data...

Thursday, June 28, 2018 | Best Practices | Read More

Data Science Maturity Model - Deployment (Part 11)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'deployment': How easily can data science work products be placed into production to meet timely business objectives? Data science comes with the expectation that amazing insights and predictions will transform the business and take the enterprise to a new level of performance. Too often, however, data science projects fail to "lift-off," resulting is significant...

Wednesday, June 27, 2018 | Best Practices | Read More

Data Science Maturity Model - Tools Dimension (Part 10)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'tools': What tools are used within the enterprise for data science? Can data scientists take advantage of open source tools in combination with high performance and scalable production quality infrastructure? A wide range of tools support data science ranging from open source to proprietary, relational database to "big data" platforms, simple analytics to complex machine...

Tuesday, June 26, 2018 | Best Practices | Read More

Returning Tables from Embedded R Execution .... Simplified

In this tips and tricks blog, we share some techniques through our own use of Oracle R Enterprise applied to data science projects that you may find useful in your projects. This time, we focus on the automated process of returning the data frame schema from the output of embedded R execution runs. Embedded R Execution ORE embedded R execution provides a powerful and convenient way to execute custom R scripts at the database server, from either R or SQL. It also enables running...

Monday, June 25, 2018 | Read More

Data Science Maturity Model - Asset Management Dimension (Part 9)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'asset management': How are data science assets managed and controlled? Assets are typically both tangible and intangible things of value. For this discussion, we will consider the array of data science work products as assets and can define 'asset management' at a high level as "any system that monitors and maintains things of value to an entity or group." As we...

Thursday, June 21, 2018 | Best Practices | Read More

Data Science Maturity Model - Scalability Dimension (Part 8)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'scalability': Do the tools scale and perform for data exploration, preparation, modeling, scoring, and deployment? As data, data science projects, and the data science team grow, is the enterprise able to support these adequately? The term 'scalability' can be defined as the "capability of a system, network, or process to handle a growing amount of work, or its potential to be...

Tuesday, June 19, 2018 | Best Practices | Read More

Data Science Maturity Model - Data Access Dimension (Part 7)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'data access': How do data analysts and data scientists request and access data? How is data access controlled, managed, and monitored?   When we consider 'data access,' one definition refers to "software and activities related to storing, retrieving, or acting on data housed in a database or other repository" normally coupled with authorization - who is permitted to access what...

Monday, June 11, 2018 | Read More

R Consortium solicits feedback on R package best practices

With over 12,000 R packages on CRAN alone, the choice of which package to use for a given task is challenging. While summary descriptions, documentation, download counts and word-of-mouth may help direct selection, a standard assessment of package quality can greatly help identify the suitability of a package for a given (non-)commercial need. Providing the R Community of package users an easily recognized “badge” indicating the level of quality achievement will make it...

Monday, June 11, 2018 | News | Read More

Data Science Maturity Model - Data Awareness Dimension (Part 6)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'data awareness': How easily can data scientists learn about enterprise data resources? Generally speaking, the term 'awareness' can be defined as "the state or condition of being aware; having knowledge; consciousness." For data awareness, we might refine this definition as "having knowledge of the data that exist in an enterprise and an understanding of its contents." As the...

Friday, June 8, 2018 | Read More

Data Science Maturity Model - Methodology Dimension (Part 5)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'methodology': What is the enterprise approach or methodology to data science? The most often cited methodology for 'data mining' - a key element of data science - is CRISP-DM. However, the breadth and growth of data science may require expanding beyond the traditional phases introduced by CRISP-DM: Business Understanding, Data Understanding, Data Preparation, Modeling,...

Thursday, June 7, 2018 | Best Practices | Read More

Data Science Maturity Model - Collaboration Dimension (Part 4)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'collaboration': How do data scientists collaborate among themselves and with others in the enterprise, e.g., business analysts, application and dashboard developers, to evolve and hand-off data science work products? Data science projects often involve significant collaboration, defined as "two or more people or organizations working together to realize or achieve a goal."...

Wednesday, June 6, 2018 | Best Practices | Read More

Data Science Maturity Model - Roles Dimension (Part 3)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'roles': What roles are defined and developed in the enterprise to support data science activities? A role can be defined as "a set of connected behaviors, rights, obligations, beliefs, and norms as conceptualized by people in a social situation." As with most any new field, data science within an enterprise can benefit from the introduction of new roles. Following the 'strategy'...

Tuesday, June 5, 2018 | Best Practices | Read More

Data Science Maturity Model - Strategy Dimension (Part 2)

In my previous post, I introduced this series on a Data Science Maturity Model and the dimensions we'll be discussing. The first dimension is 'strategy': What is the enterprise business strategy for data science? A strategy can be defined as "a high-level plan to achieve one or more goals under conditions of uncertainty." With respect to data science, goals may include making better business decisions, making new discoveries, improving customer acquisition / retention...

Monday, June 4, 2018 | Best Practices | Read More

A Data Science Maturity Model for Enterprise Assessment (Part 1)

"Maturity models" aid enterprises in understanding their current and target states. Enterprises that already embrace data science as a core competency, as well as those just getting started, often seek a road map for improving that competency. A data science maturity model is one way of assessing an enterprise and guiding the quest for data science nirvana.   As an assessment tool, this Data Science Maturity Modelprovides a set of dimensions relevant to data science and...

Wednesday, May 30, 2018 | Best Practices | Read More

Deploying Multiple R Scripts in Oracle R Enterprise

In this tips and tricks blog, we share some techniques through our own use of Oracle R Enterprise applied to data science projects that you may find useful in your own projects. Some data science projects may have tens or hundreds of R scripts and R functions written by developers or data scientists. While under ideal circumstances, you would create a package to contain these functions, that may involve more effort than you had in mind. This tradeoff of package vs. no package...

Wednesday, February 28, 2018 | Read More

Scalable scoring with multiple models using Oracle R Enterprise Embedded R Execution

At first glance, scoring data in batch with a machine learning model appears to be a straightforward endeavor: build the model, load the data, score using the model, do something with the results. This “something” can include writing the scores to a table, computing model evaluation/quality metrics, directly feeding a dashboard, etc. However, the task becomes a little more challenging when some of the details are filled in and hardware and software realities come into play....

Monday, February 26, 2018 | Read More

Building R Packages on Solaris: Variable Definitions

The R ecosystem offers numerous packages for performing data analysis. Currently, the CRAN package repository features over 14,000 available packages! A key benefit to using R is the endless support it gets from statisticians, developers, and data science experts around the world. The CRAN repository offers R packages in Linux source format, or as binaries for Windows and MacOS.  If you are installing R packages on another Operating System such as Solaris, you will need to...

Saturday, January 6, 2018 | Read More

Announcing the release of Oracle R Advanced Analytics for Hadoop 2.7.1

We are pleased to announce the general availability of Oracle R Advanced Analytics for Hadoop (ORAAH) 2.7.1, a component of the Oracle Big Data Connectors, which enables big data analytics from R. With ORAAH, Data Scientists and Data Analysts have access to the rich and productive R language for accessing and manipulating data resident across multiple platforms, including HDFS, Hive, Oracle Database, and local files. By leveraging the parallel and distributed Hadoop and Spark...

Tuesday, December 12, 2017 | News | Read More

PageRank-based College football (NCAA) ranking using OAAgraph

NCAA College football is American football played by teams of student athletes fielded by American universities, colleges, and military academies. It is one of the major weekend entertainments in the US. The match results capture most of the Sunday headlines. In particular, one key focus is the rankings of the teams. There are various types of rankings: CFP rankings, AP Poll, Coaches Poll, etc. Those rankings look similar to each other with slight differences. The ranking...

Tuesday, December 5, 2017 | Read More

Text Analytics using a pre-built Wikipedia-based Topic Model

In my previous post, Explicit Semantic Analysis (ESA) for Text Analytics, we explored the basics of the ESA algorithm and how to use it in Oracle R Enterprise to build a model from scratch and use that model to score new text.  While creating your own domain-specific model may be necessary in many situations, others may benefit from a pre-built model based on millions of Wikipedia articles reduced to 200,000 topics. This model is downloadable here with details of how to...

Thursday, November 30, 2017 | Tips and Tricks | Read More

Supporting R through the R Consortium

Oracle has supported the R Consortium since its inception in 2015 ( R Consortium Launched!). As a provider of multiple software tools and products that leverage and extend R, joining the R Consortium was a natural way for Oracle to give back to the R community and contribute to the evolution of the R ecosystem. The R Consortium provides vendors a forum within which to suggest needed projects for the R community, and to raise concerns. Through the Infrastructure Steering...

Wednesday, November 29, 2017 | R Technologies | Read More

Explicit Semantic Analysis (ESA) for Text Analytics

New in Oracle R Enterprise 1.5.1 with the Oracle Database 12.2 Oracle Advanced Analytics option is the text analytics algorithm Explicit Semantic Analysis or ESA. Compared to other techniques such as Latent Dirichlet Association (LDA) or Term Frequency-Inverse Document Frequency (TF-IDF), ESA offers some unique benefits. Most notably, it improves text document categorization by computing “semantic relatedness” between the documents and a set of topics that are explicitly...

Friday, November 17, 2017 | Read More

Getting started with OAAgraph - vignette

Following up on the introductory post on OAAgraph, here is a vignette that illustrates using some of the OAAgraph package's capabilities. Recall that OAAgraph enables seamless interaction between R users of Oracle R Enterprise (ORE) of the Oracle Advanced Analytics option, Oracle Database, and the Parallel Graph Engine (PGX) of the Oracle Spatial and Graph option. In this post, we highlight a few aspects of OAAgraph: Creating a graph from node and edge tables residing in...

Tuesday, October 31, 2017 | Read More

Working Effectively with Support

When simultaneously learning a new tool and working toward a deliverable deadline, getting timely help with technical problems is critical. If you work with R/Oracle R Distribution, Oracle R Enterprise, or Oracle R Advanced Analytics for Hadoop and need to engage support resources, we recommend doing everything you can to expedite a solution to the problem you are facing. The following tips on working effectively with support will enable more efficient communication, leading...

Tuesday, October 10, 2017 | Read More

Building "partition models" with Oracle R Enterprise

There are many approaches for improving model accuracy - anything from enriching or cleansing the data you start with to optimizing algorithm parameters or creating ensemble models. One technique that Oracle R Enterprise users sometimes employ is to partition data based on the distinct values of one or more columns and build a model for each partition. By building a model on each partition, forming a kind of ensemble model, better accuracy is possible. The embedded R...

Thursday, October 5, 2017 | Tips and Tricks | Read More

Integrating custom algorithms with Oracle Advanced Analytics with R

Data scientists and other users of machine learning and predictive analytics technology often have their favorite algorithm for solving particular problems. If they are using a tool like Oracle Advanced Analytics -- with Oracle R Enterprise and Oracle Data Mining -- there's a desire to use these algorithms within that tool's framework. Using ORE's embedded R execution, users can already use 3rd party R packages in combination with Oracle  Database for execution at the...

Tuesday, October 3, 2017 | FAQ | Read More

Building Rcpp on 64-bit Solaris SPARC Systems

One reason R has become so popular is the vast array of add-on packages available at the CRAN and Bioconductor repositories. R's package system along with the CRAN framework provides a process for authoring, documenting and distributing packages to millions of users.  However, users and administrators wanting to build packages requiring C++ on 64-bit Solaris SPARC systems often are unable to compile their packages using Oracle Developer Studio.   R uses $R_HOME/etc/Makeconf a...

Thursday, September 21, 2017 | Read More

Monitoring progress of embedded R functions

When you run R functions in Oracle Database, especially functions involving multiple R engines in parallel, you can monitor their progress using the Oracle R Enterprise datastore as a central location for progress notifications, or any intermediate status or results. In the following example, based on ore.groupApply, we illustrate instrumenting a simple function that builds a linear model to predict flight arrival delay based on a few other variables. In the function modelBuild...

Wednesday, September 20, 2017 | Best Practices | Read More

Contrasting Oracle R Distribution and Oracle R Enterprise

What is the distinction between Oracle R Distribution and Oracle R Enterprise? Oracle R Distribution (ORD) is Oracle's redistribution of open source R, with enhancements for dynamically loading high performance libraries like Intel's Math Kernel Library (MKL) and setting R memory limits on database server-side R engine execution. Oracle provides support for ORD to customers of the Oracle Advanced Analytics option (which includes Oracle R Enterprise and Oracle Data Mining),...

Thursday, September 7, 2017 | FAQ | Read More

Graph Analytics and Machine Learning - A Great Combination

Graphs are everywhere, whether looking at social media such as Facebook (friends of friends), Twitter, and LinkedIn, or customer relationships such as who calls whom or which bank accounts have money transfers between them. Graph algorithms come in two major flavors: computational graph analytics, where we analyze the entire graph to compute metrics or identify graph components, and graph pattern matching, where queries find sub-graphs corresponding to specified patterns. In...

Wednesday, September 6, 2017 | Read More

Introducing a dplyr interface to Oracle R Enterprise

While Oracle R Enterprise already provides seamless access to Oracle Database tables using standard R syntax and functions, new interfaces arise that make it conceptually easier for users to manipulate tabular data. The R package dplyr is one such package in the tidyverse that has gained wide adoption. It provides a grammar for data manipulation while working with data.frame-like objects, both in memory and out of memory. The dplyr package is intended to interface to database...

Tuesday, September 5, 2017 | Read More

Machine Learning on Database: What do you want to do?

Oracle Database provides a wide range of scalable and performant machine learning algorithms from R and SQL. This new algorithm "cheat sheet" serves to guide users of Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining) to the best in-database algorithm for a given task: predict categories (classification) predict numeric values (regression) rank predictors (attribute importance) group or segment cases (clustering) derive new values (feature extraction) identify...

Friday, September 1, 2017 | Best Practices | Read More

Oracle R Enterprise 1.5.1 for Oracle Database is now available

We are pleased to announce that Oracle R Enterprise (ORE) 1.5.1 is now available for download for Oracle Database Enterprise Edition with Oracle R Distribution 3.3.0 / R-3.3.0. Oracle R Enterprise is a component of the Advanced Analytics option to Oracle Database. With ORE 1.5.1, we introduce two new packages: OREdplyr - a transparency layer enhancement - allows ORE users access to many of the popular dplyr functions on ore.frames; and a second package via separate download,...

Thursday, August 31, 2017 | News | Read More

Visualizing Circular Distributions on Big Data

While browsing the chapter on Circular Distributions in Zar’s Biostatistical Analysis, I came across an interesting visualization for circular data. Circular scale data is all around: the days of the week, months of the year, hours of the day, degrees on the compass. Defined technically, circular scale data is a special type of interval scale data where the scale has equal intervals, no true zero point, and there is no rational high or low values, or if there are, they are...

Wednesday, August 23, 2017 | Best Practices | Read More

Computing Weight of Evidence (WOE) and Information Value (IV)

Weight of evidence (WOE) is a powerful tool for feature representation and evaluation in data science. WOE can provide interpret able transformation to both categorical and numerical features.For categorical features, the levels within a feature often do not have an ordinal meaning and thus need to be transformed by either one-hot encoding or hashing. Although such transformations convert the feature into vectors and can be fed into machine learning algorithms, the 0-1 valued...

Wednesday, August 2, 2017 | Read More

Oracle R Enterprise and Database Upgrades

After a database upgrade, a set of maintenance steps is required to update the new ORACLE_HOME with the entire set of ORE components.  For example, if the proper migration steps are not followed, ORE embedded R functions will return errors such as:   ORA-28578: protocol error during callback from an external procedure The ORE server installation consists of three components: Oracle Database schema (RQSYS) and schema-related objects. Oracle Database shared libraries for...

Tuesday, August 1, 2017 | Read More

BIWA Summit 2018 with Spatial and Graph Summit - Call for Speakers

(pdf announcement) Oracle Conference Center at Oracle Headquarters Campus, Redwood Shores, CA Share your successes… We want to hear your story. Submit your proposal today for Oracle BIWA Summit 2018, featuring Oracle Spatial and Graph Summit, March 20 - 22, 2018 and share your successes with Oracle technology. The call for speakers is now open through December 3, 2017.  Submit now for possible early acceptance and publication in Oracle BIWA Summit 2018 promotion materials.  Click HE...

Thursday, June 29, 2017 | News | Read More

Oracle R Distribution 3.3.0 Benchmarks

We recently updated the Oracle R Distribution (ORD) benchmarks for version 3.3.0. ORD is based on open source R-3.3.0 and adds support for dynamically loading  linear algebra performance libraries installed on your system. This includes Intel's Math Kernel Library (MKL), AMD's ACML, and Sun Performance Library for Solaris, which enable optimized, multi-threaded math routines to provide relevant R functions maximum performance. The benchmark results demonstrate the performance...

Thursday, June 29, 2017 | Read More

R Consortium "Code Coverage Tool for R" Working Group Achieves First Release

The "Code Coverage Tool for R" project, proposed by Oracle and approved by the R Consortium Infrastructure Steering Committee, started just over a year ago. Project goals included providing an enhanced tool that determines code coverage upon execution of a test suite, and leveraging such a tool more broadly as part of the R ecosystem. What is code coverage? As defined in Wikipedia, “code coverage is a measure used to describe the degree to which the source code of a program is...

Tuesday, June 27, 2017 | Read More

Oracle R Distribution 3.3.0 Released

Oracle R Distribution version 3.3.0 is released on all supported platforms. This release, code-named "Supposedly Educational", contains several significant bug fixes and improvements to R, including:  Support for downloading data from secure https-enabled sites using download.file  Speed improvements for a number of low-level R functions called by higher-level, commonly used functions. These include speedups for vector selection with boolean data, function argument...

Tuesday, June 13, 2017 | Read More

Diabetes Data Analysis in R

Data collected from diabetes patients has been widely investigated nowadays by many data science applications. Popular data sets include PIMA Indians Diabetes Data Set or Diabetes 130-US hospitals for years 1999-2008 Data Set. Both data sets are aggregated, labeled and relatively straightforward to do further machine learning tasks. However, in the real world, diabetes data are often collected from healthcare instruments attached to patients. The raw data can be sporadic and...

Tuesday, June 6, 2017 | Read More

Parallel Training of Multiple Foreign Exchange Return Models

In a variety of machine learning applications, there are often requirements for training multiple models. For example, in the internet of things (IoT) industry, a unique model needs to be built for each household with installed sensors that measure temperature, light or power consumption. Another example can be found in the online advertising industry. To serve personalized online advertisements or recommendations, a huge number of individualized models has to be built and...

Friday, May 26, 2017 | R Technologies | Read More

Migrating R models from Development to Production

Users of Oracle R Enterprise (ORE) embedded R execution will often calibrate R models in a development environment and promote the final models to a production database. In most cases, the development and production databases are distinct, and model serialization between databases is not effective if the underlying tables are not identical.  To facilitate the migration process, ORE includes scripts to transport the ORE system schema, RQSYS, and ORE objects such as...

Tuesday, December 20, 2016 | Best Practices | Read More

Key Capabilities for Big Data Analytics using R

There are several capabilities that data scientists benefit from when performing Big Data advanced analytics and machine learning with R. These revolve around efficient data access and manipulation, access to parallel and distributed machine learning algorithms, data and task parallel execution, and ability to deploy results quickly and easily. Data scientists using R want to leverage the R ecosystem as much as possible, whether leveraging the expansive set of open source R...

Thursday, October 6, 2016 | Best Practices | Read More

Oracle's Big Data & Analytics Platform for Data Scientists

Check out this blog post from Oracle's Avi Misra that highlights components of the Oracle Big Data and Analytics platform for data scientists. Oracle’s Big Data & Analytics Platform enables data science and machine learning at scale by taking the best that open-source offers, putting it together as an engineered solution and adding capabilities and features where open-source falls short. Products mentioned include: Oracle Big Data Cloud Service (BDCS), Oracle R Advanced...

Monday, October 3, 2016 | Best Practices | Read More

Early detection of process anomalies with SPRT

Developed by Abraham Wald more than a half century ago, the Sequential Probability Ratio Test (SPRT) is a statistical technique for binary hypothesis testing (helping to decide between two hypothesis H0 and H1) and extensively used for system monitoring and early annunciation of signal drifting. SPRT is very popular for quality control and equipment surveillance applications, in industries and areas requiring a highly sensitive, reliable and especially fast detection of...

Wednesday, July 27, 2016 | Best Practices | Read More

Predicting Energy Demand using IoT

The Internet of Things (IoT) presents new opportunities for applying advanced analytics. Sensors are everywhere collecting data – on airplanes, trains, and cars, in semiconductor production machinery and the Large Hadron Collider, and even in our homes. One such sensor is the home energy smart meter, which can report household energy consumption every 15 minutes. This data enables energy companies to not only model each customer’s energy consumption patterns, but also to...

Wednesday, July 13, 2016 | Best Practices | Read More

Real-time model scoring for streaming data - a prototype based on Oracle Stream Explorer and Oracle R Enterprise

Whether applied to manufacturing, financial services, energy, transportation, retail, government, security or other domains, real-time analytics is an umbrella term which covers a broad spectrum of capabilities (data integration, analytics, business intelligence) built on streaming input from multiple channels. Examples of such channels are: sensor data, log data, market data, click streams, social media and monitoring imagery.Key metrics separating real-time analytics from...

Thursday, March 31, 2016 | Read More

R Consortium Announces New Projects

The R Consortium works with and provides support to the R Foundation and other organizations developing, maintaining and distributing R software and provides a unifying framework for the R user community. The R Consortium Infrastructure Steering Committee (ISC) supports projects that help the R community, whether through software development, developing new teaching materials, documenting best practices, promoting R to new audiences, standardizing APIs, or doing research. In...

Wednesday, March 23, 2016 | News | Read More

Using SVD for Dimensionality Reduction

SVD, or Singular Value Decomposition, is one of several techniques that can be used to reduce the dimensionality, i.e., the number of columns, of a data set. Why would we want to reduce the number of dimensions? In predictive analytics, more columns normally means more time required to build models and score data. If some columns have no predictive value, this means wasted time, or worse, those columns contribute noise to the model and reduce model quality or predictive...

Friday, February 5, 2016 | Tips and Tricks | Read More

Learn, Share, and Network! Join us at BIWA Summit, Oracle HQ, January 26-28

Join us at BIWA Summit held at Oracle Headquarters to learn about the latest in Oracle technology, customer experiences, and best practices, while sharing your experiences with colleagues, and networking with technology experts. BIWA Summit 2016, the Oracle Big Data + Analytics User Group Conference is joining forces with the NoCOUG SIG’s YesSQL Summit, Spatial SIG’s Spatial Summit and DWGL for the biggest BIWA Summit ever.  Check out the BIWA Summit’16 agenda. The BIWA Summit...

Tuesday, January 12, 2016 | News | Read More

ORE Random Forest

Random Forest is a popular ensemble learning technique for classification and regression, developed by Leo Breiman and Adele Cutler. By combining the ideas of “bagging” and random selection of variables, the algorithm produces a collection of decision trees with controlled variance, while avoiding overfitting – a common problem for decision trees. By constructing many trees, classification predictions are made by selecting the mode of classes predicted, while regression...

Monday, January 4, 2016 | Best Practices | Read More

Oracle R Enterprise 1.5 Released

We’re pleased to announce that Oracle R Enterprise (ORE) 1.5 is now available for download on all supported platforms with Oracle R Distribution 3.2.0 / R-3.2.0. ORE 1.5 introduces parallel distributed implementations of Random Forest, Singular Value Decomposition (SVD), and Principal Component Analysis (PCA) that operate on ore.frame objects. Performance enhancements are included for ore.summary summary statistics.In addition, ORE 1.5 enhances embedded R execution with...

Thursday, December 24, 2015 | Best Practices | Read More

Using RStudio Shiny with ORE for interactive analysis and visualization

Shiny, by RStudio, is a popular web application framework for R. It can be used, for example, for building flexible interactive analyses and visualization solutions without requiring web development skills and knowledge of Javascript, HTML, CSS, etc. An overview of it's capabilities with numerous examples is available on RStudio's Shiny web site. In this blog we illustrate a simple Shiny application for processing and visualizing data stored in Oracle Database for the special...

Thursday, November 19, 2015 | Best Practices | Read More

Oracle R Distribution 3.2.0 Benchmarks

We recently updated the Oracle R Distribution (ORD) benchmarks on ORD version 3.2.0. ORD is Oracle's free distribution of the open source R environment that adds support for dynamically loading the Intel Math Kernel Library (MKL) installed on your system. MKL provides faster performance by taking advantage of hardware-specific math library implementations. The benchmark results demonstrate the performance of Oracle R Distribution 3.2.0 with and without dynamically loaded MKL. ...

Thursday, November 19, 2015 | Best Practices | Read More

BIWA Summit 2016 - Oracle Big Data + Analytics User Group Conference

BIWA Summit 2016, the Oracle Big Data + Analytics User Group Conference is joining forces with the NoCOUG SIG’s YesSQL Summit, Spatial SIG’s Spatial Summit and DWGL for the biggest BIWA Summit ever.  Check out the BIWA Summit’16 agenda so far. The BIWA Summit’16 sessions and hands-on-labs are excellent opportunities for attendees to learn about Advanced Analytics/Predictive Analytics, R, Spatial Geo-location, Graph/Social Network Analysis, Big Data Appliance and Hadoop, Cloud,...

Monday, November 16, 2015 | News | Read More

Oracle R Enterprise Performance on Intel® DC P3700 Series SSDs

Solid-state drives (SSDs) are becoming increasingly popular in enterprise storage systems, providing large caches, permanent storage and low latency. A recent study aimed to characterize the performance of Oracle R Enterprise workloads on the Intel® P3700 SSD versus hard disk drives (HDDs), with IO-WAIT as the key metric of interest. The study showed that Intel® DC P3700 Series SSDs reduced I/O latency for Oracle R Enterprise workloads, most notably when saving objects to Orac...

Friday, October 23, 2015 | Best Practices | Read More

Consolidating wide and shallow data with ORE Datastore

Clinical trial data are often characterized by a relatively small set of participants (100s or 1000s) while the data collected and analyzed on each may be significantly larger (1000s or 10,000s). Genomic data alone can easily reach the higher end of this range. In talking with industry leaders, one of the problems pharmaceutical companies and research hospitals encounter is effectively managing such data. Storing data in flat files on myriad servers, perhaps even “closeted”...

Thursday, September 10, 2015 | Tips and Tricks | Read More

Oracle R Advanced Analytics for Hadoop on the Fast Lane: Spark-based Logistic Regression and MLP Neural Networks

This is the first in a series of blogs that is going to explore the capabilities of the newly released Oracle R Advanced Analytics for Hadoop 2.5.0, part of Oracle Big Data Connectors, which includes two new algorithm implementations that can take advantage of an Apache Spark cluster for a significant performance gains on Model Build and Scoring time. These algorithms are a redesigned version of the Multi-Layer Perceptron Neural Networks (orch.neural) and a brand...

Friday, August 7, 2015 | Best Practices | Read More

ROracle 1.2-1 released

We are pleased to announce the latest update of the open source ROracle package, version 1.2-1, with enhancements and bug fixes. ROracle provides high performance and scalable interaction between R and Oracle Database. In addition to availability on CRAN, ROracle binaries specific to Windows and other platforms can be downloaded from the Oracle Technology Network. Users of ROracle, please take our brief survey. Your feedback is important and we want to hear from you!Latest...

Wednesday, August 5, 2015 | Best Practices | Read More

R Consortium Launched!

The Linux Foundation announces the R Consortium to support R users globally. The R Consortium works with and provides support to the R Foundation and other organizations developing, maintaining and distributing R software and provides a unifying framework for the R user community.“Data science is pushing the boundaries of what is possible in business, science, and technology, where the R language and ecosystem is a major enabling force,” said Neil Mendelson, Vice President,...

Tuesday, June 30, 2015 | News | Read More

Variable Selection with ORE varclus - Part 2

In our previous post we talked about variable selection and introduced a technique based on hierarchical divisive clustering and implemented using the Oracle R Enterprise embedded execution capabilities. In this post we illustrate how to visualize the clustering solution, discuss stopping criteria and highlight some performance aspects. Plots The clustering efficiency can be assessed, from a high level perspective, through a visual representation of metrics related...

Saturday, June 13, 2015 | Best Practices | Read More

Variable Selection with ORE varclus - Part 1

Variable selection also known as feature or attribute selection is an important technique for data mining and predictive analytics. It is used when the number of variables is large and has received a special attention from application areas where this number is very large (like genomics, combinatorial chemistry, text mining, etc). The underlying hypothesis for variable selection is that the data can contain many variables which are either irrelevant or redundant. Solutions...

Friday, June 5, 2015 | News | Read More

Experience using ORAAH on a customer business problem: some basic issues & solutions

We illustrate in this blog a few simple, practical solutions for problems which can arise when developing ORAAH mapreduce applications for the Oracle BDA. These problems were actually encountered during a recent POC engagement. The customer, an  important player in the medical technologies market, was interested in building an analysis flow consisting of a sequence of data manipulation and transformation steps followed by multiple model generation. The data preparation...

Wednesday, May 6, 2015 | Tips and Tricks | Read More

The Intersection of “Data Capital” and Advanced Analytics

We’ve heard about the Three Laws of Data Capital from Paul Sonderegger at Oracle: data comes from activity, data tends to make more data, and platforms tend to win. Advanced analytics enables enterprises to take full advantage of the data their activity produces, ranging from IoT sensors and PoS transactions to social media and image/video. Traditional BI tools produce summary data from data – producing more data, but traditional BI tools provide a view of the past – what didh...

Friday, April 17, 2015 | Read More

Using rJava in Embedded R Execution

Integration with high performance programming languages is one way to tackle big data with R. Portions of the R code are moved from R to another language to avoid bottlenecks and perform expensive procedures. The goal is to balance R’s elegant handling of data with the heavy duty computing capabilities of other languages.Outsourcing R to another language can easily be hidden in R functions, so proficiency in the target language is not requisite for the users of these...

Monday, April 6, 2015 | Best Practices | Read More

Oracle Open World 2015 Call for Proposals!

It's that time of year again...submit your session proposals for Oracle OpenWorld 2015! Oracle customers and partners are encouraged to submit proposals to present at the Oracle OpenWorld 2015 conference, October 25 - 29, 2015, held at the Moscone Center in San Francisco. Details and submission guidelines are available on the Oracle OpenWorld Call for Proposals web site. The deadline for submissions is Wednesday, April 29, 11:59 p.m. PDT.We look forward to checking out your...

Monday, March 30, 2015 | Best Practices | Read More

Oracle R Distribution 3.1.1 Available for Download on all Platforms

The Oracle R Distribution 3.1.1 binaries for Windows, AIX, Solaris SPARC and Solaris x86 are now available on OSS, Oracle's Open Source Software portal. Oracle R Distribution 3.1.1 is an update to R version 3.1.0, and it includes many improvements, including upgrades to the package help system and improved accuracy importing data with large integers. The complete list of changes is in the NEWS file. To install Oracle R Distribution, follow the instructions for your platform in...

Monday, March 23, 2015 | Best Practices | Read More

Pain Point #6: “We need to build 10s of thousands of models fast to meet business objectives”

The last pain point in this series on Addressing Analytic Pain Points, involves one aspect of what I call massive predictive modeling. Increasingly, enterprise customers are building a greater number of models. In past decades, producing a handful of production models per year may have been considered a significant accomplishment. With the advent of powerful computing platforms, parallel and distributed algorithms, as well as the wealth of data – Big Data – we see enterprises...

Thursday, February 12, 2015 | Best Practices | Read More

Pain Point #5: “Our company is concerned about data security, backup and recovery”

So far in this series on Addressing Analytic Pain Points, I’ve focused on the issues of data access, performance, scalability, application complexity, and production deployment. However, there are also fundamental needs for enterprise advanced analytics solutions that revolve around data security, backup, and recovery. Traditional non-database analytics tools typically rely on flat files. If data originated in an RDBMS, that data must first be extracted. Once extracted, who...

Monday, January 19, 2015 | Best Practices | Read More

Pain Point #4: “Recoding R (or other) models into SQL, C, or Java takes time and is error prone”

In the previous post in this series Addressing Analytic Pain Points, I focused on some issues surrounding production deployment of advanced analytics solutions. One specific aspect of production deployment involves how to get predictive model results (e.g., scores) from R or leading vendor tools into applications that are based on programming languages such as SQL, C, or Java. In certain environments, one way to integrate predictive models involves recoding them into one of...

Tuesday, December 23, 2014 | Best Practices | Read More

Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex”

Continuing in our series Addressing Analytic Pain Points, another concern for data scientists and analysts, as well as enterprise management, is how to leverage analytic results in production systems. These production systems can include (i) dashboards used by management to make business decisions, (ii) call center applications where representatives see personalized recommendations for the customer they’re speaking to or how likely that customer is to churn, (iii)...

Sunday, December 14, 2014 | Best Practices | Read More

Pain Point #2: “I can’t analyze or mine all of my data – it has to be sampled”

Continuing in our series Addressing Analytic Pain Points, another concern for enterprise data scientists and analysts is having to compromise accuracy due to sampling. While sampling is an important technique for data analysis, it’s one thing to sample because you choose to; it’s quite another if you are forced to sample or to use a much smaller sample than is useful. A combination of memory, compute power, and algorithm design normally contributes to this. In some cases, data...

Wednesday, November 19, 2014 | Best Practices | Read More

Pain Point #1: “It takes too long to get my data or to get the ‘right’ data”

This is the first in a series on Addressing Analytic Pain Points: “It takes too long to get my data or to get the ‘right’ data.” Analytics users can be characterized along multiple dimensions. One such dimension is how they get access to or receive data. For example, some receive data via flat files. Since we’re talking about “enterprise” users, this often means data stored in RDBMSs where users request data extracts from a DBA or more generally the IT department. Turnaround...

Friday, October 24, 2014 | Best Practices | Read More

Addressing Analytic Pain Points

If you’re an enterprise data scientist, data analyst, or statistician, and perform analytics using R or another third party analytics engine, you’ve likely encountered one or more of these pain points: Pain Point #1: “It takes too long to get my data or to get the ‘right’ data” Pain Point #2: “I can’t analyze or mine all of my data – it has to be sampled” Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex” Pain Point #4: “Recoding R (or...

Friday, October 24, 2014 | Best Practices | Read More

Oracle R Enterprise 1.4.1 Released

Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, makes the open source R statistical programming language and environment ready for the enterprise and big data. Designed for problems involving large data volumes, Oracle R Enterprise integrates R with Oracle Database.R users can execute R commands and scripts for statistical and graphical analyses on data stored in Oracle Database. R users can develop, refine, and deploy R scripts...

Monday, September 22, 2014 | Best Practices | Read More

Seismic Data Repository: on-the-fly data analysis and visualization using Oracle R Enterprise

RN-KrasnoyarskNIPIneft Establishes Seismic Information Repository for One of the World’s Largest Oil and Gas Companies. Read the complete customer story here, excerpts follow.RN-KrasnoyarskNIPIneft (KrasNIPI) is a research and development subsidiary of Rosneft Oil Companya, top oil and gas company in Russia and worldwide. KrasNIPI provides high-quality information from seismic surveys to Rosneft—delivering key information that oil and gas companies seek to lower costs,...

Wednesday, September 17, 2014 | Best Practices | Read More

Oracle R Distribution 3.1.1 Released

Oracle R Distribution version 3.1.1 has been released to Oracle's public yum today. R-3.1.1 (code name "Sock it to Me") is an update to R-3.1.0 that consists mainly of bug fixes. It also includes enhancements related to accessing package help files, improved accuracy when importing data with large integers, and better integration with RStudio graphics. The full list of new features and bug fixes is listed in the NEWS file. To install Oracle R Distribution using yum, follow...

Thursday, August 21, 2014 | Best Practices | Read More

Real-time Big Data Analytics is a reality for StubHub with Oracle Advanced Analytics

What can you use for a comprehensive platform for real-time analytics? How can you process big data volumes for near-real-time recommendations and dramatically reduce fraud? Learn in this video what Stubhub achieved with Oracle R Enterprise from the Oracle Advanced Analytics option to Oracle Database, and read more on their story here.Advanced analytics solutions that impact the bottom line of a business are challenging due to the range of skills and individuals involved...

Monday, August 18, 2014 | Customers | Read More

Selecting the most predictive variables – returning Attribute Importance results as a database table

Attribute Importance (AI) is a technique of Oracle Advanced Analytics (OAA) that ranks the relative importance of predictors given a categorical or numeric target for classification or regression models, respectively. OAA AI uses the minimum description length algorithm and produces importancescores such that predictors with positive scores help predict the target, while zero or negative do not, and may even contribute noise to a model, making it less accurate. OAA AI,...

Friday, August 15, 2014 | Best Practices | Read More

For CMOs: Take Your Company’s Data to a New Level for Marketing Insights

This guest post from Phyllis Zimbler Miller, ‎Digital Marketer, comments on uses of predictive analytics for marketing insights that could benefit from in-database scalability and ease of production deployment with Oracle R Enterprise. Does your company have tons of data, such as for how many seconds people watch each short video on your site before clicking away, and you are not yet leveraging this data to benefit your company’s bottom line?Missed opportunities can be...

Wednesday, July 30, 2014 | Best Practices | Read More

Addressing Data Order Between R and Relational Databases

Almost all data in R is a vector or is based upon vectors (vectors themselves, matrices, data frames, lists, and so forth).  The elements of a vector in R have an explicit order, and each element can be individually indexed.  R's in-memory processing relies on this order of elements for many computations, e.g., computing quantiles and summaries for time series objects.By design, query results in relational algebra are unordered.  Repeating the same query multiple times is not...

Friday, July 25, 2014 | Best Practices | Read More

Are you experiencing analytics pain points?

At the user!2014 conference at UCLA in early July, which was a stimulating and well-attended conference, I spoke about Oracle’s R Technologies during the sponsor talks. One of my slides focused on examples of analytics pain points we often hear from customers and prospects. For example, “It takes too long to get my data or to get the ‘right’ data” “I can’t analyze or mine all of my data – it has to be sampled”“Putting R models and results into production is ad hoc...

Thursday, July 24, 2014 | Best Practices | Read More

StubHub Taps into Big Data for Insight into Millions of Customers’ Ticket-Buying Patterns, Fraud Detection, and Optimized Ticket Prices

What can you use for a comprehensive platform for real-time analytics? How do you drive company growth to leverage actions of millions of customers? How can you process big data volumes for near-real-time recommendations and dramatically reduce fraud? These questions, and others, posed challenges set by Stubhub. Read what Stubhub achieved with Oracle R Enterprise from the Oracle Advanced Analytics option to Oracle Database.Mike Barber, Senior Manager of Data Science at...

Tuesday, July 22, 2014 | Customers | Read More

Using Embedded R Execution: Imputing Missing Data While Preserving Data Structure

This guest post from Matt Fritz, Data Scientist, demonstrates a method for imputing missing values in data using Embedded R Execution with Oracle R Enterprise.Missing data is a common issue among analyses and is mitigated by imputation. Several techniques handle this process within Oracle R Enterprise; however, some bias the data or generate outputs as data objects that are less accessible than others. This post illustrates ways to effectively impute data while specifying the...

Monday, July 14, 2014 | Customers | Read More

Convert ddply {plyr} to Oracle R Enterprise, or use with Embedded R Execution

The plyr package contains a set of tools for partitioning a problem into smaller sub-problems that can be more easily processed. One function within {plyr} is ddply, which allows you to specify subsets of a data.frame and then apply a function to each subset. The result is gathered into a single data.frame. Such a capability is very convenient. The function ddply also has a parallel option that if TRUE, will apply the function in parallel, using the backend provided by...

Thursday, June 5, 2014 | Best Practices | Read More

Financial institutions build predictive models using Oracle R Enterprise to speed model deployment

See the Oracle press release, Financial Institutions Leverage Metadata Driven Modeling Capability Built on the Oracle R Enterprise Platform to Accelerate Model Deployment and Streamline Governance for a description where a "unified environment for analytics data management and model lifecycle management brings the power and flexibility of the open source R statistical platform, delivered via the in-database Oracle R Enterprise engine to support open standards compliance."Thro...

Friday, May 30, 2014 | Best Practices | Read More

R Package Installation with Oracle R Enterprise

Programming languages give developers the opportunity to write reusable functions and to bundle those functions into logical deployable entities. In R, these are called packages. R has thousands of such packages provided by an almost equally large group of third-party contributors. To allow others to benefit from these packages, users can share packages on the CRAN system for use by the vast R development community worldwide. R's package system along with the CRAN framework...

Wednesday, May 28, 2014 | FAQ | Read More

Model cross-validation with ore.CV()

In this blog post we illustrate how to use Oracle R Enterprise for performing cross-validation of regression and classification models. We describe a new utility R function ore.CV that leverages features of Oracle R Enterprise and is available for download and use.Predictive models are usually built on given data and verified on held-aside or unseen data. Cross-validation is a model improvement technique that avoids the limitations of a single train-and-test experiment by...

Monday, May 19, 2014 | Tips and Tricks | Read More

Step-by-step: Returning R statistical results as a Database Table

R provides a rich set of statistical functions that we may want to use directly from SQL. Many of these results can be readily expressed as structured table data for use with other SQL tables, or for use by SQL-enabled applications, e.g., dashboards or other statistical tools.In this blog post, we illustrate in a sequence of five simple steps  how to go from an R function to a SQL-enabled result. Taken from recent "proof of concept" customer engagement, our example involves...

Sunday, April 27, 2014 | Best Practices | Read More

ore.doEval and ore.tableApply: Which one is right for me?

When beginning to use Oracle R Enterprise, users quickly grasp the techniques and benefits of using embedded R to run scripts in database-side R engines, and gain a solid sense of the available functions for executing R scripts through Oracle Database. However, various embedded R functions are closely related, and a few tips can help in learning which functions are most appropriate for the problem you wish to solve. In this post, we'll demonstrate best practices for two of...

Friday, April 25, 2014 | Best Practices | Read More

Oracle's Strategy for Advanced Analytics

At Oracle our goal is to enable you to get timely insight from all of your data. We continuously enhance Oracle Database to allow workloads that have traditionally required extracting data from the database to run in-place. We do this to narrow the gap that exists between insights that can be obtained and available data - because any data movement introduces latencies, complexity due to more moving parts, the ensuing need for data reconciliation and governance, as well as...

Wednesday, April 16, 2014 | Read More

Why choose Oracle for Advanced Analytics?

If you're an enterprise company, chances are you have your data in an Oracle database. You chose Oracle for it's global reputation at providing the best software products (and now engineered systems) to support your organization. Oracle database is known for stellar performance and scalability, and Oracle delivers world class support.If your data is already in Oracle Database or moving in that direction, leverage the high performance computing environment of the database to...

Thursday, March 27, 2014 | Best Practices | Read More

ROracle 1-1.11 released - binaries for Windows and other platforms available on OTN

We are pleased to announce the latest update of the open source ROracle package, version 1-1.11, with enhancements and bug fixes. ROracle provides high performance and scalable interaction from R with Oracle Database. In addition to availability on CRAN, ROracle binaries specific to Windows and other platforms can be downloaded from the Oracle Technology Network. Users of ROracle, please take our brief survey. We want to hear from you!Latest enhancements in version 1-1.11 of...

Thursday, March 20, 2014 | Best Practices | Read More

Oracle R Enterprise Upgrade Steps

We've recently announced that Oracle R Enterprise 1.4 is available on all platforms. To upgrade Oracle R Enterprise to the latest version:   1. *Install the version of R that is required for the new version of Oracle R Enterprise.          See the Oracle R Enterprise supported platforms matrix for the latest requirements.   2. Update Oracle R Enterprise Server on the database server by running the install.sh script          and follow the prompts for the upgrade path.  3....

Monday, March 17, 2014 | Best Practices | Read More

Oracle R Enterprise 1.4 Released

We’re pleased to announce that Oracle R Enterprise (ORE) 1.4 is now available for download on all supported platforms. In addition to numerous bug fixes, ORE 1.4 introduces an enhanced high performance computing infrastructure, new and enhanced parallel distributed predictive algorithms for both scalability and performance, added support for production deployment, and compatibility with the latest R versions.  These updates enable IT administrators to easily migrate the...

Sunday, March 16, 2014 | News | Read More

Oracle R Distribution 3.0.1 Benchmarks

Oracle R Distribution, Oracle's distribution of Open Source R, improves performance by dynamically linking to optimized, multi-threaded BLAS libraries. Unlike open source R, Oracle R Distribution uses all available cores and processors when dynamically linked against optimized BLAS, resulting in increased performance. Thus, the more cores available to Oracle R Distribution, the higher performance for many operations. How is this possible?  Standard R's internal BLAS library wa...

Tuesday, March 11, 2014 | Best Practices | Read More

Low-Rank Matrix Factorization in Oracle R Advanced Analytics for Hadoop

This guest post from Arun Kumar, a graduate student in the Department of Computer Sciences at the University of Wisconsin-Madison, describes work done during his internship in the Oracle Advanced Analytics group. Oracle R Advanced Analytics For Hadoop (ORAAH), a component of Oracle’s Big Data Connectors software suite is a collection of statistical and predictive techniques implemented on Hadoop infrastructure. In this post, we introduce and explain techniques for a popular...

Tuesday, February 18, 2014 | Best Practices | Read More

Invoking R scripts via Oracle Database: Theme and Variation, Part 6

How can I use "group apply" to partition data over multiple columns for parallel execution? How can I use R for statistical computations and return results as a database table? In this blog post of our theme and variation series, we answer these two questions through several examples, highlighting both R and SQL interfaces. So far in this blog series on Oracle R Enterprise embedded R execution we've covered: • Part 1: ore.doEval / rqEval • Part 2: ore.tableApply / rqTableEval•...

Tuesday, February 4, 2014 | Best Practices | Read More

Invoking R scripts via Oracle Database: Theme and Variation, Part 5

In the first four parts of Invoking R scripts via Oracle Database: Theme and Variation, we introduced features of Oracle R Enterprise embedded R execution involving the functions ore.doEval / rqEval, ore.tableApply / rqTableEval, ore.groupApply / “rqGroupApply”, and ore.rowApply / rqRowEval. In this blog post, we cover ore.indexApply. Note that there is no corresponding rqIndexEval – more on that later. The “index apply” function is also one of the parallel-enabled embedded R...

Monday, January 20, 2014 | Best Practices | Read More