X

Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

Recent Posts

R Package Compilation with C++11 on Linux

When installing R packages on Linux 6, a common issue is that the native GCC compiler is not sufficient for building C++ 2011 standard code.  In some cases, the R package developer will identify the C++11 requirement on the package's CRAN site or in the compilation log. Or, the build log may simply return errors indicating the C++11 support is not available:   cc1plus: error: unrecognized command line option "-std=c++11" To work around this issue, you can install and enable the Linux Developer Toolset, or devtoolset, from the Oracle Linux Software Collection Library.    Devtoolset is a back-port of modern development tools for older Linux versions. It provides current versions of the GNU Compiler Collection, GNU Debugger, and other development, debugging, and performance monitoring tools. It will not conflict with the Operating System's native GCC compiler tools because the bits are installed to a non-default location, namely /opt/rh.    For example, when installing the open source R package ranger on Linux 6.9, the build fails with:   Error: ranger requires a real C++11 compiler, e.g., gcc >= 4.7 or Clang >= 3.0. You probably have to update your C++ compiler.   Linux 6.9 contains GCC 4.4-7 by default, which does not support C++11 code:   # gcc --version gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18) Copyright (C) 2010 Free Software Foundation, Inc.   GCC 4.8.1 was the first feature-complete implementation of the 2011 C++ standard, previously known as C++0x.     To resolve this problem, first install the devtoolset on Oracle Linux 6.   1. As root, download the Oracle public yum Linux 6 repository:   # wget https://public-yum.oracle.com/public-yum-ol6.repo   2. Enable the repositories corresponding to your Operating System. For example, with Oracle Linux 6.9:   # cat /etc/oracle-release Oracle Linux Server release 6.9   3. Enable the Linux 6.9 and software collections repositories by setting enabled to 1:   [ol6_u9_base] name=Oracle Linux $releasever Update 9 installation media copy ($basearch) baseurl=https://yum.oracle.com/repo/OracleLinux/OL6/9/base/$basearch/ gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle gpgcheck=1 enabled=1   [ol6_software_collections] name=Software Collection Library release 3.0 packages for Oracle Linux 6 (x86_64) baseurl=https://yum.oracle.com/repo/OracleLinux/OL6/SoftwareCollections/x86_64/ gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle gpgcheck=1 enabled=1   4. Next, install the devtoolset packages.  For Oracle Linux 6, the current version is devtoolset-7:   # yum install devtoolset-7   Which will successfully complete with the output:   Installed:   devtoolset-7.x86_64 0:7.1-4.el6                                                Complete!   5. To use devtoolset-7, enable it by executing:     # source /opt/rh/devtoolset-7/enable   Now we have access to gcc 7.3.1, which supports C++11 code.   # gcc --version gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) Copyright (C) 2017 Free Software Foundation, Inc.   6.  The ranger packages compiles successfully.   R> install.packages("ranger") .. .. * DONE (ranger) Making 'packages.html' ... done    Note that while these instructions were developed for Linux 6, they will also work for Linux 7 whenever newer components in the GNU development tool chain are required.

When installing R packages on Linux 6, a common issue is that the native GCC compiler is not sufficient for building C++ 2011 standard code.  In some cases, the R package developer will identify the...

R Technologies

R to Oracle Database Connectivity: Use ROracle for both Performance and Scalability (2018)

R users have a few choices of how to connect to Oracle Database. The most commonly seen include: RODBC, RJDBC, and ROracle. However, these three packages have significantly different performance and scalability characteristics, which can greatly impact your application development. In this blog, we’ll discuss these options and highlight performance benchmark results on a wide range of data sets. This performance benchmark post is an update from our 2013 blog post and uses Exadata 7-2 with ROracle 1.3-1. There are some improvements to both the performance and scalability of RODBC and RJDBC compared to our first benchmark, but ROracle remains the overall leader. I'd like to acknowledge the contributions of Rajendra Pingte who provided the raw data for this blog. By way of introduction, RODBC is an R package that implements ODBC database connectivity. There are two groups of functions: the largely internal odbc* functions that implement low-level access to the corresponding ODBC functions having a similar name, and the higher level sql* functions that support read, save, copy, and manipulation of data between R data.frame objects and database tables. Here is an example using RODBC: library(RODBC) con <- odbcConnect("DD1", uid="rquser", pwd="rquser", rows_at_time = 500) sqlSave(con,test_table, "TEST_TABLE") sqlQuery(con,"select count(*) from TEST_TABLE") d <- sqlQuery(con, "select * from TEST_TABLE") close(con) The R package RJDBC is an implementation of the R DBI package – database interface – that uses JDBC as the back-end connection to the database. Any database that supports a JDBC driver can be used in connection with RJDBC. Here is an example using RJDBC: library(RJDBC) drv <- JDBC(driverClass="oracle.jdbc.OracleDriver", classPath="…/ojdbc8.jar"," ") con <- dbConnect(drv, " jdbc:oracle:thin:@myHost:1521:db", "rquser","rqpasswd") dbWriteTable(con,"TEST_TABLE", test_table) dbGetQuery(con,"select count(*) from TEST_TABLE") d <- dbReadTable(con, "TEST_TABLE") dbDisconnect(con) The ROracle package is an implementation of the R DBI package that uses Oracle OCI for high performance and scalability with Oracle Database. It requires Oracle Instant Client or Oracle Database Client to be installed on the client machine. Here is an example using ROracle: library(ROracle) drv <- dbDriver("Oracle") con <- dbConnect(drv, "rquser", "rqpasswd") dbWriteTable(con,"TEST_TABLE", test_table) dbGetQuery(con,"select count(*) from TEST_TABLE") d <- dbReadTable(con, "TEST_TABLE") dbDisconnect(con) Notice that since both RJDBC and ROracle implement the DBI interface, their code is  similar except for the driver and connection details. To compare the performance of these interfaces, we prepared tests along several dimensions: Data Tables Number of rows – 1K, 10K, 100K, and 1M Number of columns – 10, 100, 1000 Data types – NUMBER, BINARY_DOUBLE, TIMESTAMP, and VARCHAR Numeric data is randomly generated, all character data is 10 characters long. Packages supporting RODBC, RJDBC, and ROracle: DBI_1.0.0.tar.gz      RODBCDBI_0.1.1.tar.gz  RODBCext_0.3.1.tar.gz  odbc_1.1.6.tar.gz    RJDBC_0.2-7.1.tar.gz  RODBC_1.3-15.tar.gz    ROracle_1.3-1.tar.gz   rJava_0.9-10.tar.gz Types of operations: select, create, insert, and connect Loading database data to an R data.frame Where an in-database R capability as provided by Oracle R Enterprise is not available, R users typically pull data to the R client for data exploration, preparation, modeling, etc. In Figure 1, we compare the execution time to pull 10, 100, and 1000 columns of data from 1K, 10K, 100K, and 1M rows for BINARY_DOUBLE data on a log-log scale. Notice that ROracle is consistently faster than both RODBC and RJDBC, and that all three interfaces largely scale linearly. The driver is labeled with a ".get" suffix to flag that this result is using dbGetQuery(), as opposed to dbSendQuery with a fetch, however both results were comparable and so dbSendQuery results are omitted. Below, we discuss the full set of benchmark results in greater detail. Figure 1: Comparison of RJDBC, RODBC, and ROracle for BINARY_DOUBLE for Select * In Figure 2, we provide the benchmark results for RODBC, RJDBC, and ROracle across the four data types. Notice that ROracle provides the best performance for all data types except TIMESTAMP. With TIMESTAMP, RJDBC performs consistently better, excluding the smallest data set. We have found that the first call to RDJBC can result in a java startup cost that goes away upon subsequent execution.This also appears for VARCHAR2 as depicted below.  ROracle is consistently faster than RODBC: BINARY_DOUBLE data up to 11X faster, NUMBER data up to 3.4X faster, VARCHAR2 data up to 5.7X faster, and TIMESTAMP data up to 4.9X faster.  For RJDBC, ROracle is up to 48X faster on BINARY_DOUBLE data (mean of 17X), up to 10X faster on NUMBER data (mean of 3X), 36X for VARCHAR2 data (mean of 6X).  As noted above, TIMESTAMP data is where RJDBC performs faster than ROracle due to the fact that RJDBC treats TIMESTAMP data as a character string, not as POSIXct.  Note that RODBC and RJDBC have a limit of 255 characters on the length the VARCHAR2 columns. ROracle creates VARCHAR2(4000). Figure 2: Comparison of RJDBC, RODBC, and ROracle for select * from <table> For reference, the data set sizes (represented in megabytes) are captured in Table 1 for all data types.  Table 1: Dataset sizes in megabytes Creating database tables from an R data.frame Data or results created in R may need to be written to a database table. In Figure 3, we compare the execution time to create tables with 10, 100, and 1000 columns with 1K, 10, 100K, and 1M rows for each data type, performing row inserts in batches of 1000 rows. In all cases, RODBC had the slowest create times. ROracle is up to 198X faster (mean of 23X and median 9X) across all entries. For RJDBC, ROracle is up to 5.6X faster (mean and median approaching 2X) across all entries. Note that RJDBC does not support the TIMESTAMP data type for CREATE.  Figure 3: Comparison of RJDBC, RODBC, and ROracle for creating a table Inserting rows to database tables from an R data.frame In other cases, data or results created in R may need to be inserted into existing database tables. In Figure 4, we compare the execution time to insert rows into existing tables with 10, 100, and 1000 columns, insert data.frames with 1K, 10K, 100K, and 1M rows for each data type as before, in batches of 1000 rows. Similar to CREATE, RODBC has the slowest performance. ROracle is up to 243X faster (mean of 34X and median 12.4X) across all entries. For RJDBC, ROracle is up to 18X faster (mean of 2.6X and median 2X). As for CREATE, RJDBC does not support the TIMESTAMP data type for INSERT. Figure 4: Comparison of RJDBC, RODBC, and ROracle for inserting rows into a table Connecting to Oracle Database Depending on the application, sub-second response time may be sufficient to meet application database connection requirements. As depicted in Figure 5, ROracle and RJDBC require minimal time  (0.05 seconds) to establish a database connection compared to RODBC, which is 8X slower. Figure 5: Database connection times for ROracle, RJDBC, and RODBC In summary, ROracle supports a wide range of application needs for performance and scalability. While RJDBC and RODBC have improved their overall performance, ROracle remains the best choice for Oracle Database.  All tests were performed on a quarter rack X7-2 Exadata. Oracle Database was version 12.1.0.1 with R 3.5.0. For JDBC, we used a larger size for VM: options(java.parameters = "-Xmx80g")  

R users have a few choices of how to connect to Oracle Database. The most commonly seen include: RODBC, RJDBC, and ROracle. However, these three packages have significantly different performance and...

News

Should R Consortium Recommend CII Best Practices Badge for R Packages? Latest Survey Results

Based on our Fall 2017 survey, where the R Consortium asked about opportunities, concerns, and issues facing the R community, the R Consortium conducted a new survey this past month to solicit feedback on using the Linux Foundation (LF) Core Infrastructure Initiative (CII) Best Practices Badge Program for R packages. With your feedback, the R Consortium will base its recommendation for using the CII.  Your feedback will also help us and the Linux Foundation evolve the CII with the needs of the R Community, and FLOSS projects in general, in mind. Introduction With over 12,000 R packages on CRAN alone, the choice of which package to use for a given task is challenging. While summary descriptions, documentation, download counts and word-of-mouth may help direct selection, a standard assessment of package quality can greatly help identify the suitability of a package for a given need – commercial, academic, or otherwise. Providing the R Community of package users an easily recognized badge indicating the level of quality achievement would make it easier for users to know the quality of a package along several dimensions. In addition, providing R package authors and maintainers a checklist of “best practices” can help guide package development and evolution, as well as help package users know what to look for in a package. The R Consortium has been exploring the pros and cons of recommending that R package authors, contributors, and maintainers adopt the Linux Foundation (LF) Core Infrastructure Initiative (CII) Best Practices badge. This badge provides a means for Free/Libre and Open Source Software (FLOSS) projects to highlight to what extent package authors follow best software practices, while enabling individuals and enterprises to assess quickly a package’s strengths and weaknesses across a range of dimensions. The CII Best Practices Badge Program is a voluntary, self-certification, at no cost to submit a questionnaire and earn a badge. An easy to use web application guides users in the process, even automating some of the steps. More information on the CII Best Practices Badging Program is available: criteria is available on GitHub. Project statistics, criteria statistics, and videos. The projects page shows participating projects and supports queries (e.g., you can see projects that have a passing badge). What did we learn? Will the CII Best Practices Badge Program provide value to the R Community’s package developers or package users? 90% of survey respondents say ‘yes’ with 77% saying it has benefit for both developers and users. Perhaps not surprisingly, 95% of respondents had never heard of the CII before, but 74% would be willing to try it. This is according to 41 respondents, 56% of whom have been developing R packages 4 years or more, and over 60% who have developed two or more packages. Of the six categories covered by the CII – licensing, documentation, change control, software quality, security, code analysis – over 55% of respondents found all criteria to be somewhat or highly beneficial. Over 80% found documentation and software quality criteria to be somewhat or highly beneficial. Using an open ended question, we asked respondents why the CII is good for the R Community? Here is a summary of the responses. The CII… helps users discover and select R packages that adhere to software development best practices. shows R developers through the badge criteria what is possible or desirable for FLOSS, especially if developers do not have a software engineering background. provides an additional degree of assurance to the user community around package quality as well as provide a way for developers to assert more formally that they follow such best practices. gathers and presents lessons learned from other FLOSS projects so developers don’t need to re-discover them. creates an incentive to adopt a consistent set of practices throughout the R ecosystem. If you’re a package developer, we hope you’ll join the package developers above and start your own CII Best Practices Badge. The survey will remain open to collect your feedback on the experience. See an extended version of this blog post at the R Consortium blog for more survey result details.

Based on our Fall 2017 survey, where the R Consortium asked about opportunities, concerns, and issues facing the R community, the R Consortium conducted a new survey this past month to solicit...

Best Practices

Data Science Maturity Model - Summary Table for Enterprise Assessment (Part 12)

This installment of the Data Science Maturity Model (DSMM) blog series contains a summary table of the dimensions and levels. Enterprises embracing data science as a core competency may want to evaluate what level they have achieved relative to each dimension - in some cases, an enterprise may straddle more than one level. As a next step, the enterprise may use this maturity model to identify a level in each dimension to which they aspire, or fashion a new Level 6.   Questions Level 1 Level 2 Level 3 Level 4 Level 5 Data Science Maturity Model Strategy What is the enterprise business strategy for data science? Enterprise has no governing strategy for applying data science Enterprise is exploring the value of data science as a core competency Enterprise recognizes data science as core competency for competitive advantage Enterprise embraces a data-driven approach to decision making Data are viewed as an essential corporate asset - data capital Roles What roles are defined and developed in the enterprise to support data science activities? Traditional data analysts explore and summarize data using deductive techniques Introduction of 'data scientist' role and corresponding skill sets to begin leveraging advanced, inductive techniques Chief Data Officer (CDO) role is introduced to help manage data as a corporate asset Data scientist career path is codified and standardized across the enterprise Chief Data Science Officer (CDSO) role introduced Collaboration How do data scientists collaborate with others in the enterprise, e.g., business analysts, application and dashboard developers, to evolve and hand-off data science work products? Data analysts often work in silos, performing work in isolation and storing data and results in local environments Greater collaboration exists between IT and line-of-business organizations Recognized need for greater collaboration among the various players in data science projects Broad use of tools introduced to enable sharing, modifying, tracking, and handing off data science work products Standardized tools introduced across the enterprise to enable seamless collaboration Methodology What is the enterprise approach or methodology to data science? Data analytics are focused on business intelligence and data visualization using an ad hoc methodology Data analytics are expanded to include machine learning and predictive analytics for solving business problems, but still using ad hoc methodology Individual organizations begin to define and regularly apply a data science methodology Basic data science methodology best practices established for data science projects Data science methodology best practices formalized across the enterprise Data Awareness How easily can data scientists learn about enterprise data resources? Users of data have no systematic way of learning what data assets are available in the enterprise Data analysts and data scientists seek additional data sources through "key people" contacts Existing enterprise data resources are cataloged and assessed for quality and utility for solving business problems Enterprise introduces metadata management tool(s) Enterprise standardizes on a metadata management tool and institutionalizes its use for all data assets Data Access How do data analysts and data scientists request and access data? How is data access controlled, managed, and monitored? Data analysts typically access data via flat files obtained explicitly from IT or other sources Data access available via direct programmatic database access Data scientists have authenticated, programmatic access to large volume data, but database administrators struggle to manage the data access life cycle Data access is more tightly controlled and managed with identity management tools Data access lineage tracking enables unambiguous data derivation and source identification Scalability Do the tools scale and perform for data exploration, preparation, modeling, scoring, and deployment? As data, data science projects, and the data science team grow, is the enterprise able to support these adequately? Data volumes are typically "small" and limited by desktop-scale hardware and tools, with analytics performed by individuals using simple workflows Data science projects take on greater complexity and leverage larger data volumes Individual groups adopt varied scalable data science tools and provide greater hardware resources for data scientist use Enterprise standardizes on an integrated suite of scalable data science tools and dedicates sufficient hardware capacity to data science projects Data scientists have on-demand access to elastic compute resources both on premises and in the cloud with highly scalable algorithms and infrastructure Asset Management How are data science assets managed and controlled? Analytical work products are owned, organized, and maintained by individual data science players Initial efforts are underway to provide security, backup, and recovery of data science work products Data science work product governance is systematically being addressed Data science work product governance is firmly established at the enterprise level with increasing support for model management Systematic management of all data science work products is used with full support for model management. Tools What tools are used within the enterprise for data science objectives? Can data scientists take advantage of open source tools in combination with high performance and scalable production quality infrastructure? An ad hoc array of non-scalable tools is predominantly used for isolated data analysis on desktop machines Enterprise manages data through database management systems and relies on extensive open source libraries along with specialized commercial tools Enterprise seeks scalable tools to support data science projects involving large volume data Enterprise standardizes on a suite to tools to meet data science project objectives Enterprise regularly assesses state-of-the-art algorithms, methodologies, and tools for improving solution accuracy, insights, and performance, along with data scientist productivity Deployment How easily can data science work products be placed into production to meet timely business objectives? Data science results have limited reach and hence provide limited business value Production model deployment is seen as valuable, but often involves reinventing infrastructure for each project Enterprise begins leveraging tools that provide simplified, automated model deployment, inclusive of open source software and environments Increased heterogeneity of enterprise systems requires cross-platform model deployment, with a growing need to incorporate models into streaming data applications Enterprise has realized benefits of immediate data science work product (re)deployment across heterogeneous environments   Click here for the Data Science Maturity Model spreadsheet and here for the whitepaper. I hope you found this series useful and welcome hearing from you regarding your experience using this Data Science Maturity Model.

This installment of the Data Science Maturity Model (DSMM) blog series contains a summary table of the dimensions and levels. Enterprises embracing data science as a core competency may want to...

Best Practices

Data Science Maturity Model - Deployment (Part 11)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'deployment': How easily can data science work products be placed into production to meet timely business objectives? Data science comes with the expectation that amazing insights and predictions will transform the business and take the enterprise to a new level of performance. Too often, however, data science projects fail to "lift-off," resulting is significant opportunity cost for the enterprise. A data scientist may produce a predictive model with high accuracy, however, if those scores are not effectively put into production, i.e., deployed, or deployment is significantly delayed, desired gains are not realized. A more general definition of 'deployment' that seems relevant in this discussion is "the action of bringing resources into effective action." The resources in this context refer to data science work products such as machine learning models, visualizations, statistical analyses, etc. Effective action is to deliver these resources in a way that they provide business benefit: timely insights presented in interactive dashboards, predictions affecting which actions enterprises will undertake with respect to customers, employees, assets, etc. For data science in general, and machine learning in particular, much of the deployment mechanism - or plumbing - is the same across projects. Yet, enterprises often find individual projects re-inventing deployment infrastructure, requiring logic for data access, spawning separate analytic engines, and recovery along with (often missing) rigorous testing. Leveraging tools that provide such plumbing can greatly reduce overhead and risk in deploying data science projects.  The 5 maturity levels of the "deployment" dimension are: Level 1: Data science results have limited reach and hence provide limited business value. At Level 1 enterprises, results from data science projects often take the form of insights documented in slide presentations or textual reports. Data analyses, visualizations, and even predictive models may provide guidance for human decision making, but such results must be manually conveyed on a per-project basis. Level 2: Production model deployment is seen as valuable, but often involves reinventing infrastructure for each project. In Level 2 enterprises, the realization that machine learning models can and should be leveraged in front-line applications and systems takes hold. Some insights may be explicitly coded into application or dashboard logic, however, the time between model creation and deployment can significantly impact model accuracy. This deployment latency occurs when the patterns in data used for model building diverge from current data used for scoring. Moreover, manually coding, e.g., predictive model coefficients for scoring in C, Java, or even SQL, for easier integration with existing applications or dashboards takes developer time and can result in coding errors that only rigorous code reviews and testing can reveal. As a result, enterprises incur costs for data science projects, but do not fully realize potential project benefits. Level 3: Enterprise begins leveraging tools that provide simplified, automated model deployment, inclusive of open source software and environments. As more data science projects are undertaken, the Level 3 enterprise realizes that one-off deployment approaches waste valuable development resources, incurs deployment latency that reduces model effectiveness, and increases project risk. In today's internet-enabled world, patterns in data, e.g., customer preferences, can change overnight requiring enterprises to have greater agility to build, test, and deploy models using the latest data. Enterprises at Level 3 begin to leverage tools that provide the needed infrastructure to support simplified and automated model deployment. Level 4: Increased heterogeneity of enterprise systems requires cross-platform model deployment, with a growing need to incorporate models into streaming data applications.  The Level 4 enterprise has a combination of database, Hadoop, Spark, and other platforms for managing data and computation. Increasingly, the enterprise needs models and scripts produced in one environment to be deployed in another. This increases the need for tools that enable exporting models for use in a scoring engine library that can be easily integrated into applications. Level 4 enterprises seek tools that facilitate script and model deployment in real-time or streaming analytics situations as they begin to use data science results involving fast data. Level 5: Enterprise has realized benefits of immediate data science work product (re)deployment across heterogeneous environments.   The Level 5 enterprise has adopted a standard set of tools to support deployment of data science work products across all necessary environments. Machine learning models and scripts created in one environment can immediately be deployed and refreshed (redeployed) with minimal latency. In my next post, I'll provide summary Data Science Maturity Model table and a corresponding spreadsheet to aid enterprises in conducting a DSMM assessment.

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'deployment': How easily can data science work products be placedinto production to meet timely...

Best Practices

Data Science Maturity Model - Tools Dimension (Part 10)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'tools': What tools are used within the enterprise for data science? Can data scientists take advantage of open source tools in combination with high performance and scalable production quality infrastructure? A wide range of tools support data science ranging from open source to proprietary, relational database to "big data" platforms, simple analytics to complex machine learning. Tools may support isolated activities or be highly collaborative, and enable modeling in the small to massive predictive modeling with full model management. Orthogonal to each of these is the scale at which these tools can perform. Some tools and algorithm implementations will perform well for small or even moderate sized data, but fail or become unusable when presented with larger data volumes. For this, special parallel, distributed implementations are necessary to leverage multi-node/processor machines and machine clusters. Seldom will a single tool provide all required functionality, which is usually provided by a mix of commercial and open source tools. However, enterprises require commercial support for the tools adopted. As a result, commercial tools that integrate with open source tools and provide support for data- and task-parallel execution along with ease of deployment are highly desired. The 5 maturity levels of the "tools" dimension are: Level 1: An ad hoc array of non-scalable tools is predominantly used for isolated data analysis on desktop machines. Data science players at Level 1 use traditional desktop tools for data analysis, relying heavily on spreadsheet-based tools along with various open source tools for analytics and visualization. Level 2: Enterprise manages data through database management systems and relies on extensive open source libraries along with specialized commercial tools.   Level 2 enterprises, taking data management more seriously, introduce relational database management software tools. Data science projects also benefit from the broader open source package ecosystem for advanced data exploration, statistical analysis, visualization, and predictive analytics / machine learning.  However, at Level 2, there is little integration between commercial and open source tools, and performance and scalability are an issue for data science projects. Level 3: Enterprise seeks scalable tools to support data science projects involving large volume data. Data science projects at Level 3 enterprises are hindered by performance and scalability of existing software and environments. A concerted effort is made to evaluate and acquire commercial tools with a range of scalable machine learning algorithms and techniques to complement open source techniques and facilitate production deployment.  Data science players may begin to explore Big Data platforms to address new sources of high volume data, scalability, and cost reduction. Cloud-based tools are also under review. As data science projects grow in complexity involving larger team efforts, tools supporting collaboration become a recognized need. Level 4: Enterprise standardizes on a suite to tools to meet data science project objectives. The Level 4 enterprise understands the needs of data science players and projects to meet business objectives. Enhanced productivity requires scalable tools that support collaboration and work with data from a wide range of sources. Automation and integration play a major role in enhancing productivity, so tools that avoid paradigm shifts and automate tasks in data exploration, preparation, machine learning, and graph and spatial analytics are particularly valuable. Adopted tools are available or function across multiple platforms, including on-premises and cloud. As machine learning models have become a focal point for data science projects, adopted tools must support full model management. Level 5: Enterprise regularly assesses state-of-the-art algorithms, methodologies, and tools for improving solution accuracy, insights, and performance, along with data scientist productivity.   Level 5 enterprises optimize their data science tool environment. Having understood what is required for effective data science projects and data science player productivity at Level 4, enterprises work with tool providers to further enhance those tools to meet business objectives. In my next post, we'll cover the 'deployment' dimension of the Data Science Maturity Model, that last dimension in this series.

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'tools': What tools are used within the enterprise for data science?Can data scientists take...

Returning Tables from Embedded R Execution .... Simplified

In this tips and tricks blog, we share some techniques through our own use of Oracle R Enterprise applied to data science projects that you may find useful in your projects. This time, we focus on the automated process of returning the data frame schema from the output of embedded R execution runs. Embedded R Execution ORE embedded R execution provides a powerful and convenient way to execute custom R scripts at the database server, from either R or SQL. It also enables running those scripts in a data-parallel or task-parallel manner. With embedded R execution, the user also gets the opportunity to call any 3rd party R package and spawn one or more R engines to run their user-defined R function in parallel. A more detailed explanation can be found in this blog post Introduction to ORE Embedded R Script Execution. The ORE embedded R execution functions, e.g., ore.tableApply and ore.rowApply, provide flexible choices of objects to return, such as R objects, models, etc. One of the most popular choices is an ORE frame from R or tables from SQL. In order to get an ORE frame as a return object, we need to supply the schema (column names and types) to the embedded R function in the argument FUN.VALUE. For instance, consider the dataset iris. Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa ... 145 6.7 3.3 5.7 2.5 virginica 146 6.7 3.0 5.2 2.3 virginica 147 6.3 2.5 5.0 1.9 virginica 148 6.5 3.0 5.2 2.0 virginica 149 6.2 3.4 5.4 2.3 virginica 150 5.9 3.0 5.1 1.8 virginica We want to build a model for each iris species, say, using the feature Sepal.Length and return the coefficients associated with each species as a data.frame. To this end, the groupApply function can build a linear regression model on data for each species and do so in parallel. Consider the following code. grpAp2 <- ore.groupApply(IRIS, IRIS$Species, function(df) { if (nrow(df) == 0) { species <- character() cf <- numeric() names(cf) <- character() } else { species <- as.character(df$Species[1]) cf <- coef(lm(Sepal.Length ~ ., data = df[1:4])) } data.frame(Species = species, CoefName = names(cf), CoefValue = unname(cf), stringsAsFactors = FALSE) }, FUN.VALUE = data.frame(Species = character(), CoefName = character(), CoefValue = numeric(), stringsAsFactors = FALSE), parallel = TRUE) To get back an ore.frame from ore.groupApply, we need to supply FUN.VALUE that contains the schema of the returned ore.frame. This schema contains 3 columns, Species, CoefName, CoefValue. This is required if we want the output of the ore.groupApply function to be an ore.frame, otherwise it will be a list of the component data.frames. The data frame schema required here also needs us to specify the type of each column, such as character(), numeric(), etc. This is OK when the data frame has only a few columns, but in reality, there are many use cases where the returned ORE frame has a large number of columns. Even when the number of columns is not large, specifying the type of each column needs some effort. In most of cases, the user can first run the user-defined function locally and then obtain the schema based on the first row from the output data. However, this needs us to repeat a similar process every time. Also, the user needs to pull the data from the database, which may not be efficient. To automate this process and let the process run at the database server machine, we introduce a convenience function ore.getApplySchema to illustrate working with embedded R execution. The idea is to supply a representative subset data frame as an input to the function. Then the schema can be produced from the output and returned to the user. The user can then use this schema when calling the embedded R execution function to return an ore.frame. While it is not difficult to realize this process by writing a few lines of R code, we wanted to provide a general-purpose convenience function to avoid writing such code and illustrate working with embedded R execution. Our ore.getApplySchema function below supports using rowApply, tableApply, and groupApply, however, this could be expanded to work with doEval and indexApply as well. Note that the type of the embedded R execution function needs to be specified because the input and code to retrieve the schema is different for different functions. ore.getApplySchema<- function(ODF, function_name, ..., col = NULL, row.num = NULL, type = 'rowApply', ore.connect.flag = FALSE){ # check whether the script gets loaded res <- ore.scriptList(name = function_name) if(nrow(res) == 0){ stop('Function does not exist in the repository!') } rownames(ODF) <- ODF[,1] if(is.null(row.num)) INPUT <- ODF else INPUT <- ODF[row.num,] switch(type, rowApply = { res <- ore.rowApply(INPUT, FUN.NAME = function_name, ..., ore.connect = ore.connect.flag ) }, tableApply = { res <- ore.tableApply(INPUT, FUN.NAME = function_name, ..., ore.connect = ore.connect.flag ) }, groupApply = { if(is.null(col)) stop("group apply requires the col information!") res <- ore.groupApply(INPUT, INPUT[, col], FUN.NAME = function_name, ..., ore.connect = ore.connect.flag ) } ) if(type == 'tableApply'){ schema <- ore.pull(res)[0,] } else schema <- res[[1]][0,] return(schema) } For the IRIS data, we first write the user-defined function separately. Note that the user defined function should return the data frame with a schema conformed to the desired ORE schema. As seen in the following code, we add lines of code to show that the case of input of an empty data frame. build_model <- function(df) { if (nrow(df) == 0) { species <- character() cf <- numeric() names(cf) <- character() } else { species <- as.character(df$SPECIES[1]) cf <- coef(lm(SEPAL.LENGTH ~ ., data = df[1:4])) } data.frame(SPECIES = species, COEFNAME = names(cf), COEFVALUE = unname(cf), stringsAsFactors = FALSE) } Then call our function as follows. schema <- ore.getApplySchema(IRIS, 'build_model', col = 'SPECIES', row.num = NULL, type = 'groupApply') [1] SPECIES COEFNAME COEFVALUE <0 rows> (or 0-length row.names) Next, we will use another example for more detailed demonstration of this function. We do this demonstration in two difference scenarios. One is done in the R environment and the other is done for SQL version of Embedded R execution. Example in Oracle R Enterprise To illustrate working with significantly more columns, consider an analysis on the adult dataset from the UCI data repository. The dataset contains demographic information about people. We first load the data and view a few rows: adult.df <- read.csv(file = "/scratch/data/adult.csv", header=F) colnames(adult.df) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country", "label") head(adult.df) age workclass fnlwgt education education_num 1 50 Self-emp-not-inc 83311 Bachelors 13 2 38 Private 215646 HS-grad 9 3 53 Private 234721 11th 7 4 28 Private 338409 Bachelors 13 5 37 Private 284582 Masters 14 6 49 Private 160187 9th 5 marital_status occupation relationship race sex 1 Married-civ-spouse Exec-managerial Husband White Male 2 Divorced Handlers-cleaners Not-in-family White Male 3 Married-civ-spouse Handlers-cleaners Husband Black Male 4 Married-civ-spouse Prof-specialty Wife Black Female 5 Married-civ-spouse Exec-managerial Wife White Female 6 Married-spouse-absent Other-service Not-in-family Black Female capital_gain capital_loss hours_per_week native_country label 1 0 0 13 United-States <=50K 2 0 0 40 United-States <=50K 3 0 0 40 United-States <=50K 4 0 0 40 Cuba <=50K 5 0 0 40 United-States <=50K 6 0 0 16 Jamaica <=50K Since the dataset contains a lot of categorical data, we wish to create dummy variables, or one hot encoding, for those categorical variables. This is useful because a lot of R models such as xgboost or glmnet require converting categorical variables into numerical vectors. For instance, the categorical variable marital_status contains 6 levels such as Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married and Separated. The dummy variable will generate a vector with 6 binary values (0, 1), with each column related to each level. For persons with the marital status 'Separated', the dummy columns will be (0, 0, 0, 0, 0, 1). There are 8 categorical variables, the total number of dummy variables will be equal to the number of distinct levels of all categorical features. As we can imagine, the result ORE frame will have a lot of columns, which could be tedious to write explicitly. We apply the function dummy.data.frame in the 'dummies' packages to the categorical columns: library(dummies) factor.features <- c("workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country") output.df <- dummy.data.frame(adult.df, names=factor.features, sep="_") colnames(output.df) [1] "age" "workclass_ ?" "workclass_ Federal-gov" [4] "workclass_ Local-gov" "workclass_ Never-worked" "workclass_ Private" [7] "workclass_ Self-emp-inc" "workclass_ Self-emp-not-inc" "workclass_ State-gov" [10] "workclass_ Without-pay" "fnlwgt" "education_ 10th" [13] "education_ 11th" "education_ 12th" "education_ 1st-4th" [16] "education_ 5th-6th" "education_ 7th-8th" "education_ 9th" [19] "education_ Assoc-acdm" "education_ Assoc-voc" "education_ Bachelors" [22] "education_ Doctorate" "education_ HS-grad" "education_ Masters" [25] "education_ Preschool" "education_ Prof-school" "education_ Some-college" [28] "education_num" "marital_status_ Divorced" "marital_status_ Married-AF-spouse" [31] "marital_status_ Married-civ-spouse" "marital_status_ Married-spouse-absent" "marital_status_ Never-married" [34] "marital_status_ Separated" ... In total, there are 109 columns after the dummy variables are created. Suppose the data set is inside an Oracle database. We use tableApply to call the dummies package, but it is cumbersome to supply a data frame schema explicitly when we call tableApply. In this case, we can use the function ore.getApplySchema to retrieve the schema. First, let us load the dataset into Oracle Database. adult.df <- read.csv(file = "/scratch/data/adult.csv", header=F) colnames(adult.df) <- toupper(c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country", "label")) ore.drop(table = 'ADULT') ore.create(adult.df, table= 'ADULT') We write the function to generate the dummy variables: convert_dummies <- function(adult.df){ library(dummies) # do not forget to convert the column names to upper case in order to avoid adding extra "" when using the columns in queries. factor.features <- toupper(c("workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country")) output.df <- dummy.data.frame(adult.df, names=factor.features, sep="_") return(output.df) } (Note that here the column names come from the original values of the features, which contains spaces and ˜-˜, which may not be accepted for Oracle table column names. We added extra code to reformat the column names. Upload this function to the R script repository: ore.scriptCreate(name = "convert_dummies", convert_dummies, , overwrite = TRUE) The code to call the function is as follows. schema <- ore.getApplySchema(ADULT, "convert_dummies", col = NULL, row.num = NULL, type = 'tableApply') After we run the function, the output schema is returned. Let us take a look: > schema [1] AGE WORKCLASS_? WORKCLASS_FEDERALGOV [4] WORKCLASS_LOCALGOV WORKCLASS_NEVERWORKED WORKCLASS_PRIVATE [7] WORKCLASS_SELFEMPINC WORKCLASS_SELFEMPNOTINC WORKCLASS_STATEGOV [10] WORKCLASS_WITHOUTPAY FNLWGT EDUCATION_10TH [13] EDUCATION_11TH EDUCATION_12TH EDUCATION_1ST4TH [16] EDUCATION_5TH6TH EDUCATION_7TH8TH EDUCATION_9TH [19] EDUCATION_ASSOCACDM EDUCATION_ASSOCVOC EDUCATION_BACHELORS OCCUPATION_ Tech-support ... Looks like we retrieved all the new columns in the schema! Now we can use the schema to actually run tableApply. res.odf <- ore.tableApply(ADULT, FUN.NAME = "convert_dummies", FUN.VALUE = schema) The result res.odf is an ORE frame as the output. It contains all the columns of dummy variables. Let us inspect this output > names(res.odf) [1] "AGE" "WORKCLASS_?" "WORKCLASS_FEDERALGOV" "WORKCLASS_LOCALGOV" [5] "WORKCLASS_NEVERWORKED" "WORKCLASS_PRIVATE" "WORKCLASS_SELFEMPINC" "WORKCLASS_SELFEMPNOTINC" [9] "WORKCLASS_STATEGOV" "WORKCLASS_WITHOUTPAY" "FNLWGT" "EDUCATION_10TH" [13] "EDUCATION_11TH" "EDUCATION_12TH" "EDUCATION_1ST4TH" "EDUCATION_5TH6TH" [17] "EDUCATION_7TH8TH" "EDUCATION_9TH" "EDUCATION_ASSOCACDM" "EDUCATION_ASSOCVOC" [21] "EDUCATION_BACHELORS" "EDUCATION_DOCTORATE" "EDUCATION_HSGRAD" "EDUCATION_MASTERS" ... > head(res.odf) AGE WORKCLASS_? WORKCLASS_FEDERALGOV WORKCLASS_LOCALGOV WORKCLASS_NEVERWORKED WORKCLASS_PRIVATE WORKCLASS_SELFEMPINC WORKCLASS_SELFEMPNOTINC 1 23 0 0 0 0 1 0 0 2 40 0 0 0 0 1 0 0 3 41 0 0 0 0 0 0 1 4 24 0 0 0 0 0 0 0 5 20 1 0 0 0 0 0 0 6 38 0 0 0 0 1 0 0 WORKCLASS_STATEGOV WORKCLASS_WITHOUTPAY FNLWGT EDUCATION_10TH EDUCATION_11TH EDUCATION_12TH EDUCATION_1ST4TH EDUCATION_5TH6TH 1 0 0 115458 0 0 0 0 0 2 0 0 347890 0 0 0 0 0 3 0 0 196001 0 0 0 0 0 4 1 0 273905 0 0 0 0 0 5 0 0 119156 0 0 0 0 0 6 0 0 179488 0 0 0 0 0 EDUCATION_7TH8TH EDUCATION_9TH EDUCATION_ASSOCACDM EDUCATION_ASSOCVOC EDUCATION_BACHELORS EDUCATION_DOCTORATE EDUCATION_HSGRAD 1 0 0 0 0 0 0 1 2 0 0 0 0 1 0 0 3 0 0 0 0 0 0 1 4 0 0 1 0 0 0 0 5 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 EDUCATION_MASTERS EDUCATION_PRESCHOOL EDUCATION_PROFSCHOOL EDUCATION_SOMECOLLEGE EDUCATION_NUM MARITAL_STATUS_DIVORCED 1 0 0 0 0 9 0 2 0 0 0 0 13 0 3 0 0 0 0 9 0 It works! Using this function, we avoid writing the schema explicitly. This can quicken the way of applying embedded R execution. Example in Oracle SQL Embedded R Execution Embedded R execution can be initiated from Oracle SQL as well. Consider the following use case. A data scientist called our convenient function to produce the schema of the output. Then hand it over to an analyst who is mainly using SQL and call the R code from tableApply in SQL. How can we facilitate this process? First, on the R side, make sure to upload the convert_dummies function to the R script repository: ore.scriptCreate(name = "convert_dummies", convert_dummies, , overwrite = TRUE) The main difficulty here is how to specify the output schema of the table with all dummy variables in SQL. Our solution is to save the schema into a table in Oracle Database and then call rqTableEval. The entire process can be automated from R side, by adding a few lines into the convenient function ore.getApplySchema. ore.getApplySchema<- function(ODF, function_name, ..., col = NULL, row.num = NULL, type = 'rowApply', ore.connect.flag = FALSE, sql = FALSE, schema.table = NULL){ # check whether the script gets loaded res <- ore.scriptList(name = function_name) if(nrow(res) == 0){ stop('Function does not exist in the repository!') } rownames(ODF) <- ODF[,1] if(is.null(row.num)) INPUT <- ODF else INPUT <- ODF[row.num,] switch(type, rowApply = { res <- ore.rowApply(INPUT, FUN.NAME = function_name, ..., ore.connect = ore.connect.flag ) }, tableApply = { res <- ore.tableApply(INPUT, FUN.NAME = function_name, ..., ore.connect = ore.connect.flag ) }, groupApply = { if(is.null(col)) stop("group apply requires the col information!") res <- ore.groupApply(INPUT, INPUT[, col], FUN.NAME = function_name, ..., ore.connect = ore.connect.flag ) } ) if(type == 'tableApply'){ schema <- ore.pull(res)[0,] } else schema <- res[[1]][0,] if(sql == TRUE){ stopifnot(!is.null(schema.table )) ore.drop(table = schema.table) ore.create(schema, table = schema.table) qry <- paste0("SELECT * FROM ", schema.table) return(qry) } return(schema) } In this use case, we can call the function in the following way. qry <- ore.getApplySchema(ADULT, "convert_dummies", col = NULL, row.num = NULL, type = 'tableApply', sql = TRUE, schema.table = 'ADULT_SCHEMA') The function will save the schema into a SQL table with name chosen as 'ADULT_SCHEMA'. Then, the returned value qry is a string of the query: 'SELECT * FROM ADULT_SCHEMA'. This is the query for schema used in rqTableEval. From the SQL side, we can also create a table ADULT_RESULT to store the result. This avoids the extra work of materializing the ORE frame into an Oracle Database table. CREATE TABLE ADULT_RESULT AS SELECT * FROM table(rqTableEval( cursor(SELECT * FROM ADULT), NULL, 'SELECT * FROM ADULT_SCHEMA', 'convert_dummies')); Let us check the results: SELECT * FROM ADULT_RESULT; The output shows the result is returned as a table with all dummy columns. Conclusion We provide a convenience function for automatically generating the result data frame schema for use in embedded R execution when returning a table. We provided an example to illustrate how to use this function in both R and SQL. This function helps to automate this process and let the user focus on other important data processing tasks.

In this tips and tricks blog, we share some techniques through our own use of Oracle R Enterprise applied to data science projects that you may find useful in your projects. This time, we focus on the...

Best Practices

Data Science Maturity Model - Asset Management Dimension (Part 9)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'asset management': How are data science assets managed and controlled? Assets are typically both tangible and intangible things of value. For this discussion, we will consider the array of data science work products as assets and can define 'asset management' at a high level as "any system that monitors and maintains things of value to an entity or group." As we introduced earlier in this blog series, work products consist of, e.g., data (raw and transformed), data visualization plots and graphs, requirements and design specifications, code written as R / Python / SQL / other scripts directly or in web-based notebooks (e.g., Zeppelin, Jupyter), predictive models, virtual machine / container images, among others. In this context asset management should cover the full asset life cycle - from creation to retirement. Throughout the life cycle, the need for asset storage / backup / recovery, metadata-based search and retrieval, security (e.g., privilege-based access control, auditability), versioning, archiving, and lineage must be addressed - basically governance. Specific to data science is the need for model management, which encompasses, e.g., model life cycle, governance, repeatability, monitoring, and reporting. The 5 maturity levels of the "asset management" dimension are: Level 1: Analytical work products owned, organized, and maintained by individual data science players. Data science players at Level 1 enterprises are essentially 'winging it', taking an ad hoc approach to asset management. Players are responsible for maintaining their data science work products, typically on their local machines, which may or may not be backed up or secure. Asset loss and an inability to reproduce results are not uncommon. Across the enterprise, data science work products are "hidden" on individual machines, with no effective way to search. Level 2: Initial efforts underway to provide security, backup, and recovery of data science work products.  The Level 2 enterprise recognizes the need to manage data science work products. This typically begins with organization-based repositories that provide storage with backup and recovery to reduce asset loss, as well as security to control access. Level 3: Data Science work product governance is systematically being addressed. The Level 3 enterprise begins to see data science work products as an important corporate asset. As such, tools and procedures are introduced to centrally manage assets throughout their life cycle. As the enterprise expands its data science effort with machine learning models, the need for model management also gains visibility. The need to determine which data and processes were used to produce data science work products is gaining recognition with steps being taken to answer basic questions definitively, e.g., on what is this result based? Level 4: Data science work product governance is firmly established at the enterprise level with increasing support for model management. The Level 4 enterprise has adopted best practices for data science work product governance. Data science players as well as the overall enterprise reaps productivity gains through being able to easily locate, execute, reproduce, and enhance project content. The question of "how was this result produced and on what data?" can readily be answered. Level 5: Systematic management of all data science work products with full support for model management. The Level 5 enterprise surpasses the Level 4 enterprise by having introduced tools and procedures that support model management. As data science projects are deployed, their outcomes are fully monitored with reporting on value provided to the enterprise. Such outcomes are factored back into the project forming a closed loop - ensuring data science projects continue to provide value based on current relevant data and trends. In my next post, we'll cover the 'tools' dimension of the Data Science Maturity Model.

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'asset management': How are data science assets managed and controlled? Assets are typically both...

Best Practices

Data Science Maturity Model - Scalability Dimension (Part 8)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'scalability': Do the tools scale and perform for data exploration, preparation, modeling, scoring, and deployment? As data, data science projects, and the data science team grow, is the enterprise able to support these adequately? The term 'scalability' can be defined as the "capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth." Scalability with respect to data science needs to reflect the hardware and software aspects, as well as the people and process aspects. This includes several factors: data volume (number of rows, columns, and overall bytes), algorithm design and implementation (parallel, distributed, memory efficient) for data preparation and model building and scoring, hardware (RAM, CPU, storage), volume and rate of data science work products produced, number of data science players and projects, and workflow complexity. The 5 maturity levels of the "scalability" dimension are: Level 1: Data volumes are typically "small" and limited by desktop-scale hardware and tools, with analytics performed by individuals using simple workflows. Level 1 enterprises perform analytics on data that can fit and be manipulated in memory, typically on desktop hardware, and possibly using open source tools. At Level 1, data volumes are such that loading data from flat files or programmatically from databases doesn't introduce problematic latency. Similarly, algorithm efficiency in terms of memory consumption or ability to take advantage of multiple CPUs isn't a significant issue. Data science work products are produced at a rate that taxes neither individuals nor infrastructure. Level 2: Data science projects take on greater complexity and leverage larger data volumes. In Level 2 enterprises, data science players are taking on more projects of greater complexity that require more data. This increase in data volume introduces increasingly intolerable latency due to data movement, and highlights inadequate hardware resources and inefficient algorithm implementations. The need to produce more data science work products more frequently also taxes existing hardware resources. The Level 2 enterprise begins exploring scalable tools for processing data where they reside instead of relying on data movement and tools that can enhance the use of open source tools and packages. Data scientists resort to data sampling to address tool limitations. Level 3: Individual groups adopt varied scalable data science tools and provide greater hardware resources for data scientist use. The Level 3 enterprise is addressing its data science growing pains experienced at Level 2 by adopting tools that minimize latency due to data movement, have parallel distributed algorithm implementations, and provide infrastructure for leveraging open source tools. These new tools enable data scientists to use more if not all desired data in their analytics, however, there is no standard suite of tools across the enterprise and the various tools do not facilitate collaboration. An increase in available hardware resources (on-premises or cloud) for solving bigger and more complex data science problems yields significant productivity gains for the data science team.  Level 4: Enterprise standardizes on an integrated suite of scalable data science tools and dedicates sufficient hardware capacity to data science projects. Having explored and test-driven various data science tools, the Level 4 enterprise standardizes on an integrated suite of scalable tools that enables data science players to realize full-scale data science projects. Data science projects, and data scientists in particular, have sufficient hardware resources (on-premises or cloud) for both development and production. Level 5: Data scientists have on-demand access to elastic compute resources both on premises and in the cloud with highly scalable algorithms and infrastructure. The Level 5 enterprise focuses on more elastic compute resources for data scientists. As data volumes increase, data science projects benefit from being able to quickly and easily increase/decrease compute resources, which in turn expedites data exploration, data preparation, machine learning model training, and data scoring - whether for individual models or massive predictive modeling involving thousands or even millions of individual models. Elastic compute resources can eliminate the need for dedicating resources for peak demand requirements. Alternatively, cloud-at-customer solutions can provide benefits while meeting regulatory or data privacy requirements. The combination of scalable algorithms and infrastructure with elastic compute resources enables the enterprise to meet time-sensitive business objectives while minimizing cost. In my next post, we'll cover the 'asset management' dimension of the Data Science Maturity Model.

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'scalability': Do the tools scale and perform for data exploration, preparation,modeling, scoring,...

Data Science Maturity Model - Data Access Dimension (Part 7)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'data access': How do data analysts and data scientists request and access data? How is data access controlled, managed, and monitored?   When we consider 'data access,' one definition refers to "software and activities related to storing, retrieving, or acting on data housed in a database or other repository" normally coupled with authorization - who is permitted to access what - and auditing - who accessed what, when, and from where. As discussed below, data access can be provided with little or no control such as when handing someone a memory stick, or strict access control through secure database authentication and computer network authentication. Data access takes into account not only the user side, but also the ability of administrators to effectively manage the data access life cycle - from initial request to revoking privileges and post-use data cleanup. The 5 maturity levels of the "data access" dimension are: Level 1: Data analysts typically access data via flat files obtained explicitly from IT or other sources. Data science players at Level 1 enterprises use what has historically been called the 'sneakernet.'  If you need data, you walk over to the data owners, get a copy on a hard drive or memory stick, and load it onto your local machine. This, of course, has morphed into emailing requests to data owners and either getting back requested data via email, drop boxes, or FedEx. Providing access to data in this manner is clearly not secure. Further, obtaining the 'right' data is unlikely to occur on the first try, so multiple iterations may be needed with data owners - the data request cycle - which results in delays, and even annoying those data owners. Level 2: Data access available via direct programmatic database access. In Level 2 enterprises, the sneakernet is recognized as insecure and inefficient. Moreover, since much of enterprise data is stored in databases, authorization and programmatic access is more readily enabled. With direct access to databases via convenient APIs (ODBC, R and Python packages, etc.), more data can be made available to data science players, thereby shortening the data request cycle. However, any processing beyond what is possible in the data repository/environment itself, e.g., SQL for relational databases, still requires data to be pulled to the client machine, which can have security implications.   Level 3: Data scientists have authenticated, programmatic access to large volume data, but database administrators struggle to manage the data access life cycle. The Level 3 enterprise is experiencing data access growing pains. Data scientists now have access to large volume data and want to use more if not all of that data in their work. Database administrators are inundated with requests for both broad (multi-schema) and narrow (individual table) data access. Ensuring individuals have proper approvals for accessing the data they need and possibly implementing data masking causes data access request backlogs. The Level 3 enterprise has also started to supplement traditional structured database data with new "big data" repositories, e.g., HDFS, NoSQL, etc. These even greater volumes of data include anything from social media data to sensor, image, text, and voice data. Level 4: Data access is more tightly controlled and managed with identity management tools. While enterprises in some industries, e.g., Finance, will have addressed access control to varying degrees, when addressing data access more broadly, the Level 4 enterprise understands the importance of end-to-end life cycle management of user identities and begins introducing tools to strengthen security and simplify compliance as appropriate. A goal for Level 4 enterprises is to make it easier for data science players to request and receive access to data, while also making it easier for administrators to manage, especially as more big data repositories are introduced. An enterprise-wide self-service access request web application may be used to facilitate requesting and granting data access. Ideally, this would be integrated with the metadata management tool used for data awareness. Level 5: Data access lineage tracking enables unambiguous data derivation and source identification.   The Level 5 enterprise has standardized on identity management and auditing to support secure data access, and now focuses on the question "what is the source of the data that produced this result?" Even in enterprises that leverage an enterprise data warehouse, data may still be replicated to other databases, or various gateways leveraged to give transparent access to remote data. The Level 5 enterprise enables tracking the derivation of data science work products - their lineage - with verification of actual data sources. In my next post, we'll cover the 'scalability' dimension of the Data Science Maturity Model.

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'data access': How do data analysts and data scientists request and access data?How is data access...

News

R Consortium solicits feedback on R package best practices

With over 12,000 R packages on CRAN alone, the choice of which package to use for a given task is challenging. While summary descriptions, documentation, download counts and word-of-mouth may help direct selection, a standard assessment of package quality can greatly help identify the suitability of a package for a given (non-)commercial need. Providing the R Community of package users an easily recognized “badge” indicating the level of quality achievement will make it easier for users to know the quality of a package along several dimensions. In addition, providing R package authors and maintainers a checklist of “best practices” can help guide package development and evolution, as well as help package users as to what to look for in a package. The R Consortium is exploring the benefits of recommending that R package authors, contributors, and maintainers adopt the  Linux Foundation (LF) Core Infrastructure Initiative (CII) “best practices” badge. This badge provides a means for Free/Libre and Open Source Software (FLOSS) projects to highlight to what extent package authors follow best software practices, while enabling individuals and enterprises to assess quickly a package’s strengths and weaknesses across a range of dimensions. The CII Best Practices Badge Program is a voluntary,  self-certification, at no cost. An easy to use web application guides users in the process. More information on the CII Best Practices Badging Program is available: criteria, is available on GitHub. Project statistics and criteria statistics. The projects page shows participating projects and supports queries (e.g., you can see projects that have a passing badge). As a potential initiative for the R Community, we encourage community feedback on the CII for R packages. Also, consider going through the process for a package you authored or maintain. Your feedback will help us and the Linux Foundation evolve the CII to further benefit the needs of the R Community, and FLOSS projects in general. Please provide feedback using this survey.

With over 12,000 R packages on CRAN alone, the choice of which package to use for a given task is challenging. While summary descriptions, documentation, download counts and word-of-mouth may help...

Data Science Maturity Model - Data Awareness Dimension (Part 6)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'data awareness': How easily can data scientists learn about enterprise data resources? Generally speaking, the term 'awareness' can be defined as "the state or condition of being aware; having knowledge; consciousness." For data awareness, we might refine this definition as "having knowledge of the data that exist in an enterprise and an understanding of its contents." As the image above suggests, enterprises often have many data repositories across organizations and departments. Data may reside in databases, flat files, spreadsheets, among others, across a range of hardware, operating systems, and file systems - the data landscape. Moreover, data silos form where one part of the enterprise is completely unaware of the existence of data in another, let alone the meaning of that data. Data awareness across an enterprise allows data science players, especially data scientists, the ability to browse and understand data from a metadata perspective. Such metadata may include textual descriptions of, e.g., tables and individual columns, key summary statistics, data quality metrics, among others. Data awareness is essential to increase productivity, but also to inventory data assets and enable an enterprise to move toward "a single version of the truth." The 5 maturity levels of the "data awareness" dimension are: Level 1: Users of data have no systematic way of learning what data assets are available in the enterprise. Enterprises at Level 1 are often in the dark when it comes to understanding the data resources that may exist across the enterprise. Data may be siloed in spreadsheets or flat files on employee machines, or stored in departmental or application-specific databases. No map of the data landscape exists to assist in finding data of interest, moreover, the enterprise hasn't awakened to the need for this.  Level 2: Data analysts and data scientists seek additional data sources through "key people" contacts. The Level 2 enterprise has 'awakened' to the need for and benefits of finding the right data. As data analysts and data scientists take on more analytically interesting projects, the search for data ensues on a personal level - individually contacting data owners or others 'in the know' within the enterprise to understand what data exist. A significant amount of time is lost trying to understand what data exist, how to interpret them, and their quality. Level 3: Existing enterprise data resources are cataloged and assessed for quality and utility for solving business problems. The Level 3 enterprise sees the need for making it easier for data science players to find data and have greater confidence in their quality for solving business problems. Ad hoc metadata catalogs begin to emerge which make it easier to understand what data are available, however, such catalogs are non-standard, not integrated, and dispersed across the enterprise. Level 4: Enterprise introduces metadata management tool(s). The Level 4 enterprise builds on the progress from Level 3 by introducing metadata management tools where data scientists and others can discover data resources available to solve critical business problems. Since the enterprise is just starting to take metadata seriously, different departments or organizations within an enterprise may use different tools. While an improvement for data scientists, the metadata models across tools are not integrated, so multiple tools may need to be consulted. Level 5: Enterprise standardizes on a metadata management tool and institutionalizes its use for all data assets. The Level 5 enterprise has fully embraced the value of integrated metadata and facilitating the maintenance and organization of that metadata through effective tools. All data assets are curated for quality and utility with full metadata descriptions to enable efficient data identification and discovery across the enterprise. Data science players' productivity and project quality increase as they can now easily find available enterprise data. In my next post, we'll cover the 'data access' dimension of the Data Science Maturity Model.

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'data awareness': How easily can data scientists learn about enterprise data resources? Generally...

Best Practices

Data Science Maturity Model - Methodology Dimension (Part 5)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'methodology': What is the enterprise approach or methodology to data science? The most often cited methodology for 'data mining' - a key element of data science - is CRISP-DM. However, the breadth and growth of data science may require expanding beyond the traditional phases introduced by CRISP-DM: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. Indeed, the value of explicit feedback loops or expanded data awareness/access phases may be useful. In addition, enterprise-specific workflows involving data science project players and work products may be necessary to increase productivity and derived value. The 5 maturity levels of the "methodology" dimension are: Level 1: Data analytics are focused on business intelligence and data visualization using an ad hoc methodology. For Level 1 enterprises, data analysts and other players typically follow no established methodology, relying instead on their experience, skills, and preferences. The focus is on business intelligence and data visualization through dashboards and reports  and relies on traditional deductive query formulation. Level 2: Data analytics are expanded to include machine learning and predictive analytics for solving business problems, but still using ad hoc methodology. Like Level 1, Level 2 enterprises typically follow no established methodology, relying instead on player experience, skills, and preferences. However, enterprises at Level 2 supplement traditional roles such as data analysts who provide business intelligence and data visualization with data scientists who introduce more advanced data science techniques such as machine learning and predictive analytics. With the introduction of data scientists, there are implicit enhancements to the ad hoc data science methodology. Level 3: Individual organizations begin to define and regularly apply a data science methodology. Level 3 enterprises are in the experimental stage where individual organizations start to define their own methodological practices or leverage existing ones. Goals include increasing productivity, consistency, and repeatability of data science projects while controlling risk. Data science projects may or may not effectively track performance of deployed model outcomes. Level 4: Basic data science methodology best practices established for data science projects. Level 4 enterprises build on the progress from Level 3 by establishing methodology best practices throughout the enterprise. Such best practices are derived from organizational experimentation or adopted from an existing methodology. As a result of establishing best practices, the enterprise sees increased productivity, consistency, and repeatability of data science projects with reduced risk of failure. Level 5: Data science methodology best practices formalized across the enterprise. Having established best practices for data science in Level 4, the Level 5 enterprise formalizes additional key aspects of data science projects, including project planning, requirements gathering / specification, and design, as well as implementation, deployment, and project assessment. In my next post, we'll cover the 'data awareness' dimension of the Data Science Maturity Model.

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'methodology': What is the enterprise approach or methodology to data science? The most often cited...

Best Practices

Data Science Maturity Model - Collaboration Dimension (Part 4)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'collaboration': How do data scientists collaborate among themselves and with others in the enterprise, e.g., business analysts, application and dashboard developers, to evolve and hand-off data science work products? Data science projects often involve significant collaboration, defined as "two or more people or organizations working together to realize or achieve a goal." Successful data science projects that positively impact an enterprise will often require the involvement of multiple players: data scientists, data / business analysts, business leaders, domain experts, application / dashboard developers, database administrators, and information technology ( IT) administrators, just to name a few. Collaboration can informal or formal, however, in this context, we look to tools that support, encourage, monitor, and guide collaboration among players. The 5 maturity levels of the "collaboration" dimension are: Level 1: Data analysts often work in silos, performing work in isolation and storing data and results in local environments. Enterprises at Level 1 often suffer from the 'silo effect', where data analysts in different parts of the enterprise work in isolation, focusing narrowly on the data they have access to, to answer questions for their department or organization. Results produced in one area may not be consistent with those in another even if the underlying question is the same. These differences may result from using (possibly subtlety) different data, or versions of the same data, or taking a different approach to arrive at a given result. These differences can make for interesting cross-organization or enterprise-wide meetings where results are presented. Level 2: Greater collaboration exists between IT and line-of-business organizations. The Level 2 enterprise seeks greater collaboration among the traditional keepers of data (Information Technology) and the various lines of business with their data analysts and data scientists. Sharing of data and results may still be ad hoc, but greater collaboration helps identify data to solve important business problems and communicate results within the organization or enterprise. Level 3: Recognized need for greater collaboration among the various players in data science projects. With the introduction of data scientists, and the desire to make greater use of data to solve business problems, Level 3 enterprises see the need to have greater collaboration among the various players involved in or affected by data science projects. These include data scientists, business analysts, business leaders, and application/dashboard developers, among others. Collaboration takes the form of sharing, modification, and hand-off of data science work products. Work products consist of, e.g., data (raw and transformed), data visualization plots and graphs, requirements and design specifications, code written as R / Python / SQL / other scripts directly or in web-based notebooks (e.g., Zeppelin, Jupyter), and predictive models. Use of traditional tools such as source code control systems and object repositories with version control may be used, but inconsistently. Level 4: Broad use of tools introduced to enable sharing, modifying, tracking, and handing off data science work products. Level 4 enterprises build on the progress from Level 3, introducing tools specifically geared toward enhanced collaboration among data science project players. This includes support for sharing and modifying work products, as well as tracking changes and workflow. The ability to hand off work products within a defined workflow in a seamless and controlled manner is key. Different organizations within the enterprise may experiment with a variety of tools, which typically do not interoperate. Level 5: Standardized tools introduced across the enterprise to enable seamless collaboration. While the Level 4 enterprise made significant strides in enhancing collaboration, the Level 5 enterprise standardizes on tool(s) to facilitate cross-enterprise collaboration among data science project players. In my next post, we'll cover the 'methodology' dimension of the Data Science Maturity Model.

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'collaboration': How do data scientists collaborate among themselvesand with others in the...

Best Practices

Data Science Maturity Model - Roles Dimension (Part 3)

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'roles': What roles are defined and developed in the enterprise to support data science activities? A role can be defined as "a set of connected behaviors, rights, obligations, beliefs, and norms as conceptualized by people in a social situation." As with most any new field, data science within an enterprise can benefit from the introduction of new roles. Following the 'strategy' dimension, the 5 maturity levels of the "roles" dimension are: Level 1: Traditional data analysts explore and summarize data using deductive techniques.  Enterprises at Level 1 may have persons dedicated to data analysis - data analysts - and draw on skills of database administrators (DBAs) or business analysts to deliver business intelligence. They likely use a variety of tools that support, for example, spreadsheet analytics, visualization, dashboards, database query languages, among others. Persons in these roles typically use deductive reasoning in the sense that they formulate queries to answer specific questions. Level 2: Introduction of 'data scientist' role and corresponding skill sets to begin leveraging advanced, inductive techniques. The Level 2 enterprise recognizes the need for more sophisticated analytics and the value that those trained in data science - the now much admired role of the data scientist - can bring to the enterprise. Once considered unicorns, data scientists are now more numerous as universities offer degrees at both the masters and doctorate level. Even so, data scientists may have different strengths, ranging from their ability to prepare/wrangle data, write code, use machine learning algorithms, use visualization effectively, and communicate results to both technical and non-technical audiences. As such, a given data science project may require a team of data scientists with complementary skills. Level 2 enterprises can now more confidently explore, develop, and deploy solutions based on machine learning, artificial intelligence, data mining, predictive analytics, and advanced analytics - depending on which term or terms most resonate with your enterprise. At Level 2, data scientists are typically added as needed to individual departments or organizations. Level 3: Chief Data Officer (CDO) role introduced to help manage data as a corporate asset.  Although not necessarily a pure data science role, the Chief Data Officer role is highly beneficial, if not critical, for the data science-focused enterprise. The CDO is responsible for enterprise-wide governance and use of data assets. Along with a CDO, the role of data librarian may also be introduced to support data curation within the enterprise. With the introduction of these roles at Level 3, not only is data science being taken more seriously, but the key input to data science projects - the data - is as well.   Level 4: Data scientist career path codified and standardized across the enterprise.  Level 4 enterprises strive for greater uniformity across the enterprise for the data scientist role with respect to job description, skills, and training. In some enterprises, data science activities and/or data scientists may be organized under a common or matrix management structure. Level 5: Chief Data Science Officer (CDSO) role introduced. Just as the Chief Data Officer role is beneficial for enterprises taking data more seriously, the Level 5 enterprise also recognizes the need for a Chief Data Science Officer. In this role, the CDSO oversees, coordinates, evaluates, and recommends data science projects and the tools and infrastructure needed to help achieve enterprise business objectives. In my next post, we'll cover the 'collaboration' dimension of the Data Science Maturity Model.

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'roles': What roles are defined and developed in the enterprise to support data science activities? A...

Best Practices

Data Science Maturity Model - Strategy Dimension (Part 2)

In my previous post, I introduced this series on a Data Science Maturity Model and the dimensions we'll be discussing. The first dimension is 'strategy': What is the enterprise business strategy for data science? A strategy can be defined as "a high-level plan to achieve one or more goals under conditions of uncertainty." With respect to data science, goals may include making better business decisions, making new discoveries, improving customer acquisition / retention / satisfaction, reducing costs, optimizing processes, among others. Depending on the quantity and quality of data available and the way that data are used, the degree of uncertainty facing an enterprise can be significantly reduced or accentuated. The 5 levels of the 'strategy' dimension are: Level 1: Enterprise has no governing strategy for applying data science. For enterprises at Level 1, the world of data science may be unfamiliar, but data certainly is not. Data analytics may be a routine part of enterprise activity but with no overall governing strategy or realization that data is a corporate asset. The enterprise has defined goals, but the extent to which data supports those goals is limited. Level 2: Enterprise is exploring the value of data science as a core competency. The Level 2 enterprise realizes the potential value of data and the need to leverage that data for greater business advantage. With all the hype and substance around machine learning, artificial intelligence, and advanced analytics, business leaders are investigating the value data science can offer and are actively conducting proofs-of-concept - exploring data science seriously as a core business competency. Level 3: Enterprise recognizes data science as core competency for competitive advantage. Having done due diligence, enterprises at Level 3 have committed to pursuing data science as a core competency and the benefits it can bring. Systematic efforts are underway to enhance data science capabilities along the other dimensions of this maturity model. Level 4: Enterprise embraces a data-driven approach to decision making. Once an enterprise establishes a competency in data science, enterprises at Level 4 feel confident to embrace the use of data-driven decision making - backing up or substituting business instincts with measured results and predictive analytics / machine learning. As data and skill sets are refined, business leaders have greater confidence to trust data science results when making key business decisions. Level 5: Data are viewed as an essential corporate asset - data capital. A capping strategy with respect to data science involves giving data the "reverence" it deserves - recognizing it as a valuable corporate asset - a form of capital. At Level 5, the enterprise allocates adequate resources to conduct data science projects supported by proper management, maintenance, assessment, security, and growth of data assets, and the human resources to systematically achieve strategic goals. In my next post, we'll cover the 'roles' dimension of the Data Science Maturity Model.

In my previous post, I introduced this series on a Data Science Maturity Model and the dimensions we'll be discussing. The first dimension is 'strategy': What is the enterprise business strategy for...

Best Practices

A Data Science Maturity Model for Enterprise Assessment (Part 1)

"Maturity models" aid enterprises in understanding their current and target states. Enterprises that already embrace data science as a core competency, as well as those just getting started, often seek a road map for improving that competency. A data science maturity model is one way of assessing an enterprise and guiding the quest for data science nirvana.   As an assessment tool, this Data Science Maturity Model provides a set of dimensions relevant to data science and 5 maturity levels in each - 1 being the least mature, 5 being the most. Here is my take on important maturity model dimensions with the goal to provide both an assessment tool and potential road map: Strategy - What is the enterprise business strategy for data science? Roles - What roles are defined and developed in the enterprise to support data science activities? Collaboration - How do data scientists collaborate with others in the enterprise, e.g., business analysts, application and dashboard developers, to evolve and hand-off data science work products? Methodology - What is the enterprise approach or methodology to data science? Data Awareness - How easily can data scientists learn about enterprise data resources? Data Access - How do data analysts and data scientists request and access data? How is data access controlled, managed, and monitored? Scalability - Do the tools scale and perform for data exploration, preparation, modeling, scoring, and deployment? Asset Management - How are data science assets managed and controlled? Tools - What tools are used within the enterprise for data science objectives? Can data scientists take advantage of open source tools in combination with high performance and scalable production quality infrastructure? Deployment - How easily can data science work products be placed into production to meet timely business objectives? In this blog series, I'll discuss each of these dimensions and levels by which business leaders and data science players can assess where their enterprise is, identify where they would like to be, and consider how important each dimension is for the business and overall corporate strategy. Such introspection is a step toward identifying architectures, tools, and practices that can help achieve identified data science goals.  

"Maturity models" aid enterprises in understanding their current and target states. Enterprises that already embrace data science as a core competency, as well as those just getting started, often...

Deploying Multiple R Scripts in Oracle R Enterprise

// In this tips and tricks blog, we share some techniques through our own use of Oracle R Enterprise applied to data science projects that you may find useful in your own projects. Some data science projects may have tens or hundreds of R scripts and R functions written by developers or data scientists. While under ideal circumstances, you would create a package to contain these functions, that may involve more effort than you had in mind. This tradeoff of package vs. no package arose in one of our recent projects, built with ORE and running on a production database. Since the production environment is managed under strict access rules, our team did not have access to the Linux environment to install and reinstall packages at will. This posed a challenge as our codebase contained hundreds of functions. As a live machine learning project, feedback and enhancement requests arrived on a daily basis early on. Thus, we needed to respond quickly and deliver new features or fixes and deploy the updated model in a timely manner. This makes installing our code as an R package a more heavyweight process, involving administrators. Package installation must be done with system administrator privileges, whereas loading R scripts to the R Script Repository is permitted to users with the RQADMIN privilege. Since our deployment strategy requires the application to run inside Oracle Database using ORE embedded R execution, we store our top level function in the database R Script Repository and invoke it by name. If our function invokes other functions also stored in the R Script Repository, we can load each by name using ore.scriptLoad - one invocation per function, which can be a lot. Here is a simple convenience function, ore.scriptLoad2, which allows using a regular expression to load multiple functions and leverages two existing ORE functions: ore.scriptList and ore.scriptLoad. ore.scriptLoad2 <- function(pattern=NULL, envir = parent.frame()){ lst <- ore.scriptList(pattern = pattern)$NAME for( n in lst ){ ore.scriptLoad(name= n, envir = envir) } } ore.scriptCreate('ore.scriptLoad2',ore.scriptLoad2) We create the named script ore.scriptLoad2 so we can use it inside our embedded R functions as well. Tip: name your R functions with a common prefix or postfix related to your project to make it easy to grab them all at once. For instance, if all our functions have the prefix ml, we can load every matching function in the following way. ore.scriptLoad('ore.scriptLoad2') # loads our convenience function ore.scriptLoad2(pattern = '^ml') Not that the parameter pattern accepts a regular expression, which allows the developer to name the function with any predefined pattern. The regular expression used here ensures that ml is the prefix. Similarly, we provide a corresponding drop function so that we can also drop multiple functions from the R Script Repository in a single call. This makes it easy to ensure you have a clean environment. ore.scriptDrop2 <- function(pattern=NULL){ lst <- ore.scriptList(pattern = pattern)$NAME for( n in lst ){ ore.scriptDrop(name= n) } } Example Here we show a simple deployment example. Suppose a machine learning application contains only two functions ml.preproc and ml.train in source file main.R. To deploy the two functions, we need first to create the scripts in the R Script repository. See the sample code as follows. library(ORE) ore.connect(...) source('main.R') # contains ml.preproc and ml.train ore.scriptDrop2(pattern='^ml') # drop the existing old version funcs <- lsf.str() # list all functions in workspace funcs <- as.vector(funcs) sapply(funcs, function(func){ # create the script in R repository ore.scriptCreate(func, eval(parse(text=func)) ) }) After the R scripts are created, we can load these user-defined R functions to our top level function when we invoke ORE embedded R execution. For simplicity, we call them in ore.doEval(). ore.doEval(function(){ ore.scriptLoad('ore.scriptLoad2') # load our convenience function ore.scriptLoad2(pattern = '^ml') ml.preproc() ml.train() }) Note that all this code just runs from the ORE client, and does not require direct command line access to the database server machine. Conclusion In this blog, we addressed the issue of agile deployment of machine learning software based on Oracle R Enterprise. We shared two functions to assist with ORE production deployment involving multiple user-defined R functions. This allows privileged R users to deploy code without system administrator privileges to the target machine. Using ORE in this manner, code can be updated in batch and reduce system administrator overhead. In practice, this type of deployment proved advantageous for our project. We also look forward to feedback from the community of ORE users.

In this tips and tricks blog, we share some techniques through our own use of Oracle R Enterprise applied to data science projects that you may find useful in your own projects. Some data science...

Scalable scoring with multiple models using Oracle R Enterprise Embedded R Execution

At first glance, scoring data in batch with a machine learning model appears to be a straightforward endeavor: build the model, load the data, score using the model, do something with the results. This “something” can include writing the scores to a table, computing model evaluation/quality metrics, directly feeding a dashboard, etc. However, the task becomes a little more challenging when some of the details are filled in and hardware and software realities come into play. What are some of these details? How much data needs to be scored? Where are the data stored? How many individual models are involved? How large are the models? Does the model scoring software scale as data volumes grow? Does model scoring take advantage of parallelism? How quickly are the results required? Do the results need to be persisted? What are our hardware limitations: RAM, CPUs, diskspace? Note the distinction between parallelism and scalability. One does not necessarily imply the other. While algorithms in Oracle R Enterprise and Oracle R Advanced Analytics for Hadoop provide parallel and scalable algorithms for both model building and scoring, we are focusing here on how third party packages available from sites like CRAN can be used in a scalable and performant manner leveraging the infrastructure provided in Oracle R Enterprise. At a high level, we might depict the problem as: Let’s select a few parameters to make the scenario specific. Suppose we have 100 models, each 500 MB in size, and a 20 million row data set with 50 columns to be scored. Often, memory limitations are our first concern, especially if using software not specifically designed to be scalable. Based on this, we’re looking at the following potential memory requirements: Models: 100 models * 500 MB = 50 GB Scoring data: 20 M rows * 50 columns * 16 bytes = 16 GB Scores: 100 models * 20 M rows = 2 B scores + 20 M IDs = 32.2 GB If we wanted to take a brute force approach, we load everything in memory and serially execute. Just for the raw data noted above, this will require ~100 GB of RAM, and then there is the requirement to have enough RAM left over to perform the scoring - and don't forget the OS, database and other software that may be running. While there are various approaches to this problem, in this blog we highlight using Oracle R Enterprise Embedded R Execution to achieve several goals: scale to as much data as needed limit memory requirements of the models themselves create the final table in Oracle Database, thereby avoiding persisting intermediate results But first, let's explore a few options.First is the brute force approach characterized above:  Load all the data into memory Load all models into memory For each model i { Score data with model i Please scores in single in-memory table for all models } Write in-memory score table We could choose to parallelize this approach on each model, but if we do so maximally (with 100 concurrently executing processes), we’ll require not only 100 times the ~18 GB (50 MB model + 16 GB data + 640 MB for the scores with key) plus the RAM to be loaded into each parallel engine (~1.8 TB), but additional RAM for the processes supporting the scoring in parallel. So while we can speed up the elapsed execution time, it comes at a major cost of memory, which you may not have. (Of course, we could limit the number of concurrent engines to reduce this significantly, but this indicates worst case scenario.) What options might we consider to get this job done? In the example above, the models take the majority of the space, unless you consider replicating the data in each concurrent engine. Perhaps we load only one at a time, score with it, and then discard it prior to moving to the next model. Additionally, the scores themselves consume a lot of space, so let’s purge those as well after writing them to a table. Load all the data into memory For each model i { Load model i Score data with model i Purge model Write scores with ID key to a table Purge scores } Join all score tables based on ID This requires 16 GB for the data, 500 MB for the model, and 640 MB for the scores and keys. At ~18 GB, we’re doing better than the ~100 GB in the non-parallel case and ~1.8 TB in the worst case parallel case, but this may still not scale, especially if the data grow. The problem with this approach is that it requires all data to be loaded at once into memory, which by definition does not scale. A variant of this approach is to process the data in smaller chunks, iterating serially over those chunks. Since we can choose the chunk size (in terms of number of rows), we could require as little at ~2 GB of RAM (processing a handful of rows each time, which would be inefficient). If we read half the data at a time, this could be reduced to ~10 GB. Of course, with this approach, unless the script appends to the existing table, we will have n tables to union, one for each model. The union, however, is not necessarily a costly operation, but can result in duplicating the data to materialize the result. Until done { Load n rows of data into memory For each model i { Load model i Score data with model i Write scores with ID key to a table } Union all tables for model i (unneeded if data appended to single table) } Join all score tables based on ID (avoidable if scores updated in single table with ID key) A parallel variant of this performs the scoring of each chunk in parallel. We’ll need to decide on a degree of parallelism, which is based on the machine specs (RAM and CPUs) and expected resource availability. So let’s see how this could be done using Oracle R Enterprise Embedded R Execution with a simple example you can easily reproduce on your own system. To illustrate, we'll start with a simple linear model using R's lm function. However, you can substitute your favorite algorithm available from CRAN or elsewhere. Using the Longley data set, we predict the number of people employed, which results in a numeric vector. > # build test model > mod <- lm(Employed ~ ., data = longley) > pred <- predict(mod, longley) > head(pred) 1947 1948 1949 1950 1951 1952 60.05566 61.21601 60.12471 61.59711 62.91129 63.88831 To simulate multiple models, we'll create three datastore entries labeled test_1, test_2, and test_3, each containing the same lm model created above. A datastore allows one or more R objects to be stored in the database by name. > ore.save(mod, name="test_1") > ore.save(mod, name="test_2") > ore.save(mod, name="test_3") > ore.datastore(pattern="test") # list the contents datastore.name object.count size creation.date description 1 test_1 1 7346 2018-02-21 18:15:05 2 test_2 1 7346 2018-02-21 18:15:05 3 test_3 1 7346 2018-02-21 18:15:06 > ds_models <- c("test_1", "test_2", "test_3") Next, we'll define our function f, which takes a chunk of rows in argument dat, along with the set of datastore names corresponding to our models used for scoring. Our objective is to score with each model sequentially on the given chunk of data (while ultimately processing the chunks in parallel), and place the results from each model in a list. We free the model when finished with it and garbage collect the environment just for good measure. To complete the function, the prediction list is converted to a data.frame with an ID added from dat. > f <- function (dat, ds_models) { + pred <- list() + for(m in ds_models) { + ore.load(m) + pred[[m]] <- predict(mod, dat) + rm(mod) + gc() + } + res <- as.data.frame(pred) + res$ID <- dat$ID + res + } Let's test this function, but first we'll add an ID column based on the row name as the function f expects. > # provide ID column in data set > longley2 <- data.frame(ID=rownames(longley), longley) > head(longley2,3) ID GNP.deflator GNP Unemployed Armed.Forces Population Year 1 1947 83.0 234.289 235.6 159.0 107.608 1947 2 1948 88.5 259.426 232.5 145.6 108.632 1948 3 1949 88.2 258.054 368.2 161.6 109.773 1949 Employed 1 60.323 2 61.122 3 60.171 Now we're ready to test the function locally - a best practice before moving to Embedded R Execution. We see that the output is a data.frame with the prediction for each model in a column, followed by the ID. This facilitates comparison of scores from the multiple models. Note that we could store the scores in long format using a 3 column table (model_name, ID, value), but this would require roughly 3 times as much space. > # test function locally > dat <- head(longley2) > testRes <- f(dat, ds_model_list) > class(testRes) [1] "data.frame" > testRes test_1 test_2 test_3 ID 1 60.05566 60.05566 60.05566 1947 2 61.21601 61.21601 61.21601 1948 3 60.12471 60.12471 60.12471 1949 4 61.59711 61.59711 61.59711 1950 5 62.91129 62.91129 62.91129 1951 6 63.88831 63.88831 63.88831 1952 We now create an ore.frame, LONGLEY, from our data.frame for use with ore.rowApply. The function ore.push suffices for a demo example. We then invoke ore.rowApply with the ore.frame and the function f. Further, we specify that the data should be processed in chunks of 4 rows, and computed using 2 parallel R engines. We also pass the argument ds_models and specify to establish a connection to the database from the R engine automatically (this is needed to access the datastore). This invocation produces a list of data.frame objects, where each list element is produced from one execution of the R function. Since we have 16 rows, there will be 4 executions, executed 2 at a time in parallel. > LONGLEY <- ore.push(longley2) > > # return list of individual execution results > ore.rowApply(LONGLEY, f, rows=4, parallel=2, + ds_models = ds_models, ore.connect=TRUE) $`1` test_1 test_2 test_3 ID 1 62.91129 62.91129 62.91129 1951 2 63.88831 63.88831 63.88831 1952 3 65.15305 65.15305 65.15305 1953 4 63.77418 63.77418 63.77418 1954 $`2` test_1 test_2 test_3 ID 1 60.05566 60.05566 60.05566 1947 2 61.21601 61.21601 61.21601 1948 3 60.12471 60.12471 60.12471 1949 4 61.59711 61.59711 61.59711 1950 $`3` test_1 test_2 test_3 ID 1 68.81055 68.81055 68.81055 1959 2 69.64967 69.64967 69.64967 1960 3 68.98907 68.98907 68.98907 1961 4 70.75776 70.75776 70.75776 1962 $`4` test_1 test_2 test_3 ID 1 66.00470 66.00470 66.00470 1955 2 67.40161 67.40161 67.40161 1956 3 68.18627 68.18627 68.18627 1957 4 66.55206 66.55206 66.55206 1958 Although we have the scores, they're not in the form we need. We want a single table in the database that contains all these rows. We can do that with a minor addition to the ore.rowApply invocation. We specify the argument FUN.VALUE, providing a row from our previous run as an example, although we could explicitly construct it using the data.frame function. This FUN.VALUE informs Oracle Database what the new table should look like (column names and data types). Next we store the ore.frame result in variable res, and materialize a database table LONGLEY_SCORES. Note that when using FUN.VALUE, the ore.rowApply invocation constructs the ore.frame specification. Not until the result is accessed, e.g., when ore.create us invoked in this case, does Oracle Database spawn the R engines and execute the function. Each time the ore.frame in res is accessed, the result will be recomputed. So if a result is to be used multiple times, it is best to materialize it as a database table. > # return ore.frame with all individual results > res <- ore.rowApply(LONGLEY, f, FUN.VALUE = testRes[1,], + rows=4, parallel=2,ds_model_list = ds_model_list, ore.connect=TRUE) > ore.drop("LONGLEY_SCORES") > ore.create(res, table="LONGLEY_SCORES") [1] 16  4 > > dim(LONGLEY_SCORES) > head(LONGLEY_SCORES) test_1 test_2 test_3 ID 1 60.05566 60.05566 60.05566 1947 2 61.21601 61.21601 61.21601 1948 3 60.12471 60.12471 60.12471 1949 4 61.59711 61.59711 61.59711 1950 5 62.91129 62.91129 62.91129 1951 6 63.88831 63.88831 63.88831 1952 > In this blog post, we've discussed some issues that can arise when trying to perform scalable scoring with multiple models using Oracle R Enterprise Embedded R Execution. In particular, this case arises when leveraging open source package algorithms, such as those from CRAN, that are not designed with parallelism or scalability in mind. The example above provides a template you can adapt and experiment with to meet your own project requirements, When possible, leveraging algorithms designed for parallelism and scalability, such as those found in Oracle R Enterprise and Oracle R Advanced Analytics for Hadoop, can offer superior performance and scalability. The next step for many enterprises is putting this solution in production, perhaps scheduled for regular execution. This can be accomplished using Oracle Database's RDBMS_SCHEDULER package and the Oracle R Enterprise SQL interface to Embedded R Execution.

At first glance, scoring data in batch with a machine learning model appears to be a straightforward endeavor: build the model, load the data, score using the model, do something with the results....

Building R Packages on Solaris: Variable Definitions

The R ecosystem offers numerous packages for performing data analysis. Currently, the CRAN package repository features over 14,000 available packages! A key benefit to using R is the endless support it gets from statisticians, developers, and data science experts around the world. The CRAN repository offers R packages in Linux source format, or as binaries for Windows and MacOS.  If you are installing R packages on another Operating System such as Solaris, you will need to build the package from source. As covered in a previous post on building the Rcpp package, building R packages that contain C++ code on Solaris using the Oracle Developer Studio compiler requires some customization of variables. While many R packages will compile out of the box after building R using the gcc compiler, the Oracle Developer Studio compiler requires some juggling of paths to ensure R packages that use C++ templates will compile.  The following variable modifications in Makeconf will allow most R packages to build under R-3.3.0 using Oracle Developer Studio 12.5: CXX = <Oracle Studio 12.5 path>/bin/CC -m64 std=c++03 CXX1X = <Oracle Studio 12.5 path>/bin/CC -std=c++03 CXX1XFLAGS = -xO3 -m64 -std=c++03 CXX1XSTD =  -std=c++11 LDFLAGS = -L/<Oracle Studio 12.5 path>/lib/sparcv9 -lCrun -lCstd -lncurses -lreadline -lbz2 -lstdc++ -lgcc_s -lCrunG3 -lrt -lm -lc SHLIB_CXXLDFLAGS = -G -L/<Oracle Studio 12.5 path>/lib/sparcv9 -lCrun -lCstd -lncurses -lreadline -lbz2 -lstdc++ -lgcc_s -lCrunG3 -lrt -lm -lc Note that directly including C++ library linking options in $R_HOME/etc/Makeconf is a global approach.  You can override variables in Makeconf with either of two approaches:on a per-user basis, by modifying  ~/.R/Makevars, or on a per-package basis, by modifying  <package name>/src/Makevars. Makevars is a make file that overrides the default make file generated by R. The advantage of using Makevars is that you can set the flags you need without unnecessarily passing variables to all R packages. Some commonly used flags in Makevars are: PKG_LIBS: Linker flags. PKG_CFLAGS & PKG_CXXFLAGS: C and C++ flags. Most commonly used to set define directives with -D. PKG_CPPFLAGS: Pre-processor flags. Most commonly used to set include directories with -I.  CRAN packages are regularly tested using gcc on Unix systems. The results of these tests can be found on the CRAN Package Check Results site. Recognizing that most R package authors do not have the time and resources to set up a separate testing environment with Oracle Developer Studio, it cannot be guaranteed that every package on CRAN will build, even with these variables defined.  If an issue is encountered, engage Oracle Developer Studio support for assistance or use the gcc compiler.

The R ecosystem offers numerous packages for performing data analysis. Currently, the CRAN package repository features over 14,000 available packages! A key benefit to using R is the endless support...

News

Announcing the release of Oracle R Advanced Analytics for Hadoop 2.7.1

We are pleased to announce the general availability of Oracle R Advanced Analytics for Hadoop (ORAAH) 2.7.1, a component of the Oracle Big Data Connectors, which enables big data analytics from R. With ORAAH, Data Scientists and Data Analysts have access to the rich and productive R language for accessing and manipulating data resident across multiple platforms, including HDFS, Hive, Oracle Database, and local files. By leveraging the parallel and distributed Hadoop and Spark computational infrastructure, users can take advantage of scalability and performance when analyzing big data. ORAAH 2.7.1 provides several important advantages for big data analytics: A general computation framework where users invoke parallel, distributed MapReduce jobs from R, writing custom mappers and reducers in R while also leveraging open source CRAN packages. Support for binary RData representation of input data enables R-based MapReduce jobs to match the I/O performance of pure Java-based MapReduce programs. Parallel and distributed machine learning algorithms take advantage of all the nodes of your Hadoop cluster for scalable, high performance modeling on big data. Algorithms include linear regression, generalized linear models, neural networks, low rank matrix factorization, non-negative matrix factorization, k- means clustering, principal components analysis, and multivariate analysis. Functions use the expressive R formula object optimized for Spark parallel execution. R functions wrap Apache Spark MLlib algorithms within the ORAAH framework using the R formula specification and Distributed Model Matrix data structure. ORAAH's MLlib R functions can be executed either on a Hadoop cluster using YARN to dynamically form a Spark cluster, or on a dedicated standalone Spark cluster. Spark execution can be switched on or off ORAAH’s architecture and approach to big data analytics leverages the cluster compute infrastructure for parallel, distributed computation, while shielding the R user from Hadoop’s complexity using a small number of easy-to-use functions. What's new in ORAAH 2.7.1 Compatible with Oracle R Distribution 3.3.0 and Oracle R Enterprise 1.5.1 Support for Cloudera Distribution of Hadoop (CDH) release 5.12.0, both “classic” MR1 and YARN MR2 APIs Extended support for Apache Spark: execute select predictive analytic functions on a Hadoop cluster using YARN - dynamically forming a Spark cluster on a dedicated standalone Spark cluster; switch Spark execution on or off using new spark.connect() and spark.disconnect() functions Support for the new OAAgraph package, which provides a tight integration with Oracle’s Parallel Graph AnalytiX (PGX) engine from the Oracle Big Data Spatial and Graph option Function hdfs.write() adds support in writing Spark DataFrame objects, this is in addition to the Distributed Model Matrix (DMM) that can be saved in Comma-Separated Value (CSV) format on HDFS Connect to multi-tenant container databases (CDB) of Oracle Database using a new parameter pdb to specify the service name of the pluggable database (PDB) to which a connection has to be established Improved performance of ore.create() for Hive and the ability to create tables in a Hive database other than the one they are connected to A new parameter append for ore.create enables appending an ore.frame or data.frame to an existing Hive table Support for a new environment configuration variable ORCH_CLASSPATH, which sets the CLASSPATH used by ORAAH and resolves issue to support wildcards in path New features added to distributed model matrix and distributed formula to improve performance and support in all ORAAH and Spark MLlib-based functionality Upgrade to Intel® Math Kernel Library Version 2017 for Intel® 64 architecture Improved installers and un-installers for both server and client: the installer checks your environment and runs validation checks to ensure prerequisites are met; un-installation has been improved to work with co-existing Oracle R Enterprise and 3rd party installed packages Bug fixes and updates across the platform improve stability and ease-of-use. For details, see the Change List See the ORAAH OTN page to download the software and access the latest documentation.

We are pleased to announce the general availability of Oracle R Advanced Analytics for Hadoop (ORAAH) 2.7.1, a component of the Oracle Big Data Connectors, which enables big data analytics from...

PageRank-based College football (NCAA) ranking using OAAgraph

NCAA College football is American football played by teams of student athletes fielded by American universities, colleges, and military academies. It is one of the major weekend entertainments in the US. The match results capture most of the Sunday headlines. In particular, one key focus is the rankings of the teams. There are various types of rankings: CFP rankings, AP Poll, Coaches Poll, etc. Those rankings look similar to each other with slight differences. The ranking methods are not disclosed, but one thing is certain, the rankings are not based on a single algorithm or formula but generated by vote or poll from sport writers or non-players. This is largely because NCAA is not organizing a tournament game. Instead, the teams are playing in isolated 'conferences' such as Big Ten, ACC, etc. Therefore, many pairs of the teams do not even get a chance to play against each other. Moreover, there are teams like Notre Dame playing as 'independent' and thus not confined to a single conference. A natural question is: can we create our algorithm to rank college football teams? The answer is yes and has been done already. See github https://github.com/joebluems/CollegeFootball2015. The author used a customized PageRank algorithm to calculate the page rank of each football team, which is then used to generate the rankings. The results look good on college football rankings in 2015. Why does PageRank work? Is there a way to improve the development or analytic process? In this blog, we will show how we achieve the same analysis using OAAgraph, an interface that integrates Oracle R Enterprise of the Advanced Analytics option to the Parallel Graph AnalytiX (PGX) engine part of the Oracle Spatial and Graph option. NCAA College Football Data Easy-to-read NCAA football outcomes can be found in https://www.sports-reference.com/cfb/years/2017-schedule.html, offering full information of all College football teams. It is also downloadable as CSV format. The following R code reads the NCAA data from .csv and saves it as a table in Oracle Database by Oracle R Enterprise. library(ORE) ore.connect(...) scores.df <- read.csv('scores.csv', header =T) scores.df <- scores.df[, c('Winner', 'Pts', 'Loser', 'Pts.1')] colnames(scores.df) <- c('TEAM1', 'SCORE1', 'TEAM2', 'SCORE2') scores.df$TEAM1 <- as.character(scores.df$TEAM1) scores.df$TEAM2 <- as.character(scores.df$TEAM2) scores.df$TEAM1 <- sapply(scores.df$TEAM1, function(str){ gsub('\\(.*\\) ', '', str)}) scores.df$TEAM2 <- sapply(scores.df$TEAM2, function(str){ gsub('\\(.*\\) ', '', str)}) teams.df <- read.csv('teams.csv', header = F) colnames(teams.df) <- c('No', 'TEAM') scores.df <- merge(scores.df, teams.df, by.x = 'TEAM1', by.y = 'TEAM') colnames(scores.df)[colnames(scores.df) == 'No'] <- 'No1' #colnames(scores.df)[colnames(scores.df) == 'CNT'] <- 'CNT1' scores.df <- merge(scores.df, teams.df, by.x = 'TEAM2', by.y = 'TEAM') colnames(scores.df)[colnames(scores.df) == 'No'] <- 'No2' Let us take a look at the data frame: > head(scores.df) TEAM2 TEAM1 SCORE1 SCORE2 No1 No2 1 Abilene Christian New Mexico 38 14 118 11 2 Abilene Christian Colorado State 38 10 72 11 3 Air Force Michigan 29 13 29 50 4 Air Force Army 21 0 43 50 5 Air Force Wyoming 28 14 123 50 6 Air Force San Diego State 28 24 144 50 Each row of the data frame is a record of one match. TEAM1 and TEAM2 are team names in the match, and here TEAM1 is the winner and TEAM2 the loser. We also record the scores in SCORE1 and SCORE2. For ease of indexing the team, we also created IDs for the teams, which are columns No1, No2. With this data frame ready, we can start our analysis. Generate Graph by OAAgraph Now, we show how to create a graph using OAAgraph based on the data frame scores.df. Here, we model each team as a node in the graph. The following code creates a table TEAM containing all nodes i.e. teams indices in Oracle Database. VID <- teams.df$No NAME <- teams.df$TEAM nodes.df <- data.frame(VID, NAME) ore.drop(table = 'TEAM') ore.create(nodes.df, table = 'TEAM') A match between two teams is modeled as an edge with direction. The direction is from the loser to the winner. We can interpret the edge or relation as 'beaten by'. The following code creates a table of edges TEAM_EDGES. Each row contains the edge ID ('EID'), SVID (the source node id, i.e. the loser team in the match), DVID (destination node i.e. winner team), EL ( label of the edge, 'beaten_by') and other edge properties such as MARGIN, which is computed from the ratio of score difference to the winning score. This property can contain any numerical values by design. We will show how this property is used later. scores.df <- scores.df[ !is.na(scores.df$SCORE1) & !is.na(scores.df$SCORE2), ] # if team1 is beaten by team2, then team1 -> team2 EID <- rownames(scores.df) SVID <- ifelse(scores.df$SCORE1 > scores.df$SCORE2, scores.df$No2, score.df$No1) DVID <- ifelse(scores.df$SCORE1 > scores.df$SCORE2, scores.df$No1, score.df$No2) EL <- rep('beaten_by', nrow(scores.df)) MARGIN <- abs(scores.df$SCORE2 - scores.df$SCORE1)*1.0/max(c(scores.df$SCORE1, scores.df$SCORE2, 0)) edges.df <- data.frame(EID, SVID, DVID, EL, MARGIN) edges <- data.frame(EID, SVID, DVID, EL, DIFF) ore.drop(table="TEAM_EDGES") ore.create(edges, table = "TEAM_EDGES") Note that OREdpylr can also be used to do the data transformation if the data resides in database originally. After we complete the node and edge tables in Oracle Database, we are ready to create the graph. A simple command can be called by supplying both node and edge tables. try(oaa.rm(graph), silent = TRUE) graph <- oaa.graph(TEAM_EDGES, TEAM, "teamGraph") > graph Graph Name: teamGraph Number of Nodes: 209 Number of Edges: 781 Persistent Graph: FALSE Node Properties: NAME Edge Properties: MARGIN PageRank The reason why we model the edges in such a way is to utilize PageRank. PageRank was originally used by Google to rank the search results of webpages. It provides a powerful way to generate rankings or 'reputation' of a webpage based on how many times it is quoted or linked by other webpages. A brief introduction of PageRank can be found in this link. In the language of graph, each webpage is modeled as a node, and the linking from node A to node B is modeled as an edge from A to B. Ideally, the PageRank of a node will be higher when it is referenced by nodes with high ranks as well. The same idea can be applied to college football ranking. We can model each college as a node and the match between two colleges as an edge. Here, the edge is from loser to the winner. This means the edge means 'beaten by'. Therefore if team A has more edges, i.e. higher in-degree, that means that team A beats more teams. If the teams beaten by team A have a higher PageRank, then naturally team A will also has a higher PageRank. In this way, we can use PageRank to rank the football teams. To rank the college football teams, let us compute PageRank and see what the result looks like. In OAAgraph, a single command can be called to compute PageRank. pagerank(graph, error = 0.0001, damping = 0.2, maxIterations = 1000) This line of code computes PageRank and attaches the score to each node. To retrieve the PageRank, we can use a PGQL query in R: cursor <- oaa.cursor(graph, query = "select n.NAME, n.pagerank where (n) order by n.pagerank desc") oaa.next(cursor, 30) The query returns the top 30 teams sorted by PageRank. The ranking is: n.NAME n.pagerank 1 Clemson 0.006036588 2 Auburn 0.006022145 3 Central Florida 0.005866008 4 Iowa State 0.005833792 5 Georgia 0.005796430 6 Miami (FL) 0.005606519 7 Oklahoma 0.005581062 8 Fresno State 0.005540282 9 Louisiana State 0.005534469 10 Texas Christian 0.005520499 11 Pittsburgh 0.005513876 12 San Diego State 0.005478629 13 North Texas 0.005453995 14 Toledo 0.005428720 15 Memphis 0.005427890 16 Ohio State 0.005398382 17 Wisconsin 0.005387744 18 Notre Dame 0.005381033 19 Wake Forest 0.005376808 20 Alabama 0.005368288 21 Virginia Tech 0.005360752 22 North Carolina State 0.005338435 23 Boise State 0.005337153 24 South Carolina 0.005333423 25 Florida Atlantic 0.005315308 26 Southern Methodist 0.005298519 27 Central Michigan 0.005277379 28 Penn State 0.005271921 29 Stanford 0.005271275 30 Washington State 0.005264159 By the time of this blog is written, the NCAA AP poll ranking is (week 14) RK TEAM 1 Clemson 2 Oklahoma 3 Wisconsin 4 Auburn 5 Alabama 6 Georgia 7 Miami 8 Ohio State 9 Penn State 10 TCU 11 USC 12 UCF 13 Washington 14 Stanford 15 Notre Dame 16 Memphis 17 LSU 18 Oklahoma State 19 Michigan State 20 Northwestern 21 Washington State 22 Virginia Tech 23 South Florida 24 Mississippi State 25 Fresno State It seems that the PageRank has similar teams ranked at the top just as AP Poll, but in quite a different order. Some teams are ranked very high in PageRank but very low in AP Poll, such as Central Florida, Iowa State, Freson State. Why is that? Let us take a look at Central Florida. We can run a PGQL query to find all winners/losers to Central Florida: > cursor <- oaa.cursor(graph, + query ="SELECT f.NAME, g.NAME WHERE (f )-[e:beaten_by]->(g WITH NAME = 'Central Florida')") > oaa.next(cursor, 20) f.NAME g.NAME 1 Connecticut Central Florida 2 Florida International Central Florida 3 Southern Methodist Central Florida 4 Cincinnati Central Florida 5 East Carolina Central Florida 6 Maryland Central Florida 7 Memphis Central Florida 8 Navy Central Florida 9 South Florida Central Florida 10 Temple Central Florida 11 Austin Peay Central Florida > cursor <- oaa.cursor(graph, + query ="SELECT f.NAME, g.NAME WHERE (f WITH NAME = 'Central Florida')-[e:beaten_by]->(g )") > oaa.next(cursor, 20) Error in oaa.next.default(cursorObj, n) : cursor is empty The errors indicates that the query of the teams that beat Central Florida returns nothing. Actually Central Florida is an all-winner in the first 11 games! That is why it is ranked that high. But in AP poll , Central Florida only ranks 12. Another big difference is Alabama, which should ranked much higher but only ranked 20th here. One explanation is that PageRank places a lot of emphasis on the win-lose counts. The AP poll, on the other hand, considers way more beyond the win/lose counts, such as statistics in the match such as running distance of the quarterback, intercept/turnover counts, etc. Weighted PageRank – Score Margin Let us add more data into the calculation and see if we can improve the ranking. One thought is to consider the score margin of a team. If a team tends to win with a large margin, then that team should be ranked higher. The way we incorporate the score margin is to use the weighted PageRank. This algorithm allows a weight attached to each edge and rank higher for nodes with more incoming weighted edges. The code can be written as > pagerank(graph, 0.0001, 0.1, 1000, variant = 'weighted', weightPropName = 'MARGIN') oaa.cursor over: ID, weighted_pagerank position: 0 size: 209 > > cursor <- oaa.cursor(graph, + query = "select n.NAME, n.weighted_pagerank where (n) order by n.weighted_pagerank desc") > oaa.next(cursor, 30) n.NAME n.weighted_pagerank 1 Clemson 0.006511112 2 Auburn 0.006478130 3 Georgia 0.006234016 4 Central Florida 0.006195107 5 Notre Dame 0.006013695 6 Ohio State 0.005951788 7 Texas Christian 0.005940757 8 Oklahoma 0.005940566 9 Iowa State 0.005790163 10 Mississippi State 0.005757828 11 Wisconsin 0.005735430 12 Alabama 0.005698574 13 Miami (FL) 0.005693818 14 Penn State 0.005689439 15 Toledo 0.005595128 16 Oregon 0.005588613 17 Florida Atlantic 0.005514597 18 Pittsburgh 0.005501906 19 Fresno State 0.005501542 20 Memphis 0.005467258 21 Louisiana State 0.005453052 22 Virginia Tech 0.005430484 23 Southern California 0.005430218 24 Washington 0.005387747 25 San Diego State 0.005382916 26 Stanford 0.005359146 27 Missouri 0.005350855 28 Louisville 0.005331221 29 North Carolina State 0.005315241 30 Boise State 0.005286847 Looks like the ranking is improved a lot! Alabama is now ranked 12th. Another prominent difference is that Notre Dame ranks very high (5th), comparing to (15th) in AP Poll. This is because Notre Dame won quite a few games with large margin: ND Vs Temple: 49-16, ND VS Boston College (49-20), Michigan State (38 -18), Miami (OH) (52-17), USC (49-14). Although the ranking is not any closer to the AP Poll, we did see that adding weights to the link can impact the ranking through weighted PageRank algorithm. We believe the ranking can be improved if more match statistics are added. Adjustment with Number of Lost One particular flaw with using the PageRank method to rank the teams is that the PageRank algorithm only focuses on the teams that each team has beaten. Recall that the PageRank is computed as PR(A) = (1-d) + d (PR(T1)/C(T1) +...+ PR(Tn)/C(Tn)) where PR() is the PageRank score. T1 - Tn are teams beaten by A. C(Ti) is the number of teams that Ti has lost to. From this formula, we can see that there is no information about the teams that won team A! All the information used here is about teams lost to A. That gives us a biased ranking such that as long as a team beat excellent teams, that team will receive a high ranking. This can be seen from Iowa State. This team is not ranked any high in AP Poll, but received a high PageRank in both vanilla and weighted type of PageRank. Let us take a look at this team. > cursor <- oaa.cursor(graph, + query ="SELECT f.NAME, g.NAME WHERE (f )-[e:beaten_by]->(g WITH NAME = 'Iowa State')") > oaa.next(cursor, 20) f.NAME g.NAME 1 Texas Christian Iowa State 2 Baylor Iowa State 3 Northern Iowa Iowa State 4 Oklahoma Iowa State 5 Akron Iowa State 6 Kansas Iowa State 7 Texas Tech Iowa State > > > cursor <- oaa.cursor(graph, + query ="SELECT f.NAME, g.NAME WHERE (f WITH NAME = 'Iowa State')-[e:beaten_by]->(g )") > oaa.next(cursor, 20) f.NAME g.NAME 1 Iowa State Oklahoma State 2 Iowa State Iowa 3 Iowa State Kansas State 4 Iowa State Texas 5 Iowa State West Virginia Looks like Iowa State has 5 losses. This explains why it does not have a high ranking. But on the other hand, ISU beats high ranking teams such as TCU, Oklahoma, Texas Tech. This significantly boosts the PageRank score. To avoid this defect, we can make some adjustment to the obtained PageRank score by punishing the teams with losses. The idea is to multiply a factor that decreases monotonically with the number of loses. Here we used an empirical formula PageRank/( a*# of losses + b) The parameter b is to avoid divided-by-zero error when the team has no losses. Both a and b can be chosen by design. Here we chooses the parameter a and b such that the ranking looks as close as to AP Polls. Let us first calculate the out degree of each node: degree(graph, "out", "nLost") This value is attached to each node with the property name 'nLost', which means that number of losses. Then we calculate the PageRank score. pagerank(graph, error = 0.0001, damping = 0.6, maxIterations = 1000) cursor <- oaa.cursor(graph, query = "select n.NAME, n.pagerank, n.nLost where (n) order by n.pagerank desc") rank.df <- oaa.next(cursor, 30) After the PageRank is obtained, we compute the adjusted score: rank.df$SCORE <- rank.df$n.pagerank/(0.4*rank.df$n.nLost + 0.9) rank.df[order(-rank.df$SCORE),] n.NAME n.pagerank n.nLost SCORE 1 Clemson 0.016961654 1 0.013047426 2 Central Florida 0.011323213 0 0.012581347 3 Auburn 0.018990912 2 0.011171124 4 Wisconsin 0.008984433 0 0.009982703 5 Oklahoma 0.012484544 1 0.009603495 6 Miami (FL) 0.012093970 1 0.009303054 7 Georgia 0.011239293 1 0.008645610 8 Alabama 0.010776809 1 0.008289853 9 Louisiana State 0.012340969 3 0.005876652 10 Ohio State 0.009626590 2 0.005662700 11 Notre Dame 0.010977572 3 0.005227415 12 Washington 0.008674728 2 0.005102781 13 Texas Christian 0.008620726 2 0.005071015 14 Southern California 0.008586252 2 0.005050737 15 Iowa State 0.014511512 5 0.005003970 16 Penn State 0.008371257 2 0.004924269 17 San Diego State 0.008303874 2 0.004884632 18 Stanford 0.010089052 3 0.004804311 19 Washington State 0.010070243 3 0.004795354 20 Fresno State 0.008914050 3 0.004244786 21 Boise State 0.008815810 3 0.004198005 22 Michigan State 0.008096342 3 0.003855401 23 Virginia Tech 0.007767759 3 0.003698933 24 Oklahoma State 0.007471362 3 0.003557791 25 Syracuse 0.013993134 8 0.003412959 26 North Carolina State 0.008332988 4 0.003333195 27 Iowa 0.009211545 5 0.003176395 28 Pittsburgh 0.011743373 7 0.003173885 29 Mississippi State 0.007887910 4 0.003155164 30 Wake Forest 0.008395616 5 0.002895040 The result looks much better. Iowa State is now ranked 15th and Alabama ranks 8th. We believe that we can even approach the AP Poll rankings by adding consideration of more match data, but that is out of scope of this blog. Conclusion In this blog, we demonstrate how to use OAAgraph to generate rankings for NCAA football teams. The ranking results show that the top teams are close to AP Poll with a certain bias due to lack of data. Adding the score margin to the algorithm, we also demonstrate the application of weighted PageRank algorithm. We successfully generated rankings with favor to teams with higher score margin. By adjusting the PageRank score with the number of losses, we improved the accuracy of the ranking. Perhaps there will be AI rankings for College football as a primary ranking method!

NCAA College football is American football played by teams of student athletes fielded by American universities, colleges, and military academies. It is one of the major weekend entertainments in the...

Tips and Tricks

Text Analytics using a pre-built Wikipedia-based Topic Model

In my previous post, Explicit Semantic Analysis (ESA) for Text Analytics, we explored the basics of the ESA algorithm and how to use it in Oracle R Enterprise to build a model from scratch and use that model to score new text.  While creating your own domain-specific model may be necessary in many situations, others may benefit from a pre-built model based on millions of Wikipedia articles reduced to 200,000 topics. This model is downloadable here with details of how to install it here. Installing the model The installation link provided above describes other prerequisites such as directory object and table spaces. Once these are in place, when you load the model using impdp, you should see something like the following: >impdp rquser/pswd@PDB01 dumpfile=wiki_model12.2.0.1.dmp directory=DATA_PUMP_DIR remap_schema=DMUSER:RQUSER remap_tablespace=TBS_1:ESA_MODELS_1 logfile=wiki_model_import_1.log Import: Release 12.2.0.1.0 - Production on Tue Nov 28 11:11:25 2017 Copyright (c) 1982, 2017, Oracle and/or its affiliates. All rights reserved. Connected to: Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production Master table "RQUSER"."SYS_IMPORT_FULL_01" successfully loaded/unloaded import done in AL32UTF8 character set and AL16UTF16 NCHAR character set export done in WE8DEC character set and AL16UTF16 NCHAR character set Warning: possible data loss in character set conversions Starting "RQUSER"."SYS_IMPORT_FULL_01": rquser/********@PDB01 dumpfile=wiki_model12.2.0.1.dmp directory=DATA_PUMP_DIR remap_schema=DMUSER:RQUSER remap_tablespace=TBS_1:ESA_MODELS_1 logfile=wiki_model_import_1.log Processing object type TABLE_EXPORT/TABLE/TABLE Processing object type TABLE_EXPORT/TABLE/TABLE_DATA . . imported "RQUSER"."DM$PYWIKI_MODEL" 485.9 MB 21346563 rows . . imported "RQUSER"."DM$P5WIKI_MODEL" 251.7 MB 289705 rows . . imported "RQUSER"."DM$PRWIKI_MODEL" 5.573 MB 200887 rows . . imported "RQUSER"."DM$PZWIKI_MODEL" 5.300 MB 196904 rows . . imported "RQUSER"."DM$PDWIKI_MODEL" 4.415 MB 200886 rows . . imported "RQUSER"."DM$PPWIKI_MODEL" 6.765 KB 1 rows . . imported "RQUSER"."DM$PMWIKI_MODEL" 0 KB 0 rows Processing object type TABLE_EXPORT/TABLE/INDEX/INDEX Processing object type TABLE_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS Processing object type TABLE_EXPORT/TABLE/POST_INSTANCE/PROCDEPOBJ Job "RQUSER"."SYS_IMPORT_FULL_01" successfully completed at Tue Nov 28 11:12:50 2017 elapsed 0 00:01:24 To create needed preference and policy objects, you then execute: exec ctx_ddl.drop_policy('wiki_txtpol'); exec ctx_ddl.drop_preference( 'wiki_lexer'); exec ctx_ddl.create_preference('wiki_lexer', 'BASIC_LEXER'); exec ctx_ddl.set_attribute('wiki_lexer', 'INDEX_STEMS', 'ENGLISH'); exec ctx_ddl.create_policy(policy_name => 'wiki_txtpol', lexer => 'wiki_lexer'); Creating a proxy object Once the model is installed in the database, we need an ORE object as a proxy for the in-database ESA model. You may know that models created using ORE's ore.odm* functions have underlying in-database first-class objects that are accessed by ORE proxy objects. Since the WIKI_MODEL is also an in-database model, it too needs to be accessed using a proxy object. To enable this, execute the following R function, which constructs a model object with class ore.odmESA. ore.createESA.wiki_model <- function () { model.name <- "WIKI_MODEL" attr(model.name, "owner")

In my previous post, Explicit Semantic Analysis (ESA) for Text Analytics, we explored the basics of the ESA algorithm and how to use it in Oracle R Enterprise to build a model from scratch and use...

R Technologies

Supporting R through the R Consortium

Oracle has supported the R Consortium since its inception in 2015 ( R Consortium Launched!). As a provider of multiple software tools and products that leverage and extend R, joining the R Consortium was a natural way for Oracle to give back to the R community and contribute to the evolution of the R ecosystem. The R Consortium provides vendors a forum within which to suggest needed projects for the R community, and to raise concerns. Through the Infrastructure Steering Committee (ISC), knowledgeable contributors from the R community review proposals and allocate funds to make needed R projects happen. Oracle is an active participant in ISC working groups such as Code Coverage, where this group released an enhanced version of the covr package and pursues exploring ways to encourage adopting best practices for R package quality, and security. The R Consortium also supports the R Implementation, Optimization, and Tooling (RIOT) Workshop, which held its third event at UseR!2017 in Brussels, Belgium. Oracle is one of the organizers for this event. The workshop goals include sharing experience in developing and promoting alternative R language implementations, promoting the development of different R implementations, and discussing future directions for the R language, among others. These are just two examples where the R Consortium brings together some of the best R talent in the world to address the needs of the R Community. The R Consortium helps ensure the continued growth and maintenance of R and its broader ecosystem, while providing visibility for companies supporting and contributing to R. We encourage other companies to consider the value the R Consortium brings to the R Community and join the R Consortium.

Oracle has supported the R Consortium since its inception in 2015 ( R Consortium Launched!). As a provider of multiple software tools and products that leverage and extend R, joining the R Consortium...

Explicit Semantic Analysis (ESA) for Text Analytics

New in Oracle R Enterprise 1.5.1 with the Oracle Database 12.2 Oracle Advanced Analytics option is the text analytics algorithm Explicit Semantic Analysis or ESA. Compared to other techniques such as Latent Dirichlet Association (LDA) or Term Frequency-Inverse Document Frequency (TF-IDF), ESA offers some unique benefits. Most notably, it improves text document categorization by computing “semantic relatedness” between the documents and a set of topics that are explicitly defined and described by humans. An example of such a corpus of documents is the set of Wikipedia articles. Each article is equated with a topic - the article title. The ESA algorithm can discover topics related to a document from this set of Wikipedia topics. OAA provides a pre-built Wikipedia model that is based on a select subset of English Wikipedia articles. In a subsequent post (Text Analytics using a pre-built Wikipedia-based Topic Model), we'll explore using this pre-built model. Other encyclopedic sources can be used to improve text categorization in domain-specific contexts. How does ESA compare with LDA? ESA contrasts with LDA in that ESA uses such a knowledge base, making it possible to assign human-readable labels to concepts or topics. Topics discovered by LDA are latent, i.e., can often be difficult to interpret, since they are defined by their keywords, not abstract descriptions or labels. While LDA labels can be given meaning by extracting keywords, definitions solely based on keywords tend to be fuzzy with keywords from different topics overlapping and not yielding a convenient topic name. ESA, on the other hand, uses documents with clearly-labeled topics. Two common use cases for ESA include calculating the semantic similarity between text documents or between mixed data (text and structured data), and explicit topic modeling for a given document or text. Further, with LDA, the topic set varies with changes to the training data, making comparison across different data sets difficult. Changing training data also changes the topic boundaries such that topics cannot be mapped to an existing knowledgebase. The OAA implementation of ESA enables using text columns in combination with optional categorical and numerical columns. Users can also combine multiple knowledgebases, each with its own topic set, which may or may not overlap. Topic overlap in ESA does not affect ESA’s ability to detect relevant topics. Consider the following simple example, where the objective is to extract topics from the titles provided. The ESA_TEXT variable is created as an ore.frame with an ID and the title text and serves as our training data or corpus. title <- c('Aids in Africa: Planning for a long war',            'Mars rover maneuvers for rim shot',            'Mars express confirms presence of water at Mars south pole',            'NASA announces major Mars rover finding',            'Drug access, Asia threat in focus at AIDS summit',            'NASA Mars Odyssey THEMIS image: typical crater',            'Road blocks for Aids') ESA_TEXT <- ore.push(data.frame(ID = seq(length(title)),                                 TITLE = title)) Next, we create a text policy, which requires the CTXSYS.CTX_DDL database privilege. ore.exec("BEGIN ctx_ddl.create_policy('ESA_TXTPOL'); END;") At this point we can build the model esa.mod <- ore.odmESA(~., data = ESA_TEXT,     odm.setting = list(CASE_ID_COLUMN_NAME = "DOC_ID",                        ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL",                        ESAS_MIN_ITEMS = 1),     ctx.setting = list(TITLE = c("MIN_DOCUMENTS:1", "MAX_FEATURES:3"))) Note that the argument odm.setting specifies a list object for the Oracle Data Mining (ODM) parameter settings. This argument is available when building a model in Database 12.2 or later. Each list element's name and value refer to the parameter setting as specified for ODM. Parameter CASE_ID_COLUMN_NAME specifies the column name containing the unique identifier associated with each case of the data. Parameter ODMS_TEXT_POLICY_NAME specifies the name of a valid Oracle Text policy used for text mining. The argument ctx.setting is a list to specify Oracle Text attribute-specific settings. Similarly, this argument is applicable to building a model in Database 12.2 or later. In the following execution, we explore a few variants to model building and scoring using ESA models. R> title <- c('Aids in Africa: Planning for a long war', + 'Mars rover maneuvers for rim shot', + 'Mars express confirms presence of water at Mars south pole', + 'NASA announces major Mars rover finding', + 'Drug access, Asia threat in focus at AIDS summit', + 'NASA Mars Odyssey THEMIS image: typical crater', + 'Road blocks for Aids') R> R> # Text contents in character column R> df <- data.frame(ID = seq(length(title)), TITLE = title) R> ESA_TEXT <- ore.push(df) R> R> # Convert TITLE column to CLOB data type R> attr(df$TITLE, "ora.type") <- "clob" R> ESA_TEXT_CLOB <- ore.push(df) R> As above, we create our 'title' ore.frame, but define it with the text first as a character vector, and then as a CLOB data type. This second option can be useful when the data is stored in a table already with the CLOB data type. We first use ore.frame ESA_TEXT_CLOB to build the model below, and then ESA_TEXT. Also, notice the difference in specifying odm.settings. The resulting esa.mod object is of type 'ore.odmESA'. R> ore.exec("BEGIN ctx_ddl.create_policy('ESA_TXTPOL'); END;") R> R> esa.mod <- ore.odmESA(~., data = ESA_TEXT_CLOB, + odm.settings = list(case_id_column_name = "ID", + ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL", + ODMS_TEXT_MIN_DOCUMENTS = 1, + ODMS_TEXT_MAX_FEATURES = 3, + ESAS_MIN_ITEMS = 1, + ESAS_VALUE_THRESHOLD = 0.0001, + ESAS_TOPN_FEATURES = 3)) R> class(esa.mod) [1] "ore.odmESA" "ore.model" R> R> summary(esa.mod) Call: ore.odmESA(formula = ~., data = ESA_TEXT_CLOB, odm.settings = list(case_id_column_name = "ID", ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL", ODMS_TEXT_MIN_DOCUMENTS = 1, ODMS_TEXT_MAX_FEATURES = 3, ESAS_MIN_ITEMS = 1, ESAS_VALUE_THRESHOLD = 1e-04, ESAS_TOPN_FEATURES = 3)) Settings: value min.items 1 topn.features 3 value.threshold 1e-04 odms.missing.value.treatment odms.missing.value.auto odms.sampling odms.sampling.disable odms.text.max.features 3 odms.text.min.documents 1 odms.text.policy.name ESA_TXTPOL prep.auto ON Features: FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT 1 1 TITLE.AIDS 1.0000000 2 2 TITLE.MARS 0.4078615 3 2 TITLE.ROVER 0.9130438 4 3 TITLE.MARS 1.0000000 5 4 TITLE.NASA 0.6742695 6 4 TITLE.ROVER 0.6742695 7 5 TITLE.AIDS 1.0000000 8 6 TITLE.MARS 0.4078615 9 6 TITLE.NASA 0.9130438 10 7 TITLE.AIDS 1.0000000 Using the summary function, we see the call, settings, and features. Functions settings and features allows obtaining these results explicitly as well. The features output provides an ID, the corresponding attribute name(s) as multiple rows, an optional attribute value, depending on whether categorical data was provided to the model, and the resulting coefficient. Note that the ATTRIBUTE_NAME column has values that include the text column attribute name prefix. This allows different columns to provide text, yet still differentiate coefficients. R> settings(esa.mod) SETTING_NAME SETTING_VALUE SETTING_TYPE 1 ALGO_NAME ALGO_EXPLICIT_SEMANTIC_ANALYS INPUT 2 ESAS_MIN_ITEMS 1 INPUT 3 ESAS_TOPN_FEATURES 3 INPUT 4 ESAS_VALUE_THRESHOLD 1e-04 INPUT 5 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO DEFAULT 6 ODMS_SAMPLING ODMS_SAMPLING_DISABLE DEFAULT 7 ODMS_TEXT_MAX_FEATURES 3 INPUT 8 ODMS_TEXT_MIN_DOCUMENTS 1 INPUT 9 ODMS_TEXT_POLICY_NAME ESA_TXTPOL INPUT 10 PREP_AUTO ON INPUT R> R> features(esa.mod) FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT 1 1 TITLE.AIDS 1.0000000 2 2 TITLE.MARS 0.4078615 3 2 TITLE.ROVER 0.9130438 4 3 TITLE.MARS 1.0000000 5 4 TITLE.NASA 0.6742695 6 4 TITLE.ROVER 0.6742695 7 5 TITLE.AIDS 1.0000000 8 6 TITLE.MARS 0.4078615 9 6 TITLE.NASA 0.9130438 10 7 TITLE.AIDS 1.0000000 Using the overloaded predict function, we can score the same data, or interchangeably use data with character vector type, as found in ESA_TEXT, and depicted here. The feature identifier is predicted, supplemented by the ID column. The feature identifier can be joined with the features listed above to produce the desired text. R> predict(esa.mod, ESA_TEXT, type = "class", supplemental.cols = "ID") ID FEATURE_ID 1 1 1 2 2 2 3 3 3 4 4 4 5 5 1 6 6 6 7 7 1 This next example, we illustrate using the ESA_TEXT (character column) to build the model, but then predict using the ESA_TEXT_CLOB data. In this example, we reintroduce the use of the ctx.settings argument. Note that the model summary is the same as before. R> esa.mod2 <- ore.odmESA(~., data = ESA_TEXT, + odm.settings = list(case_id_column_name = "ID", + ESAS_MIN_ITEMS = 1), + ctx.settings = list(TITLE = + "TEXT(POLICY_NAME:ESA_TXTPOL)(TOKEN_TYPE:STEM)(MIN_DOCUMENTS:1)(MAX_FEATURES:3)")) R> R> summary(esa.mod2) Call: ore.odmESA(formula = ~., data = ESA_TEXT, odm.settings = list(case_id_column_name = "ID", ESAS_MIN_ITEMS = 1), ctx.settings = list(TITLE = "TEXT(POLICY_NAME:ESA_TXTPOL)(TOKEN_TYPE:STEM)(MIN_DOCUMENTS:1)(MAX_FEATURES:3)")) Settings: value min.items 1 topn.features 1000 value.threshold .00000001 odms.missing.value.treatment odms.missing.value.auto odms.sampling odms.sampling.disable odms.text.max.features 300000 odms.text.min.documents 3 prep.auto ON Features: FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT 1 1 TITLE.AIDS 1.0000000 2 2 TITLE.MARS 0.4078615 3 2 TITLE.ROVER 0.9130438 4 3 TITLE.MARS 1.0000000 5 4 TITLE.MARS 0.3011997 6 4 TITLE.NASA 0.6742695 7 4 TITLE.ROVER 0.6742695 8 5 TITLE.AIDS 1.0000000 9 6 TITLE.MARS 0.4078615 10 6 TITLE.NASA 0.9130438 11 7 TITLE.AIDS 1.0000000 R> Here, we predict using the ESA_TEXT_CLOB, but could also used the ESA_TEXT as input. When argument type is set to 'class', the feature with the maximum probability is returned. If set to 'raw', the probability for each feature returned is provided (not shown). R> predict(esa.mod2, ESA_TEXT_CLOB, type = "class", supplemental.cols = "ID") ID FEATURE_ID 1 1 1 2 2 2 3 3 3 4 4 4 5 5 1 6 6 6 7 7 1 The function feature_compare returns an ore.frame containing a column that measures the relatedness between documents in the provided ore.frame, here ESA_TEXT_CLOB. The columns to compare are in compare.cols, in this case column TITLE. The supplemental column ID allows us to see the similarity between pairs of rows in ESA_TEXT_CLOB. R> feature_compare(esa.mod2, ESA_TEXT_CLOB, compare.cols = "TITLE", supplemental.cols = "ID") ID_A ID_B SIMILARITY 1 1 1 1.0000000 1.1 1 2 0.0000000 1.2 1 3 0.0000000 1.3 1 4 0.0000000 1.4 1 5 1.0000000 1.5 1 6 0.0000000 1.6 1 7 1.0000000 2 2 3 0.6322308 2.1 2 2 1.0000000 2.2 2 1 0.0000000 2.3 2 4 0.8608680 2.4 2 7 0.0000000 2.5 2 6 0.5259416 2.6 2 5 0.0000000 3 3 1 0.0000000 3.1 3 2 0.6322308 3.2 3 3 1.0000000 3.3 3 6 0.6322308 3.4 3 5 0.0000000 ... R> Finally, we can drop the text policy to clean up. R> ore.exec("BEGIN ctx_ddl.drop_policy('ESA_TXTPOL'); END;") In this blog, we introduced the ORE capability of using the new Explicit Semantic Analysis algorithm in the Oracle Advanced Analytics option, and illustrated a few ways in which models can be built, data scored, and similarity among text assessed using a custom-built model. In a subsequent post (Text Analytics using a pre-built Wikipedia-based Topic Model), we'll look at using the pre-built Wikipedia model.

New in Oracle R Enterprise 1.5.1 with the Oracle Database 12.2 Oracle Advanced Analytics option is the text analytics algorithm Explicit Semantic Analysis or ESA. Compared to other techniques such as...

Getting started with OAAgraph - vignette

Following up on the introductory post on OAAgraph, here is a vignette that illustrates using some of the OAAgraph package's capabilities. Recall that OAAgraph enables seamless interaction between R users of Oracle R Enterprise (ORE) of the Oracle Advanced Analytics option, Oracle Database, and the Parallel Graph Engine (PGX) of the Oracle Spatial and Graph option. In this post, we highlight a few aspects of OAAgraph: Creating a graph from node and edge tables residing in Oracle Database Invoking graph analytics algorithm: countTriangles, degree, pagerank Using the oaa.cursor object Creating a graph from a snaphot in-memory representation as stored in the database Creating tables from the nodes, edges, and their properties in an oaa.graph proxy object Cleaning up in-memory graphs and database objects In the architecture figure below, Oracle Database with both ORE and PGX reside at the database server machine, while the client R engine loads packages for both ORE and OAAgraph. Note: using OAAgraph requires installation of the Oracle Spatial and Graph PGX engine. See link for details. Let's begin. First, we load the ORE and OAAgraph packages at the client and connect to ORE using ore.connect, and then to the PGX server using oaa.graphConnect. Provide the same database credentials in oaa.graphConnect to allow the PGX server to access and create database tables in Oracle Database. library(ORE) library(OAAgraph) dbHost <- "myHost" dbUser <- "myUserID" dbPassword <- "myPassword" dbSid <- "myDatabaseSID" pgxBaseUrl <- "myPGXBaseURL" ore.connect(host=dbHost, user=dbUser, password=dbPassword, sid=dbSid) oaa.graphConnect(pgxBaseUrl=pgxBaseUrl, dbHost=dbHost, dbSid=dbSid, dbUser=dbUser, dbPassword=dbPassword) To keep the example simple, we create a small set of nodes in a data.frame and then create that as a node table in Oracle Database. Note that VID refers to the numeric vertex identifier. The VID must be numeric. VID <- c(1, 2, 3, 4, 5) NP1 <- c("node1", "node2", "node3", "node4", "node5") NP2 <- c(111.11, 222.22, 333.33, 444.44, 555.55) NP3 <- c(1, 2, 3, 4, 5) nodes <- data.frame(VID, NP1, NP2, NP3) ore.drop(table="MY_NODES") ore.create(nodes, table = "MY_NODES") Similarly, we create the edge table in Oracle Database, where EID refers to the required numeric edge identifier, SVID to the source vertex identifier, DVID to the destination vertex identifier, EL to the edge label, and any other named edge properties, here we chose EP1, but any name may be used. EID <- c(1, 2, 3, 4, 5) SVID <- c(1, 3, 3, 2, 4) DVID <- c(2, 1, 4, 3, 2) EL <- c("label1", "label2", "label3", "label4", "label5") EP1 <- c("edge1", "edge2", "edge3", "edge4", "edge5") edges <- data.frame(EID, SVID, DVID, EP1, EL) ore.drop(table="MY_EDGES") ore.create(edges, table = "MY_EDGES") Using ore.ls, we can then verify that the tables exist as ore.frames, and use the transparency layer of ORE to view their dimensions and summary statistics. > ore.ls(pattern="MY") # view newly created tables in schema [1] "MY_EDGES" "MY_EDGES1" "MY_NODES" "MY_NODES1" > class(MY_NODES) [1] "ore.frame" attr(,"package") [1] "OREbase" > colnames(MY_NODES) [1] "VID" "NP1" "NP2" "NP3" > dim(MY_NODES) [1] 5 4 > summary(MY_NODES) VID NP1 NP2 NP3 Min. :1 node1:1 Min. :111.1 Min. :1 1st Qu.:2 node2:1 1st Qu.:222.2 1st Qu.:2 Median :3 node3:1 Median :333.3 Median :3 Mean :3 node4:1 Mean :333.3 Mean :3 3rd Qu.:4 node5:1 3rd Qu.:444.4 3rd Qu.:4 Max. :5 Max. :555.5 Max. :5 > Now we're ready to create a graph in PGX from the database node and edge tables. We can assign a name to this graph as well, here 'MY_PGX_GRAPH'. Note that in OAAgraph 2.4.2, names will be converted to upper case. The graph name is used in particular when creating snapshots, as depicted below. graph <- oaa.graph(MY_EDGES, MY_NODES, "MY_PGX_GRAPH") graph names(graph, "nodes") names(graph, "edges") Upon executing these statements, we see that printing the oaa.graph object displays the name, the number of nodes and edges, whether the graph is persisted (i.e., has a snapshot), along with the node and edge property names. We can also get the node and edge property names using the overloaded names function. > graph <- oaa.graph(MY_EDGES, MY_NODES, "MY_PGX_GRAPH") > graph Graph Name: MY_PGX_GRAPH Number of Nodes: 5 Number of Edges: 5 Persistent Graph: FALSE Node Properties: NP1, NP3, NP2 Edge Properties: EP1 > names(graph, "nodes") [1] "NP1" "NP3" "NP2" > names(graph, "edges") [1] "EP1" > Let's see the result of the countTriangles function, which counts the number of triangles in the graph, giving an overview of the number of connections between nodes in neighborhoods. Here, we see this simple graph has two such triangles. Note that this algorithm is intended for undirected graphs. To make a graph undirected, use the oaa.undirect function. > countTriangles(graph, sortVerticesByDegree=FALSE) [1] 2 > Next, we'll look at the degree function and it's variants. As a result of invoking functions like degree, new properties are added to graph nodes with name as specified in the 'name' argument. These properties can be accessed through the cursor object shown below. degree(graph, name = "OutDegree") degree(graph, name = "InDegree", variant = "in") degree(graph, name = "InOutDegree", variant = "all") After executing these functions, we have three new node properties: > names(graph, "nodes") [1] "OutDegree" "NP1" "NP3" "NP2" "InOutDegree" "InDegree" > To access these metrics computed by the degree function, we create a cursor including the names of the degree properties provided above: cursor <- oaa.cursor(graph, c("OutDegree", "InOutDegree", "InDegree"), "nodes") Let's view the 5 entries from the cursor: > oaa.next(cursor, 5) OutDegree InOutDegree InDegree 1 1 2 1 2 1 3 2 3 2 3 1 4 1 2 1 5 0 0 0 > We can also use a Parallel Graph Query Language (PGQL) query to retrieve the same values, to compute values, specify ordering, perform a graph pattern search, and more. See this link to PGQL for details. cursor <- oaa.cursor(graph, query = "select n.id(),n.OutDegree,n.InOutDegree,n.InDegree where (n) order by n.OutDegree desc") Again, view the 5 entries from the cursor. Note that the node identifier can be accessed using n.id in the select portion of the PGQL query. > oaa.next(cursor, 5) n.id() n.OutDegree n.InOutDegree n.InDegree 1 3 2 3 1 2 2 1 3 2 3 1 1 2 1 4 4 1 2 1 5 5 0 0 0 > Similarly, we can compute the pagerank metric on the graph, which produces the default property named 'pagerank'. pagerankCursor <- pagerank(graph, error=0.085, damping=0.1, itermations=100) oaa.next(pagerankCursor, 5) Next, we create a cursor over the pagerank property using PGQL and view the results. cursor <- oaa.cursor(graph, query = "select n.pagerank where (n) order by n.pagerank desc") > oaa.next(pagerankCursor, 5) 2 0.22 3 0.20 1 0.19 4 0.19 5 0.18 > This could be done using the ordering argument to the oaa.cursor function as well. cursor <- oaa.cursor(graph, "pagerank", ordering="desc") OAAgraph provides the capability to persist a graph as a snapshot in the database. Since reconstructing a graph from node and edge tables can take longer than loading a binary representation of a graph, persisting a graph that will be used often can save the user significant load time. To list available graph snapshots, use the oaa.graphSnapshotList function. oaa.graphSnapshotList() To export a binary snapshot of the whole graph into Oracle Database, use the oaa.graphSnapshotPersist function, where we can specify that some or all node or edge properties should be maintained. By setting argument overwrite to TRUE, if the named snapshot already exists, it will be replaced, otherwise an error is returned. After creating the snapshot,  view the listing to see the name of graph appear. oaa.graphSnapshotPersist(graph, nodeProperties = TRUE, edgeProperties = TRUE, overwrite=TRUE) > oaa.graphSnapshotList() [1] "MY_PGX_GRAPH" > We can load the snapshot into memory by name, creating a new oaa.graph proxy object, graph2. graph2 <- oaa.graphSnapshot("MY_PGX_GRAPH") Viewing this graph, we see it has the same components as our original graph. > graph2 Graph Name: MY_PGX_GRAPH_2 Number of Nodes: 5 Number of Edges: 5 Persistent Graph: TRUE Node Properties: OutDegree, NP1, pagerank, NP3, NP2, InOutDegree, InDegree Edge Properties: EP1 > To wrap up our vignette, we export the graph into database tables. First, we'll export all nodes and their properties. Using oaa.create, we can overwrite existing tables and specify the number of connections to open to the database. If the number of connections is greater than one, PGX will write the graph to the database in parallel. oaa.create(graph2, nodeTableName = "RANKED_NODES", nodeProperties = TRUE, overwrite=TRUE, numConnections = 1) Next, we export both nodes and edges as tables into the database, but only export the pagerank node property: oaa.create(graph2, nodeTableName = "RANKED_GRAPH_N", nodeProperties = c("NP1", "pagerank"), edgeTableName = "RANKED_GRAPH_E", overwrite=TRUE, numConnections = 1) Lastly, we export only graph edges and their properties: oaa.create(graph2, edgeTableName = "RANKED_EDGES", edgeProperties = TRUE) Now that we're finished with our oaa.graph proxy objects, we free the graphs at the PGX server, invoking oaa.rm on the two graphs. oaa.rm(graph) oaa.rm(graph2) To clean up the tables created above, we use the ORE function ore.drop: ore.drop("MY_NODES") ore.drop("MY_EDGES") ore.drop("RANKED_NODES") ore.drop("RANKED_GRAPH_N") ore.drop("RANKED_GRAPH_E") ore.drop("RANKED_EDGES") Finally, we can remove the snapshot using oaa.dropSnapshots: oaa.dropSnapshots("MY_PGX_GRAPH") In this vignette, we exercised a variety of the capabilities provided with OAAgraph. See the OAAgraph webpage on OTN for more details.

Following up on the introductory post on OAAgraph, here is a vignette that illustrates using some of the OAAgraph package's capabilities. Recall that OAAgraph enables seamless interaction between R...

Working Effectively with Support

When simultaneously learning a new tool and working toward a deliverable deadline, getting timely help with technical problems is critical. If you work with R/Oracle R Distribution, Oracle R Enterprise, or Oracle R Advanced Analytics for Hadoop and need to engage support resources, we recommend doing everything you can to expedite a solution to the problem you are facing. The following tips on working effectively with support will enable more efficient communication, leading to faster resolution on most issues. Preliminary Research If the question is related to base R, search the R-Help mailing list on CRAN and the R section of Stack Overflow.  If the question is specific to Oracle's R offerings, search the Oracle Developer Community R Technologies Forum and My Oracle Support. The r-project posting guide shows several ways to search for R help, lists the common mistakes people make in posting questions and provides details on the resources available for getting help. Stack Overflow provides some excellent suggestions on posting questions, including doing the work to thoroughly research your question. When it comes time to initiate a Service Request through My Oracle Support, share your research and troubleshooting steps. Describe what you tried and the reasons it didn’t resolve your issue. This helps get a specific and relevant response to your question. In addition, provide the following information: Product Name When presenting questions related to Oracle's R products, provide the full name of the product being used. A common error is referring to product "Oracle R", but there's no product called "Oracle R". Oracle's Advanced Analytics R products include: Oracle R Distribution: Oracle's redistribution of open source R Oracle R Enterprise: Proprietary R packages that run against tables residing in Oracle Database  Oracle R Advanced Analytics for Hadoop: Proprietary R packages that run against tables residing in Hadoop Version Information Always include the version of each component you are using: R/Oracle R Distribution, Oracle R Enterprise, Oracle Database (or appliance, such as ODA, BDA or Exadata) and Operating System. $ uname -a Linux 4.1.12-61.1.23.el6uek.x86_64 #2 SMP Tue Dec 20 16:51:41 PST 2016 x86_64 x86_64 x86_64 GNU/Linux R> version$version.string [1] "Oracle Distribution of R version 3.2.0  (--)" R> packageVersion("ORE") [1] ‘1.5’ R> packageVersion("ORCH") [1] ‘2.7.0’ Reproducible Example The key element to help customers resolve R-related problems is to provide a minimal reproducible example along with a description of the problem. If data cannot be shared, create a sample data set or use one of R's built-in data sets to reproduce the problem. Include sufficient contextual information, so that support understands your goal. If the problem is complicated or contains lengthy code, identify the problem R function and provide a simpler script that reproduces the problem whenever possible.   A simple reproducible example consists of the following items: A minimal data set necessary to reproduce the error. The minimal code necessary to reproduce the error, which can be run on the given data set. Looking at the examples in R help files is often helpful. In general, all the code there fulfills the requirements of a minimal reproducible example: data and minimal code are provided, and everything is executable. To view the help file for a given R function, type at the R prompt: R> help(function name) For example, one can load the help file for the lm() function using this syntax: R> help(lm) Installation Logs If you are facing an installation problem, in addition to the Operating System, R version and other relevant product versions, provide the installation log showing the commands executed and the resulting errors. Indicate whether the installation is on the server or client side. In the case of installation, most of the answers already exist in the product's installation guide.  Other Tips: Be as specific as you can in describing your problem. State exactly what happened including the executed code, where it is executed, and any error messages received. Providing only the error message is often insufficient to resolve the problem. Support will often need to duplicate the issue, even if they have worked with similar code and errors in the past. Above all, carefully consider the information support may need to reproduce the problem on their system prior to submitting your question. The majority of questions regarding Oracle R Distribution, Oracle R Enterprise an Oracle R Advanced Analytics for Hadoop sent to support require more information upfront. Providing the details in this post will lead to a timely answer. See also MOS Doc ID 166650.1 for a more general set of tips on working effectively with Oracle Support.

When simultaneously learning a new tool and working toward a deliverable deadline, getting timely help with technical problems is critical. If you work with R/Oracle R Distribution, Oracle...

Tips and Tricks

Building "partition models" with Oracle R Enterprise

There are many approaches for improving model accuracy - anything from enriching or cleansing the data you start with to optimizing algorithm parameters or creating ensemble models. One technique that Oracle R Enterprise users sometimes employ is to partition data based on the distinct values of one or more columns and build a model for each partition. By building a model on each partition, forming a kind of ensemble model, better accuracy is possible. The embedded R execution function ore.groupApply enables users to do this in a data-parallel manner; however, this is requires managing the models explicitly. New in Oracle R Enterprise 1.5.1 is a feature that further automates this process. Oracle R Enterprise allows users to easily specify "partition models" by providing a single argument that excepts one or more columns on which to partition the data. The result is a single model that is used and managed as one model. When the model is built, the abstraction of a single model is provided to the user. This single model can then be used for scoring new data. Oracle R Enterprise automatically partitions the new data and selects the appropriate component model to be used for scoring. The same capability is available through the SQL API of Oracle Data Mining. Let's look at an example. We'll use the data set stored in the database table 'WINE' originally from the UCI repository  https://archive.ics.uci.edu/ml/datasets/Wine+Quality, consisting of over 6000 data samples for both red and white wines. We will illustrate the use of partitioned models to first build a default Support Vector Machine model, then a partition model separating red wines from white wines. First, using the ORE transparency layer, we randomly split the data into train and test sets. Note that sampling occurs in the database and the resulting samples remain in the database as well, producing ore.frame proxy objects. > n.rows <- nrow(WINE) > row.names(WINE) <- WINE$color > set.seed(seed=6218945) > random.sample <- sample(1:n.rows, ceiling(n.rows/2)) > WINE.train <- WINE[random.sample,] > WINE.test <- WINE[setdiff(1:n.rows,random.sample),] Using the Oracle Advanced Analytics in-database Support Vector Machine (SVM) algorithm, we build a classification model that predicts quality based on the remaining columns, excluding pH and fixed.acidity as a result of using ore.odmAI for attribute importance / feature selection (not shown). Next we predict using the model on the test data set, appending the quality column to the result to facilitate generating a confusion matrix. . > mod.svm <- ore.odmSVM(quality~.-pH-fixed.acidity, WINE.train, + "classification",kernel.function="linear") > pred.svm <- predict (mod.svm, WINE.test,"quality") First, let's example the prediction result. It contains probabilities for each class, by default, along with the actual value and prediction. Looking at the confusion matrix, we see the model only predicts three classes, predominantly focused on quality 5 and 6. > head(pred.svm,3) '3' '4' '5' '6' '7' '8' '9' red 0.12740669 0.12740718 0.2355671 0.12740755 0.1274072 0.12740682 0.12739745 red.1 0.12511451 0.12511496 0.2493204 0.12511526 0.1251149 0.12511455 0.12510538 red.2 0.08893389 0.08893407 0.4664034 0.08893388 0.0889339 0.08893366 0.08892724 quality PREDICTION red 6 5 red.1 6 5 red.2 5 5 > with(pred.svm, table(quality,PREDICTION, dnn = c("Actual","Predicted"))) Predicted Actual 5 6 9 3 14 2 0 4 74 24 0 5 923 152 0 6 789 631 3 7 130 405 0 8 20 79 0 9 0 2 0 Let's now build a partitioned model, separating data based on the color column. All the other parameters remain the same for both model building and predicting. > mod.svm2 <- ore.odmSVM(quality~.-pH-fixed.acidity, WINE.train, + "classification",kernel.function="linear", + odm.settings=list(odms_partition_columns = "color")) > pred.svm2 <- predict (mod.svm2, WINE.test,"quality") Looking at the first few predictions, we see that there are no scores produced for quality 9. Looking further into the data, this is because no reds received a rating of 9. > head(pred.svm2,3) '3' '4' '5' '6' '7' '8' '9' quality red 1.936908e-04 0.1390664 0.2136602 0.3689467 0.1390664 0.1390666 NA 6 red.1 1.310798e-05 0.1366415 0.3563299 0.2337322 0.1366414 0.1366418 NA 6 red.2 2.127579e-02 0.1236327 0.4279944 0.1798320 0.1236323 0.1236328 NA 5 PREDICTION red 6 red.1 5 red.2 5 See the confusion matrix from the partition model where the predictions have improved modestly, but more classes are being predicted. Clearly, more work can be done on tuning the model, but this serves to illustrate the concept and use of partition models. > with(pred.svm2, table(quality,PREDICTION, dnn = c("Actual","Predicted"))) Predicted Actual 3 5 6 7 8 9 3 0 13 3 0 0 0 4 0 73 24 0 0 1 5 1 865 208 0 1 0 6 1 694 725 2 1 0 7 0 103 426 5 1 0 8 0 20 78 1 0 0 9 0 1 1 0 0 0 With a partition model, we can also view the component model using the partition function. Further, we can get summary details on each component model by accessing it using the [] operator and invoking summary. > partitions(mod.svm2) PARTITION_NAME color 1 red red 2 white white > summary(mod.svm2["red"]) $red Call: ore.odmSVM(formula = quality ~ . - pH - fixed.acidity, data = WINE.train, type = "classification", kernel.function = "linear", odm.settings = list(odms_partition_columns = "color")) Settings: value clas.weights.balanced OFF odms.max.partitions 1000 odms.missing.value.treatment odms.missing.value.auto odms.partition.columns "color" odms.sampling odms.sampling.disable prep.auto ON active.learning al.enable conv.tolerance 1e-04 kernel.function linear Coefficients: PARTITION_NAME class variable value estimate 1 red 3 (Intercept) -6.527165e+00 2 red 3 alcohol -7.049019e-01 3 red 3 chlorides -4.485416e-02 4 red 3 citric.acid -2.022373e+00 5 red 3 density -1.086982e+00 6 red 3 free.sulfur.dioxide -4.736011e-01 7 red 3 residual.sugar 9.530107e-01 8 red 3 sulphates 3.817229e-01 9 red 3 total.sulfur.dioxide -1.438305e+00 10 red 3 volatile.acidity 3.716756e-01 11 red 4 (Intercept) -1.000002e+00 12 red 4 alcohol -2.586759e-07 13 red 4 chlorides 1.060210e-07 14 red 4 citric.acid -4.288424e-07 15 red 4 density -2.802913e-07 16 red 4 free.sulfur.dioxide 6.349260e-08 17 red 4 residual.sugar 2.162969e-07 18 red 4 sulphates 2.098022e-07 19 red 4 total.sulfur.dioxide -2.514968e-07 20 red 4 volatile.acidity 4.040972e-07 21 red 5 (Intercept) -3.970451e-01 22 red 5 alcohol -7.772918e-01 23 red 5 chlorides 2.753498e-01 24 red 5 citric.acid -1.069587e-01 25 red 5 density 1.424669e-01 26 red 5 free.sulfur.dioxide -1.786699e-01 27 red 5 residual.sugar -1.787814e-01 28 red 5 sulphates -5.966614e-01 29 red 5 total.sulfur.dioxide 4.356998e-01 30 red 5 volatile.acidity 7.460307e-02 31 red 6 (Intercept) -3.563915e-01 32 red 6 alcohol 5.199011e-01 33 red 6 chlorides -6.433083e-02 34 red 6 citric.acid -8.938135e-02 35 red 6 density -3.060354e-02 36 red 6 free.sulfur.dioxide 1.633804e-01 37 red 6 residual.sugar -3.666848e-02 38 red 6 sulphates 2.400912e-01 39 red 6 total.sulfur.dioxide -2.590342e-01 40 red 6 volatile.acidity -2.599212e-01 41 red 7 (Intercept) -1.000002e+00 42 red 7 alcohol 1.288288e-06 43 red 7 chlorides -4.242262e-07 44 red 7 citric.acid 1.310518e-07 45 red 7 density 1.236209e-07 46 red 7 free.sulfur.dioxide 4.925050e-07 47 red 7 residual.sugar 3.419610e-07 48 red 7 sulphates 7.295076e-07 49 red 7 total.sulfur.dioxide -1.046192e-06 50 red 7 volatile.acidity -5.349807e-07 51 red 8 (Intercept) -1.000000e+00 52 red 8 alcohol 6.933444e-08 53 red 8 chlorides -8.491114e-08 54 red 8 citric.acid 4.561492e-08 55 red 8 density 1.115950e-08 56 red 8 free.sulfur.dioxide -8.925505e-08 57 red 8 residual.sugar 3.791063e-08 58 red 8 sulphates 5.243467e-08 59 red 8 total.sulfur.dioxide 3.168575e-08 60 red 8 volatile.acidity 3.733357e-08 The ORE partition model feature makes it is easy for users to specify, build and score on partitions of their data, potentially resulting in improved model accuracy. Many of the ore.odm* algorithms support the "partition model" feature.

There are many approaches for improving model accuracy - anything from enriching or cleansing the data you start with to optimizing algorithm parameters or creating ensemble models. One technique that...

FAQ

Integrating custom algorithms with Oracle Advanced Analytics with R

Data scientists and other users of machine learning and predictive analytics technology often have their favorite algorithm for solving particular problems. If they are using a tool like Oracle Advanced Analytics -- with Oracle R Enterprise and Oracle Data Mining -- there's a desire to use these algorithms within that tool's framework. Using ORE's embedded R execution, users can already use 3rd party R packages in combination with Oracle  Database for execution at the database server machine. New in Oracle R Enterprise 1.5.1 and Oracle Database 12.2 is the feature to integrate 3rd party algorithms with three user-defined functions: model build, score, and model details -- referred to as extensible R algorithm models. Some details: Users provide R scripts stored in the ORE R Script Repository that implement functionality for model build, score, and model details (viewing of R models) Supports mining functions: classification, regression, clustering, feature_extraction, attribute_importance, and association The predict method executes the score.function specified and returns an ore.frame containing the predictions along with any columns specified by the supplemental.cols argument Function predict applicable to classification, regression, clustering, and feature_extraction models Registered algorithms can be used with Oracle Data Mining in Oracle Database 12.2 or later Here's a simple script that defines the functions supporting R's GLM algorithm. The first, glm_build, builds a GLM model given data, formula and family as arguments. The second, glm_score, invokes R's predict function on the model and data provided as arguments. The third, glm_detail, takes a model and returns a data.frame with whatever information the user intend. IRIS <- ore.push(iris) ore.scriptCreate("glm_build", function(data, form, family) { glm(formula = form, data = data, family = family)}) ore.scriptCreate("glm_score", function(mod, data) { res <- predict(mod, newdata = data); data.frame(res)}) ore.scriptCreate("glm_detail", function(mod) { data.frame(name=names(mod$coefficients), coef=mod$coefficients)}) To invoke this algorithm from R, we use the function ore.odmRAlg, which expects an ore.frame with the data, the formula for building the model, and the three functions. Note that only the build function is required; scoring and model details are optional. The overloaded summary function returns the model details, and the overloaded predict function scores the data and returns any requested supplemental columns. ralg.mod <- ore.odmRAlg(IRIS, mining.function = "regression", formula = c(form="Sepal.Length ~ ."), build.function = "glm_build", build.parameter = list(family="gaussian"), score.function = "glm_score", detail.function = "glm_detail", detail.value = data.frame(name="a", coef=1)) summary(ralg.mod) ralg.mod$details predict(ralg.mod, newdata = head(IRIS), supplemental.cols = "Sepal.Length") Here is the output from invoking ore.odmRAlg and the remaining functions: > ralg.mod <- ore.odmRAlg(IRIS, mining.function = "regression", + formula = c(form="Sepal.Length ~ ."), + build.function = "glm_build", build.parameter = list(family="gaussian"), + score.function = "glm_score", + detail.function = "glm_detail", detail.value = data.frame(name="a", coef=1)) > summary(ralg.mod) Call: ore.odmRAlg(data = IRIS, mining.function = "regression", formula = c(form = "Sepal.Length ~ ."), build.function = "glm_build", build.parameter = list(family = "gaussian"), score.function = "glm_score", detail.function = "glm_detail", detail.value = data.frame(name = "a", coef = 1)) Settings: value odms.missing.value.treatment odms.missing.value.auto odms.sampling odms.sampling.disable prep.auto OFF build.function RQUSER.glm_build build.parameter select 'Sepal.Length ~ .' "form", 'gaussian' "family" from dual details.format select cast('a' as varchar2(4000)) "name", 1 "coef" from dual details.function RQUSER.glm_detail score.function RQUSER.glm_score name coef 1 (Intercept) 2.1712663 2 Petal.Length 0.8292439 3 Petal.Width -0.3151552 4 Sepal.Width 0.4958889 5 Speciesversicolor -0.7235620 6 Speciesvirginica -1.0234978 > ralg.mod$details name coef 1 (Intercept) 2.1712663 2 Petal.Length 0.8292439 3 Petal.Width -0.3151552 4 Sepal.Width 0.4958889 5 Speciesversicolor -0.7235620 6 Speciesvirginica -1.0234978 > predict(ralg.mod, newdata = head(IRIS), supplemental.cols = "Sepal.Length") Sepal.Length PREDICTION 1 5.1 5.004788 2 4.9 4.756844 3 4.7 4.773097 4 4.6 4.889357 5 5.0 5.054377 6 5.4 5.388886 In a subsequent post, we'll look at doing the same from SQL.

Data scientists and other users of machine learning and predictive analytics technology often have their favorite algorithm for solving particular problems. If they are using a tool like Oracle...

Building Rcpp on 64-bit Solaris SPARC Systems

One reason R has become so popular is the vast array of add-on packages available at the CRAN and Bioconductor repositories. R's package system along with the CRAN framework provides a process for authoring, documenting and distributing packages to millions of users.  However, users and administrators wanting to build packages requiring C++ on 64-bit Solaris SPARC systems often are unable to compile their packages using Oracle Developer Studio.   R uses $R_HOME/etc/Makeconf as a system-wide default for C, C++ and Fortran build variables. The options in Makeconf are defined during configuration prior to build time and are subsequently the build options used to compile R packages. Base R does not require C++ during build time and therefore, when using the Oracle Developer Studio compiler, the resulting build variables are often insufficient for building R packages that use C++ templates. This blog outlines the build flags required for compiling the popular open source R package Rcpp using R-3.3.0 under Oracle Developer Studio 12.5 for Solaris SPARC.   For Rcpp, we focus on the following compiler options:   CC: Program for compiling C programs; default ‘cc’. CXX: Program for compiling C++ programs; default ‘g++’. CXXFLAGS: Extra flags to give to the C++ compiler. CPPFLAGS: Extra flags to give to the C preprocessor and programs that use it (the C compiler). SHLIB_CXXLDFLAGS: Extra flags for the shared library linker. By default The Oracle Developer Studio C++ compiler uses an older STL which is incompatible with modern C++. There are several options to use a modern STL, and we used  g++ STL/binary compatible mode for Solaris versions 10, 11 and 12: -std=c++03.   CC = <Oracle Studio 12.5 path>/bin/cc -xc99 -m64 CXX = <Oracle Studio 12.5 path>/bin/CC -m64 -std=c++03 CXXFLAGS = -xO3 -m64 -std=c++03 CXX1X = <Oracle Studio 12.5 path>/bin/CC -m64 -std=c++03 Additionally, the CC -G command does not pass any -l options to ld. If you want the shared  library to have a dependency on another shared library, you must pass the necessary -l option on the command line. For example, if you want the shared library to be dependent upon libCrunG3.so.1, you must pass -lCrunG3 on the command line.  Rcpp requires the libraries libstdc++, libgc and libCrunG3, so the flags for the shared library linker will be:   SHLIB_CXXLDFLAGS = -G -L/<Oracle Studio 12.5 path>/lib -lstdc++ -lgcc_s -lCrunG3 With these updates to $RHOME/etc/Makeconf, we can now install Rcpp:   R> install.packages("Rcpp") trying URL 'http://camoruco.ing.uc.edu.ve/cran/src/contrib/Rcpp_0.12.12.tar.gz' Content type 'application/x-gzip' length 2421289 bytes (2.3 MB) ================================================== downloaded 2.3 MB * installing *source* package ‘Rcpp’ ... ** package ‘Rcpp’ successfully unpacked and MD5 sums checked .. .. .. ** building package indices ** installing vignettes ** testing if installed package can be loaded * DONE (Rcpp)   Note that directly including C++ library linking options in $R_HOME/etc/Makeconf is a global approach. You can override variables in Makeconf with either of two approaches:on a per-user basis, by modifying  ~/.R/Makevars, or on a per-package basis, by modifying  <package name>/src/Makevars.   For example, with package Rcpp, adding PKG_LIBS = -lstdc++ -lgcc_s -lCrunG3 to Rccp/src/Makevars would satisfy the flags for the shared library linker and thus it would not be necessary to define SHLIB_CXXLDFLAGS globally.   For more information on custom R build configurations, refer to the R manuals on Writing R Extensions and Custom Package Compilation.

One reason R has become so popular is the vast array of add-on packages available at the CRAN and Bioconductor repositories. R's package system along with the CRAN framework provides a process...

Best Practices

Monitoring progress of embedded R functions

When you run R functions in Oracle Database, especially functions involving multiple R engines in parallel, you can monitor their progress using the Oracle R Enterprise datastore as a central location for progress notifications, or any intermediate status or results. In the following example, based on ore.groupApply, we illustrate instrumenting a simple function that builds a linear model to predict flight arrival delay based on a few other variables. In the function modelBuildWithStatus, the function verifies that there are rows for building the model after eliminating incomplete cases supplied in argument dat. If not empty, the function builds a model and reports “success”, otherwise, it reports “no data.” It’s likely that the user would like to use this model in some way or save it in a datastore for future use, but for this example, we just build the model and discard it, validating that a model can be built on the data. For this example, we use flights data from the nycflights13 package. We can easily load the flights data.frame into Oracle Database as a table using ore.create. This creates a proxy object in R of class ore.frame that can be used for in-database computations.   library(nycflights13) # ore.connect (...) ore.create(flights, table="FLIGHTS") modelBuildWithStatus <- function(dat) { dat <- dat[complete.cases(dat),] if (nrow(dat)>0L) { mod <- lm(arr_delay ~ distance + air_time + dep_delay, dat); "success" } else "no_data" } When we invoke this using ore.groupApply, the goal is to build one model per “unique carrier” or airline. Using the parallel argument, we specify the degree of parallelism and set it to 2. res <- ore.groupApply(FLIGHTS[, c("carrier","distance", "arr_delay", "dep_delay", "air_time")], FLIGHTS$carrier, modelBuildWithStatus, parallel=2L) res.local<-ore.pull(res) res.local[unlist(res.local)=="no_data"] res.local The result tells us about the status of each execution. Below, we print the carriers that had no data. > res.local $`9E` [1] "success" $AA [1] "success" $AS [1] "success" $B6 [1] "success" ... To monitor the progress of each execution, we can identify the group of data being processed in each function invocation using the value from the carrier column. For this particular data set, we use the first two characters of the carrier’s symbol appended to “group.” to form a unique object name for storing in the datastore identified by job.name. (If we don’t do this, the value will form an invalid object name.) Note that since the carrier column contains uniform data, we need only the first value. The general idea for monitoring progress is to save an object in the datastore named for each execution of the function on a group. We can then list the contents of the named datastore and compute a percentage complete, which is discussed later in this post. For the “success” case, we assign the value “SUCCESS” to the variable named by the string in nm that we created earlier. Using ore.save, this uniquely named object is stored in the datastore with the name in job.name. We use the append=TRUE flag to indicate that the various function executions will be sharing the same named datastore. If there is no data left in dat, we assign “NO DATA” to the variable named in nm and save that. Notice in both cases, we’re still returning “success” or “no data” so these come back in the list returned by ore.groupApply. However, we can return other values instead, e.g., the model produced.   modelBuildWithMonitoring <- function(dat, job.name) { nm <- paste("group.", substr(as.character(dat$carrier[1L]),1,2), sep="") dat <- dat[complete.cases(dat),] if (nrow(dat)>0L) { mod <- lm(arr_delay ~ distance + air_time + dep_delay, dat); assign(nm, "SUCCESS") ore.save(list=nm, name=job.name, append=TRUE) "success" } else { assign(nm, "NO DATA") ore.save(list=nm, name=job.name, append=TRUE) "no data" } } When we use this function in ore.groupApply, we provide the job.name and ore.connect arguments as well. The variable ore.connect must be set to TRUE in order to use the datastore. As the ore.groupApply executes, the datastore named by job.name will be increasingly getting objects added with the name of the carrier. First, delete the datastore named “job1”, if it exists. ore.delete(name="job1") res <- ore.groupApply(FLIGHTS[, c("carrier","distance", "arr_delay", "dep_delay", "air_time")], FLIGHTS$carrier, modelBuildWithMonitoring, job.name="job1", parallel=2L, ore.connect=TRUE) To see the progress during execution, we can use the following function, which takes a job name and the cardinality of the INDEX column to determine the percent complete. This function is invoked in a separate R engine connected to the same schema. If the job name is found, we print the percent complete, otherwise stop with an error message. check.progress <- function(job.name, total.groups) { if ( job.name %in% ore.datastore()$datastore.name ) print(sprintf("%.1f%%", nrow(ore.datastoreSummary(name=job.name))/total.groups*100L)) else stop(paste("Job", job.name, " does not exist")) } To invoke this, compute the total number of groups and provide this and the job name to the function check.progress. total.groups <- length(unique(FLIGHTS$carrier)) check.progress("job1",total.groups) However, we really want a loop to report on the progress automatically. One simple approach is to set up a while loop with a sleep delay. When we reach 100%, stop. To be self-contained, we include a simplification of the function above as a local function. check.progress.loop <- function(job.name, total.groups, sleep.time=2) { check.progress <- function(job.name, total.groups) { if ( job.name %in% ore.datastore()$datastore.name ) print(sprintf("%.1f%%", nrow(ore.datastoreSummary(name=job.name))/total.groups*100L)) else paste("Job", job.name, " does not exist") } while(1) { try(x <- check.progress(job.name,total.groups)) Sys.sleep(sleep.time) if(x=="100.0%") break } } As before, this function is invoked in a separate R engine connected to the same schema. check.progress.loop("job1",total.groups) Looking at the results, we can see the progress reported at one second intervals. Since the models build quickly, it doesn’t take long to reach 100%. For functions that take longer to execute or where there are more groups to process, you may choose a longer sleep time. Following this, we look at the datastore “job1” using ore.datastore and its contents using ore.datastoreSummary. R> check.progress.loop("job1",total.groups,sleep.time=1) [1] "6.9%" [1] "96.6%" [1] "100.0%" > ore.datastore("job1") datastore.name object.count size creation.date description 1 job1 16 592 2017-09-20 12:38:26 > ore.datastoreSummary("job1") object.name class size length row.count col.count 1 group.9E character 37 1 NA NA 2 group.AA character 37 1 NA NA 3 group.AS character 37 1 NA NA 4 group.B6 character 37 1 NA NA 5 group.DL character 37 1 NA NA 6 group.EV character 37 1 NA NA 7 group.F9 character 37 1 NA NA 8 group.FL character 37 1 NA NA 9 group.HA character 37 1 NA NA 10 group.MQ character 37 1 NA NA 11 group.OO character 37 1 NA NA 12 group.UA character 37 1 NA NA 13 group.US character 37 1 NA NA 14 group.VX character 37 1 NA NA 15 group.WN character 37 1 NA NA 16 group.YV character 37 1 NA NA The same basic technique can be used to note progress in any long running or complex embedded R function, e.g., in ore.tableApply or ore.doEval. At various points in the function, sequence-named objects can be added to a datastore. Moreover, the contents of those objects can contain incremental or partial results, or even debug output. While we’ve focused on the R API for embedded R execution, the same functions could be invoked using the SQL API. However, monitoring would still be performed from an interactive R engine.

When you run R functions in Oracle Database, especially functions involving multiple R engines in parallel, you can monitor their progress using the Oracle R Enterprise datastore as a central...

FAQ

Contrasting Oracle R Distribution and Oracle R Enterprise

What is the distinction between Oracle R Distribution and Oracle R Enterprise? Oracle R Distribution (ORD) is Oracle's redistribution of open source R, with enhancements for dynamically loading high performance libraries like Intel's Math Kernel Library (MKL) and setting R memory limits on database server-side R engine execution. Oracle provides support for ORD to customers of the Oracle Advanced Analytics option (which includes Oracle R Enterprise and Oracle Data Mining), Oracle Enterprise Linux, and Oracle Big Data Appliance. ORD can be used in combination with R packages such as those downloaded from CRAN. Oracle does not, however, provide support for non-Oracle-provided R packages. Oracle R Enterprise (ORE) is a set of R packages and library that allows R users to manipulate data stored in Oracle Database tables and views, leveraging Oracle Database as a high performance compute engine. As noted above, ORE is a component of the Oracle Advanced Analytics option to Oracle Database Enterprise Edition. ORE functionality can be divided into three main areas: transparency layer, machine learning, and embedded R execution. The Transparency Layer allows R users to specify standard R syntax on ore.frame objects - a subclass of data.frame - which serve as proxies for database tables and views. Rather than pull data into R memory, ORE translates R function invocations into Oracle SQL for execution in Oracle Database. This is also referred to as "function pushdown." By eliminating data movement, several things are accomplished: 1) fast access - eliminate the time required to move data, which for bigger data is significant, 2) scalability - eliminate client R engine memory limitations to hold the data and to manipulate/transform it, and 3) performance - leverage Oracle Database parallelism, query optimization, column indexes, data partitioning, etc. The ORE Machine Learning, or predictive analytics, capability provides a set of in-database, parallel algorithms that are exposed through an R interface. The set of algorithms was recently highlighted in my recent blog post. These algorithms include features such as algorithm-specific automatic data preparation, and support for integrated text mining. The set of algorithms can be supplemented with the use of open source R packages, such as those available on CRAN via embedded R execution, discussed next. Embedded R Execution refers to the ability to execute user-defined R functions at the database server, in locally spawned R engines under control of Oracle Database. With the proper database permissions, third party R packages such as those form CRAN can be installed at the database server R engine. Users can store their R function in the database R Script Repository and invoke it by name, passing data and arguments. Users can also leverage data-parallel and task-parallel execution of their R scripts. In addition to invoking named R scripts from R, users can invoke them from SQL. With SQL invocation, structured results and images can be returned as database tables, and R objects and images can be returned together as an XML string. This enables enterprises to more readily integrate R results into applications and dashboards as Oracle Database provides the necessary "plumbing" to make this easy. Further, database components such as DBMS_SCHEDULER can be used to schedule R script execution. For more detailed information on Oracle R Enterprise, see this link.  

What is the distinction between Oracle R Distribution and Oracle R Enterprise? Oracle R Distribution (ORD) is Oracle's redistribution of open source R, with enhancements for dynamically loading high...

Graph Analytics and Machine Learning - A Great Combination

Graphs are everywhere, whether looking at social media such as Facebook (friends of friends), Twitter, and LinkedIn, or customer relationships such as who calls whom or which bank accounts have money transfers between them. Graph algorithms come in two major flavors: computational graph analytics, where we analyze the entire graph to compute metrics or identify graph components, and graph pattern matching, where queries find sub-graphs corresponding to specified patterns. In contrast, machine learning algorithms typically train models based on observed data called cases, where each row often corresponds to a single case and the columns correspond to predictors and targets. These models are used to learn patterns in the data for scoring data and making predictions. As depicted above, allowing seamless interaction between graph analytics and machine learning in a single environment or language, such as R, enables data scientists to leverage powerful graph algorithms to supplement the machine learning process with computed graph metrics as predictor variables. The graph analysis can provide additional strong signals, thereby making predictions more accurate. Similarly, machine learning scores or predictions can be used in combination with graph pattern matching or analytics. For example, identifying groups of close customers from their mobile call graph can improve customer churn prediction. If a customer with strong connections to other customers were to churn, this may increase the likelihood of customers in his call graph to also churn. Since a given problem can be approached from different perspectives, it may be beneficial to investigate a given problem using both graph and machine learning algorithms, and then comparing / contrasting the results for greater insight into the problem and solution. New with Oracle R Enterprise 1.5.1 - a component of the Oracle Advanced Analytics option to Oracle Database - is the availability of the R package OAAgraph, which provides a single, unified interface supporting the complementary use of machine learning and graph analytics technologies. OAAgraph leverages the ORE transparency layer and the Parallel Graph Analytics (PGX) engine from the Oracle Spatial and Graph option to Oracle Database. PGX is an in-memory graph analytics engine that provides fast, parallel graph analysis using built-in algorithm packages, graph query / pattern-matching, and custom algorithm compilation. With some thirty-five graph algorithms, PGX exceeds open source tool capabilities. OAAgraph uses ore.frame objects representing a graph's node and edge tables to construct an in-memory graph. While the basic node table includes node identifiers, nodes can also have properties, stored in node table columns. Similarly, relationships among nodes are described as edges - from node identifier to node identifier. Each edge may also have properties stored in edge table columns. Various graph algorithms can now be applied to the graph, and the results such as node or edge metrics, or sub-graphs can be exported again as database tables, for use by ORE machine learning algorithms. In subsequent blog posts, we will explore OAAgraph in more detail.

Graphs are everywhere, whether looking at social media such as Facebook (friends of friends), Twitter, and LinkedIn, or customer relationships such as who calls whom or which bank accounts have money...

Introducing a dplyr interface to Oracle R Enterprise

While Oracle R Enterprise already provides seamless access to Oracle Database tables using standard R syntax and functions, new interfaces arise that make it conceptually easier for users to manipulate tabular data. The R package dplyr is one such package in the tidyverse that has gained wide adoption. It provides a grammar for data manipulation while working with data.frame-like objects, both in memory and out of memory. The dplyr package is intended to interface to database management systems, operating on data.frame or numeric vector objects. New with Oracle R Enterprise 1.5.1, OREdplyr provides much of the dplyr functionality extending the ORE transparency layer. OREdplyr accepts, e.g., ore.frame objects instead of data.frames for in-database execution of dplyr function calls. Like the ORE transparency layer in general, OREdplyr allows users to avoid costly movement of data while scaling to larger data volumes because operations are not constrained by R client memory, the latency of data movement, or single-threaded execution, but leverage Oracle Database as a high performance compute engine. OREdplyr maps closely to dplyr for both functions and arguments, and operates on both ore.frame and ore.numeric objects. Like dplyr, functions support both the non-standard evaluation (NSE) and standard evaluation (SE) interface. Note the NSE functions are good for interactive use, while SE functions are convenient for programming. See this dplyr vignette for details. So what does this interface look like with ORE? Here are just a few examples: library(OREdplyr) library(nycflights13) # contains data sets # Import data to Oracle Database ore.drop("FLIGHTS") # remove database table, if exists ore.create(as.data.frame(flights), table="FLIGHTS") # create table from data.frame dim(FLIGHTS) # get # rows and # columns names(FLIGHTS) # view names of columns head(FLIGHTS) # verify data.frame appears as expected # Basic operations select(FLIGHTS, year, month, day, dep_delay, arr_delay) %>% head() # select columns select(FLIGHTS, -year,-month, -day) %>% head() # exclude columns select(FLIGHTS, tail_num = tailnum) %>% head() # rename columns, but drops others rename(FLIGHTS, tail_num = tailnum) %>% head() # rename columns filter(FLIGHTS, month == 1, day == 1) %>% head() # filter rows filter(FLIGHTS, dep_delay > 240) %>% head() filter(FLIGHTS, month == 1 | month == 2) %>% head() arrange(FLIGHTS, year, month, day) %>% head() # sort rows by specified columns arrange(FLIGHTS, desc(arr_delay)) %>% head() # sort in descending order distinct(FLIGHTS, tailnum) %>% head() # see distinct values distinct(FLIGHTS, origin, dest) %>% head() # see distinct pairs mutate(FLIGHTS, speed = air_time / distance) %>% head() # compute and add new columns mutate(FLIGHTS, # keeps existing columns gain = arr_delay - dep_delay, speed = distance / air_time * 60) %>% head() transmute(FLIGHTS, # only keeps new computed columns gain = arr_delay - dep_delay, gain_per_hour = (arr_delay - dep_delay) / (air_time / 60)) %>% head() summarise(FLIGHTS, # aggregates the specified column values mean_delay = mean(dep_time,na.rm=TRUE), min_delay = min(dep_time,na.rm=TRUE), max_delay = max(dep_time,na.rm=TRUE), sd_delay = sd(dep_time,na.rm=TRUE)) Functions supported include: Data manipulation: select, filter, arrange, rename, mutate, transmute, distinct, slice, desc, select_, filter_, arrange_, rename_, mutate_, transmute_ , distinct_, slice_, inner_join, left_join, right_join, full_join Grouping: group_by, groups, ungroup, group_size, n_groups, group_by_ Aggregation: summarise, summarise_, tally, count, count_ Sampling: sample_n, sample_frac Ranking: row_number, min_rank, dense_rank, percent_rank, cume_dist, ntile, nth, first, last, n_distinct, top_n With OREdplyr, ORE expands into the tidyverse giving R users another powerful way to manipulate database data, while avoiding costly data movement. This enables R users and data scientists to work with larger data volumes and not be constrained by R client memory, the latency associated with data movement, or single threaded execution, but can take advantage of the powerful database technology present in Oracle Database.

While Oracle R Enterprise already provides seamless access to Oracle Database tables using standard R syntax and functions, new interfaces arise that make it conceptually easier for users...

News

Oracle R Enterprise 1.5.1 for Oracle Database is now available

We are pleased to announce that Oracle R Enterprise (ORE) 1.5.1 is now available for download for Oracle Database Enterprise Edition with Oracle R Distribution 3.3.0 / R-3.3.0. Oracle R Enterprise is a component of the Advanced Analytics option to Oracle Database. With ORE 1.5.1, we introduce two new packages: OREdplyr - a transparency layer enhancement - allows ORE users access to many of the popular dplyr functions on ore.frames; and a second package via separate download, OAAgraph 2.4.1, which provides an R interface to the powerful Oracle Spatial and Graph Parallel Graph Engine (PGX) for use in combination with ORE and database tables. New in-database algorithms exposed through R in the OREdm package include: Expectation Maximization (EM), Explicit Semantic Analysis (ESA), and Singular Value Decomposition (SVD). In addition, ORE 1.5.1 enables performing automated text processing for Oracle Data Mining's Support Vector Machine (SVM), Generalized Linear Model (GLM), KMeans, SVD, Non-negative Matrix Factorization (NMF), and ESA models; and the building of "partitioned models" and "extensible R algorithm models" for users to define R functions that plug into the Oracle Advanced Analytics in-database model framework. Here are the highlights for the new and enhanced features in ORE 1.5.1. Upgraded R version compatibility ORE 1.5.1 is certified with R-3.3.0 - both open source R and Oracle R Distribution. See the server support matrix for the complete list of supported R versions. R-3.3.0 brings improved performance and big in-memory data objects, and compatibility with the ever-growing community-contributed R packages. For supporting packages, ORE 1.5.1 has upgraded several packages: arules 1.5-0 cairo 1.5-9 DBI 0.6-1 png 0.1-7 ROracle 1.3-1 statmod 1.4.29 randomForest 4.6-12 OREdplyr The dplyr package is widely used providing a grammar for data manipulation while working with data.frame-like objects, both in memory and out of memory. The dplyr package is also an interfaces to database management systems, operating on data.frame or numeric vector objects. OREdplyr provides a subset of dplyr functionality extending the Oracle R Enterprise transparency layer. OREdplyr functions accept ore.frames instead of data.frames for in-database execution of the corresponding dplyr functions. OREdplyr allows users to avoid costly movement of data while scaling to larger data volumes because operations are not constrained by R Client memory. OAAgraph OAAgraph is a new package that provides a single, unified interface supporting the complementary use of machine learning and graph analytics technologies. Graph analytics use a graph representation of data, where data entities are nodes and relationships are edges. Machine learning produces models that identify patterns in data for both descriptive and predictive analytics. Together, these technologies complement and augment one another. Graph analytics can be used to compute graph metrics and analysis using efficient graph algorithms and representations. These metrics can be added to structured data where machine learning algorithms build models including graph metrics as predictors - producing more accurate results. Similarly, machine learning models can be used to score or classify data. These results can be added to graph nodes where graph algorithms can be used to further explore the graph or compute new metrics leveraging the machine learning result. New In-database Algorithms Expectation Maximization (EM) - a popular probability density estimation technique used to implement a distribution-based clustering algorithm. Special features of this algorithm implementation include: automated model search that finds the number of clusters or components up to a stated maximum; protects against overfitting; supports numeric and multinomial distributions; produces models with high quality probability estimates; generates cluster hierarchy, rules, and other statistics; supports both Gaussian and multi-value Bernoulli distributions; and includes heuristics that automatically choose distribution types. Explicit Semantic Analysis (ESA) - designed to improve text categorization, this algorithm computes "semantic relatedness" using cosine similarity between vectors representing the text, collectively interpreted as a space of concepts explicitly defined and described by humans. The name "explicit semantic analysis" contrasts with latent semantic analysis (LSA) because ESA uses a knowledge base that makes it possible to assign human-readable labels to concepts comprising the vector space. Singular Value Decomposition (SVD) - this feature extraction algorithm uses orthogonal linear transformations to capture the underlying variance of data by decomposing a rectangular matrix into three matrices: U, D and V. Matrix D is a diagonal matrix and its singular values reflect the amount of data variance captured by the bases. Special features of this algorithm implementation include: support for narrow data via Tall and Skinny solvers and wide data via stochastic solvers, and providing traditional SVD for more stable results and eigensolvers for faster analysis with sparse data. Automated Text Processing For select algorithms in the OREdm package (Support Vector Machine, Singular Value Decomposition, Non-negative Matrix Factorization, Explicit Semantic Analysis), users can now identify columns that should be treated as text, similar to how Oracle Data Mining enables automated text processing as a precursor to model building and scoring. A new argument, ctx.setting, allows the user to specify Oracle Text attribute-specific settings. This argument is applicable to building models in Database 12.2. The name of each list element refers to a column that should be treated as text while the list value specifies the text transformation. Partitioned Models The OREdm package with Oracle Database 12.2 enables the building of a type of ensemble model where each model consists of multiple sub-models. A sub-model is automatically built for each partition of data, where partitions are determined based on the unique values found in user-specified columns. Partitioned models also automate scoring by allowing users to reference the top-level model only, allowing the proper sub-model to be chosen based on the values of the partitioned column(s) for each row of data to be scored. Extensible R Algorithm Models This feature enables R users to create an Extensible R Algorithm model using the Oracle Data Mining framework in Oracle Database 12.2. This makes such models appear as Oracle Data Mining models that are accessible using the ODM SQL API. Extensible R Algorithm models enable the user to build, score, and view a model from R via user-provided R functions stored in R Script Repository. The ORE overloaded predict method executes the user-specified scoring function for the model, returning an ore.frame with the predictions. For a complete list of new features, see the Oracle R Enterprise User's Guide. To learn more about Oracle R Enterprise, visit Oracle R Enterprise on Oracle's Technology Network, or review the variety of use cases on the Oracle R Technologies blog.

We are pleased to announce that Oracle R Enterprise (ORE) 1.5.1 is now available for download for Oracle Database Enterprise Edition with Oracle R Distribution 3.3.0 / R-3.3.0. Oracle R Enterprise is...

Best Practices

Visualizing Circular Distributions on Big Data

While browsing the chapter on Circular Distributions in Zar’s Biostatistical Analysis, I came across an interesting visualization for circular data. Circular scale data is all around: the days of the week, months of the year, hours of the day, degrees on the compass. Defined technically, circular scale data is a special type of interval scale data where the scale has equal intervals, no true zero point, and there is no rational high or low values, or if there are, they are arbitrarily designated. Consider days of the week, each day essentially has the same length, and although we may depict a week starting on Sunday on a typical calendar there is no true zero point. Further, aside from ascribing to Wednesdays a sense of the “hump” of the week we need to “get over,” there are no high or low values. Relating this to R and Oracle R Enterprise, you may want to visualize such data using a scatter diagram that’s shown on the circumference of a circle. I found a package in R that does this called CircStats. In this blog post, we’ll focus on one of the functions, circ.plot, to see if it can be used with an ore.frame, and then to see how it scales to bigger data. While it is possible for open source packages to work with ORE proxy objects, it is not necessarily the case. As we'll see, circ.plot does work with an ore.numeric proxy object, but this function encounters scalability issues, which we'll address. Let's start with a variation of the example provided with circ.plot. It should be noted that circ.plot is more than adequate for a wide range of uses, but the goal of this post is to explore issues around scalability. library(CircStats) data.vm100 <- rvm(100, 0, 3) circ.plot(data.vm100) After loading the library, we generate 100 observations from a von Mises distribution, with a mean of 0 and a concentration of 3. We then plot the data using a scatter diagram, which places each point on the circle, as depicted in the following graph.     However, if the data are too dense, you don’t get a good sense of where most of the data are, especially when points may overlap or be over-plotted. To address this problem, there’s an option to bin and stack the counts around the circle. This gives us a much clearer picture of where the data are concentrated. In the following invocation, we set the stack parameter to TRUE and the number of bins to 150. circ.plot(data.vm100, stack=TRUE, bins=150) So now, let’s see if this function will work on database data that could be part of an ore.frame (ORE's proxy object to a database table), in this case, an ore.numeric vector. The class ore.frame is a subclass of data.frame, where many of the functions are overloaded to have computations performed by Oracle Database. Class ore.numeric implements many of the functions available for class numeric. We can take the same data, and push it to the database to create an ore.frame, i.e., a proxy object corresponding to a table in Oracle Database. In this particular case, without changing a line of code in the circ.plot function, we’re able to transparently supply an ore.numeric vector. data.vm100.ore <- ore.push(data.vm100) circ.plot(data.vm100.ore) circ.plot(data.vm100.ore, stack=TRUE, bins=150) class(data.vm100.ore) [1] "ore.numeric" attr(,"package") [1] "OREbase" The exact same plots as above were generated.  There was one notable difference; the performance of the circ.plot with the ore.numeric vector took longer – taking about 1.4 seconds, compared with 0.03 seconds for R. Note both Oracle Database and R are executing locally on my 4 CPU laptop. This isn’t surprising since for such small data the communication to the database takes more time than the actual computation.  > system.time(circ.plot(data.vm100.ore, stack=TRUE, bins=150))    user  system elapsed    1.07    0.02    1.43 > system.time(circ.plot(data.vm100, stack=TRUE, bins=150))    user  system elapsed    0.01    0.01    0.03 Let’s try to scale this to 100K observations and see what happens. As depicted in the following graph, we don’t get much value from this visualization – without stacking. While there appear to be fewer data points around 180 degrees, we don’t get a picture of the actual distribution elsewhere, certainly no notion of concentration. data.vm100k <- rvm(100000, 0, 3) circ.plot(data.vm100k,main="100K Points") For concentration, we need to use the stacked graph, as shown in the following plot. Because of the number of points, we can infer concentration between 90 degrees through 0 to 270 degrees, but the data point go past the edge of the image. This took about 9 seconds plus some lag to actually display the data points. Notice that we used shrink = 2 to reduce the size of the inner circle. system.time(circ.plot(data.vm100k, stack=TRUE, bins=150,main="100K Points",shrink=2))    user  system elapsed    8.86    0.13    9.07   If we set the dotsep argument, which specifies the distance between stacked points (default is 40), to 1000, we begin to see the picture more clearly. Repeating the experiment with the ore.numeric vector, we again get the same graph results (not shown). However, the performance was slower for the binned option. Why is this? > system.time(circ.plot(data.vm100k.ore, main="100K Points"))    user  system elapsed    0.50    0.50    1.15 > system.time(circ.plot(data.vm100k.ore, stack=TRUE, bins=150, main="100K Points", shrink=2))    7.73    0.06   11.73 > system.time(x.pull <- ore.pull(data.vm100k.ore))    user  system elapsed    0.03    0.00    0.12 Let’s look at what is going on inside the circ.plot function. We’ll focus on two issues for scalability. The first (highlighted in red) is a for-loop that bins the data by testing if the value is within the arc segment and counting (that is, summing) the result. This sum produces a count for the bin. Notice that for each bin, the code makes a full scan of the data. The second scalability issue (highlighted in blue) involves a set of nested for-loops that draws a character for each value assigned to the bin. If a bin has a value of 20, then 20 characters are drawn, if it has 1 million, it draws 1 million characters! circ.plot <- function (x, main = "", pch = 16, stack = FALSE, bins = 0, cex = 1,     dotsep = 40, shrink = 1) {   x <- x%%(2 * pi)   if (require(MASS)) {     eqscplot(x = cos(seq(0, 2 * pi, length = 1000)),              y = sin(seq(0, 2 * pi, length = 1000)),              axes = FALSE, xlab = "",              ylab = "", main = main, type = "l",              xlim = shrink * c(-1, 1), ylim = shrink * c(-1, 1),              ratio = 1, tol = 0.04)     lines(c(0, 0), c(0.9, 1))     text(0.005, 0.85, "90", cex = 1.5)     lines(c(0, 0), c(-0.9, -1))     text(0.005, -0.825, "270", cex = 1.5)     lines(c(-1, -0.9), c(0, 0))     text(-0.8, 0, "180", cex = 1.5)     lines(c(0.9, 1), c(0, 0))     text(0.82, 0, "0", cex = 1.5)     text(0, 0, "+", cex = 2)     n <- length(x)     z <- cos(x)     y <- sin(x)     if (stack == FALSE)         points(z, y, cex = cex, pch = pch)     else {         bins.count <- c(1:bins)         arc <- (2 * pi)/bins         for (i in 1:bins) {           bins.count[i] <- sum(x <= i * arc & x > (i - 1) * arc)         }         mids <- seq(arc/2, 2 * pi - pi/bins, length = bins)         index <- cex/dotsep         for (i in 1:bins) {             if (bins.count[i] != 0) {               for (j in 0:(bins.count[i] - 1)) {                 r <- 1 + j * index                 z <- r * cos(mids[i])                 y <- r * sin(mids[i])                 points(z, y, cex = cex, pch = pch)               }             }          }       }   }   else {     stop("To use this function you have to install the package MASS (VR)\n")   } } When we profile the function above providing an ore.numeric vector to the function, we find that nearly all of the time is spent in the for-loop in red. Can we eliminate this for-loop and the corresponding full data scans since this greatly limits scalability? To compute bins, we can take advantage of overloaded in-database ORE Transparency Layer functions to compute the needed bin boundaries: round and tabulate.  Here’s a function that computes bins using standard R syntax and leveraging the ORE Transparency Layer. countbins <- function(x, nbins, range_x = range(x, na.rm = TRUE))   {     nbins <- as.integer(nbins)[1L]     if (is.na(nbins) || nbins < 1L)       stop("invalid 'nbins' argument")     min_x <- range_x[1L]     max_x <- range_x[2L]     scale <- (max_x - min_x)/nbins     x <- (0.5 + 1e-8) + (x - min_x)/scale     nbinsp1 <- nbins + 1L     counts <- ore.pull(tabulate(round(x), nbinsp1))     c(head(counts, -2), counts[nbins] + counts[nbinsp1])   } For data stored in Oracle Database, we can produce the graph much faster by replacing the for-loop in red (and a couple of lines preceding it) with the following that uses the countbins function defined above: bins.count <- countbins(x,bins,range_x=c(0,2*pi)) arc <- (2 * pi)/bins This works for a straight R numeric vector as well as for an ore.numeric vector. However, the performance is dramatically faster and the computation scales in ORE. Computing the binned counts now takes about a half second, but the overall execution time is still at 8.5 seconds, with the point generation taking 75% of the time. system.time(test(data.vm100k.ore, stack=TRUE, bins=150, main="100K Points", shrink=2)) BIN.COUNT:    user  system elapsed    0.05    0.00    0.55 POINTS:    user  system elapsed    6.27    0.09    6.38 TOTAL:    user  system elapsed    7.90    0.09    8.51 So we’ll now turn our attention to the for-loops highlighted in blue. The inner loop is plotting a point for each observation in the data. While this works quite well for smaller data, as we get bigger data, R can become overwhelmed. What we’re really trying to do is graph a line from the circle for a specific length, based on the count associated with bin. We can reduce from plotting N points – 100,000 points in our example – to at most 150 lines. This is much faster and scales to an arbitrary number of underlying points, especially when used with Oracle R Enterprise in-database computations. We’ll replace the blue lines with the following:       for (i in 1:bins) {         if (bins.count[i] != 0) {           z2 <- bins.count[i]*index*cos(mids[i])           y2 <- bins.count[i]*index*sin(mids[i])           z <- c(cos(mids[i]), z2)           y <- c(sin(mids[i]), y2)           segments(z[1],y[1],z[2],y[2])         }       } Now, performance is dramatically improved, completing in 2.1 seconds with the raw data never leaving the database. Recall that the original R function took ~10 seconds on the 100k observations. system.time(test2(data.vm100k.ore, stack=TRUE, bins=150, main="100K Points", shrink=2)) BIN.COUNT:    user  system elapsed    0.03    0.02    0.53 POINTS:    user  system elapsed    0.01    0.00    0.04 TOTAL:    user  system elapsed    1.54    0.02    2.10 We might say that 10 seconds vs. 2 seconds isn’t really a significant difference. So let’s scale up to 1 million points. > data.vm1M <- rvm(1000000, 0, 3) > system.time(circ.plot(data.vm1M, stack=TRUE, bins=150,main="1M Points",shrink=2, dotsep=1000))    user  system elapsed  229.81    6.55  238.77 > data.vm1M.ore <- ore.push(data.vm1M) > system.time(circ.plot2b(data.vm1M.ore, stack=TRUE, bins=150, main="1M Points", shrink=2)) BIN.COUNT:    user  system elapsed    0.02    0.02    4.54 POINTS:    user  system elapsed       0       0       0    user  system elapsed   10.94    0.02   15.50 Now the difference becomes more pronounce with 239 seconds compared to 16 seconds. Note that the 239 seconds does not reflect the actual drawing time of the plot, just the computation portion. Overall stopwatch time was 323 seconds to complete the plot drawing in RStudio. When performing interactive data analytics, this latency  can impact data scientist productivity, if not just be annoying. But, we still have a problem for visualizing the result.  Because of the size of the counts and the default dotsep value of 40, many lines are beyond the graph boundary. We can further shrink the circle, but the scale of our counts is such that we still don’t get a good sense of the full data distribution. In addition, by using the segments function, there is an odd behavior of lines inside the circle. Because of how the counts are scaled using index <- cex/dotsep, whenever bins.count[i]*index < 1 the line is drawn inside the circle. This could be annoying or interesting depending on your perspective. We’ll leave resolving that as an exercise for the reader. In any case, we can increase dotsep to 200, or even 1000, but we’re losing information since so many of the lines are inside the circle. We can better address this providing the option to take the log of the counts, or even normalize the counts between 0 and 1 and adjust the dotsep argument. At this point, we have a better indication of the distribution. Moreover, we can scale this to millions or billions of points and still have the computation performed quickly. As shown below, the entire graph is produced in just over 3 seconds.   Here is a scalable circ.plot function, circ.plot2. circ.plot2 <- function (x, main = "", pch = 16, stack = FALSE, bins = 0, cex = 1,                    dotsep = 40, shrink = 1, scale.method = c("none","log","normalize"),col="blue") {   x <- x%%(2 * pi)   if (require(MASS)) {     eqscplot(x = cos(seq(0, 2 * pi, length = 1000)),              y = sin(seq(0, 2 * pi, length = 1000)),              axes = FALSE, xlab = "",              ylab = "", main = main, type = "l",              xlim = shrink * c(-1, 1), ylim = shrink * c(-1, 1),              ratio = 1, tol = 0.04)     text(0.005, 0.85, "90", cex = cex)     lines(c(0, 0), c(-0.9, -1))     text(0.005, -0.825, "270", cex = cex)     lines(c(-1, -0.9), c(0, 0))     text(-0.8, 0, "180", cex = cex)     lines(c(0.9, 1), c(0, 0))     text(0.82, 0, "0", cex = cex)     text(0, 0, "+", cex = cex)     n <- length(x)     z <- cos(x)     y <- sin(x)     if (stack == FALSE)       points(z, y, cex = cex, pch = pch)     else {       bins.count <- countbins(x,bins,range_x=c(0,2*pi))       arc <- (2 * pi)/bins       min_x <- min(bins.count)       max_x <- max(bins.count)       bins.count = switch(scale.method,                           none      = bins.count,                           log       = log(bins.count),                           normalize = (bins.count - min_x) / (max_x-min_x))       index <- 1/dotsep       for (i in 1:bins) {         if (bins.count[i] != 0) {           z2 <- bins.count[i]*index*cos(mids[i])           y2 <- bins.count[i]*index*sin(mids[i])           z <- c(cos(mids[i]), z2)           y <- c(sin(mids[i]), y2)           segments(z[1],y[1],z[2],y[2],col=col)         }       }     }   }   else {     stop("To use this function you have to install the package MASS (VR)\n")     } } To push the scalability further, the following code produces a bipolar graph constructed with 2 million points, depicted below. data.vm2mBP <- c(rvm(1000000, 0, 15),rvm(1000000,3.14,12)) data.vm2mBP.ore <- ore.push(data.vm2mBP) system.time(circ.plot2(data.vm2mBP.ore, stack=TRUE, bins=100, shrink=5,dotsep=2,scale.method = "log",main="2M Points, log counts",col="red"))    user  system elapsed    0.13    0.00    8.74 To do the same with the original circ.plot function for 1 million points took about 91 seconds for the elapsed execution time, and another 135 seconds to complete rendering the plot. It’s worth noting that the improved circ.plot2 function – when invoked on a local data.frame such as data.vm2mBP – takes under 2 seconds in local execution time, so the algorithm change benefits R as well. In summary, this example highlights the impact of algorithm design on scalability. Using an existing function, CircStats circ.plot, we were able to leverage in-database ORE computations via the Transparency Layer, which leaves data in the database. However, to enable scalability, the R code that performs both computations and visualizations needed to be revised. A demonstrated, such changes can dramatically impact performance.

While browsing the chapter on Circular Distributions in Zar’s Biostatistical Analysis, I came across an interesting visualization for circular data. Circular scale data is all around: the days of the...

Computing Weight of Evidence (WOE) and Information Value (IV)

Weight of evidence (WOE) is a powerful tool for feature representation and evaluation in data science. WOE can provide interpret able transformation to both categorical and numerical features. For categorical features, the levels within a feature often do not have an ordinal meaning and thus need to be transformed by either one-hot encoding or hashing. Although such transformations convert the feature into vectors and can be fed into machine learning algorithms, the 0-1 valued vectors are difficult to interpret as a feature. For example, some ecommerce company may want to predict the conversion rate of a group of consumers. One can extract demographic information of people such as the postcode of their addresses. The postcode can be viewed as a categorical feature and encoded into a one-hot vector. But it is unclear which postcode has inclination for relatively higher or lower conversion rate. One can check such an inclination in the coefficients from a GLM or SVM models, however, this information is not available until the model is trained. In such a case, WOE can provide a score for each postcode and one can clearly see the linkage of a postcode with the conversion rate. Moreover, this linkage or inclination generated by WOE can be used for feature transformation and thus benefit the model training. For numerical features, although there is a natural ordering for different numerical values, sometimes nonlinearity exists and in such cases, a linear model fails to capture that nonlinearity. For instance, the average income for a group of people may increase by time within age range 20-60 (see figure from WSJ), but may drop because of retirement after that. In such use cases, WOE provides scores for each truncated segment (e.g. 30-40, 40-50, ..., 60-70) and that can process the nonlinearity of the data. Moreover, in many fields such as finance (e.g., credit risk analysis), the machine learning model is preferred to be transparent and interpretable step by step. WOE provides a convenient way to show the actual transformation for auditors or supervisors. Another useful byproduct of WOE analysis is Information Value (IV). This measures the importance of a feature. Note that IV depends only on frequency counting and does not need to fit a model to obtain an attribute importance score. Note that ore.odmAI, the in-database attribute importance function that utilizes the minimum description length algorithm, can also be used for ranking attributes. However, those with a background in information theory may prefer the IV calculation. A nice theoretical and practical overview of WOE analysis can be found in this blog post. Computing WOE or IV may be involved and computationally intensive if the data size is large. Generally it needs to do counting for each level of the categorical features and both binning and counting for numerical features. If the data reside inside Oracle Database, it is desirable to compute this score using in-database tools. In this blog, we provide an example to show how to compute WOE using Oracle R Enterprise. Data Overview The data we used here is New York City Taxi data. It is a well-known public dataset. The data covers the transactional and trip information of both green and yellow taxis. General information about this data set can be found in link. There is a nice github repo to show how to download the data (see the file raw_data_urls.txt). Since it is a huge data set, we picked the Green Taxi (the taxi that is allowed only to pick up people outside Manhattan) data in December, 2016. The total size of the data is around 107 million records. rm(list=ls()) library(ORE) options(ore.warn.order=FALSE) ore.connect(...) ore.ls() trip.file <- "green_tripdata_2016-12.csv" nyc.data <- read.csv(file = trip.file, header=F, skip =1) nyc.df <- nyc.data[,1:19] headers <- read.csv(file = trip.file, header = F, nrows = 1, as.is = T) colnames(nyc.df) <- as.character(headers[1:19])ore.create(nyc.data, table="NYC_GREEN_TAXI") The code above reads the .csv format data into R and creates a database table in Oracle Database using ORE. Normally, such enterprise data would already be present in Oracle Database, thereby eliminating this step. First, let us have a glimpse of this data. head(NYC_GREEN_TAXI) VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag 1 2 2016-12-01 00:38:14 2016-12-01 00:45:24 N 2 2 2016-12-01 00:32:07 2016-12-01 00:35:01 N 3 2 2016-12-01 00:47:06 2016-12-01 01:00:40 N 4 2 2016-12-01 00:35:11 2016-12-01 00:53:38 N 5 2 2016-12-01 00:24:49 2016-12-01 00:39:30 N 6 2 2016-12-01 00:12:05 2016-12-01 00:15:58 N RatecodeID PULocationID DOLocationID passenger_count trip_distance 1 1 75 262 1 1.30 2 1 193 179 1 1.21 3 1 168 94 2 3.97 4 1 25 61 1 4.29 5 1 223 129 1 2.66 6 1 129 129 2 0.89 fare_amount extra mta_tax tip_amount tolls_amount ehail_fee 1 7.0 0.5 0.5 0.00 0 NA 2 5.5 0.5 0.5 1.70 0 NA 3 14.0 0.5 0.5 3.06 0 NA 4 16.5 0.5 0.5 3.56 0 NA 5 13.0 0.5 0.5 0.00 0 NA 6 5.0 0.5 0.5 0.00 0 NA improvement_surcharge total_amount payment_type trip_type 1 0.3 8.30 2 1 2 0.3 8.50 1 1 3 0.3 18.36 1 1 4 0.3 23.31 1 1 5 0.3 14.30 1 1 6 0.3 6.30 2 1 Even without a data dictionary, the names of the features are self-explanatory. Transactional information can be found in this data set such as total amount, payment type, fare amount, tax, etc. Trip specific information about the pick-up and drop-off time and location, number of passengers and trip distance can also be found. What kind of insight can we get from such a dataset? We all know that one common practice of taking a taxi in the US is that the passenger often tips the driver. The tip amount given has no clearly defined standard and can be controversial. This is also part of the reason why Uber or Lyft have become popular since one does not need to worry about tipping at all. In this dataset, we notice that there are quite a few trips in which the passenger did not tip the driver or there is no record of a tip amount (tip amount = 0), and also there are many cases that the tip is quite high. A fair way to study the tip is through the percentage of the tip as compared to the fare amount. The fare amount is usually calculated by meters and the tip is usually paid in proportion to the fare. Normally, in the US, the tip is around 20%. But we observed quite a lot of high percentage tips (up to 100%). Can we predict whether a higher tip is paid to the taxi driver, given all the trip and transactional data? For this purpose, we define a binary variable "high tip", where high_tip = 1, if the tip percentage exceeds 20% and high_tip =0 otherwise. We can build a classification model to make predictions of this response variable. Data Cleaning The NYC taxi data has great quality in general. But there are some outliers which do not make sense and may have been generated by mistake. For instance, there are quite a few records with zero trip distance or zero total amount. We remove such data records before we move to the next step using the ORE transparency layer functions. NYC_GREEN_TAXI_STG <- subset(NYC_GREEN_TAXI, passenger_count > 0, ) NYC_GREEN_TAXI_STG <- subset(NYC_GREEN_TAXI_STG, trip_distance >0)NYC_GREEN_TAXI_STG <- subset(NYC_GREEN_TAXI, fare_amount > 0) NYC_GREEN_TAXI_STG$tip_percentage <- NYC_GREEN_TAXI_STG$tip_amount*1.0/NYC_GREEN_TAXI_STG$fare_amount NYC_GREEN_TAXI_STG$high_tip <- ifelse(NYC_GREEN_TAXI_STG$tip_percentage > 0.2, 1,0) NYC_GREEN_TAXI_STG$PULocationID <- as.factor(NYC_GREEN_TAXI_STG$PULocationID) NYC_GREEN_TAXI_STG$DOLocationID <- as.factor(NYC_GREEN_TAXI_STG$DOLocationID) NYC_GREEN_TAXI_STG$RatecodeID <- as.factor(NYC_GREEN_TAXI_STG$RatecodeID) NYC_GREEN_TAXI_STG$payment_type <- as.factor(NYC_GREEN_TAXI_STG$payment_type) NYC_GREEN_TAXI_STG <- subset(NYC_GREEN_TAXI, fare_amount > 0) One subtlety here is that the tip is often paid by cash and the amount is thus not recorded. So when the payment type is cash, the entire tip amount is recorded as zero. We can only know the tip for sure when the fare is paid by credit card. Consequently, we exclude the records with cash payment. We create a variable "high tip" as a binary variable to indicate if a tip above 20% is paid. This variable is used as a response variable. Next, we split the entire dataset into training and test sets. set.seed(1) # enable repeatable results N <- nrow(NYC_GREEN_TAXI_STG) sampleSize <- round(N * 0.7) ind <- sample(1:N,sampleSize group <- as.integer(1:N %in% ind) row.names(NYC_GREEN_TAXI_STG) <- NYC_GREEN_TAXI_STG$VendorID NYC.train <- NYC_GREEN_TAXI_STG[group==TRUE,] NYC.test <- NYC_GREEN_TAXI_STG[group==FALSE,] ore.create(NYC.train, table = 'NYC_TRAIN') ore.create(NYC.test, table = 'NYC_TEST') Feature Engineering As a first step, we pick the set of feature empirically because we want to verify later using WOE and IV to check the importance of the features. We select categorical features such as PULocationID, DOLocationID, RatecodeID, payment_type and numerical features such as trip_distance, improvement_surcharge,fare_amount, tolls_amount. Empirically, the tip percentage may have something to do with people from different neighborhoods, the trip distance or the fare amount. WOE analysis provides both attribute importance and a way to convert the categorical features into numerical values, which reduces the computational load. Weight of Evidence Analysis The weight of evidence is defined for each level of a feature. Consider a categorical feature X, which contains a level named x_j, the WOE is defined as This is the logarithmic conditional probability ratio of a level x_j. Why it is defined this way? Suppose we want to convert a categorical feature into numeric values. One straightforward way is to calculate the frequencies of the response variable according to that level. For example, for each location ID, we can calculate the frequency of high percentage tip. This is actually the posterior probability P(Y=1|x_j), where Y is the high percentage indicator and X_j is the particular location. We can use this frequency score to show the inclination to the positive or negative direction of the response variable. In this example, whether the location ID has more indication for high tips rather than low tips, is to consider the ratio P(Y=1|x_j)/ P(Y=0|x_j). In most of the cases, the ratio may be within (0,1). A logarithmic transform is desirable to flatten the distribution. We can further break down the ratio of posterior distribution into a ratio of priors P(Y=1)/P(Y=0) and likelihood ( f(X_J|Y=1) and f(X_J|Y=0)). It is obvious that the priors are not related to X_j at all. So it is better to focus on the second part, which leads to the definition of WOE. In other words, weight of evidence provides a way to focus on the inclination of the feature level with no influence of the priors. If we use the posterior ratio, then the imbalance of the class will weigh in and distort the picture. Imagine for a case that the records with high tips dominate the data set and every level of the feature will have very high ratio. By removing the prior component, WOE provides a "scientific" way to avoid such a problem. The weight of evidence analysis also provides a measure for evaluation of the predictive power of a feature. This measure is called Information Value (IV). This value is defined as The formula can be interpreted as a weighted sum of the difference of the likelihood of each level in a feature. The weight is just the WOE value. For those familiar with information theory, the information value is simply a sum of the KL divergence from two directions: IV = D( f(x_j|Y=1) | f(x_j|Y=0)) + D(f(x_j|Y=0)| f(x_j|Y=1)), where KL divergence mainly measures the difference of two distributions. The underlying intuition is that the more different the two likelihoods f(X_j|Y=1) and f(X_j|Y=0) look, the more predictive power this feature has. Weight of Evidence Implementation Nowadays, there are several packages in R to support WOE analysis. A library we have tried is called "klaR". However, as most R packages, it requires all the data to be in-memory and as a result is not able to process large amounts of data. If the data is large and stored in Oracle Database, a good choice is to use Oracle R Enterprise algorithms to calculate WOE values. Although there is no official API available, we can use the in-database algorithms available in ORE to help us on this task. Note that the WOE needs to calculate the likelihoods f(X_J|Y=1). Coincidentally, this is also an intermediate result for Naive Bayes. In the ORE library, the in-database Naive Bayes function, ore.odmNB, can be called to provide the likelihood. The likelihoods can be found in the result of the summary function of the Naive Bayes model. Conveniently, ore.odmNB also provides automatic binning for continuous numerical features. The way of binning the feature utilizes a minimum description length method that is optimized in terms of predicting power. Also one can turn off this binning function using function options if desired. First, let us see how we calculate the WOE values for categorical features using ORE. categorical.features <- c('PULocationID', 'DOLocationID', 'RatecodeID', 'payment_type') categoricalModelFormula <- formula(paste0( 'high_tip ~' , paste(categorical.features, collapse = ' + '))) model.categorical <- ore.odmNB(categoricalModelFormula, data = NYC.train, auto.data.prep = FALSE) summary.cat <- ore.pull(summary(model.categorical)) Similar to most R packages, the output of ore.odmNB is the model the algorithm has learned. We can run summary() to view model details. Let us see what is inside the result of the summary. We omit some of the output to make it short: summary.cat$tables $DOLocationID ... $PULocationID ... $RatecodeID 1 2 3 4 0 9.719172e-01 2.616082e-03 7.895998e-04 5.696033e-04 1 9.928279e-01 2.662321e-03 7.134569e-04 4.618695e-04 $payment_type 1 2 3 4 0 2.391157e-01 7.536380e-01 4.377760e-03 2.786622e-03 1 9.999850e-01 7.510073e-06 7.510073e-06 We can see that for each feature, there is a table of the likelihoods. For instance, the RatecodeID column 1 has two values in row "0" and row "1". The values are actually P(RatecodeID =1|high tip = 0) and P(RatecodeID =1|high tip = 1). Next, we map each level to the column of the table and compute WOE values. # calculate woe for given level make.woe <- function(grp) { woe<- grp[grp$OUTCOME==1,]$FREQ/grp[grp$OUTCOME==0,]$FREQ return(log(woe[[1]])) } # generate woe lookup table for one level woe.lookup <- function(df, N.pos, N.neg){ colnames(df) <- c('OUTCOME', 'LEVEL', 'FREQ') df[is.na(df$FREQ), ]$FREQ <- 1/ifelse(df[is.na(df$FREQ), ]$OUTCOME == 1,N.pos +1, N.neg+1) df.woe <- df %>% group_by(LEVEL) %>% do(WOE = make.woe(.)) %>% mutate(WOE = WOE[[1]]) df.woe } # generate woe lookup tables for all levels , stored in a list make.lookup.cat <- function(summary.ca, N.pos, N.neg){ lookup.cat <- list() length.tab <- length(summary.cat$tables) for(i in 1:length.tab){ feature <- names(summary.cat$tables[i]) df<- as.data.frame(summary.cat$tables[i]) df.woe <- woe.lookup(df) colnames(df.woe)[colnames(df.woe) =='WOE'] <- paste0(feature,'_WOE') lookup.cat[[i]] <- df.woe } return(lookup.cat) } add.woe.cat <- function(train.woe, lookup.cat, summary.cat){ length.tab <- length(lookup.cat) for(i in 1:length.tab){ feature <- names(summary.cat$tables[i]) Lookup <- ore.push(lookup.cat[[i]]) Lookup$LEVEL <- as.character(Lookup$LEVEL) train.woe <- merge(train.woe, Lookup, by.x = feature, by.y = 'LEVEL') colnames(train.woe)[colnames(train.woe) == 'FREQ'] <- paste0(feature, '.WOE') } return(train.woe) } N.pos <- nrow(NYC.train[NYC.train$high_tip ==1,]) N.neg <- nrow(NYC.train[NYC.train$high_tip ==0,]) lookup.cat <- make.lookup.cat(summary.cat, N.pos, N.neg) NYC.train.woe <- add.woe.cat(NYC.train.woe, lookup.cat, summary.cat) In the code snippet, a lookup table is created for each feature. The level and corresponding likelihood values are stored in a data frame. The lookup table is then joined with the original data frame on the level and thus we obtain the column with levels replaced by WOE values. The resulting ORE frame looks like the following: PULocationID DOLocationID RatecodeID payment_type PULocationID_WOE DOLocationID_WOE RatecodeID_WOE payment_type_WOE 1 75 75 1 1 -0.06174966 -0.4190850 0.02128671 1.4307927 2 264 264 1 1 -0.69524126 -0.8111242 0.02128671 1.4307927 3 225 17 1 1 -0.45636689 0.0312819 0.02128671 1.4307927 4 49 97 1 1 0.16296406 0.2206145 0.02128671 1.4307927 5 41 142 1 1 -0.19541305 0.6142433 0.02128671 1.4307927 6 112 80 1 2 0.73283221 0.7252893 0.02128671 0.2828431 Next, we will convert the numerical features. This is done separately because the numerical features need to be first binned and the lookup table will be slightly different. The code for this part is rownames(NYC.train.woe) <- NYC.train.woe$ID numerical.features <- c('trip_distance', "fare_amount") numericalModelFormula <- formula(paste0( 'high_tip ~' , paste(numerical.features, collapse = ' + '))) model.numerical <- ore.odmNB(numericalModelFormula, data = NYC.train, auto.data.prep = TRUE) summary.num <- ore.pull(summary(model.numerical)) length.tab.num <- length(summary.num$tables) We can see that this summary contains tables like $fare_amount ( ; 3.25] (3.25; ) 0 0.016610587 0.983389413 1 0.005294601 0.994705399 $trip_distance ( ; .235] (.235; ) 0 0.026999879 0.973000121 1 0.007540113 0.992459887 We can further use the binning information and the WOE values to transform the numerical columns of the original data set. Note that since it involves more complicated operations, we use ORE embedded R execution, which allows us to run a function over each row of the ore.frame in parallel. Since embedded R execution usually does not materialize the data, we will first create a table with the result. # extract upper and lower bounds from the labels extract.bounds <- function(s){ s <- gsub("\\(",'',s) s <- gsub("\\)",'',s) s <- gsub("\\]",'',s) parts <- strsplit(s, ";") left <- parts[[1]][1] right <- parts[[1]][2] left <- gsub(' ', '', left) right <- gsub(' ', '', right) if(left =='') left <- -Inf if(right == '') right <- Inf c(left, right) } # lookup the bin where a value belongs to bin.woe <- function(DF, feature, df.woe){ val <- DF[,feature][1] freq <- df.woe[ df.woe$MIN < val & val <= df.woe$MAX,]$WOE DF$bin_temp <- freq[1] DF } make.woe <- function(grp) { woe<- grp[grp$OUTCOME==1,]$FREQ/grp[grp$OUTCOME==0,]$FREQ return(log(woe[[1]])) } ore.create(NYC.train.woe, table='NYC_TRAIN_WOE') # generate a list of lookup tables for WOE make.lookup.num <- function(summary.num){ lookup.num <- list() length.num <- length(summary.num$tables) for(i in 1:length.num){ df<- as.data.frame(summary.num$tables[i]) df.woe <- woe.lookup(df) df.woe$LEVEL <- as.character(df.woe$LEVEL) bounds <- sapply(df.woe$LEVEL, extract.bounds) bounds.df <- as.data.frame(t(bounds)) colnames(bounds.df) <- c('MIN', 'MAX') df.woe <- cbind(df.woe, bounds.df) df.woe$MIN <- as.numeric(as.character(df.woe$MIN)) df.woe$MAX <- as.numeric(as.character(df.woe$MAX)) lookup.num[[i]] <- df.woe } return(lookup.num) } lookup.num <- make.lookup.num(summary.num) add.woe.num <- function(Train.woe, lookup.num, summary.num){ length.num <- length(lookup.num) for(i in 1:length.num){ feature <- names(summary.num$tables[i]) Train.woe[,feature] <- as.numeric(as.character(Train.woe[,feature])) df.woe <- lookup.num[[i]] schema.string <- paste0("extended.schema = data.frame(ID=integer(),", feature, "=numeric(), bin_temp=numeric())") eval(parse(text=schema.string)) Feature.woe <- ore.rowApply(Train.woe[, c('ID', feature)], bin.woe, feature = feature, df.woe=df.woe, FUN.VALUE = extended.schema, rows = 1e5, parallel = TRUE) Train.woe <- merge(Train.woe, Feature.woe[, c('ID', 'bin_temp')], by.x = 'ID', by.y='ID') colnames(Train.woe)[colnames(Train.woe) == 'bin_temp'] <- paste0(feature, '_WOE') } Train.woe } NYC_TRAIN_WOE <- add.woe.num(NYC_TRAIN_WOE, lookup.num, summary.num) For the process of numerical features, ORE has an advantage of the automatic binning capability. In the open source R package such as klaR, WOE calculation is done only for categorical features. Extra work needs to be done for binning. Although packages like "woe" provide binning functions, binning result in odmNB has an advantage in classification because the partitions come from feature splits in a decision tree. This means that odmNB provides a more efficient way of binning for the purpose of classification. After we created the WOE value columns for the training set, we can then create the WOE for the test set. Since we stored the lookup tables, we can use them for the test set. The code is as follows: NYC.test.woe <- NYC_TEST rownames(NYC.test.woe) <- NYC.test.woe$VendorID NYC.test.woe$ID <- (N.train+1):(N.train + nrow(NYC.test.woe)) NYC.test.woe <- add.woe.cat(NYC.test.woe, lookup.cat, summary.cat, N.pos, N.neg) ore.create(NYC.test.woe, table='NYC_TEST_WOE') NYC_TEST_WOE <- add.woe.num(NYC_TEST_WOE, lookup.num, summary.num) Now, we can train the model using the WOE transformed features and evaluate the model on the test set. all.features <- append(categorical.features, numerical.features) all.features <- sapply(all.features, function(x){ paste0(x, '_WOE')}) binaryModelFormula <- formula(paste0('high_tip ~', paste(all.features, collapse = '+'))) model.lr <- ore.odmGLM(binaryModelFormula, data = NYC_TRAIN_WOE, type = "logistic") pred.ore <- predict( model.lr, NYC_TEST_WOE, type = 'prob', supplemental.cols= 'high_tip') calculate.AUC <- function(pred.ore){ prob <- pred.ore[,2] actualClassLabels <- pred.ore[,1] library(ROCR) pred <- prediction( prob, actualClassLabels ) perfROC <- performance( pred, measure="tpr", x.measure="fpr") perfAUC <- performance( pred, measure = "auc") auc = perfAUC@y.values[[1]][1] } auc <- ore.tableApply(pred.ore, calculate.AUC) auc The AUC (Area Under Curve) we obtained here is 0.9. The performance is great! It is interesting to ask what if we do not use WOE to train the model. The answer is that we need extra preprocessing if we go without WOE transformation. Notice that we have two high cardinality features: the PULocationID has 241 levels and DOLocationID has 260 levels. ML algorithms such as ore.randomForest require the total level of a categorical feature should not exceed 53 levels. And GLM algorithms are very slow on data with high cardinality features. In such a case, a reasonable alternative is to combine one-hot encoding and hash tricks and apply ML algorithms that can handle sparse data, such as glmnet or xgboost. In contrast, WOE provides a convenient solution for transforming such categorical features into numerical values before training machine learning models. At the same time, WOE keeps the interpretability of the model as compared to hashed features. Information Values We can also calculate Information Value using the summary of the odmNB output. The code is as follows. cond.diff <- function(grp){ diff <- grp[grp$OUTCOME==1,]$FREQ - grp[grp$OUTCOME==0,]$FREQ return(diff[[1]]) } ibrary(dplyr) compute.iv <- function(lookup.list, summary, N.pos, N.neg){ length.tab <- length(summary$tables) iv <- rep(0, length.tab) for(i in 1:length.tab){ df.woe <- lookup.list[[i]] cond <- as.data.frame(summary$tables[i]) colnames(cond) <- c('OUTCOME', 'LEVEL', 'FREQ') cond[is.na(cond$FREQ), ]$FREQ <- 1/ifelse(cond[is.na(cond$FREQ), ]$OUTCOME == 1,N.pos +1, N.neg+1) cond <- cond %>% group_by(LEVEL) %>% do(COND_DIFF = cond.diff(.)) %>% mutate(COND_DIFF = COND_DIFF[[1]]) df.combined <- merge(df.woe, cond, by = 'LEVEL') iv[i] <- sum(df.combined[,2]*df.combined$COND_DIFF) } iv.df <- as.data.frame(cbind(names(summary$tables), iv)) colnames(iv.df)[1] <- 'feature' iv.df } compute.iv(lookup.num, summary.num) compute.iv(lookup.cat, summary.cat) The result looks like: feature iv 1 fare_amount 0.0130676330371582 2 trip_distance 0.0252081334107816 > compute.iv(lookup.cat, summary.cat) feature iv 1 DOLocationID 0.391818787450526 2 PULocationID 0.325000696270622 3 RatecodeID 0.0413093811488285 4 payment_type 0.919023041303648 Information value provides a clue to the predictive power of the features. The dominant feature here is payment_type. One possible reason is that payment_type provides status such as "void trip", "dispute", etc. These may indicate that the customer may have an unusual experience and thus the tip may not be high. The next important two features are the location features. It is possible that neighborhood can provide economic status of the passengers, which is reflected in the tip. The result from the klaR basically looks similar. The main difference lies in the processing of empty levels. For example, some levels may have zero counts for high_tip= 1. In klaR, the related conditional probability is calculated by simply using a surrogate value supplied by the customer. But the solution above uses Laplace smoothing, which is # (total high_tip=1) +1 / #(total high tip =0) + 1. Conclusion In this blog, we discussed WOE and IV, illustrated how to compute these while leveraging convenient Oracle R Enterprise in-database algorithm features. This allowed us to highlight other ways in which ORE algorithms can be leveraged. The model successfully predicted the high percentage tips for NYC green taxi data as evidenced by the AUC. The relative importance of each feature is also provided by computing information value. The ORE-based solution has the advantages of computing necessary statistics in-database, thereby scaling to large amounts of data. Leveraging the ore.odmNB algorithms performs automatic binning, which is convenient for processing numerical features.

Weight of evidence (WOE) is a powerful tool for feature representation and evaluation in data science. WOE can provide interpret able transformation to both categorical and numerical features.For...

Oracle R Enterprise and Database Upgrades

After a database upgrade, a set of maintenance steps is required to update the new ORACLE_HOME with the entire set of ORE components.  For example, if the proper migration steps are not followed, ORE embedded R functions will return errors such as:   ORA-28578: protocol error during callback from an external procedure The ORE server installation consists of three components: Oracle Database schema (RQSYS) and schema-related objects. Oracle Database shared libraries for supporting Oracle R Enterprise clients. Oracle R Enterprise packages and supporting packages installed on the Operating System. After a database upgrade, the RQSYS schema and dependent database components must be migrated to the new ORACLE_HOME. The ORE packages must also be installed to the new database location. The instructions provided in this post are required to migrate Oracle R Enterprise 1.5.0 from an initial database installation to a new database after a database upgrade. In this case, Oracle Database was upgraded from version 12.1.0.2 to version 12.2.0.1. Oracle R Distribution and Oracle R Enterprise are not upgraded, only migrated to the new ORACLE_HOME. The critical step for ORE maintenance after a database upgrade is to run the server installation script against the new ORACLE_HOME.  In this case, the server installation operation can be seen as a patching step, as it will try to fix what is missing.  Thus, it  creates a new path to ORACLE_HOME into ORE's metadata.  Simply run server.sh with the --no-user to transfer ORE to the new ORACLE_HOME: $ ./server.sh --no-user We pass the --no-user flag assuming an ORE user is alredy created in the original database. As always, back up the RQSYS and ORE user schema prior to upgrade. After executing the ORE server script, verify the ORE configuration script is pointing to the new ORACLE_HOME and the ORE dependent libraries ore.so and librqe.so are in the new ORACLE_HOME.  Under sysdba: SQL> select * from sys.rq_config; NAME -------------------------------------------------------------- VALUE -------------------------------------------------------------- R_HOME /usr/lib64/R R_LIBS_USER /u01/app/oracle/product/12.2.0.1/dbhome_1/R/library VERSION 1.5 .. ..  SQL> select library_name, file_spec from all_libraries where owner = 'RQSYS'; LIBRARY_NAME ----------------------------------------------------------------- FILE_SPEC ----------------------------------------------------------------- RQ$LIB /u01/app/oracle/product/12.2.0.1/dbhome_1/lib/ore.so RQELIB /u01/app/oracle/product/12.2.0.1/dbhome_1/lib/librqe.so Finally, test the Oracle R Enterprise installation against the upgraded ORACLE_HOME by running product demos.  

After a database upgrade, a set of maintenance steps is required to update the new ORACLE_HOME with the entire set of ORE components.  For example, if the proper migration steps are not followed, ORE...

News

BIWA Summit 2018 with Spatial and Graph Summit - Call for Speakers

(pdf announcement) Oracle Conference Center at Oracle Headquarters Campus, Redwood Shores, CA Share your successes… We want to hear your story. Submit your proposal today for Oracle BIWA Summit 2018, featuring Oracle Spatial and Graph Summit, March 20 - 22, 2018 and share your successes with Oracle technology. The call for speakers is now open through December 3, 2017.  Submit now for possible early acceptance and publication in Oracle BIWA Summit 2018 promotion materials.  Click HERE  to submit your abstract(s) for Oracle BIWA Summit 2018. Oracle Spatial and Graph Summit will be held in partnership with BIWA Summit.  BIWA Summits are organized and managed by the Oracle Business Intelligence, Data Warehousing and Analytics (BIWA) User Community and the Oracle Spatial and Graph SIG – a Special Interest Group in the Independent Oracle User Group (IOUG). BIWA Summits attract presentations and talks from the top Business Intelligence, Data Warehousing, Advanced Analytics, Spatial and Graph, and Big Data experts. The 3-day BIWA Summit 2017 event involved Keynotes by Industry experts, Educational sessions, Hands-on Labs and networking events. Click HERE to see presentations and content from BIWA Summit 2017. Call for Speaker DEADLINE is December 3, 2017 at midnight Pacific Time. Presentations and Hands-on Labs must be non-commercial. Sales promotions for products or services disguised as proposals will be eliminated.  Speakers whose abstracts are accepted will be expected to submit their presentation as PDF slide deck for posting on the BIWA Summit conference website.  Accompanying technical and use case papers are encouraged, but not required. Complimentary registration to Oracle BIWA Summit 2018 is provided to the primary speaker of each accepted presentation. Note:  Any additional co-presenters need to register for the event separately and provide appropriate registration fees.    Please submit session proposals in one of the following areas: Machine Learning Analytics Big Data Data Warehousing and ETL Cloud Internet of Things Spatial and Graph (Oracle Spatial and Graph Summit) …Anything else “Cool” using Oracle technologies in “novel and interesting” ways Proposals that cover multiple areas are acceptable and highly encouraged.  On your submission, please indicate a primary track and any secondary tracks for consideration.  The content committee strongly encourages technical/how to sessions, strategic guidance sessions, and real world customer end user case studies, all using Oracle technologies. If you submitted a session last year, your login should carry over for 2018. We will be accepting abstracts on a rolling basis, so please submit your abstracts as soon as possible. What To Expect 400+ Attendees | 90+ Speakers | Hands on Labs | Technical Content| Networking New at this year’s BIWA Summit: Strategy track – targeted at the C-level audience, how to assess and plan for new Oracle Technology in meeting enterprise objectives Oracle Global Leaders track – sessions by Oracle’s Global Leader customers on their use of Oracle Technology, and targeted product managers on latest Oracle products and features Grad-student track – sessions on cutting edge university work using Oracle Technology, continuing Oracle Academy’s sponsorship of graduate student participation  Exciting Topics Include:  Database, Data Warehouse, and Cloud, Big Data Architecture Deep Dives on existing Oracle BI, DW and Analytics products and Hands on Labs Updates on the latest Oracle products and technologies e.g. Oracle Big Data Discovery, Oracle Visual Analyzer, Oracle Big Data SQL Novel and Interesting Use Cases of Spatial and Graph, Text, Data Mining, ETL, Security, Cloud Working with Big Data:  Hadoop, "Internet of Things", SQL, R, Sentiment Analysis Oracle Business Intelligence (OBIEE), Oracle Spatial and Graph, Oracle Advanced Analytics —All Better Together Example Talks from BIWA Summit 2017: [Visit www.biwasummit.org to see the  Full Agenda from BIWA’17 and to download copies of BIWA’17 presentations and HOLs.] Machine Learning Taking R to new heights for scalability and performance Introducing Oracle Machine Learning Zeppelin Notebooks Oracle's Advanced Analytics 12.2c New Features & Road Map: Bigger, Better, Faster, More! An Post -- Big Data Analytics platform and use of Oracle Advanced Analytics Customer Analytics POC for a global retailer, using Oracle Advanced Analytics Oracle Marketing Advanced Analytics Use of OAA in Propensity to Buy Models Clustering Data with Oracle Data Mining and Oracle Business Intelligence How Option Traders leverage Oracle R Enterprise to maximize trading strategies From Beginning to End - Oracle's Cloud Services and New Customer Acquisition Marketing K12 Student Early Warning System Business Process Optimization Using Reinforcement Learning Advanced Analytics & Graph: Transparently taking advantage of HW innovations in the Cloud Dynamic Traffic Prediction in Road Networks Context Aware GeoSocial Graph Mining Analytics Uncovering Complex Spatial and Graph Relationships: On Database, Big Data, and Cloud Make the most of Oracle DV (DVD / DVCS / BICS) Data Visualization at SoundExchange – A Case Study Custom Maps in Oracle Big Data Discovery with Oracle Spatial and Graph 12c Does Your Data Have a Story? Find out with Oracle Data Visualization Desktop Social Services Reporting, Visualization, and Analytics Using OBIEE Leadership Essentials in Successful Business Intelligence (BI) Programs Big Data Uncovering Complex Spatial and Graph Relationships: On Database, Big Data, and Cloud Why Apache Spark has become the darling in Big Data space? Custom Maps in Oracle Big Data Discovery with Oracle Spatial and Graph 12c A Shortest Path to Using Graph Technologies– Best Practices in Graph Construction, Indexing, Analytics and Visualization Cloud Computing Oracle Big Data Management in the Cloud Oracle Cloud Cookbook for Professionals Uncovering Complex Spatial and Graph Relationships: On Database, Big Data, and Cloud Deploying Oracle Database in the Cloud with Exadata: Technical Deep Dive Employee Onboarding: Onboard – Faster, Smarter & Greener Deploying Spatial Applications in Oracle Public Cloud Analytics in the Oracle Cloud: A Case Study Deploying SAS Retail Analytics in the Oracle Cloud BICS - For Departmental Data Mart or Enterprise Data Warehouse? Cloud Transition and Lift and Shift of Oracle BI Applications Data Warehousing and ETL Business Analytics in the Oracle 12.2 Database: Analytic Views Maximizing Join and Sort Performance in Oracle Data Warehouses Turbocharging Data Visualization and Analyses with Oracle In-Memory 12.2 Oracle Data Integrator 12c: Getting Started Analytic Functions in SQL My Favorite Scripts 2017 Internet of Things Introduction to IoT and IoT Platforms The State of Industrial IoT Complex Data Mashups: an Example Use Case from the Transportation Industry Monetizable Value Creation from Industrial-IoT Analytics Spatial and Graph Summit Uncovering Complex Spatial and Graph Relationships: On Database, Big Data, and Cloud A Shortest Path to Using Graph Technologies– Best Practices in Graph Construction, Indexing, Analytics and Visualization Build Recommender Systems, Detect Fraud, and Integrate Deep Learning with Graph Technologies Building a Tax Fraud Detection Platform with Big Data Spatial and Graph technologies Maps, 3-D, Tracking, JSON, and Location Analysis: What’s New with Oracle’s Spatial Technologies Deploying Spatial Applications in Oracle Public Cloud RESTful Spatial services with Oracle Database as a Service and ORDS Custom Maps in Oracle Big Data Discovery with Oracle Spatial and Graph 12c Smart Parking for a Smart City Using Oracle Spatial and Graph at Los Angeles and Munich Airports Analysing the Panama Papers with Oracle Big Data Spatial and Graph Apply Location Intelligence and Spatial Analysis to Big Data with Java  Example Hands-on Labs from BIWA Summit 2017: Using R for Big Data Advanced Analytics and Machine Learning Learn Predictive Analytics in 2 hours!  Oracle Data Miner Hands on Lab Deploy Custom Maps in OBIEE for Free Apply Location Intelligence and Spatial Analysis to Big Data with Java Use Oracle Big Data SQL to Analyze Data Across Oracle Database, Hadoop, and NoSQL Make the most of Oracle DV (DVD / DVCS / BICS) Analyzing a social network using Big Data Spatial and Graph Property Graph Submit your abstract(s) today, good luck and hope to see you there! See last year’s Full Agenda from BIWA’17.

(pdf announcement) Oracle Conference Center at Oracle Headquarters Campus, Redwood Shores, CA Share your successes… We want to hear your story. Submit your proposal today for Oracle BIWA Summit 2018,...

Oracle R Distribution 3.3.0 Benchmarks

We recently updated the Oracle R Distribution (ORD) benchmarks for version 3.3.0. ORD is based on open source R-3.3.0 and adds support for dynamically loading  linear algebra performance libraries installed on your system. This includes Intel's Math Kernel Library (MKL), AMD's ACML, and Sun Performance Library for Solaris, which enable optimized, multi-threaded math routines to provide relevant R functions maximum performance. The benchmark results demonstrate the performance of Oracle R Distribution 3.3.0 with and without dynamically loaded MKL. We executed the community-based R-Benchmark-25 script, which consists of a set of tests that benefit from faster matrix computations. The tests were run on a 24 core Linux system with 3.07 GHz per CPU and 47 GB RAM.    On average, Oracle R Distribution with dynamically loaded MKL and 8 threads is 50 times faster than Oracle R Distribution (and open source R) using R's internal BLAS library (Netlib) with 1 thread. Matrix multiplication is an astounding 94 times faster than single-threaded R, and principal components analysis is 32 times faster.  As always, ORD is free to download, use and share, and is available from Oracle's Open Source Software portal. Oracle R Distribution 3.3.0 will be supported with the upcoming version of Oracle R Enterprise 1.5.1. Installation Instructions are in the Oracle R Enterprise Installation and Administration Guide. Look for a future blog post announcing it's release.  

We recently updated the Oracle R Distribution (ORD) benchmarks for version 3.3.0. ORD is based on open source R-3.3.0 and adds support for dynamically loading  linear algebra performance...

R Consortium "Code Coverage Tool for R" Working Group Achieves First Release

The "Code Coverage Tool for R" project, proposed by Oracle and approved by the R Consortium Infrastructure Steering Committee, started just over a year ago. Project goals included providing an enhanced tool that determines code coverage upon execution of a test suite, and leveraging such a tool more broadly as part of the R ecosystem. What is code coverage? As defined in Wikipedia, “code coverage is a measure used to describe the degree to which the source code of a program is executed when a particular test suite runs. A program with high code coverage, measured as a percentage, has had more of its source code executed during testing which suggests it has a lower chance of containing undetected software bugs compared to a program with low code coverage.”  Why code coverage? Code coverage is an essential metric for understanding software quality. For R, developers and users alike should be able to easily see what percent of an R package’s code has been tested and the status of those tests. By knowing code is well-tested, users have greater confidence in selecting CRAN packages. Further, automating test suite execution with code coverage analysis helps ensure new package versions don’t unknowingly break existing tests and user code. Approach and main features in release After surveying the available code coverage tools in the R ecosystem, the working group decided to use the covr package, started by Jim Hester in December 2014, as a foundation and continue to build on its success. The working group has enhanced covr to support even more R language aspects and needed functionality, including: R6 methods support Address parallel code coverage Enable compiling R with Intel compiler ICC Enhanced documentation / vignettes Provide tool for benchmarking and defining canonical test suite for covr Clean up dependent package license conflicts and change covr license to GPL-3 CRAN Process Today, code coverage is an optional part of R package development. Some package authors/maintainers provide test suites and leverage code coverage to assess code quality. As noted above, code coverage has significant benefits for the R community to help ensure correct and robust software. One of the goals of the Code Coverage project is to incorporate code coverage testing and reporting into the CRAN process. This will involve working with the R Foundation and the R community on the following points: Encourage package authors and maintainers to develop, maintain, and expand test suites with their packages, and use the enhanced covr package to assess coverage Enable automatic execution of provided test suites as part of the CRAN process, just as binaries of software packages are made available, test suites would be executed and code coverage computed per package Display on each packages CRAN web page its code coverage results, e.g., the overall coverage percentage and a detailed report showing coverage per line of source code. Next Steps The working group will assess additional enhancements for covr that will benefit the R community. In addition, we plan to explore with the R Foundation the inclusion of code coverage results in the CRAN process. Acknowledgements The following individuals are members of the Code Coverage Working Group: Shivank Agrawal Chris Campbell Santosh Chaudhari Karl Forner Jim Hester Mark Hornick – Group Lead Chen Liang Willem Ligtenberg Andy Nicholls Vlad Sharanhovich Tobias Verbeke Qin Wang Hadley Wickham – ISC Sponsor

The "Code Coverage Tool for R" project, proposed by Oracle and approved by the R Consortium Infrastructure Steering Committee, started just over a year ago. Project goals included providing an...

Oracle R Distribution 3.3.0 Released

Oracle R Distribution version 3.3.0 is released on all supported platforms. This release, code-named "Supposedly Educational", contains several significant bug fixes and improvements to R, including:  Support for downloading data from secure https-enabled sites using download.file  Speed improvements for a number of low-level R functions called by higher-level, commonly used functions. These include speedups for vector selection with boolean data, function argument matching, sorting vectors, and finding a single value in a vector with match.  A new function sigma to calculate residual standard deviation for a variety of statistical models.  A new high-performance radix sort algorithm contributed by Matt Dowle.  Packages built using C++11 code are now supported on Windows. Improvements specific to Oracle's Distribution of R include:  The ability to install to a non-default R_HOME directory on Unix systems.  ​A new RPM: R-core-extra, containing several required libraries not available on Linux 6 systems. R has always depended on several third party libraries (curl, zlib, bzip2, xz, and pcre). Prior to R-3.3.0, R depended on much older versions of these libraries, but, if they were not found on the system, bundled copies were included that were built on the fly. R-3.3.0 depends on much newer versions of these libraries and no longer contains the bundled copies. This means that R 3.3.0 won't build against Linux 6 as is, because the native versions of these libraries are older than those that R-3.3.0 requires. The R-core-extra RPM contains the required versions of these libraries and is provided as a convenience for users of Oracle Linux 6. Adding the location of the libraries in R-core-extra to LD_LIBRARY_PATH removes the need to built these libraries separately. Oracle Linux 7 introduces the required versions of these libraries. The yum commands to install Oracle R Distribution 3.3.0 on Linux 6 are as follows: yum install R-3.3.0 yum install R-core-extra Then set the LD_LIBRARY_PATH environment variable to the location of the R-core-extra RPM. For example, the default location of the R-core-extra RPM is /usr/lib64/R/port/Linux-X64/lib. The following command sets LD_LIBRARY_PATH to the default location: export LD_LIBRARY_PATH=/usr/lib64/R/port/Linux-X64/lib On Linux 7, the required versions of these libraries are available natively so setting LD_LIBRARY_PATH is not required. Oracle R Distribution 3.3.0 will be certified with Oracle R Enterprise 1.5.1. Refer to Table 1-3 in the Oracle R Enterprise Installation Guide for supported configurations of Oracle R Enterprise components. To install Oracle R Distribution, follow the instructions for your platform in the Oracle R Enterprise Installation and Administration Guide.

Oracle R Distribution version 3.3.0 is released on all supported platforms. This release, code-named "Supposedly Educational", contains several significant bug fixes and improvements to R, including:  S...

Diabetes Data Analysis in R

Data collected from diabetes patients has been widely investigated nowadays by many data science applications. Popular data sets include PIMA Indians Diabetes Data Set or Diabetes 130-US hospitals for years 1999-2008 Data Set. Both data sets are aggregated, labeled and relatively straightforward to do further machine learning tasks. However, in the real world, diabetes data are often collected from healthcare instruments attached to patients. The raw data can be sporadic and messy. Analyzing such data requires more preprocessing. In this blog, we will explore an interesting diabetes data set to demonstrate the powerful data manipulation capability of R with Oracle R Enterprise (ORE), component of Oracle Advanced Analytics - an option to Oracle Database Enterprise Edition. Note that this data analysis is for machine learning study only. We are not medical researchers or physicians in the diabetes domain. Our knowledge on this disease so far comes from the material included with the data set. Data Overview The data is from the UCI archive. It is collected from electronic recording devices as well as paper records for 70 diabetes patients. For each patient, there is a file that contains 3-4 months of glucose level measurements and insulin dosages, as well as other special events (exercise, meal consumption, etc). First, we need to construct a data frame from the 70 separate files. This can be readily accomplished in R as follows; however, if the data were provided as several database tables, the function rbind overloaded by ORE to work on ore.frame objects could be used to union these tables. dd.list <- list(0) for(i in 1:70) { fileName <- sprintf("data-%02d", i) dd <- read.csv(fileName,header=FALSE,sep='\t') datetime.vec <- paste(dd$V1, dd$V2) dd$datetime <- as.POSIXct(strptime(datetimeVec, "%m-%d-%Y %H:%M")) colnames(dd) <- c('DATE', 'TIME', 'CODE', 'VALUE', 'DATETIME') dd$CODE <- as.factor(dd$CODE) dd.list[[i]] <- data.frame(ID=i, dd) } dd.df <- do.call("rbind", dd.list) dd.df <- subset(dd.df, !is.na(dd.df$DATETIME)) dd.df$NO <- row.names(dd.df) head(dd.df) ID DATE TIME CODE VALUE DATETIME 1 04-21-1991 9:09 58 100 1991-04-21 09:09:00 1 04-21-1991 9:09 33 9 1991-04-21 09:09:00 1 04-21-1991 9:09 34 13 1991-04-21 09:09:00 1 04-21-1991 17:08 62 119 1991-04-21 17:08:00 1 04-21-1991 17:08 33 7 1991-04-21 17:08:00 1 04-21-1991 22:51 48 123 1991-04-21 22:51:00 We can store the data frame into Oracle Database using ORE create. library(ORE) ore.connect(...) # connect to Oracle Database ore.drop(table="DD") ore.create(dd.df, table="DD") The column ID represents the patient ID and DATETIME is the timestamp when the event/measurement occurred. The field CODE stands for the particular type of measurement and the exact mapping can be found in the 'Data-Codes' file. Here is an example of some of the codes. 33 = Regular insulin dose 34 = NPH insulin dose 35 = UltraLente insulin dose 48 = Unspecified blood glucose measurement 58 = Pre-breakfast blood glucose measurement 62 = Pre-supper blood glucose measurement 65 = Hypoglycemic symptoms 66 = Typical meal ingestion 69 = Typical exercise activity What can we do with this type of data? In the raw data file, data points are recorded in a 'transaction' style and the time interval is irregular. Also the data set does not have any clear label or indicator. This makes the data at hand difficult to work with. Therefore, we need to preprocess the data for machine learning tasks. Next, we will show how to leverage the data to carry out analysis. Clustering Analysis Since the patients may have different levels of symptoms and also vary in treatment (such as insulin dose), we first conduct a clustering analysis to see if there are underlying groups. For now, we ignore the timestamps and just do an aggregation on the patient level. We calculate the average value for each code and thus each average code value can be used as a feature. Note that for an event CODE, the VALUE is always zero, since it only indicates that an event happens at such time. In that case, we calculate the average number of occurrences over the number of days. For each patient, we combine the information and form a feature vector. Here, we need to do a transpose of the data frame, which means that we want to convert CODE as separate columns. This can be done by using a 'pivot table' operation, which can be realized in R by calling the library reshape2. We can use reshape2 through ORE embedded R execution. See the code below. aggregate.code <- function(DD, full_code_list, event_list){ full_code_list <- c(33, 34, 35, 48, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72) event_list = c(65, 66, 67, 68, 69,70, 71, 72) code_filtered <- subset(DD, CODE %in% full_code_list) code_agg <- aggregate(code_filtered$VALUE, by=list(ID=code_filtered$ID, CODE=code_filtered$CODE), FUN=mean, na.rm=TRUE) colnames(code_agg)[3] <- 'MEASURE' activity_filtered <- subset(DD, CODE %in% event_list) activity_count <- aggregate(activity_filtered$VALUE, by=list(ID=activity_filtered$ID, CODE=activity_filtered$CODE), FUN=length) colnames(activity_count)[3] <- 'MEASURE' total.date <- length(unique(DD$DATE)) activity_count$MEASURE <- activity_count$MEASURE*1.0/total.date feature.agg <- rbind(code_agg, activity_count) feature.df <- aggregate(feature.agg$MEASURE, by=list(ID=feature.agg$ID, CODE=feature.agg$CODE), FUN=max) library(reshape2) cast.df <- dcast(feature.df, ID~CODE) for (col in colnames(cast.df)){ cast.df[,col] <- ifelse(is.na(cast.df[,col]), 0, cast.df[,col]) } ore.drop(table='PIVOTED') ore.create(cast.df, table='PIVOTED') TRUE } ore.tableApply(DD, # an ore.frame referencing the prepared database table aggregate.code, # function defined above to execute on table ore.connect = TRUE, # allows creating resulting table in function full_code_list = full_code_list, # data for function args event_list = event_list) ore.sync(table = "PIVOTED") # make new table accessible in client session Now, we obtained the ORE data frame (ore.frame) PIVOTED that contains the average of each CODE values for each patient ID. In fact, the entire operation is done at the database server and the result is stored as a DB table. We can have a look at the obtained data set shown below. head(PIVOTED[,1:7]) ID 33 34 35 48 57 58 1 6.593750 16.892086 0.000000 150.1538 0.00 169.7185 2 10.060847 11.333333 0.000000 201.4022 0.00 207.8438 3 2.433333 8.000000 8.452055 0.0000 120.50 117.6750 4 2.304348 8.413793 8.444444 0.0000 142.75 141.5714 5 2.388889 8.500000 0.000000 0.0000 183.40 147.4324 6 6.084746 18.000000 0.000000 246.5556 0.00 213.5238 For ease of presentation, we omitted other CODE values. In total, there are 20 types of CODE. The next job is to cluster this data. Usually, clustering on data with high dimensionality is not ideal since the distance of each data point tends be large. It would be great if we can do a PCA analysis and cluster on principle components (PC), which can indicate directions of the features that have the highest variations. In ORE, the following code carries out PCA analysis. Note that the function looks the same as open source R - it is overloaded in ORE. dd.pca <- prcomp(PIVOTED[, -1], # use the ore.frame and exclude ID column center = TRUE, scale. = TRUE) We can convert the original data frame to the space of principal components (PCs) using the code below. dd.pca.ore <- ore.predict(dd.pca, PIVOTED[, -1]) In the principal component space, we use a k-means clustering method to do the clustering over the first two PCs. model.km <- ore.odmKMeans(~., dd.pca.ore[, c("PC1","PC2")] , num.centers=5) km.res <- predict(model.km, dd.pca.ore, type="class", supplemental.cols=c("PC1","PC2")) To generate the plot, we use ggplot2, which can also be used through ORE embedded R execution. For simplicity here, however, we pull the data from the DB and call ggplot2. km.res.df <- ore.pull(km.res) ggplot(km.res.df, aes(PC1, PC2)) + geom_point(aes(color=factor(CLUSTER_ID))) The clusters are illustrated in the plot. A natural question is what are the essential differences between different clusters? We plot several boxplots for glucose level and insulin dose to have an idea of the distribution in the patients' characteristics. pivoted <- ore.pull(PIVOTED) library(ggplot2) colnames(pivoted) <- append('ID', paste("CODE", colnames(pivoted[,-1]), sep = '_')) label.df <- cbind(pivoted, km.res.df$CLUSTER_ID) colnames(label.df)[ncol(label.df)] <- 'CLUSTER_ID' library(gridExtra) p1 <- ggplot(label.df, aes(factor(CLUSTER_ID), CODE_35)) + geom_boxplot() + labs(x = "CLUSTER", y = 'Reguar insulin dose') + ggtitle("Boxplot of Selected Features across Patient Groups") + theme(plot.title = element_text(hjust = 0.5)) p2 <- ggplot(label.df, aes(factor(CLUSTER_ID), CODE_65)) + geom_boxplot()+ labs(x = "CLUSTER", y = 'Hypoglycemic') p3 <- ggplot(label.df, aes(factor(CLUSTER_ID), CODE_69)) + geom_boxplot()+ labs(x = "CLUSTER", y = 'Typical Excercise') p4 <- ggplot(label.df, aes(factor(CLUSTER_ID), CODE_62)) + geom_boxplot()+ labs(x = "CLUSTER", y = 'Pre-supper glucose') + grid.arrange(p1, p2, p3, p4, nrow=4) From these boxplots, we can see that the difference in the patient clusters involves insulin dose, frequency of hypoglycemic symptoms, frequency of exercise and pre-supper glucose level. One interesting observation is that the patients with higher amount of exercise (cluster 8) have relatively lower glucose level than other groups with similar dose of insulin (cluster 5), but with higher hypoglycemic symptoms (occurs because of low glucose level). Although this clustering analysis sheds some light on differences among patients, we lose a lot of information associated with the time stamped data. Next, we will see how we leverage the data without aggregation. Regression Analysis One important question for a diabetes patient is to estimate the glucose level in order to control the insulin dose. This topic has been widely studied, which is an important direction as part of system biology. Previous work has access to time stamped data with high frequency (15 min). But in this data set, each data point has a large time interval and that interval is not constant. In this case, we can still do regression on the glucose level by considering all factors within a time window. For a glucose measurement at a particular time, we can focus on all the insulin doses and previous events 15 hours before that time. Then, a regression model can be built considering the insulin dose in all types and the previous events. To do the regression, we need to create a data frame that contains the target glucose level and all features related to it. For that purpose, we go through each row of the original data frame, find all features within the time window and collect the features for that data point. This is done in the following function. related.row <- function(row, dd.df){ range.df <- subset(dd.df, (ID == row$ID & row$DATETIME - DATETIME < 14*3600 & row$DATETIME - DATETIME > 0)) range.df$TIME_DIFF = row$DATETIME - range.df$DATETIME events.df <- subset(range.df, (CODE %in% c(66,67,68,69,70, 71))) events.df <- events.df[which.max(events.df$DATETIME), ] events.df$LAST_EVENT <- ifelse( events.df$CODE %in% c(67), 1, -1) col.names <- c('C33', 'T33', 'C34', 'T34', 'C35', 'T35', 'LAST_EVENT', 'EVENT_TIME', 'LAST_GLU', 'LAST_TIME', 'TARGET') glu.df <- subset(range.df, CODE %in% c(48, 57, 58, 59, 60, 61, 62, 63, 64)) last.glu <- glu.df[which.max(glu.df$DATETIME), ] last.glu$TIME_DIFF <- row$DATETIME - last.glu$DATETIME C33 <- subset(range.df, (CODE==33)) C33 <- C33[which.max(C33$DATETIME), c('VALUE', 'TIME_DIFF')] C34 <- subset(range.df, CODE == 34) C34 <- C34[which.max(C34$DATETIME), c('VALUE', 'TIME_DIFF')] C35 <- subset(range.df, CODE == 35) C35 <- C35[which.max(C35$DATETIME), c('VALUE', 'TIME_DIFF')] row.list <- list(C33$VALUE, C33$TIME_DIFF, C34$VALUE, C34$TIME_DIFF, C35$VALUE, C35$TIME_DIFF, events.df$LAST_EVENT, events.df$TIME_DIFF, last.glu$VALUE, last.glu$TIME_DIFF, row$VALUE) row.result <- lapply(row.list, function(x) { ifelse(length(x)==0, 0, x)}) new.df <- data.frame(row.result) colnames(new.df) <- col.names new.df[1,] } combine.row <- function(DD.glu, dd.df, related.row) { N.glu <- nrow(DD.glu) row <- DD.glu[1,] train.df <- related.row(row, dd.df) for(i in 2:N.glu){ row <- DD.glu[i,] new.row <- related.row(row,dd.df) if(all(new.row[1,]==0) != TRUE) train.df <- rbind(train.df, new.row) } train.df <- train.df[-1,] ore.drop(table='TRAIN') ore.create(train.df, table='TRAIN') TRUE } DD.glu <- subset(DD, CODE %in% c(48, 57, 58, 60, 62)) row.names(DD.glu) <- DD.glu$NO res <- ore.tableApply(DD.glu, combine.row, dd.df = DD, related.row = related.row, ore.connect = TRUE) ore.sync(table ='TRAIN') We can have a look at the data frame obtained: C33 T33 C34 T34 C35 T35 LAST_EVENT EVENT_TIME LAST_GLU LAST_TIME TARGET 1 12.483333 0 0.000000 8 12.48333 1 12.483333 220 12.516667 118 4 9.233333 8 9.233333 0 0.00000 1 9.233333 272 9.283333 213 0 0.000000 0 0.000000 0 0.00000 -1 0.850000 222 7.883333 71 0 0.000000 0 0.000000 0 0.00000 -1 6.100000 222 13.133333 193 4 1.883333 4 1.883333 0 0.00000 1 1.733333 70 5.383333 134 4 5.300000 4 5.300000 0 0.00000 1 5.150000 70 8.800000 281 On an i5 based laptop with 16G memory, this normally takes about 6 minutes. The data set itself has only 29244 rows, so if this performance in not acceptable to the user, we can further optimize the solution by including native SQL in our R function. For those data scientists comfortable with SQL, this is a significant advantage of ORE being able to leverage SQL in conjunctions with R and ore.frame objects. Let us review this process. The entire operation actually need to join each row of glucose measurement to the original data set ('DD') and filter based on time stamps on a certain time window. The entire process is done in R using iteration, which is expensive. One solution to this is to use an Oracle SQL query to do the joining and then do the subsequent operations in parallel using 'groupApply' in ORE embedded R execution. The relational database optimizes table join performance, so we can take advantage of that using the following SQL query. ore.exec("CREATE TABLE DD_GLU_AGG AS SELECT ID, GLU_NO, GLUCOSE, CODE, VALUE, EXTRACT(HOUR FROM GLU_TIME - DATETIME) + EXTRACT(MINUTE FROM GLU_TIME - DATETIME)/60 AS TIME_DIFF FROM ( SELECT ID, GLU_NO, GLUCOSE, GLU_TIME, CODE, VALUE, DATETIME, row_number() OVER ( PARTITION BY ID, GLU_NO, GLUCOSE, GLU_TIME, CODE ORDER BY DATETIME NULLS LAST) AS RANK FROM ( SELECT a.ID, GLU_NO, GLUCOSE, GLU_TIME, CASE WHEN b.CODE IN (48, 57, 58, 59, 60, 61, 62, 63, 64) THEN '0' ELSE (CASE WHEN b.CODE IN (66, 68, 69, 70, 71) THEN '-1' ELSE (CASE WHEN b.CODE = 67 THEN '1' ELSE b.CODE END) END) END AS CODE, b.VALUE, b.DATETIME FROM ( SELECT ID, NO AS GLU_NO, VALUE AS GLUCOSE, DATETIME AS GLU_TIME FROM DD WHERE CODE IN (48, 57, 58, 60, 62) ) a JOIN DD b ON (GLU_TIME - b.DATETIME) < INTERVAL '14' HOUR AND (GLU_TIME - b.DATETIME) > INTERVAL '0' HOUR AND a.ID = b.ID) c ORDER BY ID, GLU_NO, GLUCOSE, GLU_TIME, CODE, VALUE, DATETIME) d WHERE RANK = 1") ore.sync(table='DD_GLU_AGG') This query generates a table that looks like ID GLU_NO GLUCOSE CODE VALUE TIME_DIFF 1 105 282 0 183 13.13333 1 105 282 33 10 13.13333 1 105 282 34 14 13.13333 1 105 282 65 0 6.30000 1 107 91 0 282 9.40000 1 107 91 33 7 13.70000 The data aggregate all CODES, with the latest timestamp, related to each pair of IDs and GLU_NO. Next, we can use groupApply() to do the rest of the job in parallel. form.row <- function(DD_GLU_AGG){ C33 <- subset(DD_GLU_AGG, CODE == 33) C34 <- subset(DD_GLU_AGG, CODE == 34) C35 <- subset(DD_GLU_AGG, CODE == 35) events.df <- subset(DD_GLU_AGG, CODE %in% c(1,-1)) events.df$VALUE <- ifelse(events.df$CODE == -1, -1, 1) last.glu <- subset(DD_GLU_AGG, CODE == 0) row.list <- list(C33$VALUE, C33$TIME_DIFF, C34$VALUE, C34$TIME_DIFF, C35$VALUE, C35$TIME_DIFF, events.df$VALUE, events.df$TIME_DIFF, last.glu$VALUE, last.glu$TIME_DIFF, DD_GLU_AGG$GLUCOSE[1]) row.result <- lapply(row.list, function(x) { ifelse(length(x)==0, 0, x)}) new.df <- data.frame(row.result) # C33 stands for the insulin dose. T33 stands for the time between the injection and the glucose level col.names <- c('C33', 'T33', 'C34', 'T34', 'C35', 'T35', 'LAST_EVENT', 'EVENT_TIME', 'LAST_GLU', 'LAST_TIME', 'TARGET') colnames(new.df) <- col.names new.df[1,] } Train <- ore.groupApply(DD_GLU_AGG, # ore.frame proxy for database table DD_GLU_AGG[, c("ID", "GLU_NO")], # columns for partitioning data form.row, FUN.VALUE = data.frame(C33 = integer(0), # define resulting table sturucture T33 = numeric(0), C34 = integer(0), T34 = numeric(0), C35 = integer(0), T35 = numeric(0), LAST_EVENT = integer(0), EVENT_TIME = numeric(0), LAST_GLU = numeric(0), LAST_TIME = numeric(0), TARGET = numeric(0)), parallel = 4) # materialize the ORE frame. train.df <- ore.pull(Train) Using this approach, the entire process takes around 1 min, a 6x performance improvement. The ORE framework allows us to run the in-database queries and R analytics without moving the data off the database server. After the data is prepared, we can go forward to build regression models. For simplicity, we treat all data points as homogeneous, which means we assume all patients have the same nature regarding their response to the insulin dose. A regression model is fit according to this data using the ORE’s parallel implementation of lm: ore.lm. model.formula <- formula( TARGET ~ C33 + T33 + C34 + T34+ C35 + T35 + LAST_EVENT + EVENT_TIME + LAST_GLU + LAST_TIME) model.lm <- ore.lm(model.formula, TRAIN) summary(model.lm) Call: ore.lm(formula = model.formula, data = TRAIN) Residuals: Min 1Q Median 3Q Max -488.69 -59.35 -7.54 46.96 370.25 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 132.316609 2.770239 47.764 < 2e-16 *** C33 1.699390 0.143044 11.880 < 2e-16 *** T33 2.339259 0.244046 9.585 < 2e-16 *** C34 -0.384308 0.068708 -5.593 2.29e-08 *** T34 -0.171930 0.102842 -1.672 0.09460 . C35 0.278242 0.122206 2.277 0.02282 * T35 -0.285301 0.114634 -2.489 0.01283 * LAST_EVENT -0.150663 2.852225 -0.053 0.95787 EVENT_TIME -0.206862 0.302349 -0.684 0.49388 LAST_GLU 0.003885 0.009028 0.430 0.66696 LAST_TIME 0.580310 0.182167 3.186 0.00145 ** --- Residual standard error: 77.27 on 9952 degrees of freedom Multiple R-squared: 0.04094, Adjusted R-squared: 0.03998 F-statistic: 42.48 on 10 and 9952 DF, p-value: < 2.2e-16 We can see that most of the features have significant impact on the response. But the R squared score is low. This is because we use all patients' data and the variance of glucose level given the same condition could be high. Moreover, we can run a decision tree to see the effect of insulin dose and time of injection on the glucose level. For better visualization purpose, we use the conditional inference tree package {party}. train.df <- ore.pull(TRAIN) library(party) model.ct <- ctree(model.formula, data=train.df) plot(model.ct, main="Conditional Inference Tree for GLU") The plot provides boxplots of glucose level in different partitions. It is easy to understand that for larger doses of UltraLente insulin (C35), the related glucose level is lower, similar for NPH insulin (C34). However, the partition related to regular insulin (C33) indicates that the one with a higher dose of regular insulin tends to have a high level of glucose, which seems to be paradoxical. One explanation is that the patient who has severe symptoms tends to take more regular insulin. This reminds us that the data only provide evidence of correlation, not causality. Statistical Test of Hypoglycemic Symptoms One of the symptoms recorded in the data set is the hypoglycemic symptom. This symptom is supposed to occur when the patient has too low of a glucose level. The following plot illustrates the occurrence of this symptom. The blue dots are the glucose level of one patient and the red vertical line marks the occurrence of hypoglycemic symptoms. We can see that most of the symptoms are related to the low glucose level. To verify this fact statistically, we can run a T-test to check if there is a significant difference in the glucose level. Here is the code for this analysis. Basically, the code goes through each hypoglycemic event and finds the nearest glucose measurement. Then compares the group of glucose level associated with hypoglycemic event and the one not. rm(list=ls()) library(ORE) options(ore.warn.order=FALSE) ore.connect(...) ore.ls() DD.hypo <- subset(DD, CODE==65) rownames(DD.hypo) <- DD.hypo$NO rownames(DD) <- DD$NO row.ahead <- function(row, dd.df){ range.df <- subset(dd.df, ID == row$ID & (row$DATETIME - DATETIME < 0.25*3600) & (row$DATETIME - DATETIME > - 0.25*3600)) glu.df <- subset(range.df, (CODE %in% c(48, 57, 58, 59, 60, 61, 62, 63, 64))) glu.df$TIME_DIFF = row$DATETIME -glu.df$DATETIME events.df <- subset(range.df, CODE %in% c(66,67,68,69,70, 71)) events.df$TIME_DIFF = row$DATETIME -events.df$DATETIME events.df <- events.df[which.max(events.df$DATETIME), ] events.df$LAST_EVENT = ifelse( events.df$CODE %in% c(67), 1, -1) col.names <- c('LAST_EVENT', 'EVENT_TIME', 'LAST_NO', 'LAST_GLU', 'MEAUSRE_TIME', 'HYPO_TIME') if(nrow(glu.df) ==0){ empty.df <- data.frame(as.list(rep(0, length(col.names) ))) colnames(empty.df) <- col.names return(empty.df) } last.glu <- glu.df[which.min(abs(glu.df$TIME_DIFF)), ] row.list <- list(ifelse(nrow(events.df)==0, 0, events.df$LAST_EVENT), ifelse(nrow(events.df)==0, 0, events.df$TIME_DIFF), ifelse(nrow(last.glu) ==0, 0, last.glu$NO), ifelse(nrow(last.glu) ==0, 0, last.glu$VALUE), ifelse(nrow(last.glu) ==0, 0, last.glu$TIME_DIFF), row$DATETIME ) row.result <- lapply(row.list, function(x) { ifelse(is.na(x), 0, x)}) new.df <- data.frame(row.list) colnames(new.df) <- col.names new.df[1,] } combine.rows <- function(DD.hypo, dd.df, row.ahead){ N.glu <- nrow(DD.hypo) row <- DD.hypo[1,] hypo.df <- row.ahead(row, dd.df) for(i in 2:N.glu){ row <- DD.hypo[i,] new.row <- row.ahead(row,dd.df) if(all(new.row[1,]==0) != TRUE) hypo.df <- rbind(hypo.df, new.row) } hypo.df <- hypo.df[-1,] ore.drop(table='HYPO') ore.create(hypo.df, table='HYPO') TRUE } res <- ore.tableApply(DD.hypo, combine.rows, dd.df = DD, row.ahead = row.ahead, ore.connect = TRUE) ore.sync(table ='HYPO') hypo.df <- ore.pull(HYPO) DD.glu <- subset(DD, (CODE %in% c(48, 57, 58, 59, 60, 61, 62, 63, 64))) dd.glu.df <- ore.pull(DD.glu) boxplot(HYPO$LAST_GLU) dd.glu.df$HYPO = ifelse(dd.glu.df$NO %in% hypo.df$LAST_NO, 1, 0 ) library(ggplot2) dd.glu.df$HYPO <- as.factor(dd.glu.df$HYPO) p <- ggplot(dd.glu.df, aes(HYPO, VALUE)) p + geom_boxplot() t.test(dd.glu.df[dd.glu.df$HYPO==1,]$VALUE,dd.glu.df[dd.glu.df$HYPO==0,]$VALUE) Welch Two Sample t-test data: dd.glu.df[dd.glu.df$HYPO == 1, ]$VALUE and dd.glu.df[dd.glu.df$HYPO == 0, ]$VALUE t = -35.265, df = 251.52, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -106.35820 -95.10693 sample estimates: mean of x mean of y 61.18304 161.91560 The p-value is lower than 0.05, so we can reject the null hypothesis that the two groups have the same glucose level. This provides statistical evidence of the relationship between glucose level and the hypoglycemic symptom. Conclusion In this blog, we demonstrated the data wrangling and analysis capability of R and ORE for the diabetes data set. A workable dataset was successfully created from the raw data. Based on the dataset, a clustering and decision tree based analysis and visualization provided important insights into the data, which can be useful for evaluation of the effect of the treatment for diabetes patients

Data collected from diabetes patients has been widely investigated nowadays by many data science applications. Popular data sets include PIMA Indians Diabetes Data Set or Diabetes 130-US hospitals for...

R Technologies

Parallel Training of Multiple Foreign Exchange Return Models

In a variety of machine learning applications, there are often requirements for training multiple models. For example, in the internet of things (IoT) industry, a unique model needs to be built for each household with installed sensors that measure temperature, light or power consumption. Another example can be found in the online advertising industry. To serve personalized online advertisements or recommendations, a huge number of individualized models has to be built and maintained because each online user has a unique browsing history. Moreover, such a model has to be updated in a frequent manner to capture the change of consumer behavior. When the number of models goes high, and even the algorithm is carefully designed and proved to be solid, it could be a challenge to implement in production. Especially for time sensitive applications, multiple model training cannot afford the extra delay caused by iterations through a huge number of models. A good example is the financial industry. In this article, we will show an example of fitting multiple foreign exchange (FX) rate models and demonstrate how we can leverage the powerful parallel computation capability provided by the Oracle R Enterprise (ORE) , component of Oracle Advanced Analytics - an option to Oracle Database Enterprise Edition.  FX Rate Data The FX rate data can be obtained from Federal Reserve Economic Data . Instead of going online to fetch the data manually, a library in R called ‘quantmod’ provides a convenient way of downloading the data. Here is the code used for this purpose. library(quantmod) symbol = "DEXCAUS" getSymbols(symbol,src="FRED") The symbol “DEXCAUS” means the FX rate of Canadian dollar to US dollar. In this example, we downloaded foreign exchanges rates for 22 currencies and focused on the time range from 1999 to 2015.  rm(list=ls()) symbols <- c( "DEXBZUS", "DEXCAUS", "DEXCHUS", "DEXDNUS", "DEXHKUS", "DEXINUS", "DEXJPUS", "DEXKOUS", "DEXMXUS", "DEXNOUS", "DEXSDUS", "DEXSFUS", "DEXSIUS", "DEXSLUS", "DEXSZUS", "DEXTAUS", "DEXTHUS", "DEXUSAL", "DEXUSEU", "DEXUSNZ", "DEXUSUK", "DEXVZUS") for( symbol in symbols){ getSymbols(symbol, src="FRED") } mergecode<-paste("merge(", paste(symbols,collapse=","),")", collapse="") merged.df<-eval(parse(text= mergecode)) fxrates.df.raw <- data.frame(date=index(merged.df), coredata(merged.df)) fxrates.df.raw <- fxrates.df.raw[fxrates.df.raw$date>'1999-01-04',]  Non-stationarity Let us take a first look into the FX rate data. We plot the FX rate of the Canadian dollar to the US dollar:  ggplot(fxrates.df.raw[fxrates.df.raw$date>'2015-01-01',], aes(date, DEXCAUS)) + geom_line()+ labs(x = "day", title = "CA dollar ~ US dollar FX Rate") + theme(plot.title = element_text(hjust = 0.5))    At a first glance, the series does not look to be stationary. To confirm it, we can run an Augmented Dickey–Fuller test to check if it has unit roots, which means that the series can have F(t) = ρF(t-1) + a(t) with ρ = 1. We can use the R library fUnitRoots to do the test. The null hypothesis is that the unit root exists.  library(fUnitRoots) adfTest(fxrates.df.raw$DEXCAUS) The result shows as follows: Title: Augmented Dickey-Fuller Test Test Results: PARAMETER: Lag Order: 1 STATISTIC: Dickey-Fuller: -0.8459 P VALUE: 0.3467   Since p >> 0.05, we cannot reject the null hypothesis. This suggests that there is unit root in this series and thus it is confirmed that the time series is non-stationary.  FX Rate Prediction  Foreign exchange rate series are known to be difficult to predict. For a time, the predictability is questioned since it seems untied to several economic fundamentals link. Thus, a random walk model is often used as a benchmark. In this article, we will implement a random walk model for demonstration purposes.  A random walk model is formulated as  F(t) = F(t-1) + a(t),  where a(t) is the zero mean random noise.  In R, we can use the following function to fit a random walk model.  arima(data, c(0,1,0)) This basically means that we remove both the MA and AR parts and only retain the integral part, which is exactly the random walk model.  The prediction is often backtested in a moving window fashion. For each time step t, the model is trained using data over [t-L-1, t-1], which is a window with length L. The prediction result is then evaluated by out of sample (OOS) data. Then, we move the window forward for every t and calculate the out of sample error. Here, we only use one sample as OOS data, which means that we use a window of historical data to predict the next day’s FX rate.  There are many ways to evaluate the result of backtesting. Here, we adopted the R squared as a measure of the goodness of fit. The closer the R squared is towards 1, the more accurate the prediction will be.  Combining all the ingradients, we now can write a function in R for making the predictions for one currency pred.fxrate <- function (data.fxrate) {   data.fxrate <- data.fxrate[order(data.fxrate$date),]   N <- nrow(data.fxrate)   L <- 300   pred <- rep(0, N-L)   country <- data.fxrate$country[1]   for(i in (L+1):N){     model <- arima(data.fxrate$rate[(i-L):i-1],c(0,1,0))     pred[i-L] <- predict(model,1)[[1]][1] }   R.sq <- 1 - sum((pred - data.fxrate$rate[(L+1):N])^2, na.rm =TRUE)/sum((mean(data.fxrate$rate[(L+1):N], na.rm =TRUE) - data.fxrate$rate[(L+1):N])^2, na.rm =TRUE)   pred.df <- as.data.frame(data.fxrate$date[(L+1):N])   pred.df$pred <- pred names(pred.df) <- c("date", "pred")   plot(data.fxrate$date, data.fxrate$rate, type = "l")   lines(pred.df$date, pred.df$pred, col="red")}    Note that the lines that compute the R squared has the option na.rm = TRUE on. It is because the data contains null values.  We can test the function using CA dollar using the data from 2014 to 2017. The R squared is 0.97. Seems that we have a decent model!    Parallel Prediction   As mentioned at the beginning, there are quite a few currencies and we probably do not want to loop through them. A solution is use the “group apply” capability provided by Oracle R Enterprise (ORE). That allows us to store the data as a table in Oracle Database (in many cases, it is the original data location), then run the function we wrote above in parallel for each currency.  First, we need to merge all the FX data together and change the schema as follows. fxrates.df <- data.frame(date=character(),                          country=character(),                          rate=double()) date.col <- fxrates.df.raw$date symbols <- names(fxrates.df.raw)[-1] n.row <- length(date.col) for( symbol in symbols){   symbol.data <- as.data.frame(date.col)   symbol.data$country <- rep(symbol, n.row)   symbol.data$return <- fxrates.df.raw[,symbol]   fxrates.df <- rbind(fxrates.df, symbol.data) } names(fxrates.df) <- c("date", "country", "rate") fxrates.df <- fxrates.df[fxrates.df$date > '2014-01-01', ] fxrates.df <- fxrates.df[order(fxrates.df$date),]   The data frame we obtained looks like: date country rate 2014-01-02 DEXCAUS 1.0634 2014-01-03 DEXCAUS 1.0612 2014-01-06 DEXCAUS 1.0658 2014-01-07 DEXCAUS 1.0742 2014-01-08 DEXCAUS 1.0802 2014-01-09 DEXCAUS 1.0850 Then, we create the table in Oracle Database with ORE. ore.drop(table="FX_RATE") # to remove the table if it already exists ore.create(fxrates.df, table="FX_RATE") After the table is created, we call the ore.groupApply function on the column ‘country’. That will run the function pred.fxrate on the FX rate of each currency, using at most four parallel executing R engines spawned by Oracle Database.  res <- ore.groupApply(FX_RATE,                       FX_RATE$country,                       pred.fxrate,                       ore.connect=TRUE,                       parallel = 4)  Another way to store the result is creating an object in the ORE R Datastore. For instance, we can add the following code into the function pred.fxrate. R.sq <- 1 - sum((pred - data.fxrate$rate[(L+1):N])^2)/sum((mean(data.fxrate$rate[(L+1):N]) - data.fxrate$rate[(L+1):N])^2) name <- paste("Rsq_",country,sep="") assign(name,R.sq) try(ore.save(list=name, name="Rsquares",append=TRUE)) Then, after running the ore.groupApply function, we can retrieve the objects through ORE functions as below.   Based on the R squared, the results look decent and will be even better if we can access data about other economic fundamentals and build an ensemble model. Due to the scope of this blog, we will leave this exploration to the reader.  Invoke R scripts from SQL side  Another scenario may require storing the result, such as R squared scores, into a structured format as a table in the database. Or we may need to store the generated image in the database. These can also be done by calling the R functions using capabilities provided by Oracle R Enterprise (ORE) on the SQL side.  Let us first look at how we store the R squared scores as a table. Suppose we want to build the model over each currency in SQL. We can first create a SQL function that has the group apply capability. Recall that we have all data stored in FX_RATE. All we need to do is that we create a group apply function and also supply the script that build the model.  CREATE OR REPLACE PACKAGE fxratePkg AS TYPE cur IS REF CURSOR RETURN FX_RATE%ROWTYPE; END fxratePkg;   CREATE OR REPLACE FUNCTION fxrateGroupEval( inp_cur fxratePkg.cur, par_cur SYS_REFCURSOR, out_qry VARCHAR2, grp_col VARCHAR2, exp_txt CLOB) RETURN SYS.AnyDataSet PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH("country")) CLUSTER inp_cur BY ("country") USING rqGroupEvalImpl; This function is a PL/SQL function that can do the group apply. You can view it as a counterpart of ore.groupApply. Next, we store the script that build the model in the database.  begin sys.rqScriptDrop('RW_model'); -- call if the model already exists. sys.rqScriptCreate('RW_model',     'function (data.fxrate) {    data.fxrate <- data.fxrate[order(data.fxrate$date),]   N <- nrow(data.fxrate)   L <- 300   pred <- rep(0, N-L)   country <- data.fxrate$country[1]   for(i in (L+1):N){     model <- arima(data.fxrate$rate[(i-L):i-1],c(0,1,0))     pred[i-L] <- predict(model,1)[[1]][1]   }     R.sq <- 1 - sum((pred - data.fxrate$rate[(L+1):N])^2, na.rm =TRUE)/sum((mean(data.fxrate$rate[(L+1):N], na.rm=TRUE) - data.fxrate$rate[(L+1):N])^2, na.rm =TRUE)   data.frame(CURRENCY=country, RSQ = R.sq)  }'); end;  Note that in order to form a table, we need to create a data frame to store each single result. With both the group apply SQL function and the script stored, we now can call it within SQL. select * from table(fxrateGroupEval( cursor(select /*+ parallel(t, 4) */ * from FX_RATE t), cursor(select 1 as "ore.connect" from dual), 'SELECT ''aaaaaaa'' CURRENCY, 1 RSQ FROM DUAL', 'country', 'RW_model')); Note that “aaaaaaa” is a way to declare the format of the column, which is a 7 character text column. Moreover, we can even store the plots generated by each FX model. We can modify the function as below.  begin sys.rqScriptDrop('RW_model_plot'); sys.rqScriptCreate('RW_model_plot',     'function (data.fxrate) {        data.fxrate <- data.fxrate[order(data.fxrate$date),]     N <- nrow(data.fxrate)   L <- 300   pred <- rep(0, N-L)   country <- data.fxrate$country[1]   for(i in (L+1):N){     model <- arima(data.fxrate$rate[(i-L):i-1],c(0,1,0))     pred[i-L] <- predict(model,1)[[1]][1]   }      pred.df <- as.data.frame(data.fxrate$date[(L+1):N])   pred.df$pred <- pred   names(pred.df) <- c("date", "pred")   plot(data.fxrate$date, data.fxrate$rate, type = "l")   lines(pred.df$date, pred.df$pred, col="red")     }'); end; Then, we can all the new SQL function and generate an table of images.   select * from table(fxrateGroupEval( cursor(select /*+ parallel(t, 4) */ * from FX_RATE t),  cursor(select 1 as "ore.connect" from dual), 'PNG', 'country', 'RW_model_plot')); The output, if viewed at SQL developer, is as follows.    Note that now the image is generated and stored as a BLOB (binary large object) in the table. We can double click on the BLOB item and view the image in pop-up window (make sure the view as image box is checked).    Conclusion In this blog, we demonstrate the parallel training of multiple FX rate models using a benchmark random walk model. The implementation only takes a few lines of code. We can see the powerful functionality provided by the in-database technology enabled by Oracle R Enterprise.   

In a variety of machine learning applications, there are often requirements for training multiple models. For example, in the internet of things (IoT) industry, a unique model needs to be built for...

Best Practices

Migrating R models from Development to Production

Users of Oracle R Enterprise (ORE) embedded R execution will often calibrate R models in a development environment and promote the final models to a production database. In most cases, the development and production databases are distinct, and model serialization between databases is not effective if the underlying tables are not identical.  To facilitate the migration process, ORE includes scripts to transport the ORE system schema, RQSYS, and ORE objects such as tables, scripts, and models from one database to another. Migration ScriptsThe ORE migration utility scripts and documentation reside in $ORACLE_HOME/R/migration after the ORE server component is installed. Navigate to the server directory and change to the migration subdirectory: /oreserver_install_dir/server/migrationThe migration subdirectory contains a README and the following subdirectories: exp: Contains a script to migrate RQSYS and all ORE user data to a dump file.imp: Contains a script for importing ORE user data from the dump file created by the script in exp.oreuser: Contains scripts for exporting and importing data for a specific ORE user.Instructions for running the migration scripts are provided in the README.  Note that the current version of the migration scripts require that the source and target environments contain the same versions of Oracle Database and Oracle R Enterprise.Migration ExampleHere's an example that migrates models from a single ORE 1.5.0 schema in a local Oracle 12.1.0.2 database to an ORE 1.5.0 schema in a remote Oracle 12.1.0.2 database.  In this case, the databases reside on different servers.  Create model and predictions:R> ore.create(iris, "IRIS")R> mod <- ore.randomForest(Species~., IRIS)R> modCall: ore.randomForest(formula = Species ~ ., data = IRIS)               Type of random forest: classification                     Number of trees: 500                    Number of groups:  1 No. of variables tried at each split: 2R> pred <- predict(mod, IRIS, type="all", supplemental.cols="Species")R> head(pred)  setosa versicolor virginica prediction Species1  1.000      0.000         0     setosa  setosa2  0.998      0.002         0     setosa  setosa3  1.000      0.000         0     setosa  setosa4  1.000      0.000         0     setosa  setosa5  1.000      0.000         0     setosa  setosa6  1.000      0.000         0     setosa  setosaSave the models to a datastore named myModels:R> ore.save(mod, pred, name = "myModels")Run the migration utility.  Refer to $ORACLE_HOME/R/migration/oreuser/exp/README for syntax details.  In this case, I'm using the Big Data Lite VMwith schema moviedemo and instance orcl.  I'm exporting all ORE data including the model mod and predictions pred to a dump file in /tmp/moviedemodsi.Predictions are included to illustrate capability, however, scoring would likely occur on new data in the production environment.$ cd $ORACLE_HOME/R/migration/oreuser/exp$ perl -I$ORACLE_HOME/R/migration/perl $ORACLE_HOME/bin/ore_dsiexport.pl orcl /tmp/moviedemodsimoviedemodsi.zip MOVIEDEMOChecking connection to orcl......... Enter db user system password:welcome1Connect to db connect_str .....   PassChecking ORE version .........Pass****Step1 Setup before export ******/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/sqlplus -L -S system/welcome1@"orcl"@/u01/app/oracle/product/12.1.0.2/dbhome_1/R/migration/oreuser/exp/setup.sql system welcome1 orcl/tmp/moviedemodsi MOVIEDEMO/u01/app/oracle/product/12.1.0.2/dbhome_1/R/migration/oreuser/exp/storedproc.sql >/tmp/moviedemodsi/tmpstep1.log******Step2 export schema with data store ********Step3 cleanup ****Export completedump files are in /tmp/moviedemodsi*****Step4 creating zip file moviedemodsi.zip ***  adding: MOVIEDEMO_rqds.dmp (deflated 90%)  adding: MOVIEDEMO_rqdsob.dmp (deflated 89%)  adding: MOVIEDEMO_rqrefdbobj.dmp (deflated 90%)  adding: MOVIEDEMO_rqdsref.dmp (deflated 91%)  adding: MOVIEDEMO_rqdsaccess.dmp (deflated 90%)  adding: MOVIEDEMO_src.dmp (deflated 70%)  adding: MOVIEDEMO_srcdsi.dmp (deflated 91%)  adding: exp_dsi_user.sh (deflated 59%)  adding: imp_dsi_user.sh (deflated 78%)Created moviedemodsi.zip. Use this for file importing ORE data from MOVIEDEMO into target dbIn my target database, I ran the import script, which imports the dumped ORE user data.  As with the export, the script is run as system user, and you will be prompted for the password:$ perl -I$ORACLE_HOME/R/migration/perl ore_dsiimport.pl orcl /home/oracleChecking connection to orcl......... Enter db user system password:welcome1Connect to db connect_str .....   PassChecking ORE version .........Pass****Step2 import of schema with datastoreinventory ********Step3 staging datastore metadata **********************************IMPORTANT **************************Check dstorestg.log for errors. Then run the fllowing scripts to complete the import ********* Run as sysdba : 1. rqdatastoremig.sql <rquser> ******************************************************************Then, after running rqdatastoremig.sql, I can log into my target MOVIEDEMO schema and load the models:> ore.load("myModels")[1] "mod"  "pred"> modCall: ore.randomForest(formula = Species ~ ., data = IRIS)               Type of random forest: classification                     Number of trees: 500                    Number of groups:  1 No. of variables tried at each split: 2> head(pred)    setosa versicolor virginica prediction    Species1    1.000      0.000     0.000     setosa     setosa2    1.000      0.000     0.000     setosa     setosa3    1.000      0.000     0.000     setosa     setosa4    1.000      0.000     0.000     setosa     setosa5    1.000      0.000     0.000     setosa     setosa6    1.000      0.000     0.000     setosa     setosa ORE Migration Utility BenefitsThe ORE migration utility enables data scientists to test and deploy models quickly, reducing the feedback loop time required for retraining and fine tuning models. ORE users don't have to worry about the risk of error in rewriting models because the models are deployed in a language they were trained and tested in. In addition, if the tables in the model training environment are identical to the production, you'll cut out another enormous chunk of testing time.

Users of Oracle R Enterprise (ORE) embedded R execution will often calibrate R models in a development environment and promote the final models to a production database. In most cases, the development...

Best Practices

Key Capabilities for Big Data Analytics using R

There are several capabilities that data scientists benefit from when performing Big Data advanced analytics and machine learning with R. These revolve around efficient data access and manipulation, access to parallel and distributed machine learning algorithms, data and task parallel execution, and ability to deploy results quickly and easily. Data scientists using R want to leverage the R ecosystem as much as possible, whether leveraging the expansive set of open source R packages in their solutions, or leveraging their R scripts directly in production to avoid costly recoding or custom application integration solutions. Data Access and Manipulation R generally loads and processes small to medium-sized data sets with sufficient performance. However, as data volumes increase, moving data into a separate analytics engine becomes a non-starter. Moving large volume data across a network takes a non-trivial amount of time, but even if the user is willing/able to wait, client machines often have insufficient memory either for the data itself, or for desired R processing. This makes processing such data intractable if not impossible. Moreover, R functions are normally single threaded and do not benefit from multiple CPUs for parallel processing. Systems that enable transparent access to and manipulation of data from R, have the benefit of short-circuiting costly data movement and client memory requirements, while allowing data scientists to leverage the R language constructs and functions. In the case of Oracle R Enterprise (ORE) and Oracle R Advanced Analytics for Hadoop (ORAAH), R users work with proxy objects for Oracle Database and HIVE tables. R functions that normally operate on data.frame and other objects are overloaded to work with ore.frame proxy objects. In ORE, R function invocations are translated to Oracle SQL for execution in Oracle Database. In ORAAH, these are translated to HiveQL for execution by Hive map-reduce jobs. By leveraging Oracle Database, ORE enables scalable, high performance execution of functions for data filtering, summary statistics, and transformations, among others. Since data is not moved into R memory, there are two key benefits: no latency for moving data and no client-side memory limitations. Moreover, since the R functionality is translated to SQL, users benefit from Oracle Database table indexes, data partitioning, and query optimization, in addition to executing on a likely more powerful and memory-rich machine. Using Hive QL, ORAAH provides scalability implicitly using map-reduce, accessing data directly from Hive. Data access and manipulation is further expanded through the use of Oracle Big Data SQL, where users can reference Hadoop data, e.g., as stored on Oracle Big Data Appliance, as though they were database tables. Those tables are also mapped to ore.frame objects and can be used in ORE functions. Big Data SQL transparently moves query processing to the most effective platform, which minimizes data movement and maximizes performance.   Parallel Machine Learning Algorithms Machine learning algorithms as found in R, while rich in variety and quality, typically do not leverage multi-threading or parallelism. Aside from specialized packages, scaling to bigger data becomes problematic for R users, both for execution time as well as the need to load the full data set into memory with enough memory left over for computation. For Oracle Database, custom parallel distributed algorithms are integrated with the Oracle Database kernel and ORE infrastructure. These enable building models and scoring on “big data” by being able to leverage machines with 100s of processors and terabytes of memory. With ORAAH, custom parallel distributed algorithms are also provided, but leverage Apache Spark and Hadoop. To further expand the set of algorithms, ORAAH exposes Apache Spark MLlib algorithms using R’s familiar formula specification for building models and integration with ORAAH’s model matrix and scoring functionality. The performance benefits are significant, for example: compared to R’s randomForest and lm, ORE’s random forest 20x faster using 40 degrees of parallelism (DOP), and 110x faster for ORE’s lm with 64 DOP. ORAAH’s glm can be 4x – 15x faster than MLlib’s Spark-based algorithms depending on how memory is constrained (48 GB down to 8 GB). This high performance in the face of reduced memory requirements means that more users can share the same hardware resources for concurrent model building.   Data and Task Parallel Execution Aside from parallel machine learning algorithms, users want to easily specify data-parallel and task-parallel execution. Data-parallel behavior is often referred to as “embarrassingly parallel” since it’s extremely easy to achieve – partition the data and invoke a user-defined R function on each partition of data in parallel, then collect the results. Task-parallel behavior takes a user-defined R function and executes it n times, with an index passed to the function corresponding to the thread being executed (1..n). This index facilitates setting random seeds for monte carlo simulations or selecting behavior to execute within the R function. ORE’s embedded R execution provides specialized operations that support both data-parallel (ore.groupApply, ore.rowApply) and task-parallel (ore.indexApply) execution where users can specify the degree of parallelism, i.e., the number of parallel R engines desired. ORAAH enables specifying map-reduce jobs where the mapper and reducer are specified as R functions and can readily support the data-parallel and task-parallel behavior. With both ORE and ORAAH, users can leverage CRAN packages in their user-defined R functions. Does the performance of CRAN packages improve as well? In general, there is no automated way to parallelize an arbitrary algorithm, e.g., have an arbitrary CRAN package’s functions become multi-threaded and/or execute across multiple servers. Algorithm designers often need to decompose a problem into chunks that can be performed in parallel and then integrate those results, sometimes in an iterative fashion. However, since the R functions are executed at the database server, which is likely a more powerful machine, performance may be significantly improved, especially for data loaded at inter-process communication speeds as opposed to ethernet. There are some high performance libraries that provide primitive or building block functionality, e.g., matrix operations and algorithms like FFT or SVD, that can transparently boost the performance of those operations. Such libraries include Intel’s Math Kernel Library (MKL), which is included with ORE and ORAAH for use with Oracle R Distribution – Oracle’s redistribution of open source R.   Production Deployment Aside from the subjective decision about whether a given data science solution should be put in production, the next biggest hurdle includes the technical obstacles for putting an R-based solution into production and having those results available to production applications. For many enterprises, one approach has been to recode predictive models using C, SQL, or Java so they can be more readily used by applications. However, this takes time, is error prone, and requires rigorous testing. All too often, models become stale while awaiting deployment. Alternatively, some enterprises will hand-craft the “plumbing” required to spawn an R engine (or engines), load data, execute R scripts, and pass results to applications. This can involve reinventing complex infrastructure for each project, while introducing undesirable complexity and failure conditions. ORE provides embedded R execution, which allows users to store their R scripts as functions in the Oracle Database R Script Repository and then invoke those functions by name, either from R or SQL. The SQL invocation facilitates production deployment. User-defined R functions that return a data.frame can have that result returned from SQL as a database table. Similarly, user-defined R functions that return images can have those images returned, one row per image, as a database table with a BLOB column containing the PNG images. Since most enterprise applications use SQL already, invoking user-defined R functions directly and getting back values becomes straightforward and natural. Embedded R execution also enables the use of job scheduling via the Oracle Database DBMS_SCHEDULER functionality. For ORAAH, user-defined R functions that invoke ORAAH functionality can also be stored in the Oracle Database R Script Repository for execution by name from R or SQL, also taking advanced of database job scheduling. These capabilities – efficient data access and manipulation, access to parallel and distributed machine learning algorithms, data and task parallel execution, and ability to deploy results quickly and easily – enable data scientists to perform Big Data advanced analytics and machine learning with R. Oracle R Enterprise is a component of the Oracle Advanced Analytics option to Oracle Database. Oracle R Advanced Analytics for Hadoop is a component of the Oracle Big Data Connectors software suite for use on Cloudera and Hortonworks, and both Oracle Big Data Appliance and non-Oracle clusters.

There are several capabilities that data scientists benefit from when performing Big Data advanced analytics and machine learning with R. These revolve around efficient data access and manipulation,...

Best Practices

Early detection of process anomalies with SPRT

Developed by Abraham Wald more than a half century ago, the Sequential Probability Ratio Test (SPRT) is a statistical technique for binary hypothesis testing (helping to decide between two hypothesis H0 and H1) and extensively used for system monitoring and early annunciation of signal drifting. SPRT is very popular for quality control and equipment surveillance applications, in industries and areas requiring a highly sensitive, reliable and especially fast detection of degradation behavior and/or sensor malfunctions. Examples of SPRT-based applications include process anomaly detection for nuclear power plants, early fault predictions for wind turbines, anomaly surveillance for key satellite components in the aerospace industry, pro-active fault monitoring for entreprise servers, quality control studies in manufacturing, drug and vaccine safety surveillance, construction and administration of computerized adatptive tests (CAT) and many other.Conventional techniques for signal monitoring rely on simple tests, based, for example on control chart schemes with thresholds, mean values, etc, and are sensitive only to spikes exceeding some limits or abrupt changes in the process mean. They trigger alarms generally just before failures or only after the process drifted significantly. Tighter thresholds lead to high numbers of false alarms and relaxed thresholds result in high numbers of missed alarms. Moreover, these techniques can perform very poorly in the presence of noise. The popularity of using SPRT for surveillance applications derives from the following characteristics: SPRT detects signal discrepancies at the earliest mathematically possible time after the onset of system disturbances and/or sensor degradations. This property results from the fact that SPRT examines the statistical qualities of the monitored signals and catches nuances induced by disturbances well in advance of any measurable changes in the mean values of the signals. SPRT examines successive observations of a discrete process and is very cheap to compute. Thus, it is perfectly adapted for real time analysis with streaming data flows. SPRT offers control over false-alarms and missed-alarms rates via user-defined parameters. The mathematical expressions for SPRT were derived under two assumptions: (1) The samples follow an a-priori known distribution function and (2) The samples are independent and identically distributed (i.i.d). Let us consider a sequence of values {Yn} = y0, y1 ... yn resulting from a stationary process satisfying the previous assumptions. Let us further consider the case where the signal data obeys a Gaussian distribution with mean µ0 and variance σ02.  The normal signal behavior is referred to as the null hypothesis, H0.  Alternative hypothesis express abnormal behavior. For example one can formulate alternate H1,H2,H3 and H4 to characterize signals with larger or smaller mean or variance. More explicitly, the null and alternate hypothesis could be written as: H0 :  mean µ0 , variance σ02H1 :  mean µ1 >  µ0 , variance σ02H2 :  mean µ2 <  µ0 , variance σ02H3 :  mean µ0 , variance σ32 = V σ02H4 :  mean µ0 , variance σ42 = (1/V) σ02 The likelihood ratio (LR) is the ratio of probabilities for observing the sequence {Yn} under an alternate hypothesis Hi, versus the null hypothesis H0  : At each step of the {Yn} sequence, SPRT calculates a test index as the natural log of LR, referred to as LLR and compares it to two stopping boundaries: If LLR is greater or equal to the upper boundary log(B) then H0 is rejected and Hi is accepted. If LLR is less or equal to the lower boundary log(A) then Hi is rejected and H0 is accepted. For as long as LLR remains between these two boundaries there is not enough evidence to reach a conclusion; the sampling yn continues and LLR(n) is updated for a new comparison. The decision boundaries are derived from the false and missed alarm probabilities via the following equations: with α = probability of accepting Hi when H0 is true (false alarm probability) β = probability of accepting H0 when Hi is true (missed alarm probability)After a decision is reached (Hi or H0 are accepted), LLR is reset to zero, the sampling y1,y2,… is re-started and a new sequence of calculations and comparisons is performed to validate Hi or H0 anew.The power of SPRT comes from the simplicity of the LLR(n) expression for some common distributions. For the normal distribution we are considering and under the i.i.d. assumption mentionded above, the P(Y{n} |H0) and P(Y{n} |H1)  probabilities reduce to leading to the following expression for LLR : For each new sample yn the update of this expression is trivial with the cost of one multiplication and two additions: Similarly, the LLR(n) expression for testing H3  (with σ32 = V σ02) against H0 is: leading to an also fast and cheap LLR update as the sampling {Yn} continues. Change of meanThe script sprt.R illustrates a simple implementation of these formulas and the SPRT decision process. A random sequence with 1000 points (t=1,…1000) is generated from a normal distribution with µ = 0 and σ = 1. The signal is modified between t=101 and t=200 with samples drawn from a normal distribution with µ = 1 and σ = 1. R> vals <- rnorm(1000,mean=0,sd=1) R> vals[101:200] <- rnorm(100,mean=1,sd=1) The resulting signal is captured in the plot below, for t=1,..200 The SPRT LLR equation for change of mean are used with the following null and alternate hypothesis: H0 : µ0 = 0 and σ0 = 1 H1 : µ1 = 0.8 and σ1 = 1The value µ1 = 0.8 is arbitrary as we are not supposed to know if the signal mean shifted and by how much. The mysprt(..) function (see the script) is used to  calculate LLR(n), perform the comparisons, decide whether to accept the null or alternate hypothesis or keep sampling and updating LLR or restart the LLR sequence from zero after a decision was taken. The values false alarm and missed alarm parameters α and β were both set to 0.05. The plot_sprt() function generates several plots as shown below.The first plot illustrates LLR(t) for t=1,…200. Starting on the left one can see that LLR(1) is in the undecided region log(A) < LLR < log(B). At t=12 LLR falls under log(A) confirming H0. LLR is reset to zero and a new LLR(n) sequence is calculated until, at t=23 H0 is confirmed again. These sequences continue several times until t=101 when the mean shift kicks in. Very fast, at t=105 LLR rises above log(B) and H1 is confirmed. LLR is reset and at t=119 H1 is confirmed again. H1 continues to be confirmed over several sequences until we reach t=200 where mysprt() was stopped (arbitrarily as in a real application it should continue running) Over the t=1,..,200 interval, the application of SPRT generated 26 LLR (restart to H1/H0 decision) sequences. The length of a sequence corresponds to the number of samples/time needed to decide in favor of H1 or H0. The next two plots show the length of each successive sequence and the distribution (histogram) of the sequence lengths. The decision time is generally short; for some sequences only 3 samples are necessary to validate H0 or H1, for some other up to 15 samples are required. The next plot illustrates LLR without restart, that an LLR updated disregarding the log(A) and log(B) boundaries without stop and without decision. We see that as long as the samples come from the un-shifted signal (t=1,..,100), the non-restarted LLR continues to drop. Once the signal changes, LLR reverts the trend and keeps climbing. A posteriori, it is very clear where the signal change occurred : at t~100. Change of varianceIn this section we discuss the SPRT results when detecting a change in variance. For this purpose the original signal is modified between t=501 and t=600 with samples drawn, this time, from a normal distribution with µ = 0 and σ = 1.4 R> vals[501:600] <- rnorm(100,mean=0,sd=1.4) The plot below represents the signal for t=401,..,600 We examined this subset of the signal with SPRT by calculating LLR with the following null and alternate hypothesis: H0 : µ0 = 0 and σ0 = 1 H3 : µ3 = 0 and σ3 = 1.5α and β were again set to 0.05. Note that we are constructing an alternate hypothesis with a σ3 value higher than the actual σ of the modified signal. The LLR sequences are illustrated below, together with the lengths of each sequence and the sequence lengths distribution. One can see that the LLR sequences are longer and quite a few require a number of sample of O[30-50] to decide in favor of H1 or H0. The variance-shift detection appears, for this case, to be a tougher problem. But the comparison with the mean-shift case is not apple to apple and the {µ0,σ0,µi,σi,α,β} parameter space needs to be properly explored before drawing conclusions. Practical aspectsFor real world processes, signal shifts occur in multiple ways and SPRT should be applied for both mean and variance and for shifts in both directions (positive and negative).  SPRT could be also run for multiple shift values and false and missed alarms probabilities. The efficient monitoring of a large number of signals with multiple SPRT tests per signal requires task parallelism capabilities and a flexible management of the various types of alarms which can be generated. Often, SPRT is applied not to the raw signals but to the residuals between signals and signal predictions. The 'normal-operation' signal behavior is learned via specific techniques during a generally off-line training stage, after which the system is put in surveillance mode where departure from ‘normality’ is caught, on-line, by SPRT at the earliest stage of process shifts.Mathematical expressions for LLR can also be derived for distributions other than Gaussian but the analytical complexity can escalate quickly.Some of these topics and especially how to put this technique into production, using, for example ORE's Embedded R Execution mechanism, will be expanded in future blogs.

Developed by Abraham Wald more than a half century ago, the Sequential Probability Ratio Test (SPRT) is a statistical technique for binary hypothesis testing (helping to decide between two hypothesis H...

Best Practices

Predicting Energy Demand using IoT

The Internet of Things (IoT) presents new opportunities for applying advanced analytics. Sensors are everywhere collecting data – on airplanes, trains, and cars, in semiconductor production machinery and the Large Hadron Collider, and even in our homes. One such sensor is the home energy smart meter, which can report household energy consumption every 15 minutes. This data enables energy companies to not only model each customer’s energy consumption patterns, but also to forecast individual usage. Across all customers, energy companies can compute aggregate demand, which enables more efficient deployment of personnel, redirection or purchase of energy, etc., often a few days or weeks out. To build one predictive model per customer, when an energy company can have millions of customers, poses some interesting challenges. Consider an energy company with 1 million customers. Over the course of a single year, these smart meters will collect over 35 billion readings. Each customer, however, generates only about 35,000 readings. On most hardware, R can easily build a model on 35,000 readings. Note that if each model requires even only 10 seconds to build, doing this serially will require roughly 116 days to build all models. Since the results are need a few days or weeks out, a delay of months makes this project a non-starter. If powerful hardware, such as Oracle Exadata, can be leveraged to compute these models in parallel, say with degree of parallelism of 128, all models can be computed in less than one day. While users can leverage parallelism enabled by various R packages, there are several factors that need to be taken into account. For example, what happens if certain models fail? Will the models be stored as 1 million separate flat files – one per customer? For flat files, how will backup, recovery, and security be handled? How can these models be used for forecasting customer usage and where will the forecasts be stored? How can these R models be incorporated into a production environment where applications and dashboards normally work with SQL?Using the Embedded R Execution capability of Oracle R Enterprise, Data Scientists can focus on the task of building a model for a single customer. This model is stored in the R Script Repository in Oracle Database. ORE enables invoking this script from a single function, i.e., ore.groupApply, relying on the database to spawn multiple R engines, load one partition of data from the database to the function produced by the Data Scientist, and then store the resulting model immediately in the R Datastore, again in Oracle Database. This greatly simplifies the process of computing and storing models. Moreover, standard database backup and recovery mechanisms already in place can be used to avoid having to devise separate special practices. Forecasting using these models is handled in an analogous way.To put these R scripts into production, users can invoke the same R scripts produced by the Data Scientist from SQL, both for the model building and forecasting. The forecasts can be immediately available as a database table that can be read by applications and dashboards, or used in other SQL queries. In addition, these SQL statements that invoke the R functions can be scheduled for periodic execution using the DBMS_SCHEDULER package of Oracle Database. Leveraging the built-in functionality of ORE, Data Scientists, application developers, and administrators do not have to reinvent complex code and testing strategies, often done for each new project. Instead, they benefit from Oracle's integration of R with Oracle Database - to easily design and implement R-based solutions for use with applications and dashboards, and scale to the enterprise.

The Internet of Things (IoT) presents new opportunities for applying advanced analytics. Sensors are everywhere collecting data – on airplanes, trains, and cars, in semiconductor production machinery...

Real-time model scoring for streaming data - a prototype based on Oracle Stream Explorer and Oracle R Enterprise

Whether applied to manufacturing, financial services, energy, transportation, retail, government, security or other domains, real-time analytics is an umbrella term which covers a broad spectrum of capabilities (data integration, analytics, business intelligence) built on streaming input from multiple channels. Examples of such channels are: sensor data, log data, market data, click streams, social media and monitoring imagery.Key metrics separating real-time analytics from more traditional, batch, off-line analytics are latency and availability. At one end of the analytics spectrum are complex, long running batch analyses with slow response time and low availability requirements. At the other end are real-time, lightweight analytic applications with fast response time (O[ms]) and high availability (99.99..%). Another distinction is between the capability for responding to individual events and/or ordered sequences of events versus the capability for handling only event collections in micro batches without preservation of their ordered characteristics. The complexity of the analysis performed on the real-time data is also a big differentiator: capabilities range from simple filtering and aggregations to complex predictive procedures. The level of integration between the model generation and the model scoring functionalities needs also to be considered for real-time applications. Machine learning algorithms specially designed for online model building exist and are offered by some streaming data platforms but their number is small. Practical solutions could be built by combining an off-line model generation platform with a data streaming platform augmented with scoring capabilities. In this blog we describe a new prototype for real time analytics integrating two components : Oracle Stream Explorer (OSX) and Oracle R Enterprise (ORE). Examples of target applications for this type of integration are: equipment monitoring through sensors, anomaly detection and failure prediction for large systems made of a high number of components.The basic architecture is illustrated below: ORE is used for model building, in batch mode, at low frequency, and OSX handles the high frequency streams and pushes data toward a scoring application, performs predictions in real time and returns results to consumer applications connected to the output channels. OSX is a middleware platform for developing streaming data applications. These applications monitor and process large amounts of streaming data in real time, from a multitude of sources like sensors, social media, financial feeds, etc. Readers unfamiliar with OSX should visit Getting Started with Event Processing for OSX. In OSX, streaming data flows into, through, and out of an application. The applications can be created, configured and deployed with pre-built components provided with the platform or built from customized adapters and event beans. The application in this case is a custom scoring application for real time data.  A thorough description of the application building process can be found in the following guide: Developing Applications for Event Processing with Oracle Stream Explorer.In our solution prototype for streaming analytics, the model exchange between ORE and OSX is realized by converting the R models to a PMML representation. After that, JPMML - the Java Evaluator API for PMML - is leveraged for reading the model and building a custom OSX scoring application.The end-to-end workflow is represented below: and the subsequent sections of this blog will summarize the essentials aspects. Model Generation As previously stated, the use cases targeted by this OSX-ORE integration prototype application consist of systems made of a large number of different components. Each  component type is abstracted by a different model. We leverage ORE's Embedded R Execution capability for data and task parallelism to generate a large number of models concurrently. This is accomplished for example with ore.groupApply(): res <- ore.groupApply(   X=...   INDEX=...   function(dat,frml) {mdl<-...},   ...,   parallel=np) Model representation in PMML The model transfer between the model generator and the scoring engine is enabled by conversion to a PMML representation. PMML is an XML-based mature standard for model exchange. A model in PMML format is represented by a collection of XML elements, or PMML components, which completely describe the modeling flow. For example, the Data Dictionary component contains the definitions for all fields used by the model (attribute types, value ranges, etc) the Data Transformations component describes the mapping functions between the raw data and its desired form for the modeling algorithms, the Mining Schema component assigns the active and target variables and enumerates the policies for missing data, outliers, and so on. Besides the specifications for the data mining algorithms together with accompanying of pre- and post-processing steps, PMML can also describe more complex modeling concepts like model composition, model hierarchies, model verification and fields scoping - to find out more about PMML's structure and functionality go to General Structure. PMML representations have been standardized for several classes of data mining algorithms.  Details are available at the same location. PMML in R In R the conversion/translation to PMML formats is enabled through the pmml package. The following algorithms are supported: ada (ada)arules (arules)coxph (survival)glm (stats)glmnet (glmnet)hclust (stats)kmeans (stats)ksvm (kernlab)lm (stats)multinom (nnet)naiveBayes (e1071)nnet (nnet)randomForest (randomFoerst)rfsrc (randomForestSRC)rpart (rpart)svm (e1071) The r2pmml package offers complementary support for gbm (gbm)train(caret) and a much better (performance-wise) conversion to PMML for randomForest. Check the details at converting_randomforest.The conversion to pmml is done via the pmml() generic function which dispatches the appropriate method for the supplied model,  depending on it's class. library(pmml) mdl <- randomForest(...) pmld <- pmml(mdl) Exporting the PMML model In the current prototype, the pmml model is exported to the streaming platform as a physical XML file. A better solution is to leverage R's serialization interface which supports a rich set of connections through pipes, url's, sockets, etc.The pmml objects can be also saved in ORE datastores within the database and specific policies can be implemented to control the access and usage. write(toString(pmmdl),file="..”) serialize(pmmdl,connection)ore.save(pmmdl,name=dsname,grantable=TRUE)ore.grant(name=dsname, type="datastore", user=...) OSX Applications and the Event Processing Network (EPN) The OSX workflow, implemented as an OSX application, consists of three logical steps: the pmml model is imported into OSX, a scoring application is created and scoring is performed on the input streams.In OSX, applications are modeled as Data Flow graphs named Event Processing Networks (EPN). Data flows into, through, and out of EPNs. When raw data flows into an OSX application it is first converted into events. Events flow through the different stages of application where they are processed according to the specifics of the application. At the end, events are converted back to data in suitable format for consumption by downstream applications.  The EPN for our prototype is basic: Streaming data flows from the Input Adapters through Input Channels, reaches the Scoring Processor where the prediction is performed, flows through the Output Channel to an Output Adapter and exits the application in a desired form. In our demo application the data is streamed out of a CSV file into the Input Adapter. The top adaptors (left & right) on the EPN  diagram represent connections to  the Stream Explorer User Interface (UI). Their purpose is to demonstrate options for controlling the scoring process (like, for example, change the model while the application is still running) and visualizing the predictions. The JPMML-Evaluator The Scoring Processor was implemented by leveraging the open source library JPMML library, the Java Evaluator API for PMML. The methods of this class allow, among others to pre-process the active & target fields according to the DataDictionary and MiningSchema elements, evaluate the model for several classes of algorithms and post-process the results according to the Targets element.JPMML offers support for: Association RulesCluster ModelsRegression General Regressionk-Nearest NeighborsNaïve BayesNeural NetworksTree ModelsSupport Vector MachinesEnsemble Modelswhich covers most of the models which can be converted to PMML from R, using the pmml() method, except for time series, sequence rules & text models. The Scoring Processor The Scoring Processor (see EPN) is implemented as a JAVA class with methods that automate scoring based on the PMML model. The important steps of this automation are enumerated below: The PMML schema is loaded, from the xml document,     pmml = pmmlUtil.loadModel(pmmlFileName); An instance of the Model Evaluator is created. In the example below we assume that we don't know what type of model we are dealing with so the instantiation is delegated to an instance of a ModelEvaluatorFactory class.     ModelEvaluatorFactory modelEvaluatorFactory =                                            ModelEvaluatorFactory.newInstance();    ModelEvaluator<?>  evaluator = modelEvaluatorFactory.newModelManager(pmml); This Model Evaluator instance is queried for the fields definitions. For the active fields:     List<FieldName>  activeModelFields = evaluator.getActiveFields(); The subsequent data preparation performs several tasks: value conversions between the Java type system and the PMML type system, validation of these values according to the specifications in the Data Field element, handling of invalid, missing values and outliers as per the Mining Field element.     FieldValue activeValue = evaluator.prepare(activeField, inputValue)    pmmlArguments.put(activeField, activeValue); The prediction is executed next     Map<FieldName, ?> results = evaluator.evaluate(pmmlArguments); Once this is done, the mapping between the scoring results & other fields to output events is performed. This needs to differentiate between the cases where the target values are Java primitive values or smth different.     FieldName targetName = evaluator.getTargetField();    Object targetValue = results.get(targetName);    if (targetValue instanceof Computable){ ….More details about this approach can be found at JPMML-Evaluator: Preparing arguments for evaluation and Java (JPMML) Prediction using R PMML model.The key aspect is that the JPMML Evaluator API provides the functionality for implementing the Scoring Processor independently of the actual model being used. The active variables, mappings, assignments, predictor invocations are figured out automatically, from the PMML representation. This approach allows flexibility for the scoring application. Suppose that several PMML models have been generated off-line, for the same system component/equipment part, etc. Then, for example, an n-variables logistic model could be replaced by an m-variables decision tree model via the UI control by just pointing a Scoring Processor variable to the new PMML object. Moreover the substitution can be executed via signal events sent through the UI Application Control (upper left of EPN) without stopping and restarting the scoring application. This is practical because the real-time data keeps flowing in ! Tested models The R algorithms listed below were tested and identical results were obtained for predictions based on the OSX PMML/JPMML scoring application and predictions in R. lm (stats) glm (stats) rpart (rpart) naiveBayes (e1071) nnet (nnet) randomForest (randomForest) The prototype is new and other algorithms are currently tested. The details will follow in a subsequent post. AcknowledgmentThe OSX-ORE PMML/JPMML-based prototype for real-time scoring was developed togehther with Mauricio Arango | A-Team Cloud Solutions Architects. The work was presented at BIWA 2016.

Whether applied to manufacturing, financial services, energy, transportation, retail, government, security or other domains, real-time analytics is an umbrella term which covers a broad spectrum of...

News

R Consortium Announces New Projects

The R Consortium works with and provides support to the R Foundation and other organizations developing, maintaining and distributing R software and provides a unifying framework for the R user community. The R Consortium Infrastructure Steering Committee (ISC) supports projects that help the R community, whether through software development, developing new teaching materials, documenting best practices, promoting R to new audiences, standardizing APIs, or doing research. In the first open call for proposals, Oracle submitted three proposals, each of which has been accepted by the ISC: “R Implementation, Optimization and Tooling Workshops” which received a grant, and two working groups “Future-proof native APIs for R” and “Code Coverage Tool for R.” These were officially announced by the R Consortium here. R Implementation, Optimization and Tooling Workshops Following the successful first edition of the R Implementation, Optimization and Tooling (RIOT) Workshop collocated with ECOOP 2015 conference, the second edition of the workshop will be collocated with useR! 2016 and held on July 3rd at Stanford University. Similarly to last year’s event, RIOT 2016 is a one-day workshop dedicated to exploring future directions for development of R language implementations and tools. The goals of the workshop include, but are not limited to, sharing experiences of developing different R language implementations and tools and evaluating their status, exploring possibilities to increase involvement of the R user community in constructing different R implementations, identifying R language development and tooling opportunities, and discussing future directions for the R language. The workshop will consist of a number of short talks and discussions and will bring together developers of R language implementations and tools. See this link for more information. Code Coverage Tool for R Code coverage helps to ensure greater software quality by reporting how thoroughly test suites cover the various code paths. Having a tool that supports the breadth of the R language across multiple platforms, and that is used by R package developers and R core teams, helps to improve software quality for the R Community. While a few code coverage tools exist for R, this Oracle-proposed ISC project aims to provide an enhanced tool that addresses feature and platform limitations of existing tools via an ISC-established working group. It also aims to promote the use of code coverage more systematically within the R ecosystem. Future-proof native APIs for R This project aims to develop a future-proof native API for R. The current native API evolved gradually, adding new functionality incrementally, as opposed to reflecting an overall design with one consistent API, which makes it harder than necessary to understand and use. As the R ecosystem evolves, the native API is becoming a bottleneck, preventing crucial changes to the GNUR runtime, while presenting difficulties for alternative implementations of the R language. The ISC recognizes this as critical to the R ecosystem and will create a working group to facilitate cooperation on this issue. This project's goal is to assess current native API usage, gather community input, and work toward a modern, future-proof, easy to understand, consistent and verifiable API that will make life easier for both users and implementers of the R language. Oracle is pleased to be a founding member of the R Consortium and to contribute to these and other projects that support the R community and ecosystem.

The R Consortiumworks with and provides support to the R Foundation and other organizations developing, maintaining and distributing R software and provides a unifying framework for the R user...

Tips and Tricks

Using SVD for Dimensionality Reduction

SVD, or Singular Value Decomposition, is one of several techniques that can be used to reduce the dimensionality, i.e., the number of columns, of a data set. Why would we want to reduce the number of dimensions? In predictive analytics, more columns normally means more time required to build models and score data. If some columns have no predictive value, this means wasted time, or worse, those columns contribute noise to the model and reduce model quality or predictive accuracy. Dimensionality reduction can be achieved by simply dropping columns, for example, those that may show up as collinear with others or identified as not being particularly predictive of the target as determined by an attribute importance ranking technique. But it can also be achieved by deriving new columns based on linear combinations of the original columns. In both cases, the resulting transformed data set can be provided to machine learning algorithms to yield faster model build times, faster scoring times, and more accurate models. While SVD can be used for dimensionality reduction, it is often used in digital signal processing for noise reduction, image compression, and other areas. SVD is an algorithm that factors an m x n matrix, M, of real or complex values into three component matrices, where the factorization has the form USV*. U is an m x p matrix. S is a p x p diagonal matrix. V is an n x p matrix, with V* being the transpose of V, a p x n matrix, or the conjugate transpose if M contains complex values. The value p is called the rank. The diagonal entries of S are referred to as the singular values of M. The columns of U are typically called the left-singular vectors of M, and the columns of V are called the right-singular vectors of M. Consider the following visual representation of these matrices: One of the features of SVD is that given the decomposition of M into U, S, and V, one can reconstruct the original matrix M, or an approximation of it. The singular values in the diagonal matrix S can be used to understand the amount of variance explained by each of the singular vectors. In R, this can be achieved using the computation: cumsum(S^2/sum(S^2)) When plotted, this provides a visual understanding of the variance captured by the model. The figure below indicates that the first singular vector accounts for 96.5% of the variance, the second with the first accounts for over 99.5%, and so on. As such, we can use this information to limit the number of vectors to the amount of variance we wish to capture. Reducing the number of vectors can help eliminate noise in the original data set when that data set is reconstructed using the subcomponents of U, S, and V. ORE’s parallel, distributed SVD With Oracle R Enterprise’s parallel distributed implementation of R’s svd function, only the S and V components are returned. More specifically, the diagonal singular values are returned of S as the vector d. If we store the result of invoking svd on matrix dat in svd.mod, U can be derived from these using M as follows: svd.mod <- svd(dat)U <- dat %*% svd.mod$v %*% diag(1./svd.mod$d)So, how do we achieve dimensionality reduction using SVD? We can use the first k columns of V and S and achieve U’ with fewer columns. U.reduced <-dat %*% svd.mod$v[,1:k,drop=FALSE] %*% diag((svd.mod$d)[1:k,drop=FALSE])This reduced U can now be used as a proxy for matrix dat with fewer columns.The function dimReduce introduced below accepts a matrix x, the number of columns desired k, and a request for any supplemental columns to return with the transformed matrix. dimReduce <- function(x, k=floor(ncol(x)/2), supplemental.cols=NULL) { colIdxs <- which(colnames(x) %in% supplemental.cols) colNames <- names(x[,-colIdxs]) sol <- svd(x[,-colIdxs]) sol.U <- as.matrix(x[,-colIdxs]) %*% (sol$v)[,1:k,drop=FALSE] %*% diag((sol$d)[1:k,drop=FALSE]) sol.U = sol.U@data res <- cbind(sol.U,x[,colIdxs,drop=FALSE]) names(res) <- c(names(sol.U@data),names(x[,colIdxs])) res}We will now use this function to reduce the iris data set. To prepare the iris data set, we first add a unique identifier, create the database table IRIS2 in the database, and then assign row names to enable row indexing. We could also make ID the primary key using ore.exec with the ALTER TABLE statement. Refreshing the ore.frame proxy object using ore.sync reflects the change in primary key. dat <- irisdat$ID <- seq_len(nrow(dat))ore.drop("IRIS2")ore.create(dat,table="IRIS2")row.names(IRIS2) <- IRIS2$ID# ore.exec("alter table IRIS2 add constraint IRIS2 primary key (\"ID\")")# ore.sync(table = "IRIS2", use.keys = TRUE)IRIS2[1:5,]Using the function defined above, dimReduce, we produce IRIS2.reduced with supplemental columns of ID and Species. This allows us to easily generate a confusion matrix later. You will find that IRIS2.reduced has 4 columns. IRIS2.reduced <- dimReduce(IRIS2, 2, supplemental.cols=c("ID","Species"))dim(IRIS2.reduced) # 150 4Next, we will build an rpart model to predict Species using first the original iris data set, and then the reduced data set so we can compare the confusion matrices of each. Note that to use R's rpart for model building, the data set IRIS2.reduced is pulled to the client. library(rpart)m1 <- rpart(Species~.,iris)res1 <- predict(m1,iris,type="class")table(res1,iris$Species)#res1 setosa versicolor virginica# setosa 50 0 0# versicolor 0 49 5# virginica 0 1 45dat2 <- ore.pull(IRIS2.reduced)m2 <- rpart(Species~.-ID,dat2)res2 <- predict(m2,dat2,type="class")table(res2,iris$Species)# res2 setosa versicolor virginica# setosa 50 0 0# versicolor 0 47 0# virginica 0 3 50Notice that the resulting models are comparable, but that the model that used IRIS2.reduced actually has better overall accuracy, making just 3 mistakes instead of 6. Of course, a more accurate assessment of error would be to use cross validation, however, this is left as an exercise for the reader. We can build a similar model using the in-database decision tree algorithm, via ore.odmDT, and get the same results on this particular data set. m2.1 <- ore.odmDT(Species~.-ID, IRIS2.reduced)res2.1 <- predict(m2.1,IRIS2.reduced,type="class",supplemental.cols = "Species")table(res2.1$PREDICTION, res2.1$Species)# res2 setosa versicolor virginica# setosa 50 0 0# versicolor 0 47 0# virginica 0 3 50A more interesting example is based on the digit-recognizer data which can be located on the Kaggle website here. In this example, we first use Support Vector Machine as the algorithm with default parameters on split train and test samples of the original training data. This allows us to get an objective assessment of model accuracy. Then, we preprocess the train and test sets using the in-database SVD algorithm and reduce the original 785 predictors to 40. The reduced number of variables specified is subject to experimentation. Degree of parallelism for SVD was set to 4. The results highlight that reducing data dimensionality can improve overall model accuracy, and that overall execution time can be significantly faster. Specifically, using ore.odmSVM for model building saw a 43% time reduction and a 4.2% increase in accuracy by preprocessing the train and test data using SVD. However, it should be noted that not all algorithms are necessarily aided by dimensionality reduction with SVD. In a second test on the same data using ore.odmRandomForest with 25 trees and defaults for other settings, accuracy of 95.3% was achieved using the original train and test sets. With the SVD reduced train and test sets, accuracy was 93.7%. While the model building time was reduced by 80% and scoring time reduced by 54%, if we factor in the SVD execution time, however, using the straight random forest algorithm does better by a factor of two. DetailsFor this scenario, we modify the dimReduce function introduced above and add another function dimReduceApply. In dimReduce, we save the model in an ORE Datastore so that the same model can be used to transform the test data set for scoring. In dimReduceApply, that same model is loaded for use in constructing the reduced U matrix. dimReduce <- function(x, k=floor(ncol(x)/2), supplemental.cols=NULL, dsname="svd.model") { colIdxs <- which(colnames(x) %in% supplemental.cols) if (length(colIdxs) > 0) { sol <- svd(x[,-colIdxs]) sol.U <- as.matrix(x[,-colIdxs]) %*% (sol$v)[,1:k,drop=FALSE] %*% diag((sol$d)[1:k,drop=FALSE]) res <- cbind(sol.U@data,x[,colIdxs,drop=FALSE]) # names(res) <- c(names(sol.U@data),names(x[,colIdxs])) res } else { sol <- svd(x) sol.U <- as.matrix(x) %*% (sol$v)[,1:k,drop=FALSE] %*% diag((sol$d)[1:k,drop=FALSE]) res <- sol.U@data } ore.save(sol, name=dsname, overwrite=TRUE) res}dimReduceApply <- function(x, k=floor(ncol(x)/2), supplemental.cols=NULL, dsname="svd.model") { colIdxs <- which(colnames(x) %in% supplemental.cols) ore.load(dsname) if (length(colIdxs) > 0) { sol.U <- as.matrix(x[,-colIdxs]) %*% (sol$v)[,1:k,drop=FALSE] %*% diag((sol$d)[1:k,drop=FALSE]) res <- cbind(sol.U@data,x[,colIdxs,drop=FALSE]) # names(res) <- c(names(sol.U@data),names(x[,colIdxs])) res } else { sol.U <- as.matrix(x) %*% (sol$v)[,1:k,drop=FALSE] %*% diag((sol$d)[1:k,drop=FALSE]) res <- sol.U@data } res}Here is the script used for the digit data:# load data from filetrain <- read.csv("D:/datasets/digit-recognizer-train.csv") dim(train) # 42000 786train$ID <- 1:nrow(train) # assign row idore.drop(table="DIGIT_TRAIN")ore.create(train,table="DIGIT_TRAIN") # create as table in the databasedim(DIGIT_TRAIN) # 42000 786# Split the original training data into train and # test sets to evaluate model accuracyset.seed(0)dt <- DIGIT_TRAINind <- sample(1:nrow(dt),nrow(dt)*.6)group <- as.integer(1:nrow(dt) %in% ind)row.names(dt) <- dt$IDsample.train <- dt[group==TRUE,]sample.test <- dt[group==FALSE,]dim(sample.train) # 25200 786dim(sample.test) # 16800 786# Create train table in databaseore.create(sample.train, table="DIGIT_SAMPLE_TRAIN") # Create test table in databaseore.create(sample.test, table="DIGIT_SAMPLE_TEST") # Add persistent primary key for row indexing# Note: could be done using row.names(DIGIT_SAMPLE_TRAIN) <- DIGIT_SAMPLE_TRAIN$IDore.exec("alter table DIGIT_SAMPLE_TRAIN add constraint DIGIT_SAMPLE_TRAIN primary key (\"ID\")")ore.exec("alter table DIGIT_SAMPLE_TEST add constraint DIGIT_SAMPLE_TEST primary key (\"ID\")")ore.sync(table = c("DIGIT_SAMPLE_TRAIN","DIGIT_SAMPLE_TRAIN"), use.keys = TRUE)# SVM modelm1.svm <- ore.odmSVM(label~.-ID, DIGIT_SAMPLE_TRAIN, type="classification")pred.svm <- predict(m1.svm, DIGIT_SAMPLE_TEST, supplemental.cols=c("ID","label"),type="class")cm <- with(pred.svm, table(label,PREDICTION))library(caret)confusionMatrix(cm)# Confusion Matrix and Statistics# # PREDICTION# label 0 1 2 3 4 5 6 7 8 9# 0 1633 0 4 2 3 9 16 2 7 0# 1 0 1855 12 3 2 5 4 2 23 3# 2 9 11 1445 22 26 8 22 30 46 10# 3 8 9 57 1513 2 57 16 16 41 15# 4 5 9 10 0 1508 0 10 4 14 85# 5 24 12 14 52 28 1314 26 6 49 34# 6 10 2 7 1 8 26 1603 0 6 0# 7 10 8 27 4 21 8 1 1616 4 70# 8 12 45 14 40 7 47 13 10 1377 30# 9 12 10 6 19 41 15 2 54 15 1447# # Overall Statistics# # Accuracy : 0.9114 # 95% CI : (0.907, 0.9156)# No Information Rate : 0.1167 # P-Value [Acc > NIR] : < 2.2e-16 #... options(ore.parallel=4)sample.train.reduced <- dimReduce(DIGIT_SAMPLE_TRAIN, 40, supplemental.cols=c("ID","label"))sample.test.reduced <- dimReduceApply(DIGIT_SAMPLE_TEST, 40, supplemental.cols=c("ID","label"))ore.drop(table="DIGIT_SAMPLE_TRAIN_REDUCED")ore.create(sample.train.reduced,table="DIGIT_SAMPLE_TRAIN_REDUCED")ore.drop(table="DIGIT_SAMPLE_TEST_REDUCED")ore.create(sample.test.reduced,table="DIGIT_SAMPLE_TEST_REDUCED")m2.svm <- ore.odmSVM(label~.-ID, DIGIT_SAMPLE_TRAIN_REDUCED, type="classification")pred2.svm <- predict(m2.svm, DIGIT_SAMPLE_TEST_REDUCED, supplemental.cols=c("label"),type="class")cm <- with(pred2.svm, table(label,PREDICTION))confusionMatrix(cm)# Confusion Matrix and Statistics# # PREDICTION# label 0 1 2 3 4 5 6 7 8 9# 0 1652 0 3 3 2 7 4 1 3 1# 1 0 1887 8 2 2 1 1 3 3 2# 2 3 4 1526 11 20 3 7 21 27 7# 3 0 3 29 1595 3 38 4 16 34 12# 4 0 4 8 0 1555 2 11 5 9 51# 5 5 6 2 31 6 1464 13 6 10 16# 6 2 1 5 0 5 18 1627 0 5 0# 7 2 6 22 7 10 2 0 1666 8 46# 8 3 9 9 34 7 21 9 7 1483 13# 9 5 2 8 17 30 10 3 31 20 1495# # Overall Statistics# # Accuracy : 0.9494 # 95% CI : (0.946, 0.9527)# No Information Rate : 0.1144 # P-Value [Acc > NIR] : < 2.2e-16 #... # CASE 2 with Random Forestm2.rf <- ore.randomForest(label~.-ID, DIGIT_SAMPLE_TRAIN,ntree=25)pred2.rf <- predict(m2.rf, DIGIT_SAMPLE_TEST, supplemental.cols=c("label"),type="response")cm <- with(pred2.rf, table(label,prediction))confusionMatrix(cm)# Confusion Matrix and Statistics# # prediction# label 0 1 2 3 4 5 6 7 8 9# 0 1655 0 1 1 2 0 7 0 9 1# 1 0 1876 12 8 2 1 1 2 6 1# 2 7 4 1552 14 10 2 5 22 10 3# 3 9 5 33 1604 1 21 4 16 27 14# 4 1 4 3 0 1577 1 9 3 3 44# 5 9 6 2 46 3 1455 18 1 9 10# 6 13 2 3 0 6 14 1621 0 3 1# 7 1 6 31 5 16 3 0 1675 3 29# 8 3 7 15 31 11 20 8 4 1476 20# 9 9 2 7 23 32 5 1 15 12 1515# # Overall Statistics# # Accuracy : 0.9527 # 95% CI : (0.9494, 0.9559)# No Information Rate : 0.1138 # P-Value [Acc > NIR] : < 2.2e-16 #... m1.rf <- ore.randomForest(label~.-ID, DIGIT_SAMPLE_TRAIN_REDUCED,ntree=25)pred1.rf <- predict(m1.rf, DIGIT_SAMPLE_TEST_REDUCED, supplemental.cols=c("label"),type="response")cm <- with(pred1.rf, table(label,prediction))confusionMatrix(cm)# Confusion Matrix and Statistics# # prediction# label 0 1 2 3 4 5 6 7 8 9# 0 1630 0 4 5 2 8 16 3 5 3# 1 0 1874 17 4 0 5 2 2 4 1# 2 15 2 1528 17 10 5 10 21 16 5# 3 7 1 32 1601 4 25 10 8 34 12# 4 2 6 6 3 1543 2 17 4 4 58# 5 9 1 5 45 12 1443 11 3 15 15# 6 21 3 8 0 5 15 1604 0 7 0# 7 5 11 33 7 17 6 1 1649 2 38# 8 5 13 27 57 14 27 9 12 1404 27# 9 10 2 6 22 52 8 5 41 12 1463# # Overall Statistics# # Accuracy : 0.9368 # 95% CI : (0.9331, 0.9405)# No Information Rate : 0.1139 # P-Value [Acc > NIR] : < 2.2e-16 #... Execution TimesThe following numbers reflect the execution times for select operations of the above script. Hardware was a Lenovo Thinkpad with Intel i5 processor and 16 GB RAM.

SVD, or Singular Value Decomposition, is one of several techniques that can be used to reduce the dimensionality, i.e., the number of columns, of a data set. Why would we want to reduce the number of...

News

Learn, Share, and Network! Join us at BIWA Summit, Oracle HQ, January 26-28

Join us at BIWA Summit held at Oracle Headquarters to learn about the latest in Oracle technology, customer experiences, and best practices, while sharing your experiences with colleagues, and networking with technology experts. BIWA Summit 2016, the Oracle Big Data + Analytics User Group Conference is joining forces with the NoCOUG SIG’s YesSQL Summit, Spatial SIG’s Spatial Summit and DWGL for the biggest BIWA Summit ever.  Check out the BIWA Summit’16 agenda.The BIWA Summit 2016 sessions and hands-on-labs are excellent opportunities for attendees to learn about Advanced Analytics / Predictive Analytics, R, Spatial Geo-location, Graph/Social Network Analysis, Big Data Appliance and Hadoop, Cloud, Big Data Discovery, OBIEE & Business Intelligence, SQL Patterns, SQL Statistical Functions, and more! Check out these R technology-related sessions: Hands-on lab with Oracle R Enterprise "Scaling R to New Heights with Oracle Database" Oracle R Enterprise 1.5 - Hot new features! Large Scale Machine Learning with Big Data SQL, Hadoop and Spark Improving Predictive Model Development Time with R and Oracle Big Data Discovery Fiserv Case Study: Using Oracle Advanced Analytics for Fraud Detection in Online Payments Machine Learning on Streaming Data via Integration of Oracle R Enterprise and Oracle Stream Explorer Fault Detection using Advanced Analytics at CERN's Large Hadron Collider: Too Hot or Too Cold Stubhub and Oracle Advanced Analytics ...and more

Join us at BIWA Summit held at Oracle Headquarters to learn about the latest in Oracle technology, customer experiences, and best practices, while sharing your experiences with colleagues, and...

Best Practices

ORE Random Forest

Random Forest is a popular ensemble learning technique for classification and regression, developed by Leo Breiman and Adele Cutler. By combining the ideas of “bagging” and random selection of variables, the algorithm produces a collection of decision trees with controlled variance, while avoiding overfitting – a common problem for decision trees. By constructing many trees, classification predictions are made by selecting the mode of classes predicted, while regression predictions are computed using the mean from the individual tree predictions. Although the Random Forest algorithm provides high accuracy, performance and scalability can be issues for larger data sets. Oracle R Enterprise 1.5 introduces Random Forest for classification with three enhancements:  •  ore.randomForest uses the ore.frame proxy for database tables so that data remain in the database server  •  ore.randomForest executes in parallel for model building and scoring while using Oracle R Distribution or R’s randomForest package 4.6-10  •  randomForest in Oracle R Distribution significantly reduces memory requirements of R’s algorithm, providing only the functionality required for use by ore.randomForest PerformanceConsider the model build performance of randomForest for 500 trees (the default) and three data set sizes (10K, 100K, and 1M rows). The formula is ‘DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH’ using samples of the popular ONTIME domestic flight dataset. With ORE’s parallel, distributed implementation, ore.randomForest is an order of magnitude faster than the commonly used randomForest package. While the first plot uses the original execution times, the second uses a log scale to facilitate interpreting scalability. Memory vs. Speed ore.randomForest is designed for speed, relying on ORE embedded R execution for parallelism to achieve the order of magnitude speedup. However, the data set is loaded into memory for each parallel R engine, so high degrees of parallelism (DOP) will result in the corresponding use of memory. Since Oracle R Distribution’s randomForest improves memory usage over R's randomForest (approximately 7X less), larger data sets can be accommodated. Users can specify the DOP using the ore.parallel global option. APIThe ore.randomForest API:ore.randomForest(formula, data, ntree=500, mtry = NULL,                replace = TRUE, classwt = NULL, cutoff = NULL,                sampsize = if(replace) nrow(data) else ceiling(0.632*nrow(data)),                nodesize = 1L, maxnodes = NULL, confusion.matrix = FALSE,                na.action = na.fail, ...)To highlight two of the arguments, confusion_matrix is a logical value indicating whether to calculate the confusion matrix. Note that this confusion matrix is not based on OOB (out-of-bag), it is the result of applying the built random forest model to the entire training data. Argument groups is the number of tree groups that the total number of trees are divided into during model build. The default is equal to the value of the option 'ore.parallel'. If system memory is limited, it is recommended to set this argument to a value large enough so that the number of trees in each group is small to avoid exceeding memory availability. Scoring with ore.randomForest follows other ORE scoring functions:predict(object, newdata,        type = c("response", "prob", "vote", "all"),        norm.votes = TRUE,        supplemental.cols = NULL,        cache.model = TRUE, ...)The arguments include:  •  type: scoring output content – 'response', 'prob', 'votes', or 'all'. Corresponding to predicted values, matrix of class probabilities, matrix of vote counts, or both the vote matrix and predicted values, respectively.  •  norm.votes: a logical value indicating whether the vote counts in the output vote matrix should be normalized. The argument is ignored if 'type' is 'response' or 'prob'.  •  supplemental.cols: additional columns from the 'newdata' data set to include in the prediction result. This can be particularly useful for including a key column that can be related back to the original data set.     cache.model: a logical value indicating whether the entire random forest model is cached in memory during prediction. While the default is TRUE, setting it to FALSE may be beneficial if memory is an issue. Example options(ore.parallel=8)df <- ONTIME_S[,c("DAYOFWEEK","DEPDELAY","DISTANCE",             "UNIQUECARRIER","DAYOFMONTH","MONTH")]df <- df[complete.cases(df),]mod <- ore.randomForest(DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH,                 df, ntree=100,groups=20)ans <- predict(mod, df, type="all", supplemental.cols="DAYOFWEEK")head(ans) R> options(ore.parallel=8)R> df <- ONTIME_S[,c("DAYOFWEEK","DEPDELAY","DISTANCE",            "UNIQUECARRIER","DAYOFMONTH","MONTH")]R> df <- dd[complete.cases(dd),]R> mod <- ore.randomForest(DAYOFWEEK~DEPDELAY+DISTANCE+UNIQUECARRIER+DAYOFMONTH+MONTH,+                 df, ntree=100,groups=20)R> ans <- predict(mod, df, type="all", supplemental.cols="DAYOFWEEK")R> head(ans)     1    2    3    4    5    6    7 prediction DAYOFWEEK1 0.09 0.01 0.06 0.04 0.70 0.05 0.05 5          52 0.06 0.01 0.02 0.03 0.01 0.38 0.49 7          63 0.11 0.03 0.16 0.02 0.06 0.57 0.05 6          64 0.09 0.04 0.15 0.03 0.02 0.62 0.05 6          65 0.04 0.04 0.04 0.01 0.06 0.72 0.09 6          66 0.35 0.11 0.14 0.27 0.05 0.08 0.00 1          1

Random Forest is a popular ensemble learning technique for classification and regression, developed by Leo Breiman and Adele Cutler. By combining the ideas of “bagging” and random selection...

Best Practices

Oracle R Enterprise 1.5 Released

We’re pleased to announce that Oracle R Enterprise (ORE) 1.5 is now available for download on all supported platforms with Oracle R Distribution 3.2.0 / R-3.2.0. ORE 1.5 introduces parallel distributed implementations of Random Forest, Singular Value Decomposition (SVD), and Principal Component Analysis (PCA) that operate on ore.frame objects. Performance enhancements are included for ore.summary summary statistics. In addition, ORE 1.5 enhances embedded R execution with CLOB/BLOB data column support to enable larger text and non-character data to be transferred between Oracle Database and R. CLOB/BLOB support is also enabled for functions ore.push and ore.pull. The ORE R Script Repository now supports finer grained R script access control across schemas. Similarly, the ORE Datastore enables finer grained R object access control across schemas. For ore.groupApply in embedded R execution, ORE 1.5 now supports multi-column partitioning of data using the INDEX argument. Multiple bug fixes are also included in this release. Here are the highlights for the new and upgraded features in ORE 1.5:Upgraded R version compatibilityORE 1.5 is certified with R-3.2.0 - both open source R and Oracle R Distribution. See the server support matrix for the complete list of supported R versions. R-3.2.0 brings improved performance and big in-memory data objects, and compatibility with more than 7000 community-contributed R packages.For supporting packages, ORE 1.5 includes one new package, randomForest, with upgrades to other packages:arules 1.1-9cairo 1.5-8DBI 0.3-1png 0.1-7ROracle 1.2-1statmod 1.4.21randomForest 4.6-10Parallel and distributed algorithmsWhile the Random Forest algorithm provides high accuracy, performance and scalability can be issues for large data sets. ORE 1.5 introduces Random Forest in Oracle R Distribution with two enhancements: first, a revision to reduce memory requirements of the open source randomForest algorithm; and second, the function ore.randomForest that executes in parallel for model building and scoring while using the underlying randomForest function either from Oracle R Distribution or R’s randomForest package 4.6-10. ore.randomForest uses ore.frame objects allowing data to remain in the database server. The functions svd and prcomp have been overloaded to execute in parallel and accept ore.frame objects. Users now get in-database execution of this functionality to improve scalability and performance – no data movement. Performance enhancementsore.summary performance enhancements supports executions that are 30x faster than previous releases. Capability enhancementsore.grant and ore.revoke functions enable users to grant other users read access to their R scripts in the R script repository and individual datastores. The database data types CLOB and BLOB are now supported for embedded R execution invocations input and output, as well as for the functions ore.pull and ore.push. For embedded R execution ore.groupApply, users can now specify multiple columns for automatically partitioning data via the INDEX argument. For a complete list of new features, see the Oracle R Enterprise User's Guide. To learn more about Oracle R Enterprise, visit Oracle R Enterprise on Oracle's Technology Network, or review the variety of use cases on the Oracle R Technologies blog.

We’re pleased to announce that Oracle R Enterprise (ORE) 1.5 is now available for download on all supported platforms with Oracle R Distribution 3.2.0 / R-3.2.0. ORE 1.5 introduces...

Best Practices

Using RStudio Shiny with ORE for interactive analysis and visualization

Shiny, by RStudio, is a popular web application framework for R. It can be used, for example, for building flexible interactive analyses and visualization solutions without requiring web development skills and knowledge of Javascript, HTML, CSS, etc. An overview of it's capabilities with numerous examples is available on RStudio's Shiny web site. In this blog we illustrate a simple Shiny application for processing and visualizing data stored in Oracle Database for the special case where this data is wide and shallow. In this use case, analysts wish to interactively compute and visualize correlations between many variables (up to 40k) for genomics-related data where the number of observations is small, typically around 1000 cases. A similar use case was discussed in a recent blog Consolidating wide and shallow data with ORE Datastore, which addressed the performance aspects of reading and saving wide data from/to an Oracle R Enterprise (ORE) datastore. ORE allows users to store any type of R objects, including wide data.frames, directly in an Oracle Database using ORE datastores. Our shiny application, invoked at the top level via the shinyApp() command below, has two components: a user-interface (ui) definition myui and a server function myserver. R> library("shiny") R> shinyApp(ui = myui, server = myserver) The functionality is very basic. The user specifies the dataset to be analyzed, the sets of variables to correlate against each other, the correlation method (Pearson, Spearman,etc) and the treatment of missing data as handled by the 'method' and, respectively, the 'use' arguments of R's cor() function. The datasets are all wide (ncols > 1000) and have been already saved in an ORE datastore. A partial code for the myui user interface is below. The sidebarPanel section handles the input and the correlation values are displayed graphically by plotOutput() within the mainPanel section. The argument 'corrPlot' corresponds to the function invoked by the server() component. R> myui <- fluidPage(..., sidebarLayout( sidebarPanel( selectInput("TblName", label = "Data Sets", choices = c(...,...,...), selected = (...), radioButtons("corrtyp", label = "Correlation Type", choices = list("Pearson"="pearson", "Spearman"="spearman", "Kendall"="kendall"), selected = "pearson"), selectInput("use", label = "Use", choices = c("everything","all.obs","complete.obs", "na.or.complete","pairwise.complete.obs"), selected = "everything"), textInput("xvars",label=...,value=...), textInput("yvars",label=...,value=...) , submitButton("Submit") ), mainPanel( plotOutput("corrPlot") ) )) The server component consists of two functions : The server function myserver(), passed as argument during the top level invocation of shinyApp(). myserver returns the output$corrPlot object (see the ui component) generated by shiny's rendering function renderPlot(). The plot object p is generated within renderPlot() by calling ore.doEval() for the embedded R execution of gener_corr_plot(). The ui input selections are passed to gener_corr_plot() through ore.doEval(). R> myserver <- function(input, output) { output$corrPlot <- renderPlot({ p<- ore.pull(ore.doEval( FUN.NAME="gener_corr_plot", TblName=input$TblName, corrtyp=input$corrtyp, use=input$use, xvartxt=input$xvars, yvartxt=input$yvars, ds.name="...", ore.connect=TRUE)) print(p) }) } A core function gener_corr_plot(), which combines loading data from the ORE datastore, invoking the R cor() function with all arguments specified in the ui and generating the plot for the resulting correlation values. R> gener_corr_plot <- function(TblName,corrtyp="pearson",use="everything", xvars=...,yvars=...,ds.name=...) { library("ggplot2") ore.load(name=ds.name,list=c(TblName)) res <- cor(get(TblName)[,filtering_based_on_xvars], get(TblName)[,filtering_based_on_yvars], method=corrtyp,use=use) p <- qplot(y=c(res)....) } The result of invoking ShinyApp() is illustrated in the figure below. The data picked for this display is DOROTHEA, a drug discovery dataset from the UCI Machine Learning Repository and the plot shows the Pearson correlation between the variable 1 and the first 2000 variables of the dataset. The variables are labeled here generically, by their index number. For the type of data considered by the customer (wide tables with 10k-40k columns and ~ 1k rows) a complete iteration (data loading, correlation calculations between one variable and all others & plot rendering) takes from a few seconds up to a few dozen seconds, depending on the correlation method and the handling of missing values. This is considerably faster than the many minutes reported by customers running their code in memory with data loaded from CSV files. Several questions may come to mind when thinking about such Shiny-based applications. For example: When and how to decouple the analysis from data reading? How to build chains of reactive components for more complex applications? What is an efficient rendering mechanism for multiple plotting methods and increasing data sizes? What other options for handling data with more than 1000 columns exist? (nested tables?) We will address these in future posts. In this Part 1 we illustrated that it is easy to construct a Shiny-based interactive application for wide data by leveraging ORE's datastores capability and support for embedded R execution. Besides improved performance, this solution offers the security, auditing, backup and recovery capabilities of Oracle Database.

Shiny, by RStudio, is a popular web application framework for R. It can be used, for example, for building flexible interactive analyses and visualization solutions without requiring web development...

Best Practices

Oracle R Distribution 3.2.0 Benchmarks

We recently updated the Oracle R Distribution (ORD) benchmarks on ORD version 3.2.0. ORD is Oracle's free distribution of the open source R environment that adds support for dynamically loading the Intel Math Kernel Library (MKL) installed on your system. MKL provides faster performance by taking advantage of hardware-specific math library implementations. The benchmark results demonstrate the performance of Oracle R Distribution 3.2.0 with and without dynamically loaded MKL. We executed the community-based R-Benchmark-25 script, which consists of a set of tests that benefit from faster matrix computations. The tests were run on a 24 core Linux system with 3.07 GHz per CPU and 47 GB RAM. We previously executed the same scripts against ORD 2.15.1 and ORD 3.0.1 on similar hardware.  Oracle R Distribution 3.2.0 Benchmarks (Time in seconds)Speedup = (Slower time / Faster Time) - 1The first graph focuses on shorter running tests.  We see significant performance improvements - SVD with ORD + MKL executes 20 times faster using 4 threads, and 30 times faster using 8 threads. For Cholesky Factorization, ORD + MKL is 15 and 27 times faster for 4 and 8 threads, respectively. In the second graph,we focus on the longer running tests. Principal components analysis is 30 and almost 38 times faster for ORD + MKL with 4 and 8 threads, respectively, matrix multiplication is 80 and 139 times faster, and Linear discriminant analysis is almost 4 times faster.  By using Oracle R Distribution with MKL, you will see a notable performance boost for many R applications. These improvements happened with the exact same R code, without requiring any linking steps or updating any packages. Oracle R Distribution for R-3.2.0 is available for download on Oracle's Open Source Software portal. Oracle offers support for users of Oracle R Distribution on Windows, Linux, AIX and Solaris 64 bit platforms.

We recently updated the Oracle R Distribution (ORD) benchmarks on ORD version 3.2.0. ORD is Oracle's free distribution of the open source R environment that adds support for dynamically loading the...

Best Practices

Oracle R Enterprise Performance on Intel® DC P3700 Series SSDs

Solid-state drives (SSDs) are becoming increasingly popular in enterprise storage systems, providing large caches, permanent storage and low latency. A recent study aimed to characterize the performance of Oracle R Enterprise workloads on the Intel® P3700 SSD versus hard disk drives (HDDs), with IO-WAIT as the key metric of interest. The study showed that Intel® DC P3700 Series SSDs reduced I/O latency for Oracle R Enterprise workloads, most notably when saving objects to Oracle R Enterprise datastore and materializing scoring results. The test environment was a 2-node Red Hat Linux 6.5 cluster, each node containing a 2TB Intel® DC P3700 Series SSD with attached 2 TB HDD. As the primary objective was to identify applications that could benefit from SSDs as-is, test results showed the performance gain without systems modification or tuning. The tests for the study consisted of a set of single analytic workloads composed of I/O, computational, and memory intensive tests. The tests were run both serially and in parallel up to 100 jobs, which are a good representation of typical workloads for Oracle R Enterprise customers.  The figures below show execution time results for datastore save and load, and materializing model scoring results to a database table. The datastore performance is notably improved for the larger test sets (100 million and 1 billion rows).  For these workloads, execution time was reduced by an average of 46% with a maximum of 67% compared to the HDD attached to the same cluster.  Figure 1: Saving and loading objects to and from the ORE datastore, HDD vs. Intel® P3700 Series SSD Figure 2: Model scoring and materializing results, HDD vs. Intel® P3700 Series SSD The entire set of test results shows that Intel® DC P3700 Series SSDs provides an average reduction of 30% with a maximum of 67% in execution time for I/O heavy Oracle R Enterprise workloads. These results could potentially be improved by working with hardware engineers to tune the host kernel and other settings to optimize SSD performance. Intel® DC P3700 Series SSDs can increase storage I/O by reducing latency and increasing throughput, and it is recommended to explore system tuning options with your engineering team to achieve the best result. 

Solid-state drives (SSDs) are becoming increasingly popular in enterprise storage systems, providing large caches, permanent storage and low latency. A recent study aimed to characterize...

Tips and Tricks

Consolidating wide and shallow data with ORE Datastore

Clinical trial data are often characterized by a relatively small set of participants (100s or 1000s) while the data collected and analyzed on each may be significantly larger (1000s or 10,000s). Genomic data alone can easily reach the higher end of this range. In talking with industry leaders, one of the problems pharmaceutical companies and research hospitals encounter is effectively managing such data. Storing data in flat files on myriad servers, perhaps even “closeted” when no longer actively needed, poses problems for data accessibility, backup, recovery, and security. While Oracle Database provides support for wide data using nested tables in a number of contexts, to take advantage of R native functions that handle wide data using data.frames, Oracle R Enterprise allows you to store wide data.frames directly in Oracle Database using Oracle R Enterprise datastores. With Oracle R Enterprise (ORE), a component of the Oracle Advanced Analytics option, the ORE datastore supports storing arbitrary R objects, including data.frames, in Oracle Database. In particular, users can load wide data from a file into R and store the resulting data.frame directly the R datastore. From there, users can repeatedly load the data at much faster speeds than they can from flat files. The following benchmark results illustrate the performance of saving and loading data.frames of various dimensions. These tests were performed on an Oracle Exadata 5-2 half rack, ORE 1.4.1, ROracle 1.2-1, and R 3.2.0. Logging is turned off on the datastore table (see performance tip below). The data.frame consists of numeric data. Comparing AlternativesWhen it comes to accessing data and saving data for use with R, there are several options, including: CSV file, .Rdata file, and the ORE datastore. Each comes with its own advantages. CSV“Comma separated value” or CSV files are generally portable, provide a common representation for exporting/importing data, and can be readily loaded into a range of applications. However, flat files need to be managed and often have inadequate security, auditing, backup, and recovery. As we’ll see, CSV files provide significantly slower read and write times compared to .Rdata and ORE datastore..RdataR’s native .Rdata flat file representation is generally efficient for reading/writing R objects since the objects are in serialized form, i.e., not converted to a textual representation as CSV data are. However, .Rdata flat files also need to be managed and often have inadequate security, auditing, backup, and recovery. While faster than CSV read and write times, .Rdata is slower than ORE datastore. Being an R-specific format, access is limited to the R environment, which may or may not be a concern. ORE DatastoreORE’s datastore capability allows users to organize and manage all data in a single location – the Oracle Database. This centralized repository provides Oracle Database quality security, auditing, backup, and recovery. The ORE datastore, as you’ll see below, provides read and write performance that is significantly better than CSV and .Rdata. Of course, as with .Rdata being accessed through R, accessing the datastore is through Oracle Database. Let’s look at a few benchmark comparisons. First, consider the execution time for loading data using each of these approaches. For 2000 columns, we see that ore.load() is 124X faster than read.csv(), and over 3 times faster than R’s load() function for 5000 rows. At 20,000 rows, ore.load() is 198X faster than read.csv() and almost 4 times faster than load().Considering the time to save data, ore.save() is over 11X faster than write.csv() and over 8X faster than save() at 2000 rows, with that benefit continuing through 20000 rows.Looking at this across even wider data.frames, e.g., adding results for 4000 and 16000 columns, we see a similar performance benefit for the ORE datastore over save/load and write.csv/read.csv. If you are looking to consolidate data while gaining performance benefits along with security, backup, and recovery, the Oracle R Enterprise datastore may be a preferred choice. Example using ORE DatastoreThe ORE datastore functions ore.save() and ore.load() are similar to the corresponding R save() and load() functions. In the following example, we read a CSV data file, save it in the ORE datastore using ore.save() and associated it with the name “MyDatastore”. Although not shown, multiple objects can be listed in the initial arguments. Note that any R objects can be included here, not just data.frames. From there, we list the contents of the datastore and see that “MyDatastore” is listed with the number of objects stored and the overall size. Next we can ask for a summary of the contents of “MyDatastore”, which includes the data.frame ‘dat’. Next we remove ‘dat’ and load the contents of the datastore, reconstituting ‘dat’ as a usable data.frame object. Lastly, we delete the datastore and see that the ORE datastore is empty. > > dat <- read.csv("df.dat")> dim(dat)[1] 300 2000> > ore.save(dat, name="MyDatastore")> ore.datastore() datastore.name object.count size creation.date description1 MyDatastore 1 4841036 2015-09-01 12:07:38 >> ore.datastoreSummary("MyDatastore") object.name class size length row.count col.count1 dat data.frame 4841036 2000 300 2000> > rm(dat)> ore.load("MyDatastore")[1] "dat">> ore.delete("MyDatastore")[1] "MyDatastore">> ore.datastore()[1] datastore.name object.count size creation.date description <0 rows> (or 0-length row.names)> Performance TipThe performance of saving R objects to the datastore can be increased by temporarily turning off logging on the table that serves as the datastore in the user’s schema: RQ$DATASTOREINVENTORY. This can be accomplished using the following SQL, which can also be invoked from R:SQL> alter table RQ$DATASTOREINVENTORY NOLOGGING;ORE> ore.exec(“alter table RQ$DATASTOREINVENTORY NOLOGGING”)While turning off logging speeds up inserts and index creation, it avoids writing the redo log and as such has implications for database recovery. It can be used in combination with explicit backups before and after loading data.

Clinical trial data are often characterized by a relatively small set of participants (100s or 1000s) while the data collected and analyzed on each may be significantly larger (1000s or 10,000s)....

Best Practices

Oracle R Advanced Analytics for Hadoop on the Fast Lane: Spark-based Logistic Regression and MLP Neural Networks

This is the first in a series of blogs that is going to explore the capabilities of the newly released Oracle R Advanced Analytics for Hadoop 2.5.0, part of Oracle Big Data Connectors, which includes two new algorithm implementations that can take advantage of an Apache Spark cluster for a significant performance gains on Model Build and Scoring time. These algorithms are a redesigned version of the Multi-Layer Perceptron Neural Networks (orch.neural) and a brand new implementation of a Logistic Regression model (orch.glm2).Through large scale benchmarks we are going to see the improvements in performance that the new custom algorithms bring to enterprise Data Science when running on top of a Hadoop Cluster with an available Apache Spark infrastructure.In this first part, we are going to compare only model build performance and feasibility of the new algorithms against the same algorithms running on Map-Reduce, and we are not going to be concerned with model quality or precision. Model scoring, quality and precision are going to be part of a future Blog.The Documentation on the new Components can be found on the product itself (with help and sample code), and also on the ORAAH 2.5.0 Documentation Tab on OTN. Hardware and Software used for testing As a test Bed, we are using an Oracle Big Data Appliance X3-2 cluster with 6-nodes.  Each node consists of two Intel® Xeon® 6-core X5675 (3.07 GHz), for a total of 12 cores (24 threads), and 96 GB of RAM per node is available.The BDA nodes run Oracle Enterprise Linux release 6.5, Cloudera Hadoop Distribution 5.3.0 (that includes Apache Spark release 1.2.0).  Each node also is running Oracle R Distribution release 3.1.1 and Oracle R Advanced Analytics for Hadoop release 2.5.0. Dataset used for Testing For the test we are going to use a classic Dataset that consists of Arrival and Departure information of all major Airports in the USA.  The data is available online in different formats.  The most used one contains 123 million records and has been used for many benchmarks, originally cited by the American Statistical Association for their Data Expo 2009.  We have augmented the data available in that file by downloading additional months of data from the official Bureau of Transportation Statistics website.  Our starting point is going to be this new dataset that contains 159 million records and has information up to September 2014.For smaller tests, we created a simple subset of this dataset of 1, 10 and 100 million records.  We also created a 1 billion-record dataset by appending the 159 million-record data over and over until we reached 1 billion records. Connecting to a Spark Cluster In release 2.5.0 we are introducing a new set of R commands that will allow the Data Scientist to request the creation of a Spark Context, in either YARN or Standalone modes.For this release, the Spark Context is exclusively used for accelerating the creation of Levels and Factor variables, the Model Matrix, the final solution to the Logistic and Neural Networks models themselves, and Scoring (in the case of GLM).The new commands are highlighted below:spark.connect(master, name = NULL, memory = NULL,  dfs.namenode = NULL)spark.connect() requires loading the ORCH library first to read the configuration of the Hadoop Cluster.The “master” variable can be specified as either “yarn-client” to use YARN for Resource Allocation, or the direct reference to a Spark Master service and port, in which case it will use Spark in Standalone Mode.  The “name” variable is optional, and it helps centralized logging of the Session on the Spark Master.  By default, the Application name showing on the Spark Master is “ORCH”. The “memory” field indicates the amount of memory per Spark Worker to dedicate to this Spark Context.Finally, the dfs.namenode points to the Apache HDFS Namenode Server, in order to exchange information with HDFS.In summary, to establish a Spark connection, one could do:> spark.connect("yarn-client", memory="2g", dfs.namenode=”my.namenode.server.com")Conversely, to disconnect the Session after the work is done, you can use spark.disconnect() without options.The command spark.connected() checks the status of the current Session and contains the information of the connection to the Server. It is automatically called by the new algorithms to check for a valid connection.ORAAH 2.5.0 introduces support for loading data to Spark cache from an HDFS file via the function hdfs.toRDD(). ORAAH dfs.id objects were also extended to support both data residing in HDFS and in Spark memory, and allow the user to cache the HDFS data to an RDD object for use with the new algorithms. For all the examples used in this Blog, we used the following command in the R Session: Oracle Distribution of R version 3.1.1 (--) -- "Sock it to Me"Copyright (C) The R Foundation for Statistical ComputingPlatform: x86_64-unknown-linux-gnu (64-bit)R is free software and comes with ABSOLUTELY NO WARRANTY.You are welcome to redistribute it under certain conditions.Type 'license()' or 'licence()' for distribution details.R is a collaborative project with many contributors.Type 'contributors()' for more information and'citation()' on how to cite R or R packages in publications.Type 'demo()' for some demos, 'help()' for on-line help, or'help.start()' for an HTML browser interface to help.Type 'q()' to quit R.You are using Oracle's distribution of R. Please contactOracle Support for any problems you encounter with thisdistribution.[Workspace loaded from ~/.RData]> library(ORCH)Loading required package: OREbaseAttaching package: ‘OREbase’The following objects are masked from ‘package:base’: cbind, data.frame, eval, interaction, order, paste, pmax, pmin, rbind, tableLoading required package: OREstatsLoading required package: MASSLoading required package: ORCHcoreLoading required package: rJavaOracle R Connector for Hadoop 2.5.0 (rev. 307)Info: using native C base64 encoding implementationInfo: Hadoop distribution is Cloudera's CDH v5.3.0Info: using auto-detected ORCH HAL v4.2Info: HDFS workdir is set to "/user/oracle"Warning: mapReduce checks are skipped due to "ORCH_MAPRED_CHECK"=FALSEWarning: HDFS checks are skipped due to "ORCH_HDFS_CHECK"=FALSEInfo: Hadoop 2.5.0-cdh5.3.0 is upInfo: Sqoop 1.4.5-cdh5.3.0 is upInfo: OLH 3.3.0 is upInfo: loaded ORCH core Java library "orch-core-2.5.0-mr2.jar"Loading required package: ORCHstats>> # Spark Context Creation> spark.connect(master="spark://my.spark.server:7077", memory="24G",dfs.namenode="my.dfs.namenode.server")> In this case, we are requesting the usage of 24 Gb of RAM per node in Standalone Mode. Since our BDA has 6 nodes, the total RAM assigned to our Spark Context is 144 GB, which can be verified in the Spark Master screen shown below. GLM – Logistic RegressionIn this release, because of a totally overhauled computation engine, we created a new function called orch.glm2() that is going to execute exclusively the Logistic Regression model using Apache Spark as platform.  The input data expected by the algorithm is an ORAAH dfs.id object, which means an HDFS CSV dataset, a HIVE Table that was made compatible by using the hdfs.fromHive() command, or HDFS CSV dataset that has been cached into Apache Spark as an RDD object using the command hdfs.toRDD().A simple example of the new algorithm running on the ONTIME dataset with 1 billion records is shown below. The objective of the Test Model is the prediction of Cancelled Flights. The new model requires the indication of a Factor variable as an F() in the formula, and the default (and only family available in this release) is the binomial().The R code and the output below assumes that the connection to the Spark Cluster is already done. > # Attaches the HDFS file for use within R> ont1bi <- hdfs.attach("/user/oracle/ontime_1bi")> # Checks the size of the Dataset> hdfs.dim(ont1bi) [1] 1000000000         30> # Testing the GLM Logistic Regression Model on Spark> # Formula definition: Cancelled flights (0 or 1) based on other attributes> form_oraah_glm2 <- CANCELLED ~ DISTANCE + ORIGIN + DEST + F(YEAR) + F(MONTH) ++   F(DAYOFMONTH) + F(DAYOFWEEK)> # ORAAH GLM2 Computation from HDFS data (computing factor levels on its own)> system.time(m_spark_glm <- orch.glm2(formula=form_oraah_glm2, ont1bi)) ORCH GLM: processed 6 factor variables, 25.806 sec ORCH GLM: created model matrix, 100128 partitions, 32.871 sec ORCH GLM: iter  1,  deviance   1.38433414089348300E+09,  elapsed time 9.582 sec ORCH GLM: iter  2,  deviance   3.39315388583931150E+08,  elapsed time 9.213 sec ORCH GLM: iter  3,  deviance   2.06855738812683250E+08,  elapsed time 9.218 sec ORCH GLM: iter  4,  deviance   1.75868100359263200E+08,  elapsed time 9.104 sec ORCH GLM: iter  5,  deviance   1.70023181759611580E+08,  elapsed time 9.132 sec ORCH GLM: iter  6,  deviance   1.69476890425481350E+08,  elapsed time 9.124 sec ORCH GLM: iter  7,  deviance   1.69467586045954760E+08,  elapsed time 9.077 sec ORCH GLM: iter  8,  deviance   1.69467574351380850E+08,  elapsed time 9.164 secuser  system elapsed84.107   5.606 143.591  > # Shows the general features of the GLM Model> summary(m_spark_glm)               Length Class  Mode    coefficients   846    -none- numeric deviance         1    -none- numeric solutionStatus   1    -none- characternIterations      1    -none- numeric formula          1    -none- characterfactorLevels     6    -none- list  A sample benchmark against the same models running on Map-Reduce are illustrated below.  The Map-Reduce models used the call orch.glm(formula, dfs.id, family=(binomial()), and used as.factor() in the formula.We can see that the Spark-based GLM2 is capable of a large performance advantage over the model executing in Map-Reduce.Later in this Blog we are going to see the performance of the Spark-based GLM Logistic Regression on 1 billion records. Linear Model with Neural Networks For the MLP Neural Networks model, the same algorithm was adapted to execute using the Spark Caching.  The exact same code and function call will recognize if there is a connection to a Spark Context, and if so, will execute the computations using it.In this case, the code for both the Map-Reduce and the Spark-based executions is exactly the same, with the exception of the spark.connect() call that is required for the Spark-based version to kick in.The objective of the Test Model is the prediction of Arrival Delays of Flights in minutes, so the model class is a Regression Model. The R code used to run the benchmarks and the output is below, and it assumes that the connection to the Spark Cluster is already done. > # Attaches the HDFS file for use within R> ont1bi <- hdfs.attach("/user/oracle/ontime_1bi")> # Checks the size of the Dataset> hdfs.dim(ont1bi) [1] 1000000000         30> # Testing Neural Model on Spark> # Formula definition: Arrival Delay based on other attributes> form_oraah_neu <- ARRDELAY ~ DISTANCE + ORIGIN + DEST + as.factor(MONTH) ++   as.factor(YEAR) + as.factor(DAYOFMONTH) + as.factor(DAYOFWEEK)> # Compute Factor Levels from HDFS data> system.time(xlev <- orch.getXlevels(form_oraah_neu, dfs.dat = ont1bi))    user  system elapsed17.717   1.348  50.495 > # Compute and Cache the Model Matrix from HDFS data, passing factor levels> system.time(Mod_Mat <- orch.prepare.model.matrix(form_oraah_neu, dfs.dat = ont1bi,xlev=xlev))   user  system elapsed17.933   1.524  95.624 > # Compute Neural Model from RDD cached Model Matrix> system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat, xlev=xlev, trace=T))Unconstrained Nonlinear OptimizationL-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)Iter           Objective Value   Grad Norm        Step   Evals  1   5.08900838381104858E+11   2.988E+12   4.186E-16       2  2   5.08899723803646790E+11   2.987E+12   1.000E-04       3  3   5.08788839748061768E+11   2.958E+12   1.000E-02       3  4   5.07751213455999573E+11   2.662E+12   1.000E-01       4  5   5.05395855303159180E+11   1.820E+12   3.162E-01       1  6   5.03327619811536194E+11   2.517E+09   1.000E+00       1  7   5.03327608118144775E+11   2.517E+09   1.000E+00       6  8   4.98952182330299011E+11   1.270E+12   1.000E+00       1  9   4.95737805642779968E+11   1.504E+12   1.000E+00       1 10   4.93293224063758362E+11   8.360E+11   1.000E+00       1 11   4.92873433106044373E+11   1.989E+11   1.000E+00       1 12   4.92843500119498352E+11   9.659E+09   1.000E+00       1 13   4.92843044802041565E+11   6.888E+08   1.000E+00       1Solution status             Optimal (objMinProgress)Number of L-BFGS iterations 13Number of objective evals   27Objective value             4.92843e+11Gradient norm               6.88777e+08   user  system elapsed43.635   4.186  63.319  > # Checks the general information of the Neural Network Model> mod_neuNumber of input units      845Number of output units     1Number of hidden layers    0Objective value            4.928430E+11Solution status            Optimal (objMinProgress)Output layer               number of neurons 1, activation 'linear'Optimization solver        L-BFGSScale Hessian inverse      1Number of L-BFGS updates   20> mod_neu$nObjEvaluations[1] 27> mod_neu$nWeights[1] 846 A sample benchmark against the same models running on Map-Reduce are illustrated below.  The Map-Reduce models used the exact same orch.neural() calls as the Spark-based ones, with only the Spark connection as a difference.We can clearly see that the larger the dataset, the larger the difference in speeds of the Spark-based computation compared to the Map-Reduce ones, reducing the times from many hours to a few minutes.This new performance makes possible to run much larger problems and test several models on 1 billion records, something that took half a day just to run one model. Logistic and Deep Neural Networks with 1 billion recordsTo prove that it is now feasible not only to run Logistic and Linear Model Neural Networks on large scale datasets, but also complex Multi-Layer Neural Network Models, we decided to test the same 1 billion record dataset against several different architectures.These tests were done to check for performance and feasibility of these types of models, and not for comparison of precision or quality, which will be part of a different Blog.The default activation function for all Multi-Layer Neural Network models was used, which is the bipolar sigmoid function, and also the default output activation layer was also user, which is the linear function.As a reminder, the number of weights we need to compute for a Neural Networks is as follows:The generic formulation for the number of weights to be computed is then:Total Number of weights = SUM of all Layers from First Hidden to the Output of [(Number of inputs into each Layer + 1) * Number of Neurons)]In the simple example, we had [(3 inputs + 1 bias) * 2 neurons] + [(2 neurons + 1 bias) * 1 output ] = 8 + 3 = 11 weightsIn our tests for the Simple Neural Network model (Linear Model), using the same formula, we can see that we were computing 846 weights, because it is using 845 inputs plus the Bias.Thus, to calculate the number of weights necessary for the Deep Multi-layer Neural Networks that we are about to Test below, we have the following: MLP 3 Layers (50,25,10) => [(845+1)*50]+[(50+1)*25]+[(25+1)*10]+[(10+1)*1] = 43,846 weightsMLP 4 Layers (100,50,25,10) => [(845+1)*100]+[(100+1)*50]+[(50+1)*25]+[(25+1)*10]+[(10+1)*1] = 91,196 weightsMLP 5 Layers (200,100,50,25,10) => [(845+1)*200]+[(200+1)*100]+[(100+1)*50]+[(50+1)*25]+ [(25+1)*10]+[(10+1)*1] = 195,896 weightsThe times required to compute the GLM Logistic Regression Model that predicts the Flight Cancellations on 1 billion records is included just as an illustration point of the performance of the new Spark-based algorithms.The Neural Network Models are all predicting Arrival Delay of Flights, so they are either Linear Models (the first one, with no Hidden Layers) or Non-linear Models using the bipolar sigmoid activation function (the Multi-Layer ones).This demonstrates that the capability of building Very Complex and Deep Networks is available with ORAAH, and it makes possible to build networks with hundreds of thousands or millions of weights for more complex problems.Not only that, but a Logistic Model can be computed on 1 billion records in less than 2 and a half minutes, and a Linear Neural Model in almost 3 minutes.The R Output Listing of the Logistic Regression computation and of the MLP Neural Networks are below. > # Spark Context Creation> spark.connect(master="spark://my.spark.server:7077", memory="24G",dfs.namenode="my.dfs.namenode.server")> # Attaches the HDFS file for use with ORAAH> ont1bi <- hdfs.attach("/user/oracle/ontime_1bi")> # Checks the size of the Dataset> hdfs.dim(ont1bi)[1] 1000000000         30 GLM - Logistic Regression > # Testing GLM Logistic Regression on Spark> # Formula definition: Cancellation of Flights in relation to other attributes> form_oraah_glm2 <- CANCELLED ~ DISTANCE + ORIGIN + DEST + F(YEAR) + F(MONTH) ++   F(DAYOFMONTH) + F(DAYOFWEEK)> # ORAAH GLM2 Computation from RDD cached data (computing factor levels on its own)> system.time(m_spark_glm <- orch.glm2(formula=form_oraah_glm2, ont1bi)) ORCH GLM: processed 6 factor variables, 25.806 sec ORCH GLM: created model matrix, 100128 partitions, 32.871 sec ORCH GLM: iter  1,  deviance   1.38433414089348300E+09,  elapsed time 9.582 sec ORCH GLM: iter  2,  deviance   3.39315388583931150E+08,  elapsed time 9.213 sec ORCH GLM: iter  3,  deviance   2.06855738812683250E+08,  elapsed time 9.218 sec ORCH GLM: iter  4,  deviance   1.75868100359263200E+08,  elapsed time 9.104 sec ORCH GLM: iter  5,  deviance   1.70023181759611580E+08,  elapsed time 9.132 sec ORCH GLM: iter  6,  deviance   1.69476890425481350E+08,  elapsed time 9.124 sec ORCH GLM: iter  7,  deviance   1.69467586045954760E+08,  elapsed time 9.077 sec ORCH GLM: iter  8,  deviance   1.69467574351380850E+08,  elapsed time 9.164 secuser  system elapsed84.107   5.606 143.591 > # Checks the general information of the GLM Model> summary(m_spark_glm)               Length Class  Mode    coefficients   846    -none- numeric deviance         1    -none- numeric solutionStatus   1    -none- characternIterations      1    -none- numeric formula          1    -none- characterfactorLevels     6    -none- list  Neural Networks - Initial StepsFor the Neural Models, we have to add the times for computing the Factor Levels plus the time for creating the Model Matrix to the Total elapsed time of the Model computation itself. > # Testing Neural Model on Spark> # Formula definition> form_oraah_neu <- ARRDELAY ~ DISTANCE + ORIGIN + DEST + as.factor(MONTH) ++   as.factor(YEAR) + as.factor(DAYOFMONTH) + as.factor(DAYOFWEEK)>> # Compute Factor Levels from HDFS data> system.time(xlev <- orch.getXlevels(form_oraah_neu, dfs.dat = ont1bi))  user  system elapsed12.598   1.431  48.765>> # Compute and Cache the Model Matrix from cached RDD data> system.time(Mod_Mat <- orch.prepare.model.matrix(form_oraah_neu, dfs.dat = ont1bi,xlev=xlev))  user  system elapsed  9.032   0.960  92.953 < Neural Networks Model with 3 Layers of Neurons > # Compute DEEP Neural Model from RDD cached Model Matrix (passing xlevels)> # Three Layers, with 50, 25 and 10 neurons respectively.> system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat,+                                    xlev=xlev, hiddenSizes=c(50,25,10),trace=T))Unconstrained Nonlinear OptimizationL-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)Iter           Objective Value   Grad Norm        Step   Evals  0   5.12100202340115967E+11   5.816E+09   1.000E+00       1  1   4.94849165811250305E+11   2.730E+08   1.719E-10       1  2   4.94849149028958862E+11   2.729E+08   1.000E-04       3  3   4.94848409777413513E+11   2.702E+08   1.000E-02       3  4   4.94841423640935242E+11   2.437E+08   1.000E-01       4  5   4.94825372589270386E+11   1.677E+08   3.162E-01       1  6   4.94810879175052673E+11   1.538E+07   1.000E+00       1  7   4.94810854064597107E+11   1.431E+07   1.000E+00       1Solution status             Optimal (objMinProgress)Number of L-BFGS iterations 7Number of objective evals   15Objective value             4.94811e+11Gradient norm               1.43127e+07    user   system  elapsed  91.024    8.476 1975.947>> # Checks the general information of the Neural Network Model> mod_neuNumber of input units      845Number of output units     1Number of hidden layers    3Objective value            4.948109E+11Solution status            Optimal (objMinProgress)Hidden layer [1]           number of neurons 50, activation 'bSigmoid'Hidden layer [2]           number of neurons 25, activation 'bSigmoid'Hidden layer [3]           number of neurons 10, activation 'bSigmoid'Output layer               number of neurons 1, activation 'linear'Optimization solver        L-BFGSScale Hessian inverse      1Number of L-BFGS updates   20> mod_neu$nObjEvaluations[1] 15> mod_neu$nWeights[1] 43846> Neural Networks Model with 4 Layers of Neurons > # Compute DEEP Neural Model from RDD cached Model Matrix (passing xlevels)> # Four Layers, with 100, 50, 25 and 10 neurons respectively.> system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat,+                                    xlev=xlev, hiddenSizes=c(100,50,25,10),trace=T))Unconstrained Nonlinear OptimizationL-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)Iter           Objective Value   Grad Norm        Step   Evals   0   5.15274440087001343E+11   7.092E+10   1.000E+00       1   1   5.10168177067538818E+11   2.939E+10   1.410E-11       1   2   5.10086354184862549E+11   5.467E+09   1.000E-02       2   3   5.10063808510261475E+11   5.463E+09   1.000E-01       4   4   5.07663007172408386E+11   5.014E+09   3.162E-01       1   5   4.97115989230861267E+11   2.124E+09   1.000E+00       1   6   4.94859162124700928E+11   3.085E+08   1.000E+00       1   7   4.94810727630636963E+11   2.117E+07   1.000E+00       1   8   4.94810490064279846E+11   7.036E+06   1.000E+00       1Solution status             Optimal (objMinProgress)Number of L-BFGS iterations 8Number of objective evals   13Objective value             4.9481e+11Gradient norm               7.0363e+06    user   system  elapsed166.169   19.697 6467.703>> # Checks the general information of the Neural Network Model> mod_neuNumber of input units      845Number of output units     1Number of hidden layers    4Objective value            4.948105E+11Solution status            Optimal (objMinProgress)Hidden layer [1]           number of neurons 100, activation 'bSigmoid'Hidden layer [2]           number of neurons 50, activation 'bSigmoid'Hidden layer [3]           number of neurons 25, activation 'bSigmoid'Hidden layer [4]           number of neurons 10, activation 'bSigmoid'Output layer               number of neurons 1, activation 'linear'Optimization solver        L-BFGSScale Hessian inverse      1Number of L-BFGS updates   20> mod_neu$nObjEvaluations[1] 13> mod_neu$nWeights[1] 91196 Neural Networks Model with 5 Layers of Neurons > # Compute DEEP Neural Model from RDD cached Model Matrix (passing xlevels)> # Five Layers, with 200, 100, 50, 25 and 10 neurons respectively.> system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat,+                                    xlev=xlev, hiddenSizes=c(200,100,50,25,10),trace=T))Unconstrained Nonlinear OptimizationL-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)Iter           Objective Value   Grad Norm        Step   Evals  0   5.14697806831633850E+11   6.238E+09   1.000E+00       1 ............  6   4.94837221890043518E+11   2.293E+08   1.000E+00  1                                                                     7   4.94810299190365112E+11   9.268E+06   1.000E+00       1 8   4.94810277908935242E+11   8.855E+06   1.000E+00       1Solution status             Optimal (objMinProgress)Number of L-BFGS iterations 8Number of objective evals   16Objective value             4.9481e+11Gradient norm               8.85457e+06     user    system   elapsed  498.002    90.940 30473.421>> # Checks the general information of the Neural Network Model> mod_neuNumber of input units      845Number of output units     1Number of hidden layers    5Objective value            4.948103E+11Solution status            Optimal (objMinProgress)Hidden layer [1]           number of neurons 200, activation 'bSigmoid'Hidden layer [2]           number of neurons 100, activation 'bSigmoid'Hidden layer [3]           number of neurons 50, activation 'bSigmoid'Hidden layer [4]           number of neurons 25, activation 'bSigmoid'Hidden layer [5]           number of neurons 10, activation 'bSigmoid'Output layer               number of neurons 1, activation 'linear'Optimization solver        L-BFGSScale Hessian inverse      1Number of L-BFGS updates   20> mod_neu$nObjEvaluations[1] 16> mod_neu$nWeights[1] 195896 Best Practices on logging level for using Apache Spark with ORAAHIt is important to note that Apache Spark’s log is by default verbose, so it might be useful after a few tests with different settings to turn down the level of logging, something a System Administrator typically will do by editing the file $SPARK_HOME/etc/log4j.properties (see Best Practices below).By default, that file is going to look something like this: # cat $SPARK_HOME/etc/log4j.properties # Set everything to be logged to the consolelog4j.rootCategory=INFO, consolelog4j.appender.console=org.apache.log4j.ConsoleAppenderlog4j.appender.console.target=System.errlog4j.appender.console.layout=org.apache.log4j.PatternLayoutlog4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n# Settings to quiet third party logs that are too verboselog4j.logger.org.eclipse.jetty=INFOlog4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFOlog4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFOA typical full log will provide the below information, but also might provide too much logging when running the Models themselves, so it will be more useful for the first tests and diagnostics. > # Creates the Spark Context. Because the Memory setting is not specified,> # the defaults of 1 GB of RAM per Spark Worker is used> spark.connect("yarn-client", dfs.namenode="my.hdfs.namenode") 15/02/18 13:05:44 WARN SparkConf:SPARK_JAVA_OPTS was detected (set to '-Djava.library.path=/usr/lib64/R/lib'). This is deprecated in Spark 1.0+. Please instead use:- ./spark-submit with conf/spark-defaults.conf to set defaults for an application- ./spark-submit with --driver-java-options to set -X options for a driver- spark.executor.extraJavaOptions to set -X options for executors- SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker) 15/02/18 13:05:44 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' to '- Djava.library.path=/usr/lib64/R/lib' as a work-around.15/02/18 13:05:44 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to '- Djava.library.path=/usr/lib64/R/lib' as a work-around15/02/18 13:05:44 INFO SecurityManager: Changing view acls to: oracle15/02/18 13:05:44 INFO SecurityManager: Changing modify acls to: oracle15/02/18 13:05:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(oracle); users with modify permissions: Set(oracle) 15/02/18 13:05:44 INFO Slf4jLogger: Slf4jLogger started15/02/18 13:05:44 INFO Remoting: Starting remoting15/02/18 13:05:45 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@my.spark.master:35936]15/02/18 13:05:45 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@my.spark.master:35936]15/02/18 13:05:46 INFO SparkContext: Added JAR /u01/app/oracle/product/12.2.0/dbhome_1/R/library/ORCHcore/java/orch-core-2.4.1-mr2.jar at http://1.1.1.1:11491/jars/orch-core-2.4.1-mr2.jar with timestamp 142426474610015/02/18 13:05:46 INFO SparkContext: Added JAR /u01/app/oracle/product/12.2.0/dbhome_1/R/library/ORCHcore/java/orch-bdanalytics-core-2.4.1- mr2.jar at http://1.1.1.1:11491/jars/orch-bdanalytics-core-2.4.1-mr2.jar with timestamp 1424264746101 15/02/18 13:05:46 INFO RMProxy: Connecting to ResourceManager at my.hdfs.namenode /10.153.107.85:8032 Utils: Successfully started service 'sparkDriver' on port 35936. SparkEnv: Registering MapOutputTrackerSparkEnv: Registering BlockManagerMasterDiskBlockManager: Created local directory at /tmp/spark-local- MemoryStore: MemoryStore started with capacity 265.1 MBHttpFileServer: HTTP File server directory is /tmp/spark-7c65075f-850c- HttpServer: Starting HTTP ServerUtils: Successfully started service 'HTTP file server' on port 11491. Utils: Successfully started service 'SparkUI' on port 4040.SparkUI: Started SparkUI at http://my.hdfs.namenode:4040 15/02/18 13:05:46 INFO Client: Requesting a new application from cluster with 1 NodeManagers 15/02/18 13:05:46 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 15/02/18 13:05:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/02/18 13:05:46 INFO Client: Uploading resource file:/opt/cloudera/parcels/CDH-5.3.1- 1.cdh5.3.1.p0.5/lib/spark/lib/spark-assembly.jar -> hdfs://my.hdfs.namenode:8020/user/oracle/.sparkStaging/application_1423701785613_0009/spark- assembly.jar 15/02/18 13:05:47 INFO Client: Setting up the launch environment for our AM container15/02/18 13:05:47 INFO SecurityManager: Changing view acls to: oracle15/02/18 13:05:47 INFO SecurityManager: Changing modify acls to: oracle15/02/18 13:05:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(oracle); users with modify permissions: Set(oracle) 15/02/18 13:05:47 INFO Client: Submitting application 9 to ResourceManager 15/02/18 13:05:47 INFO YarnClientImpl: Submitted application application_1423701785613_0009 15/02/18 13:05:48 INFO Client: Application report for application_1423701785613_0009 (state: ACCEPTED) 13:05:48 INFO Client:client token: N/Adiagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.oracle start time: 1424264747559 final status: UNDEFINED tracking URL: http:// my.hdfs.namenode 13:05:46 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB 13:05:46 INFO Client: Setting up container launch context for our AM 13:05:46 INFO Client: Preparing resources for our AM container my.hdfs.namenode:8088/proxy/application_1423701785613_0009/ user: oracle 15/02/18 13:05:49 INFO Client: Application report for application_1423701785613_0009 (state: ACCEPTED)15/02/18 13:05:50 INFO Client: Application report for application_1423701785613_0009 (state: ACCEPTED) Please note that all those warnings are expected, and may vary depending on the release of Spark used. With the Console option in the log4j.properties settings are lowered from INFO to WARN, the request for a Spark Context would return the following: # cat $SPARK_HOME/etc/log4j.properties # Set everything to be logged to the consolelog4j.rootCategory=INFO, consolelog4j.appender.console=org.apache.log4j.ConsoleAppenderlog4j.appender.console.target=System.errlog4j.appender.console.layout=org.apache.log4j.PatternLayoutlog4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n# Settings to quiet third party logs that are too verboselog4j.logger.org.eclipse.jetty=WARNlog4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFOlog4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFONow the R Log is going to show only a few details about the Spark Connection. > # Creates the Spark Context. Because the Memory setting is not specified,> # the defaults of 1 GB of RAM per Spark Worker is used> spark.connect(master="yarn-client", dfs.namenode="my.hdfs.server")15/04/09 23:32:11 WARN SparkConf:SPARK_JAVA_OPTS was detected (set to '-Djava.library.path=/usr/lib64/R/lib').This is deprecated in Spark 1.0+.Please instead use:- ./spark-submit with conf/spark-defaults.conf to set defaults for an application- ./spark-submit with --driver-java-options to set -X options for a driver- spark.executor.extraJavaOptions to set -X options for executors- SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker) 15/04/09 23:32:11 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' to '-Djava.library.path=/usr/lib64/R/lib' as a work-around.15/04/09 23:32:11 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to '-Djava.library.path=/usr/lib64/R/lib' as a work-around.15/04/09 23:32:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableFinally, with the Console logging option in the log4j.properties file set to ERROR instead of INFO or WARN, the request for a Spark Context would return nothing in case of success: # cat $SPARK_HOME/etc/log4j.properties # Set everything to be logged to the consolelog4j.rootCategory=INFO, consolelog4j.appender.console=org.apache.log4j.ConsoleAppenderlog4j.appender.console.target=System.errlog4j.appender.console.layout=org.apache.log4j.PatternLayoutlog4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n# Settings to quiet third party logs that are too verboselog4j.logger.org.eclipse.jetty=ERRORlog4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFOlog4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFOThis time there is no message returned back to the R Session, because we requested it to only return feedback in case of an error: > # Creates the Spark Context. Because the Memory setting is not specified,> # the defaults of 1 GB of RAM per Spark Worker is used> spark.connect(master="yarn-client", dfs.namenode="my.hdfs.server")>In summary, it is practical to start any Project with the full logging, but it would be a good idea to bring the level of logging down to WARN or ERROR after the environment has been tested to be working OK and the settings are stable.

This is the first in a series of blogs that is going to explore the capabilities of the newly released Oracle R Advanced Analytics for Hadoop 2.5.0, part of Oracle Big Data Connectors, which...

Best Practices

ROracle 1.2-1 released

We are pleased to announce the latest update of the open source ROracle package, version 1.2-1, with enhancements and bug fixes. ROracle provides high performance and scalable interaction between R and Oracle Database. In addition to availability on CRAN, ROracle binaries specific to Windows and other platforms can be downloaded from the Oracle Technology Network. Users of ROracle, please take our brief survey. Your feedback is important and we want to hear from you!Latest enhancements in version 1.2-1 include:• Support for NATIVE, UTF8 and LATIN1 encoded data in query and results• enhancement 20603162 - CLOB/BLOB enhancement, see man page on attributes ore.type, ora.encoding, ora.maxlength, and ora.fractional_seconds_precision. • bug 15937661 – mapping of dbWriteTable BLOB, CLOB, NCLOB, NCHAR AND NVARCHAR columns. Data frame mapping to Oracle Database type is provided.• bug 16017358 – proper handling of NULL extproc context when passed to in ORE embedded R execution• bug 16907374 - ROracle creates time stamp column for R Date with dbWriteTable• ROracle now displays NCHAR, NVARCHAR2 and NCLOB data types defined for columns in the server using dbColumnInfo and dbGetInfoIn addition, enhancements in the previous release of ROracle, version 1.1-12, include:• Add bulk_write parameter to specify number of rows to bind at a time to improve performance for dbWriteTable and DML operations• Date, Timestamp, Timestamp with time zone and Timestamp with local time zone data are maintained in R and Oracle's session time zone. Oracle session time zone environment variable ORA_SDTZ and R's environment variable TZ must be the same for this to work else an error is reported when operating on any of these column types• bug 16198839 - Allow selecting data from time stamp with time zone and time stamp with local time zone without reporting error 1805• bug 18316008 - increases the bind limit from 4000 to 2GB for inserting data into BLOB, CLOB and 32K VARCHAR and RAW data types. Changes describe lengths to NA for all types except for CHAR, VARCHAR2, and RAW• and other performance improvements and bug fixesSee the ROracle NEWS for the complete list of updates. We encourage ROracle users to post questions and provide feedback on the Oracle R Technology Forum and the ROracle survey.ROralce is not only a high performance interface to Oracle Database from R for direct use, ROracle supports database access for Oracle R Enterprise from the Oracle Advanced Analytics option to Oracle Database.

We are pleased to announce the latest update of the open source ROraclepackage, version 1.2-1, with enhancements and bug fixes. ROracle provides high performance and scalable interaction between R...

Best Practices

Variable Selection with ORE varclus - Part 2

In our previous post we talked about variable selection and introduced a technique based on hierarchical divisive clustering and implemented using the Oracle R Enterprise embedded execution capabilities. In this post we illustrate how to visualize the clustering solution, discuss stopping criteria and highlight some performance aspects. Plots The clustering efficiency can be assessed, from a high level perspective, through a visual representation of metrics related to variability. The plot.clusters() function provided as example in varclus_lib.R takes the datastore name, the iteration number (nclust corresponds to the number of clusters after the final iteration) and an output directory to generate a png output file with two plots. R> plot.clusters(dsname="datstr.MYDATA",nclust=6,                   outdir="out.varclus.MYDATA")unix> ls -1 out.varclus.MYDATAout.MYDATA.clustersout.MYDATA.logplot.datstr.MYDATA.ncl6.png The upper plot focuses on the last iteration. The x axis represents the cluster id (1 to 6 for six clusters after the 6-th and final iteration). The variation explained and proportion of variation explained (Variation.Explained and Proportion.Explained from 'Cluster Summary') are rendered by the blue curve with units on the left y axis and the red curve with units on the right y axis). Clusters 1,2,3,4,6 are well represented by their first principal component. Cluster 5, contains variation which is not well captured by a single component (only 47.8% is explained, as alraedy mentioned in Part 1). This can be seen also from the r2.own values for the variables of Cluster 5, VAR20, VAR26,...,VAR29 , between 0.24 and 0.62 indicating that their are not well correlated with the 1st principal component score. For this kind of situation, domain expertise will be needed to evaluate the results and decide the course of action : does it make sense to have VAR20, VAR26,...,VAR29 clustered together/keep VAR27 as representative variable or should Cluster 5 be further split by lowering eig2.threshold (below the corresponding Secnd.Eigenval value from the 'Clusters Summary' section) ? The bottom plot illustrates the entire clustering sequence (all iterations) The x axis represents the iteration number or the numbers of clusters after that iteration. The total variation explained and proportion of total variation explained (Tot.Var.Explained and Prop.Var.Explained from 'Grand Summary' are rendered by the blue curve with units on the left y axis and the red curve with units on the right y axis). One can see how Prop.Var.Explained tends to flatten below 90% (86.3% for the last iteration). For the case above a single cluster was 'weak' and there were no ambiguities about where to start examining the results or search for issues.  Below is the same output for a different problem with 120 variables and 29 final clusters. For this case, the proportion of variation explained by the 1st component (red curve, upper plot) shows several 'weak' clusters : 23, 28, 27, 4, 7, 19.  The Prop.Var.Explained is below 60% for these clusters. Which one should be examined first ? A good choice could be Cluster 7 because it plays a more important role as measured by the absolute value of Variation.Explained. Here again, domain knowledge will be required to examine these clusters and decide if and for how long how one should continue the splitting process. Stopping criteria & number of variables As illustrated in the previous section, the number of final clusters can be raised or reduced by lowering or increasing the eig2.trshld parameter. For problems with many variables the user may want to stop the iterations at lower values and inspect the clustering results & history before convergence to gain a better understanding of the variable selection process. Early stopping is achieved through the maxclust argument, as discussed in the previous post, and can be used also if the user wants/has to keep the number of selected variables below an upper limit. Performance The clustering runtime is entirely dominated  by the cost of the PCA analysis. The 1st split is the most expensive as PCA is run for the entire data; the subsequent splits are executed faster and faster as the PCAs handle clusters with less and less variables. For the 39 variables & 55k rows case presented it took ~10s for the entire run (splitting into 6 clusters, post-processing from datastore, generation). The 120 variables & 55k rows case required ~54s. For a larger case with 666 variables & 64k rows the execution completed in 112s and generated 128 clusters. These numbers were obtained on a Intel Xeon 2.9Ghz OL6 machine.The customer ran cases with more than 600 variable & O[1e6] rows in 5-10 mins.

In our previous post we talked about variable selection and introduced a technique based on hierarchical divisive clustering and implemented using the Oracle R Enterprise embedded execution...

News

Variable Selection with ORE varclus - Part 1

Variable selection also known as feature or attribute selection is an important technique for data mining and predictive analytics. It is used when the number of variables is large and has received a special attention from application areas where this number is very large (like genomics, combinatorial chemistry, text mining, etc). The underlying hypothesis for variable selection is that the data can contain many variables which are either irrelevant or redundant. Solutions are therefore sought for selecting subsets of these variables which can predict the output with an accuracy comparable to that of the complete input set. Variable selection serves multiple purposes: (1) It provides a faster and more cost-effective model generation (2) It simplifies the model interpretation as it based on a (much) smaller and more effective set of predictors (3) It supports a better generalization because the elimination of irrelevant features can reduce model over-fitting. There are many approaches for feature selection differentiated by search techniques, validation methods or optimality considerations. In this blog we will describe a solution based on hierarchical and divisive variable clustering which generates disjoint groups of variables such that each group can be interpreted essentially as uni-dimensional and represented by a single variable from the original set. This solution was developed and implemented during a POC with a customer from the banking sector. The data consisted of tables with several hundred variables and O[1e5-1e6] observations. The customer wanted to build an analysis flow operating with a much smaller number of 'relevant' attributes, from the original set, which would best capture the variability expressed in the data. The procedure is iterative and starts from a single cluster containing all original variables. This cluster is divided in two clusters and variables assigned to one or another of the two children clusters. At every iteration one particular cluster is selected for division and the procedure continues until there are no more suitable candidates for division or if the user decided to stop the procedure once n clusters were generated (and n representative variables were identified) The selection criteria for division is related to the variation contained in the candidate cluster, more precisely to how this variation is distributed among it's principal components. PCA is performed on the initial (starting) cluster and on every cluster resulting from divisions. If the 2nd eigenvalue is large it means that the variation is distributed at least between two principal axis or components. We are not looking beyond the 2nd eigenvalue and divide that cluster's variables into two groups depending on how they are associated with the first two axis of variability. The division process continues until every clusters has variables associated with only one principal component i.e. until every cluster has a 2nd PCA eigenvalue less than a specified threshold. During the iterative process, the cluster picked for splitting is the one having the largest 2nd eigenvalue. The assignment of variables to clusters is based on the matrix of factor loadings or the correlation between the original variables and the PCA factors. Actually the factor loadings matrix is not directly used but a rotated matrix which improves separability. Details on the principle of factor rotations and the various types of rotations can be found in Choosing the Right Type of Rotation in PCA and EFA and Factor Rotations in Factor Analyses.The rotations are performed with the function GPFoblq() from the GPArotation package, a pre-requisite for ORE varclus.The next sections will describe how to run the variable clustering algorithm and interpret the results. The ORE varclus scriptsThe present version of ORE varclus is implemented in a function, ore.varclus() to be run in embedded execution mode. The driver script example, varclus_run.R illustrates how to call this function with ore.doEval: R> clust.log <- ore.doEval(FUN.NAME="ore.varclus",                 ,data.name="MYDATA"                 ,maxclust=200                 ,pca="princomp"                 ,eigv2.threshold=1.                 ,dsname="datstr.MYDATA"                                          ,ore.connect=TRUE) The arguments passed to ore.varclus() are : ore.varclus() is implemented in the varclus_lib.R script. The script contains also examples of post-processing functions illustrating how to selectively extract results from the datastore and generate reports and plots. The current version of ore.varclus() supports only numerical attributes. Details on the usage of the post-processing functions are provided in the next section. The output DatastoresWe illustrate the output of ORE varclus for a particular dataset (MYDATA) containing 39 numeric variables and 54k observations. ore.varclus() saves the history of the entire cluster generation in a datastore specified via the dsname argument:   datastore.name object.count  size       creation.date description1 datstr.MYDATA            13 30873 2015-05-28 01:03:42        <NA>     object.name      class size length row.count col.count1  Grand.Summary data.frame  562      5         6         52  clusters.ncl1       list 2790      1        NA        NA3  clusters.ncl2       list 3301      2        NA        NA4  clusters.ncl3       list 3811      3        NA        NA5  clusters.ncl4       list 4322      4        NA        NA6  clusters.ncl5       list 4833      5        NA        NA7  clusters.ncl6       list 5344      6        NA        NA8   summary.ncl1       list  527      2        NA        NA9   summary.ncl2       list  677      2        NA        NA10  summary.ncl3       list  791      2        NA        NA11  summary.ncl4       list  922      2        NA        NA12  summary.ncl5       list 1069      2        NA        NA13  summary.ncl6       list 1232      2        NA        NA     For this dataset the algorithm generated 6 clusters after 6 iterations with a threshold eigv2.trshld=1.00. The datastore contains several types of objects : clusters.nclX, summary.nclX and Grand.Summary. The suffix X indicates the iteration step. For example clusters.ncl4 does not mean the 4th cluster; it is a list of objects (numbers and tables) related to the 4 clusters generated during the 4th iteration. summary.ncl4 contains summarizing information about each of the 4 clusters generated during the  4th iteration. Grand.Summary provides the same metrics but aggregated for all clusters per iteration. More details will be provided below. The user can load and inspect each clusters.nclX or summary.nclX individually to track for example how variables are assigned to clusters during the iterative process. Saving the results on a per iteration basis becomes practical when the number of starting variables is several hundreds large and many clusters are generated. Text based output ore.varclus_lib.R contains a function write.clusters.to.file() which allows to concatenate all the information from either one single or multiple iterations and dump it in formatted text for visual inspection. In the example below the results from the last two step (5 and 6) specified via the clust.steps argument is written to the file named via the fout argument. R> fclust <- "out.varclus.MYDATA/out.MYDATA.clusters"R> write.clusters.to.file(fout=fclust,                          dsname="datstr.MYDATA",clust.steps=c(5,6)) The output contains now the info from summary.ncl5, clusters.ncl5, summary.ncl6, clusters.ncl6, and Grand.Summary in that order. Below we show only the output corresponding to the 6th iteration which contains the final results. The output starts with data collected from summary.ncl6 and displayed as two sections 'Clusters Summary' and 'Inter-Clusters Correlation'. The columns of  'Clusters Summary' are: The 'Inter-Clusters Correlation' matrix is the correlation matrix between the scores of (data on) the 1st principal component of every cluster. It is a measure of how much the clusters are uncorrelated when represented by the 1st principal component. ----------------------------------------------------------------------------------------Clustering step 6---------------------------------------------------------------------------------------- Clusters Summary :  Cluster Members Variation.Explained Proportion.Explained Secnd.Eigenval Represent.Var1       1      13           11.522574            0.8863518   7.856187e-01         VAR252       2       6            5.398123            0.8996871   3.874496e-01         VAR133       3       6            5.851600            0.9752667   1.282750e-01          VAR94       4       3            2.999979            0.9999929   2.112009e-05         VAR105       5       5            2.390534            0.4781069   8.526650e-01         VAR276       6       6            5.492897            0.9154828   4.951499e-01         VAR14Inter-Clusters Correlation :             Clust.1      Clust.2       Clust.3       Clust.4       Clust.5       Clust.6Clust.1  1.000000000  0.031429267  0.0915034534 -0.0045104029 -0.0341091948  0.0284033464Clust.2  0.031429267  1.000000000  0.0017441189 -0.0014435672 -0.0130659191  0.8048780461Clust.3  0.091503453  0.001744119  1.0000000000  0.0007563413 -0.0080611117 -0.0002118345Clust.4 -0.004510403 -0.001443567  0.0007563413  1.0000000000 -0.0008410022 -0.0022667776Clust.5 -0.034109195 -0.013065919 -0.0080611117 -0.0008410022  1.0000000000 -0.0107850694Clust.6  0.028403346  0.804878046 -0.0002118345 -0.0022667776 -0.0107850694  1.0000000000Cluster 1             Comp.1       Comp.2    r2.own     r2.next   r2.ratio var.idxVAR25 -0.3396562963  0.021849138 0.9711084 0.010593134 0.02920095      25VAR38 -0.3398365257  0.021560264 0.9710107 0.010590140 0.02929962      38VAR23 -0.3460431639  0.011946665 0.9689027 0.010689408 0.03143329      23VAR36 -0.3462378084  0.011635813 0.9688015 0.010685952 0.03153546      36VAR37 -0.3542777932 -0.001166427 0.9647680 0.010895771 0.03562009      37VAR24 -0.3543088809 -0.001225793 0.9647155 0.010898262 0.03567326      24VAR22 -0.3688379400 -0.026782777 0.9484384 0.011098450 0.05214028      22VAR35 -0.3689127408 -0.026900129 0.9484077 0.011093779 0.05217103      35VAR30 -0.0082726659  0.478137910 0.8723316 0.006303141 0.12847817      30VAR32  0.0007818601  0.489061629 0.8642301 0.006116234 0.13660543      32VAR31  0.0042646500  0.493099400 0.8605441 0.005992662 0.14029666      31VAR33  0.0076560545  0.497131056 0.8573146 0.005934929 0.14353729      33VAR34 -0.0802417381  0.198756967 0.3620001 0.007534643 0.64284346      34Cluster 2           Comp.1      Comp.2    r2.own   r2.next  r2.ratio var.idxVAR13 -0.50390550 -0.03826113 0.9510065 0.6838419 0.1549652      13VAR3  -0.50384385 -0.03814382 0.9509912 0.6838322 0.1550089       3VAR18 -0.52832332 -0.09384185 0.9394948 0.6750884 0.1862204      18VAR11 -0.31655455  0.33594147 0.9387738 0.5500716 0.1360798      11VAR16 -0.34554284  0.26587848 0.9174539 0.5351907 0.1775913      16VAR39 -0.02733522 -0.90110241 0.7004025 0.3805168 0.4836249      39Cluster 3             Comp.1       Comp.2    r2.own      r2.next    r2.ratio var.idxVAR9  -4.436290e-01  0.010645774 0.9944599 0.0111098555 0.005602316       9VAR8  -4.440656e-01  0.009606151 0.9944375 0.0113484256 0.005626315       8VAR7  -4.355970e-01  0.028881014 0.9931890 0.0110602004 0.006887179       7VAR6  -4.544373e-01 -0.016395561 0.9914545 0.0114996393 0.008644956       6VAR21 -4.579777e-01 -0.027336302 0.9865562 0.0004552779 0.013449888      21VAR5   1.566362e-06  0.998972842 0.8915032 0.0093737140 0.109523464       5Cluster 4            Comp.1        Comp.2    r2.own      r2.next     r2.ratio var.idxVAR10 7.067763e-01  0.0004592019 0.9999964 1.899033e-05 3.585911e-06      10VAR1  7.074371e-01 -0.0004753728 0.9999964 1.838949e-05 3.605506e-06       1VAR15 2.093320e-11  0.9999997816 0.9999859 2.350467e-05 1.408043e-05      15Cluster 5            Comp.1       Comp.2    r2.own      r2.next  r2.ratio var.idxVAR27 -0.556396037 -0.031563215 0.6199740 0.0001684573 0.3800900      27VAR29 -0.532122723 -0.041330455 0.5586173 0.0001938785 0.4414683      29VAR28 -0.506440510 -0.002599593 0.5327290 0.0001494172 0.4673408      28VAR26 -0.389716922  0.198849850 0.4396647 0.0001887849 0.5604411      26VAR20  0.003446542  0.979209797 0.2395493 0.0076757755 0.7663329      20Cluster 6             Comp.1        Comp.2    r2.own   r2.next  r2.ratio var.idxVAR14 -0.0007028647  0.5771114183 0.9164991 0.7063442 0.2843495      14VAR4  -0.0007144334  0.5770967589 0.9164893 0.7063325 0.2843714       4VAR12 -0.5779762250 -0.0004781436 0.9164238 0.4914497 0.1643420      12VAR2  -0.5779925997 -0.0004993306 0.9164086 0.4914361 0.1643676       2VAR17 -0.5760772611  0.0009732350 0.9150015 0.4900150 0.1666686      17VAR19  0.0014223072  0.5778410825 0.9120741 0.7019736 0.2950272      19---------------------------------------------------------------------------------------Grand Summary---------------------------------------------------------------------------------------  Nb.of.Clusters Tot.Var.Explained Prop.Var.Explained Min.Prop.Explained Max.2nd.Eigval1              1          11.79856          0.3025272          0.3025272       9.7871732              2          21.47617          0.5506711          0.4309593       5.7788293              3          27.22407          0.6980530          0.5491522       2.9999504              4          30.22396          0.7749735          0.6406729       2.3894005              5          32.60496          0.8360246          0.4781069       1.2057696              6          33.65571          0.8629668          0.4781069       0.852665The sections 'Cluster 1' ... 'Cluster 6' contain results collected from the clusters.ncl6 list from the datastore. Each cluster is described by a table where the rows are the variables and the columns correspond to: For example, from 'Clusters Summary', the first cluster (index 1) has 13 variables and is best represented by variable VAR25 which, from an inspecting the 'Cluster 1' section, shows the highest r2.own = 0.9711084.The section 'Grand Summary' displays the results from the Grand.Summary table in the datastore. The rows correspond to the clustering iterations and the columns are defined as: For example, for the final clusters (Nb.of.Clusters = 6) Min.Proportion.Explained is 0.4781069. This corresponds to Cluster 5 - see Proportion.Explained value from 'Clusters Summary'. It means that variation in Cluster 5 is poorly captured by the first principal component (only 47.8%)As previously indicated, the representative variables, one per final cluster, are collected in the Represent.Var column from the 'Clusters Summary' section in the output text file. They can be retrieved from the summary.ncl6 object in the datastore as shown below: R> ore.load(list=c("summary.ncl6"),name=datstr.name)[1] "summary.ncl6"R> names(summary.ncl6)[1] "clusters.summary"      "inter.clusters.correl"R> names(summary.ncl6$clusters.summary)[1] "Cluster"  "Members"  "Variation.Explained"  "Proportion.Explained" "Secnd.Eigenval"      [6] "Represent.Var"       R> summary.ncl6$clusters.summary$Represent.Var[1] "VAR25" "VAR13" "VAR9"  "VAR10" "VAR27" "VAR14" In our next post we'll look at plots, performance and future developments for ORE varclus.

Variable selection also known as feature or attribute selection is an important technique for data mining and predictive analytics. It is used when the number of variables is large and has received a...

Tips and Tricks

Experience using ORAAH on a customer business problem: some basic issues & solutions

We illustrate in this blog a few simple, practical solutions for problems which can arise when developing ORAAH mapreduce applications for the Oracle BDA. These problems were actually encountered during a recent POC engagement. The customer, an  important player in the medical technologies market, was interested in building an analysis flow consisting of a sequence of data manipulation and transformation steps followed by multiple model generation. The data preparation included multiple types of merging, filtering, variable generation based on complex search patterns and represented, by far, the most time consuming component of the flow. The original implementation on the customer's hardware required multiple days per flow to complete. Our ORAAH mapreduce based implementation running on a X5-2 Starter Rack BDA reduced that time to between 4-20 minutes, depending on which flow was tested. The points which will be addressed in this blog are related to the fact that the data preparation was structured as a chain of task where each tasks performed transformations on HDFS data generated by one or multiple upstream tasks. More precisely we will consider the: Merging of HDFS data from multiple sources Re-balancing and parts reduction for HDFS data Getting unique levels for categorical variables from HDFS data Partitioning the data for distributed mapreduce execution 'Merging data' from above is to be understood as row binding of multiple tables. Re-balancing and parts reduction addresses the fact the HDFS data (generated by upstream jobs) may consist of very unequal parts (chunks) - this would lead to performance losses when this data further processed by other mapreduce jobs. The 3rd and 4th items are related. Getting the unique levels of categorical variables was useful for the data partitioning process, namely for how to generate the key-values pairs within the mapper functions. 1. Merging of hdfs data from multiple sources The practical case here is that of a data transformation task for which the input consists of several, similarly structured HDFS data sets. As a reminder, data in HDFS is stored as a collection of flat files/chunks (part-00000, part-00001, etc) under an HDFS directory and the hdfs.* functions access the directory, not the 'part-xxxxx' chunks. Also the hadoop.run()/hadoop.exec().* functions work with single input data objects (HDFS object identifier representing a directory in HDFS); R rbind, cbind, merge, etc operations cannot be invoked within mapreduce to bind two or several large tables.For the case under consideration, each input (dataA_dfs, dataB_dfs, etc) consists of a different number of files/chunks R> hdfs.ls("dataA_dfs") [1] "__ORCHMETA__" "part-00000" "part-00001" .... "part-00071"R> hdfs.ls("dataB_dfs") [1] "__ORCHMETA__" "part-00000" "part-00001" .... "part-00035" corresponding to the number of reducers used by the upstream mapreduce jobs which generated this data. As these multiple chunks from various HDFS directories need to be processed as a single input data, they need to be moved into a unique HDFS directory. The 'merge_hdfs_data' function below does just that, by creating a new HDFS directory and copying all the part-xxxxx from each source directory  with proper updating of the resulting parts numbering. : R> merge_hdfs_data <- function(SrcDirs,TrgtDir) {  #cat(sprintf("merge_hdfs_files : Creating %s ...\n",TrgtDir))  hdfs.mkdir(TrgtDir,overwrite=TRUE)  i <- 0  for (srcD in SrcDirs) {    fparts <- hdfs.ls(get(srcD),pattern="part")    srcd <- (hdfs.describe(get(srcD)))[1,2]    for (fpart in fparts) {      #cat(sprintf("merge_hdfs_files : Copying %s/%s to %s ...\n",                       srcD,fpart,TrgtDir))      i <- i+1      hdfs.cp(paste(srcd,fpart,sep="/"),sprintf("%s/part-%05d",TrgtDir,i))    }  }} Merging of the dataA_dfs and dataB_dfs directories into a new data_merged_dfs directory is achieved through: R> merge_hdfs_data(c("dataA_dfs","dataB_dfs"),"data_merged_dfs") 2. Data re-balancing / Reduction of the number of parts Data stored in HDFS can suffer from two key problems that will affect performance: too many small files and files with very different numbers of records, especially those with very few records. The merged data produced by the function  above consists of a number of files equal to the sum of all files from all input HDFS directories. Since the upstream mapeduce jobs generating the inputs were run with a high number of reducers (for faster execution) the resulting total number of files got large (100+). This created an impractical constraint for the subsequent analysis as one cannot run a mapreduce application with a number of mappers less than the number of parts (the reverse is true, hdfs parts are splittable for processing by multiple mappers). Moreover if the parts have very different number of records the performance of the application will be affected since different mappers will handle very different volumes of data. The rebalance_data function below represents a simple way of addressing these issues. Every mapper splits its portion of the data into a user-defined number of parts (nparts) containing quasi the same number of records. A key is associated with each part. In this implementation the number of reducers is set to the number of parts. After shuffling each reducer will collect the records corresponding to one particular key and write them to the output. The overall output consists of nparts parts with quasi equal size. A basic mechanism for preserving the data types is illustrated (see the map.output and reduce.output constructs below). R> rebalance_data <- function(HdfsData,nmap,nparts){  mapper_func <- function(k,v) {    nlin <- nrow(v)    if(nlin>0) {      idx.seq <- seq(1,nlin)      kk <- ceiling(idx.seq/(nlin/nparts))      orch.keyvals(kk,v)    }  }  reducer_func <- function(k,v) {    if (nrow(v) > 0) { orch.keyvals(k=NULL,v) }  }  dtypes.out <- sapply(hdfs.meta(HdfsData)$types,                       function(x) ifelse(x=="character","\"a\"",                                          ifelse(x=="logical","FALSE","0")))  val.str <- paste0(hdfs.meta(HdfsData)$names,"=",dtypes.out,collapse=",")  meta.map.str <- sprintf("data.frame(key=0,%s)",val.str)  meta.red.str <- sprintf("data.frame(key=NA,%s)",val.str)  config <- new("mapred.config",                job.name      = "rebalance_data",                map.output    = eval(parse(text=meta.map.str)),                reduce.output = eval(parse(text=meta.red.str)),                map.tasks     = nmap,                reduce.tasks  = nparts)                reduce.split  = 1e5)  res <- hadoop.run(data = HdfsData,                    mapper = mapper_func,                    reducer = reducer_func,                    config = config,                    cleanup = TRUE  )  res}Before using this function, the data associated with the new data_merged_dfs directory needs to be attached to the ORAAH framework:R> data_merged_dfs <- hdfs.attach("data_merged_dfs")The invocation below uses 144 mappers for splitting the data into 4 parts: R> x <- rebalance_data(data_merged_dfs,nmap=144,nparts=4) The user may also want to save the resulting object, permanently, under some convenient/recognizable name like 'data_rebalanced_dfs' for example. The path to the temporary object x is retrieved with the hdfs.describe() command and provided as first argument to the hdfs.cp() command.R> tmp_dfs_name <- hdfs.describe(x)[1,2]R> hdfs.cp(tmp_dfs_name,"data_rebalanced_dfs",overwrite=TRUE)The choice of the number of parts is up to the user. It is better to have a few parts to avoid constraining from below the number of mappers for the downstream runs but one should consider other factors like the read/write performance related to the size of the data sets, the HDFS block size, etc which are not the topic of the present blog. 3. Getting unique levels Determining the unique levels of categorical variables in a dataset is of basic interest for any data exploration procedure. If the data is distributed in HDFS, this determination requires an appropriate solution. For the application under consideration here, getting the unique levels serves another purpose; the unique levels are used to generate data splits better suited for distributed execution by the downstream mapreduce jobs. More details are available in the next section. Depending on the categorical variables in question and data charactersitics, the determination of unique levels may require different solutions. The implementation below is a generic solution providing these levels for multiple variables bundled together in the input argument 'cols'. The mappers associate a key with each variable and collect the unique levels for each of these variables. The resulting array of values are packed in text stream friendly format and provided as value argument to orch.keyvals() - in this way complex data types can be safely passed between the mappers and reducers (via text-based Hadoop streams). The reducers unpack the strings, retrieve the all values associated with a particular key (variable) and re-calculate the unique levels accounting now for all values of that variable.R> get_unique_levels <- function(x, cols, nmap, nred) {  mapper <- function(k, v) {    for (col in cols) {      uvals <- unique(v[[col]])      orch.keyvals(col, orch.pack(uvals))    }  }  reducer <- function(k, v) {    lvals <- orch.unpack(v$val)    uvals <- unique(unlist(lvals))    orch.keyval(k, orch.pack(uvals))  }  config <- new("mapred.config",                job.name      = "get_unique_levls",                map.output    = data.frame(key="a",val="packed"),                reduce.output = data.frame(key="a",val="packed"),                map.tasks     = nmap,                reduce.tasks  = nred,  )  res <- hadoop.run(data = x,                    mapper = mapper,                    reducer = reducer,                    config = config,                    export = orch.export(cols=cols))  resl <- (lapply((hdfs.get(res))$val,function(x){orch.unpack(x)}))[[1]]}This implementation works fine provided that the number of levels for the categorical variables are much smaller than the large number of records of the entire data. If some categorical variables have many levels, not far  from order of the total number of records, each mapper may return a large numbers of levels and each reducer may have to handle multiple large objects. An efficient solution for this case requires a different approach. However, if the column associated with one of these variables can  fit in memory, a direct, very crude calculation like below can run faster than the former implementation. Here the mappers extract the column with the values of the variable in question, the column is pulled into an in-memory object and unique() is called to determine the unique levels.R> get_unique_levels_sngl <- function(HdfsData,col,nmap){  mapper_fun <- function(k,v) { orch.keyvals(key=NULL,v[[col]]) }  config <- new("mapred.config",                job.name      = "extract_col",                map.output    = data.frame(key=NA,VAL=0),                map.tasks     = nmap)    x <- hadoop.run(data=HdfsData,                    mapper=mapper_fun,                    config=config,                    export=orch.export(col=col),                    cleanup=TRUE)  xl <- hdfs.get(x)  res <- unique(xl$VAL)}R> customers <- get_unique_levls_sngl(data_rebalanced_dfs,"CID",nmap=32)We obtained thus the unique levels of the categorical variable CID (customer id) from our data_balanced_dfs data. 4. Partitioning the data for mapreduce execution Let's suppose that the user wants to execute some specific data manipulations at the CID level like aggregations, variable transformations or new variables generation, etc. Associating a key with every customer (CID level) would be a bad idea since there are many customers - our hypothesis was that the number of CID levels is not orders of magnitude below the total number of records. This would lead to an excessive number of reducers with a terrible impact on performance. In such case it would be better, for example, to bag customers into groups and distribute the execution at the group level. The user may want to set the number of this groups ngrp to something commensurate with the number of  BDA cores available for parallelizing the task.The example below illustrates how to do that at a basic level. The groups are generated within the encapsulating function myMRjob, before the hadoop.run() execution - the var.grp dataframe has two columns : the CID levels and the group number (from 1 to ngrp) with which they are associated. This table is passed to the hadoop execution environment via orch.export() within hadoop.run(). The mapper_fun function extracts the group number as key and inserts the multiple key-values pairs into the output buffer. The reducer gets then a complete set of records for every customer associated with a particular key(group) and can proceed with the transformations/ manipulations within a loop-over-customers or whatever programming construct would be appropriate. Each reducer would handle a quasi-equal number of customers because this is how the groups were generated. However the number of records per customer is not constant and may introduce some imbalances.R> myMRjob <- function(HdfsData,var,ngrp,nmap,nred){  mapper_fun <- function(k,v) {    ....    fltr <- <some_row_filetring>    cID <- which(names(v) %in% "CUSTOMID")    kk <- var.grps[match(v[fltr,cID],var.grps$CUSTOMID),2]    orch.keyvals(kk,v[fltr,,drop=FALSE])  }  reducer_fun <- function(k,v) { ... }  config <- new("mapred.config", map.tasks = nmap, reduce.tasks = nred,....)  var.grps <- data.frame(CUSTOMID=var,    GRP=rep(1:ngrp,sapply(split(var,ceiling(seq_along(var)/(length(var)/ngrp))),length)))  res <- hadoop.run(data = HdfsData,                    mapper = mapper_fun,                    reducer = reducer_fun,                    config = config,                    export = orch.export(var.grps=var.grps,ngrp=ngrp),                    cleanup = TRUE  )  res}x <- myMRjob(HdfsData=data_balanced_dfs, var=customers, ngrp=..,nmap=..,nred=..)Improved data partitioning solutions could be sought for the cases where there are strong imbalances in the number of records per customer or if great variations are noticed between the reducer jobs completion times. This kind of optimization will be addressed in a later blog.

We illustrate in this blog a few simple, practical solutions for problems which can arise when developing ORAAH mapreduce applications for the Oracle BDA. These problems were actually encountered...

The Intersection of “Data Capital” and Advanced Analytics

We’ve heard about the Three Laws of Data Capital from Paul Sonderegger at Oracle: data comes from activity, data tends to make more data, and platforms tend to win. Advanced analytics enables enterprises to take full advantage of the data their activity produces, ranging from IoT sensors and PoS transactions to social media and image/video. Traditional BI tools produce summary data from data – producing more data, but traditional BI tools provide a view of the past – what did happen. Advanced analytics also produces more data from data, but this data is transformative, generating previously unknown insights and providing a view of future behavior or outcomes – what will likely happen. Oracle provides a platform for advanced analytics today through Oracle Advanced Analytics on Oracle Database, and Oracle R Advanced Analytics for Hadoop on Big Data Appliance to support investing data. Enterprises need to put their data to work to realize a return on their investment in data capture, cleansing, and maintenance. Investing data through advanced analytics algorithms has shown repeatedly to dramatically increase ROI. For examples, see customer quotes and videos from StubHub, dunnhumby, CERN, among others. Too often, data centers are perceived as imposing a “tax” instead of yielding a “dividend.” If you cannot extract new insights from your data and use data to perform such revenue enhancing actions such as predicting customer behavior, understanding root causes, and reducing fraud, the costs to maintain large volumes of historical data may feel like a tax. How do enterprises convert data centers to dividend-yielding assets?One approach is to reduce “transaction costs.” Typically, these transaction costs involve the cost for moving data into environments where predictive models can be produced or sampling data to be small enough to fit existing hardware and software architectures. Then, there is the cost for putting those models into production. Transaction costs result in multi-step efforts that are labor intensive and make enterprises postpone investing their data and deriving value. Oracle has long recognized the origins of these high transaction costs and produced tools and a platform to eliminate or dramatically lower these costs. Further, consider the data scientist or analyst as the “data capital manager,” the person or persons striving to extract the maximum yield from data assets. To achieve high dividends with low transaction costs, the data capital manager needs to be supported with tools and a platform that automates activities – making them more productive – and ultimately more heroic within the enterprise – doing more with less because it’s faster and easier. Oracle removes a lot of the grunt work from the advanced analytics process: data is readily accessible, data manipulation and model building / data scoring is scalable, and deployment is immediate. To learn more about how to increase dividends from your data capital, see Oracle Advanced Analytics and Oracle R Advanced Analytics for Hadoop.

We’ve heard about the Three Laws of Data Capital from Paul Sonderegger at Oracle: data comes from activity, data tends to make more data, and platforms tend to win. Advanced analytics enables...

Best Practices

Using rJava in Embedded R Execution

Integration with high performance programming languages is one way to tackle big data with R. Portions of the R code are moved from R to another language to avoid bottlenecks and perform expensive procedures. The goal is to balance R’s elegant handling of data with the heavy duty computing capabilities of other languages.Outsourcing R to another language can easily be hidden in R functions, so proficiency in the target language is not requisite for the users of these functions. The rJava package by Simon Urbanek is one such example - it outsources R to Java very much like R's native .C/.Call interface. rJava allows users to create objects, call methods and access fields of Java objects from R. Oracle R Enterprise (ORE) provides an additional boost to rJava when used in embedded R script execution on the database server machine. Embedded R Execution allows R scripts to take advantage of a likely more powerful database server machine - more memory and CPUs, and greater CPU power. Through embedded R, ORE enables R to leverage database support for data parallel and task parallel execution of R scripts and also operationalize R scripts in database applications.  The net result is the ability to analyze larger data sets in parallel from a single R or SQL interface, depending on your preference.In this post, we demonstrate a basic example of configuring and deploying rJava in base R and embedded R execution.1. Install JavaTo start, you need Java. If you are not using a pre-configured engineered system like Exadata or the Big Data Appliance, you can download the Java Runtime Environment (JRE) and Java Development Kit (JDK) here.To verify the JRE is installed on your system, execute the command:$ java -versionjava version "1.7.0_67"If the JRE is installed on the system, the version number is returned. The equivalent check for JDK is:$ javac -versionjavac 1.7.0_67A "command not recognized" error indicates either Java is not present or you need to add Java to your PATH and CLASSPATH environment variables. 2. Configure Java Parameters for RR provides the javareconf utility to configure Java support in R.  To prepare the R environment for Java, execute this command in R's home directory:$ echo $R_HOME/usr/lib64/R$ cd /usr/lib64/R$ sudo R CMD javareconfor$ R CMD javareconf -e3.  Install rJava PackagerJava release versions can be obtained from CRAN.  Assuming an internet connection is available, the install.packages command in an R session will do the trick.> install.packages("rJava")....* installing *source* package ‘rJava’ ...** package ‘rJava’ successfully unpacked and MD5 sums checkedchecking for gcc... gcc -m64 -std=gnu99....** testing if installed package can be loaded* DONE (rJava) 4. Configure the Environment Variable CLASSPATH The CLASSPATH environment variable must contain the directories with the jar and class files.  The class files in this example will be created in /home/oracle/tmp.   export CLASSPATH=$ORACLE_HOME/jlib:/home/oracle/tmp Alternatively, use the rJava function .jaddClassPath to define the path to the class files. 5. Create and Compile Java ProgramFor this test, we create a simple, Hello, World! example. Create the file HelloWorld.java in /home/oracle/tmp with the contents:   public class HelloWorld {          public String SayHello(String str)            {                  String a = "Hello,";            return a.concat(str);            }    }Compile the Java code.$ javac HelloWorld.java 6.  Call Java from RIn R, execute the following commands to call the rJava package and initialize the Java Virtual Machine (JVM).R> library(rJava)R> .jinit()Instantiate the class HelloWorld in R. In other words, tell R to look at the compiled HelloWorld program.R> .jnew("HelloWorld")Call the function directly.R> .jcall(obj, "S", "SayHello", str)              VAL1 Hello,      World!7.  Call Java In Embedded R ExecutionOracle R Enterprise uses external procedures in Oracle Database to support embedded R execution. The default configuration for external procedures is spawned directly by Oracle Database. The path to the JVM shared library, libjvm.so must be added to the environment variable LD_LIBRARY_PATH so it is found in the shell where Oracle is started.  This is defined in two places: at the OS shell and in the external procedures configuration file, extproc.ora.In the OS shell:$ locate libjvm.so/usr/java/jdk1.7.0_45/jre/lib/amd64/server$ export LD_LIBRARY_PATH=/usr/java/jdk1.7.0_45/jre/lib/amd64/server:$LD_LIBRARY_PATHIn extproc.ora:$ cd $ORACLE_HOME/hs/admin/extproc.oraEdit the file extproc.ora to add the path to libjvm.so in LD_LIBRARY_PATH:SET EXTPROC_DLLS=ANYSET LD_LIBRARY_PATH=/usr/java/jdk1.7.0_45/jre/lib/amd64/serverexport LD_LIBRARY_PATHYou will need to bounce the database instance after updating extproc.ora.Now load rJava in embedded R:> library(ORE)> ore.connect(user     = 'oreuser',             password = 'password',             sid      = 'sid',             host     = 'hostname',             all      = TRUE)> TEST <- ore.doEval(function(str) {                       library(rJava)                       .jinit()                       obj <- .jnew("HelloWorld")                       val <- .jcall(obj, "S", "SayHello", str)                       return(as.data.frame(val))                     },                     str = 'World!',                    FUN.VALUE = data.frame(VAL = character())  )> print(TEST)               VAL1 Hello,      World!If you receive this error, LD_LIBRARY_PATH is not set correctly in extproc.ora:Error in .oci.GetQuery(conn, statement, data = data, prefetch = prefetch,  :   Error in try({ : ORA-20000: RQuery errorError : package or namespace load failed for ‘rJava’ORA-06512: at "RQSYS.RQEVALIMPL", line 104ORA-06512: at "RQSYS.RQEVALIMPL", line 101Once you've mastered this simple example, you can move to your own use case. If you get stuck, the rJava package has very good documentation. Start with the information on the rJava CRAN page. Then, from an R session with the rJava package loaded, execute the command help(package="rJava") lto list  the available functions.After that, the source code of R packages which use rJava are a useful source of further inspiration – look at the reverse dependencies list for rJava in CRAN. In particular, the helloJavaWorld package is a tutorial for how to include Java code in an R package.

Integration with high performance programming languages is one way to tackle big data with R. Portions of the R code are moved from R to another language to avoid bottlenecks and perform...

Best Practices

Pain Point #6: “We need to build 10s of thousands of models fast to meet business objectives”

The last pain point in this series on Addressing Analytic Pain Points, involves one aspect of what I call massive predictive modeling. Increasingly, enterprise customers are building a greater number of models. In past decades, producing a handful of production models per year may have been considered a significant accomplishment. With the advent of powerful computing platforms, parallel and distributed algorithms, as well as the wealth of data – Big Data – we see enterprises building hundreds and thousands of models in targeted ways. For example, consider the utility sector with data being collected from household smart meters. Whether water, gas, or electricity, utility companies can make more precise demand projections by modeling individual customer consumption behavior. Aggregating this behavior across all households can provide more accurate forecasts, since individual household patterns are considered, not just generalizations about all households, or even different household segments. The concerns associated with this form of massive predictive modeling include: (i) dealing effectively with Big Data from the hardware, software, network, storage and Cloud, (ii) algorithm and infrastructure scalability and performance, (iii) production deployment, and (iv) model storage, backup, recovery and security. Some of these I’ve explored under previous pain points blog posts. Oracle Advanced Analytics (OAA) and Oracle R Advanced Analytics for Hadoop (ORAAH) both provide support for massive predictive modeling. From the Oracle R Enterprise component of OAA, users leverage embedded R execution to run user-defined R functions in parallel, both from R and from SQL. OAA provides the infrastructure to allow R users to focus on their core R functionality while allowing Oracle Database to handle spawning of R engines, partitioning data and providing data to their R function across parallel R engines, aggregating results, etc. Data parallelism is enabled using the “groupApply” and “rowApply” functions, while task parallelism is enabled using the “indexApply” function. The Oracle Data Mining component of OAA provides "on-the-fly" models, also called "predictive queries," where the model is automatically built on partitions of the data and scoring using those partitioned models is similarly automated. ORAAH enables the writing of mapper and reducer functions in R where corresponding ORE functionality can be achieved on the Hadoop cluster. For example, to emulate “groupApply”, users write the mapper to partition the data and the reducer to build a model on the resulting data. To emulate “rowApply”, users can simply use the mapper to perform, e.g., data scoring and passing the model to the environment of the mapper. No reducer is required.

The last pain point in this series on Addressing Analytic Pain Points, involves one aspect of what I call massive predictive modeling. Increasingly, enterprise customers are building a greater number...

Best Practices

Pain Point #5: “Our company is concerned about data security, backup and recovery”

So far in this series on Addressing Analytic Pain Points, I’ve focused on the issues of data access, performance, scalability, application complexity, and production deployment. However, there are also fundamental needs for enterprise advanced analytics solutions that revolve around data security, backup, and recovery. Traditional non-database analytics tools typically rely on flat files. If data originated in an RDBMS, that data must first be extracted. Once extracted, who has access to these flat files? Who is using this data and when? What operations are being performed? Security needs for data may be somewhat obvious, but what about the predictive models themselves? In some sense, these may be more valuable than the raw data since these models contain patterns and insights that help make the enterprise competitive, if not the dominant player. Are these models secure? Do we know who is using them, when, and with what operations? In short, what audit capabilities are available? While security is a hot topic for most enterprises, it is essential to have a well-defined backup process in place. Enterprises normally have well-established database backup procedures that database administrators (DBAs) rigorously follow. If data and models are stored in flat files, perhaps in a distributed environment, one must ask what procedures exist and with what guarantees. Are the data files taxing file system backup mechanisms already in place – or not being backed up at all? On the other hand, recovery involves using those backups to restore the database to a consistent state, reapplying any changes since the last backup. Again, enterprises normally have well-established database recovery procedures that are used by DBAs. If separate backup and recovery mechanisms are used for data, models, and scores, it may be difficult, if not impossible, to reconstruct a consistent view of an application or system that uses advanced analytics. If separate mechanisms are in place, they are likely more complex than necessary. For Oracle Advanced Analytics (OAA), data is secured via Oracle Database, which wins security awards and is highly regarded for its ability to provide secure data for confidentiality, integrity, availability, authentication, authorization, and non-repudiation. Oracle Database logs and monitors user activity. Users can work independently or jointly in a shared environment with data access controlled by standard database privileges. The data itself can be encrypted and data redaction is supported. OAA models are secured in one of two ways: (i) models produced in the kernel of the database are treated as first-class database objects with corresponding access privileges (create, update, delete, execute), and (ii) models produced through the R interface can be stored in the R datastore, which exists as a database table in the user's schema with its own access privileges. In either case, users must log into their Oracle Database schema/account, which provides the needed degree of confidentiality, integrity, availability, authentication, authorization, and non-repudiation. Enterprise Oracle DBAs already follow rigorous backup and recovery procedures. The ability to reuse these procedures in conjunction with advanced analytics solutions is a major simplification and helps to ensure the integrity of data, models, and results.

So far in this series on Addressing Analytic Pain Points, I’ve focused on the issues of data access, performance, scalability, application complexity, and production deployment. However, there are...

Best Practices

Pain Point #4: “Recoding R (or other) models into SQL, C, or Java takes time and is error prone”

In the previous post in this series Addressing Analytic Pain Points, I focused on some issues surrounding production deployment of advanced analytics solutions. One specific aspect of production deployment involves how to get predictive model results (e.g., scores) from R or leading vendor tools into applications that are based on programming languages such as SQL, C, or Java. In certain environments, one way to integrate predictive models involves recoding them into one of these languages. Recoding involves identifying the minimal information needed for scoring, i.e., making predictions, and implementing that in a language that is compatible with the target environment. For example, consider a linear regression model with coefficients. It can be fairly straightforward to write a SQL statement or a function in C or Java to produce a score using these coefficients. This translated model can then be integrated with production applications or systems. While recoding has been a technique used for decades, it suffers from several drawbacks: latency, quality, and robustness. Latency refers to the time delay between the data scientist developing the solution and leveraging that solution in production. Customers recount historic horror stories where the process from analyst to software developers to application deployment took months. Quality comes into play on two levels: the coding and testing quality of the software produced, and the freshness of the model itself. In fast changing environments, models may become “stale” within days or weeks. As a result, latency can impact quality. In addition, while a stripped down implementation of the scoring function is possible, it may not account for all cases considered by the original algorithm implementer. As such, robustness, i.e., the ability to handle greater variation in the input data, may suffer. One way to address this pain point is to make it easy to leverage predictive models immediately (especially open source R and in-database Oracle Advanced Analytics models), thereby eliminating the need to recode models. Since enterprise applications normally know how to interact with databases via SQL, as soon as a model is produced, it can be placed into production via SQL access. In the case of R models, these can be accessed using Oracle R Enterprise embedded R execution in parallel via ore.rowApply and, for select models, the ore.predict capability performs automatic translation of native R models for execution inside the database. In the case of native SQL Oracle Advanced Analytics interface algorithms, as found in Oracle Data Mining and exposed through an R interface in Oracle R Enterprise, users can perform scoring directly in Oracle Database. This capability minimizes or even eliminates latency, dramatically increases quality, and leverages the robustness of the original algorithm implementations.

In the previous post in this series Addressing Analytic Pain Points, I focused on some issues surrounding production deployment of advanced analytics solutions. One specific aspect of production...

Best Practices

Pain Point #3: “Putting R (or other) models and results into production is ad hoc and complex”

Continuing in our series Addressing Analytic Pain Points, another concern for data scientists and analysts, as well as enterprise management, is how to leverage analytic results in production systems. These production systems can include (i) dashboards used by management to make business decisions, (ii) call center applications where representatives see personalized recommendations for the customer they’re speaking to or how likely that customer is to churn, (iii) real-time recommender systems for customer retail web applications, (iv) automated network intrusion detection systems, and (v) semiconductor manufacturing alert systems that monitor product quality and equipment parameters via sensors – to name a few. When a data scientist or analyst begins examining a data-based business problem, one of the first steps is to acquire the available data relevant to that problem. In many enterprises, this involves having it extracted from a data warehouse and operational systems, or acquiring supplemental data from third parties. They then explore the data, prepare it with various transformations, build models using a variety of algorithms and settings, evaluate the results, and after choosing a “best” approach, produce results such as predictions or insights that can be used by the enterprise. If the end goal is to produce a slide deck or report, aside from those final documents, the work is done. However, reaping financial benefits from advanced analytics often needs to go beyond PowerPoint! It involves automating the process described above: extract and prepare the data, build and select the “best” model, generate predictions or highlight model details such as descriptive rules, and utilize them in production systems. One of the biggest challenges enterprises face involves realizing the promised benefits in production that the data scientist achieved in the lab. How do you take that cleverly crafted R script, for example, and put all the necessary “plumbing” around it to enable not only the execution of the R script, but the movement of data and delivery of results where they are needed, parallel and distributed script execution across compute nodes, and execution scheduling. As a production deployment, care needs to taken to safeguard against potential failures in the process. Further, more “moving parts” result in greater complexity. Since the plumbing is often custom implemented for each deployment, this plumbing needs to be reinvented and thoroughly tested for each project. Unfortunately, code and process reuse is seldom realized across an enterprise even for similar projects, which results in duplication of effort. Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining) with Oracle Database provides an environment that eliminates the need for a separately managed analytics server, the corresponding movement of data and results between such a server and the database, and the need for custom plumbing. Users can store their R and SQL scripts directly in Oracle Database and invoke them through standard database mechanisms. For example, R scripts can be invoked via SQL, and SQL scripts can be scheduled for execution through Oracle Database’s DMBS_SCHEDULER package. Parallel and distributed execution of R scripts is supported through embedded R execution, while the database kernel supports parallel and distributed execution of SQL statements and in-database data mining algorithms. In addition, using the Oracle Advanced Analytics GUI, Oracle Data Miner, users can convert “drag and drop” analytic workflows to SQL scripts for ease of deployment in Oracle Database. By making solution deployment a well-defined and routine part of the production process and reducing complexity through fewer moving parts and built-in capabilities, enterprises are able to realize and then extend the value they get from predictive analytics faster and with greater confidence.

Continuing in our series Addressing Analytic Pain Points, another concern for data scientists and analysts, as well as enterprise management, is how to leverage analytic results in production systems....

Best Practices

Pain Point #2: “I can’t analyze or mine all of my data – it has to be sampled”

Continuing in our series Addressing Analytic Pain Points, another concern for enterprise data scientists and analysts is having to compromise accuracy due to sampling. While sampling is an important technique for data analysis, it’s one thing to sample because you choose to; it’s quite another if you are forced to sample or to use a much smaller sample than is useful. A combination of memory, compute power, and algorithm design normally contributes to this. In some cases, data simply cannot fit in memory. As a result, users must either process data in batches (adding to code or process complexity), or limit the data they use through sampling. In some environments, sampling itself introduces a catch 22 problem: the data is too big to fit in memory so it needs to be sampled, but to sample it with the current tool, I need to fit the data in memory! As a result, sampling large volume data may require processing it in batches, involving extra coding. As data volumes increase, computing statistics and predictive analytics models on a data sample can significantly reduce accuracy. For example, to find all the unique values for a given variable, a sample may miss values, especially those that occur infrequently. In addition, for environments like open source R, it is not enough for data to fit in memory; sufficient memory must be left over to perform the computation. This results from R’s call-by-value semantics. Even when data fits in memory, local machines, such as laptops, may have insufficient CPU power to process larger data sets. Insufficient computing resources means that performance suffers and users must wait for results - perhaps minutes, hours, or longer. This wastes the valuable (and expensive) time of the data scientist or analyst. Having multiple fast cores for parallel computations, as normally present on database server machines, can significantly reduce execution time. So let’s say we can fit the data in memory with sufficient memory left over, and we have ample compute resources. It may still be the case that performance is slow, or worse, the computation effectively “never” completes. A computation that would take days or weeks to complete on the full data set may be deemed as “never” completing by the user or business, especially where the results are time-sensitive. To address this problem, algorithm design must be addressed. Serial, non-threaded algorithms, especially with quadratic or worse order run time do not readily scale. Algorithms need to be redesigned to work in a parallel and even distributed manner to handle large data volumes. Oracle Advanced Analytics provides a range of statistical computations and predictive algorithms implemented in a parallel, distributed manner to enable processing much larger volume data. By virtue of executing in Oracle Database, client-side memory limitations can be eliminated. For example, with Oracle R Enterprise, R users operate on database tables using proxy objects – of type ore.frame, a subclass of data.frame – such that data.frame functions are transparently converted to SQL and executed in Oracle Database. This eliminates data movement from the database to the client machine. Users can also leverage the Oracle Data Miner graphical interface or SQL directly. When high performance hardware, such as Oracle Exadata, is used, there are powerful resources available to execute operations efficiently on big data. On Hadoop, Oracle R Advanced Analytics for Hadoop – a part of the Big Data Connectors often deployed on Oracle Big Data Appliance – also provides a range of pre-package parallel, distributed algorithms for scalability and performance across the Hadoop cluster.

Continuing in our series Addressing Analytic Pain Points, another concern for enterprise data scientists and analysts is having to compromise accuracy due to sampling. While sampling is an important...

Best Practices

Pain Point #1: “It takes too long to get my data or to get the ‘right’ data”

This is the first in a series on Addressing Analytic Pain Points: “It takes too long to get my data or to get the ‘right’ data.” Analytics users can be characterized along multiple dimensions. One such dimension is how they get access to or receive data. For example, some receive data via flat files. Since we’re talking about “enterprise” users, this often means data stored in RDBMSs where users request data extracts from a DBA or more generally the IT department. Turnaround time can be hours to days, or even weeks, depending on the organization. If the data scientist needs more or different data, the cycle repeats – often leading to frustration on both sides and delays in generating results. Others users are granted access to databases directly using programmatic access tools like ODBC, JDBC, their corresponding R variants, or ROracle. These users may be given read-only access to a range of data tables, possibly in a sandbox schema. Here, analytics users don’t have to go back to their DBA or IT as to obtain extracts, but they still need to pull the data from the database to their client environment, e.g., a laptop, and push results back to the database. If significant volumes of data are involved, the time required for pulling data can hinder productivity. (Of course, this assumes the client has enough RAM to load the needed data sets, but that’s a topic for the next blog post.) To address the first type of user, since much of the data in question resides in databases, empowering users with a self service model mitigates the vicious cycle described above. When the available data are readily accessible to analytics users, they can see and select what they need at will. An Oracle Database solution addresses this data access pain point by providing schema access, possibly in a sandbox with read-only table access, for the analytics user. Even so, this approach just turns the first type of user into the second mentioned above. An Oracle Database solution further addresses this pain point by either minimizing or eliminating data movement as much as possible. Most analytics engines bring data to the computation, requiring extracts and in some cases even proprietary formats before being able to perform analytics. This takes time. Often, data movement can dwarf the time required to perform the actual computation. From the perspective of the analytics user, this is wasted time because it is just a perfunctory step on the way to getting the desired results. By bringing computation to the data, using Oracle Advanced Analytics (Oracle R Enterprise and Oracle Data Mining), the time normally required to move data is eliminated. Consider the time savings of being able to prepare data, compute statistics, or build predictive models and score data directly in the database. Using Oracle Advanced Analytics, either from R via Oracle R Enterprise, SQL via Oracle Data Mining, or the graphical interface Oracle Data Miner, users can leverage Oracle Database as a high performance computational engine. We should also note that Oracle Database has the high performance Oracle Call Interface (OCI) library for programmatic data access. For R users, Oracle provides the package ROracle that is optimized using OCI for fast data access. While ROracle performance may be much faster than other methods (ODBC- and JDBC-based), the time is still greater than zero and there are other problems that I’ll address in the next pain point.

This is the first in a series on Addressing Analytic Pain Points: “It takes too long to get my data or to get the ‘right’ data.” Analytics users can be characterized along multiple dimensions. One such...

Best Practices

Oracle R Enterprise 1.4.1 Released

Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, makes the open source R statistical programming language and environment ready for the enterprise and big data. Designed for problems involving large data volumes, Oracle R Enterprise integrates R with Oracle Database.R users can execute R commands and scripts for statistical and graphical analyses on data stored in Oracle Database. R users can develop, refine, and deploy R scripts that leverage the parallelism and scalability of the database to automate data analysis. Data analysts and data scientists can use open source R packages and develop and operationalize R scripts for analytical applications in one step – from R or SQL. With the new release of Oracle R Enterprise 1.4.1, Oracle enables support for Multitenant Container Database (CDB) in Oracle Database 12c and pluggable databases (PDB). With support for CDB / PDB, enterprises can take advantage of new ways of organizing their data: easily taking entire databases offline and easily bringing them back online when needed. Enterprises, such as pharmaceutical companies, that collect vast quantities of data across multiple experiments for individual projects immediately benefit from this capability. This point release also includes the following enhancements:• Certified for use with R 3.1.1 and Oracle R Distribution 3.1.1. • Simplified and enhanced script for install, upgrade, uninstall of ORE Server and the creation and configuratioon of ORE users.• New supporting packages: arules and statmod.• ore.glm accepts offset terms in model formula and can fit negative binomial and tweedie families of GLM.• ore.sync argument, query, creates ore.frame object from SELECT statement without creating view. This allows users to effectively access a view of the data without the CREATE VIEW privilege. • Global option for serialization, ore.envAsEmptyenv, specifies whether referenced environment objects in an R object, e.g., in an lm model, should be replaced with an empty environment during serialization to the ORE R datastore. This is used by (i) ore.push, which for a list object accepts envAsEmptyenv as an optional argument, (ii) ore.save, which has envAsEmptyenv as a named argument, and (iii) ore.doEval and the other embedded R execution functions, which accept ore.envAsEmptyenv as a control argument.Oracle R Enterprise 1.4.1 can be downloaded from OTN here.

Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, makes the open source R statistical programming language and environment ready for the enterprise and big...

Best Practices

Seismic Data Repository: on-the-fly data analysis and visualization using Oracle R Enterprise

RN-KrasnoyarskNIPIneft Establishes Seismic Information Repository for One of the World’s Largest Oil and Gas Companies. Read the complete customer story here, excerpts follow. RN-KrasnoyarskNIPIneft (KrasNIPI) is a research and development subsidiary of Rosneft Oil Companya, top oil and gas company in Russia and worldwide. KrasNIPI provides high-quality information from seismic surveys to Rosneft—delivering key information that oil and gas companies seek to lower costs, environmental impacts, and risks while exploring for resources to satisfy growing energy needs. KrasNIPI’s primary activities include preparing the information base used for the exploration of hydrocarbons, development and construction of oil and gas fields, processing and interpretation of 2-D and 3-D seismic data, and seismic data warehousing.Part of the solution involved on-the-fly data analysis and visualization for remote users with only a thin client—such as a web browser (without additional plug-ins and extensions). This was made possible by using Oracle R Enterprise (a component of Oracle Advanced Analytics) to support applications requiring extensive analytical processing.“We store vast amounts of seismic data, process this information with sophisticated math algorithms, and deliver it to remote users under tight deadlines. We deployed Oracle Database together with Oracle Spatial and Graph, Oracle Fusion Middleware MapViewer on Oracle WebLogic Server, and Oracle R Enterprise to keep these complex business processes running smoothly. The result exceeded our most optimistic expectations.”                               – Artem Khodyaev, Chief Engineer                                                               Corporate Center of Seismic Information Repository                                                               RN-KrasnoyarskNIPIneft

RN-KrasnoyarskNIPIneft Establishes Seismic Information Repository for One of the World’s Largest Oil and Gas Companies. Read the complete customer story here, excerpts follow.RN-KrasnoyarskNIPIneft...

Best Practices

Oracle R Distribution 3.1.1 Released

Oracle R Distribution version 3.1.1 has been released to Oracle's public yum today. R-3.1.1 (code name "Sock it to Me") is an update to R-3.1.0 that consists mainly of bug fixes. It also includes enhancements related to accessing package help files, improved accuracy when importing data with large integers, and better integration with RStudio graphics. The full list of new features and bug fixes is listed in the NEWS file.To install Oracle R Distribution using yum, follow the instructions in the Oracle R Enterprise Installation and Administration Guide.Installing using yum will resolve any operating system dependencies automatically. As such, we recommend using yum to install Oracle R Distribution. However, if yum is not available, you can install Oracle R Distribution RPMs directly using RPM commands.For Oracle Linux 5, the Oracle R Distribution RPMs are available in the Enterprise Linux Add-Ons repository:  R-3.1.1-1.el5.x86_64.rpm   R-core-3.1.1-1.el5.x86_64.rpm  R-devel-3.1.1-1.el5.x86_64.rpm  libRmath-3.1.1-1.el5.x86_64.rpm  libRmath-devel-3.1.1-1.el5.x86_64.rpm  libRmath-static-3.1.1-1.el5.x86_64.rpm For Oracle Linux 6, the Oracle R Distribution RPMs are available in the Oracle Linux Add-Ons repository:  R-3.1.1-1.el6.x86_64.rpm  R-core-3.1.1-1.el6.x86_64.rpm  R-devel-3.1.1-1.el6.x86_64.rpm  libRmath-3.1.1-1.el6.x86_64.rpm  libRmath-devel-3.1.1-1.el6.x86_64.rpm  libRmath-static-3.1.1-1.el6.x86_64.rpmFor example, this command installs the R 3.1.1 RPM on Oracle Linux x86-64 version 6:  rpm -i R-3.1.1-1.el6.x86_64.rpm To complete the Oracle R Distribution 3.1.1 installation, repeat this command for each of the 6 RPMs, resolving dependencies as required. Oracle R Distribution 3.1.1 is the certified with Oracle R Enterprise 1.4.x. Refer to Table 1-2 in the Oracle R Enterprise Installation Guide for supported configurations of Oracle R Enterprise components, or check this blog for updates. The Oracle R Distribution 3.1.1 binaries for Windows, AIX, Solaris SPARC and Solaris x86 are also available on OSS, Oracle's Open Source Software portal.

Oracle R Distribution version 3.1.1 has been released to Oracle's public yum today. R-3.1.1 (code name "Sock it to Me") is an update to R-3.1.0 that consists mainly of bug fixes. It also includes...

Customers

Real-time Big Data Analytics is a reality for StubHub with Oracle Advanced Analytics

What can you use for a comprehensive platform for real-time analytics? How can you process big data volumes for near-real-time recommendations and dramatically reduce fraud? Learn in this video what Stubhub achieved with Oracle R Enterprise from the Oracle Advanced Analytics option to Oracle Database, and read more on their story here.Advanced analytics solutions that impact the bottom line of a business are challenging due to the range of skills and individuals involved in realizing such solutions. While we hear a lot about the role of the data scientist, that role is but one piece of the puzzle. Advanced analytics solutions also have an operationalization aspect that also requires close proximity to where the transactional activity occurs.The data scientist needs access to the right data with which to model the business problem. This involves IT for data collection, management, and administration, as well as ensuring zero downtime (a website needs to be up 24x7). This also involves working with the data scientist to keep predictive models refreshed with the latest scripts.Integrating advanced analytics solutions into enterprise apps involves not just generating predictions, but supporting the whole life-cycle from data collection, to model building, model assessment, and then outcome assessment and feedback to the model building process again. Application and web interface designers need to take into account how end users will see and use the advanced analytics results, e.g., supporting operations staff that need to handle the potentially fraudulent transactions.As just described, advanced analytics projects can be "complicated" from just a human perspective. The extent to which software can simplify the interactions among users and systems will increase the likelihood of project success. The ability to quickly operationalize advanced analytics projects and demonstrate measurable value, means the difference between a successful project and just a nice research report.By standardizing on Oracle Database and SQL invocation of R, along with in-database modeling as found in Oracle Advanced Analytics, expedient model deployment and zero downtime for refreshing models becomes a reality. Meanwhile, data scientists are also able to explore leading edge techniques available in open source. The Oracle solution propels the entire organization forward to realize the value of advanced analytics.

What can you use for a comprehensive platform for real-time analytics? How can you process big data volumes for near-real-time recommendations and dramatically reduce fraud? Learn in this video what...

Best Practices

Selecting the most predictive variables – returning Attribute Importance results as a database table

Attribute Importance (AI) is a technique of Oracle Advanced Analytics (OAA) that ranks the relative importance of predictors given a categorical or numeric target for classification or regression models, respectively. OAA AI uses the minimum description length algorithm and produces importance scores such that predictors with positive scores help predict the target, while zero or negative do not, and may even contribute noise to a model, making it less accurate. OAA AI, however, considers predictors only pairwise with the target, so any interactions among predictors are addressed. OAA AI is a good first assessment of which predictors should be included in a classification or regression model, enabling what is sometimes called feature selection or variable selection.In my series on Oracle R Enterprise Embedded R Execution, I explored how structured table results could be returned from embedded R calls. In a subsequent post, I explored how to return select results from a principal components analysis (PCA) model as a table. In this post, I describe how you can work with results from an Attribute Importance model from ORE embedded R execution via an R function. This R function takes a table name and target variable name as input, places the predictor rankings in an named ORE datastore also specified as input, and returns a data.frame with the predictor variable name, rank, importance value. The function below implements this functionality. Notice that we dynamically sync the named table and get its ore.frame proxy object. From here, we invoke ore.odmAI using the dynamically generated formula using the targetName argument. We pull out the importance component of the result, explicitly assign the column variable to the row names, and then reorder the columns. Next, we nullify the row names since these are now redundant with column variable. The next three lines assign the result to a datastore. This is technically not necessary since the result is returned by this function, but if a user wanted to access this result without recomputing it, the user could simply retrieve the datastore object using another embedded R function. This is left as an exercise for the reader to load the named datastore and return the contents as an ore.frame in R or database table in SQL. Lastly, the resulting data.frame is returned.rankPredictors <- function(tableName,targetName,dsName) {  ore.sync(table=tableName)  ore.attach()  dat <- ore.get(tableName)  formulaStr <- paste(targetName,".",sep="~")  res <- ore.odmAI(as.formula(formulaStr),dat)  res <- res$importance  res$variable <- rownames(res)  res <- res[,c("variable","rank","importance")]  row.names(res) <- NULL  resName <- paste(tableName,targetName,"AI",sep=".")  assign(resName,res)  ore.save(list=c(resName),name=dsName,overwrite=TRUE)  res}To test this funtion, we invoke it explicitly with suitable arguments.res <- rankPredictors ("IRIS","Species","/DS/Test1")resHere, you see the results. > res    variable rank importance1  Petal.Width    1  1.17018512 Petal.Length    2  1.14944023 Sepal.Length    3  0.52488154  Sepal.Width    4  0.2504077The contents of the datastore can be accessed as well. ore.datastore(pattern="/DS")ore.datastoreSummary(name="/DS/Test1")ore.load("/DS/Test1")IRIS.Species.AI > ore.datastore(pattern="/DS")  datastore.name object.count size    creation.date description1 /DS/Test1 1 355 2014-08-14 16:38:46 <na>> ore.datastoreSummary(name="/DS/Test1") object.name class size length row.count col.count1 IRIS.Species.AI data.frame 355 3 4 3> ore.load("/DS/Test1")[1] "IRIS.Species.AI"> IRIS.Species.AI    variable rank importance1  Petal.Width    1  1.17018512 Petal.Length    2  1.14944023 Sepal.Length    3  0.52488154  Sepal.Width    4  0.2504077With the confidence that our R function is behaving correctly, we load it into the R Script Repository in Oracle Database. ore.scriptDrop("rankPredictors")ore.scriptCreate("rankPredictors",rankPredictors)To test that the function behaves properly with embedded R execution, we invoke it first from R using ore.doEval, passing the desired parameters and returning the result as an ore.frame. This last part is enabled through the specification of the FUN.VALUE argument. Since we are using a datastore and the transparency layer, ore.connect is set to TRUE. ore.doEval(FUN.NAME="rankPredictors",  tableName="IRIS",  target="Species",  dsName="/AttributeImportance/IRIS/Species",  FUN.VALUE=data.frame(variable=character(0)      ,rank=numeric(0)      ,importance=numeric(0)),  ore.connect=TRUE)Notice we get the same result as above.    variable rank importance1  Petal.Width    1  1.17018512 Petal.Length    2  1.14944023 Sepal.Length    3  0.52488154  Sepal.Width    4  0.2504077 Again, we can view the datastore contents for the execution above. Notice our use of the “/” notation to organize our datastore content. While we can name datastores with any arbitrary string, this approach can help structure the retrieval of datastore contents.ore.datastore(pattern="/AttributeImportance/IRIS")ore.datastoreSummary(name="/AttributeImportance/IRIS/Species") We have a single datastore matching our IRIS data set followed by the summary with the IRIS.Species.AI object, which is an R data.frame with 3 columns and 4 rows. > ore.datastore(pattern="/AttributeImportance/IRIS") datastore.name object.count size creation.date description1 /AttributeImportance/IRIS/Species 1 355 2014-08-14 16:55:40 > ore.datastoreSummary(name="/AttributeImportance/IRIS/Species") object.name class size length row.count col.count1 IRIS.Species.AI data.frame 355 3 4 3To execute this R script from SQL, use the ORE SQL API. select * from table(rqEval(  cursor(select 1 "ore.connect",       'IRIS' "tableName",       'Species' "targetName",      '/AttributeImportance/IRIS/Species' "dsName"       from dual),   'select cast(''a'' as varchar2(50)) "variable",   1 "rank",   1 "importance"   from dual',  'rankPredictors'));In summary, we’ve explored how to use ORE embedded R execution to extract model elements from an in-database algorithm and present it as an R data.frame, ore.frame, and SQL table. The process used above can also serve as a template for working on your own embedded R execution projects:+ Interactively develop an R script that does what you need and wrap it in a function+ Validate that the R function behaves as expected+ Store the function in the R Script Repository+ Validate that the R interface to embedded R execution produces the desired results + Generate SQL query that invokes the R function + Validate that the SQL interface to embedded R execution produces the desired resultsv

Attribute Importance (AI) is a technique of Oracle Advanced Analytics (OAA) that ranks the relative importance of predictors given a categorical or numeric target for classification or regression...

Best Practices

For CMOs: Take Your Company’s Data to a New Level for Marketing Insights

This guest post from Phyllis Zimbler Miller, ‎Digital Marketer, comments on uses of predictive analytics for marketing insights that could benefit from in-database scalability and ease of production deployment with Oracle R Enterprise.Does your company have tons of data, such as for how many seconds people watch each short video on your site before clicking away, and you are not yet leveraging this data to benefit your company’s bottom line?Missed opportunities can be overcome by utilizing predictive analyticsPredictive analytics uses statistical and machine learning techniques that analyze current and historical facts to make predictions about events. For example, your company could take data you’ve already collected and, utilizing statistical analysis software, gain insights into the behavior of your target audiences.Previously, running the software to analyze this data could take many hours or even days. Today, with advanced software and hardware options, this analysis can take minutes.Customer segmentation and customer satisfaction based on data analysisUsing predictive analytics you could, for example, begin to evaluate which prospective customers in which part of the country tend to watch which videos on your site longer than the other videos on your site. This evaluation can then be used by your marketing people to craft regional messages that can better resonate with people in those regions.In terms of data analysis for customer satisfaction, imagine an online entertainment streaming service using data analysis to determine at what point people stop watching a particular film or TV episode. Presumably this information could then be used, among other things, to improve the individual recommendations for site members.Or imagine an online game company using data analysis of player actions for customer satisfaction insights. Although certain actions may not be against the rules, these actions might artificially increase a player’s ranking against other players, which would interfere with the game satisfaction of others. The company could use data analysis to look for players “gaming” the system and take appropriate action.Customer retention opportunities from data analysisPerhaps one of the most important opportunities for analysis of data your company may already have is for customer retention efforts. Let’s say you have a subscription model business. You perform data analysis and discover that your biggest drop-offs are at the 3-month and 6-month points. First, your marketing department comes up with incentives offered to customers right before those drop-off points – incentives that require extending the customer’s subscription. Then you use data analysis to evaluate whether there is a statistical difference in the drop-offs after the incentives have been instituted. Next you try different incentives for those drop-off points and analyze that data. Which incentives seem to better improve customer retention?Companies with large volume dataYour company may already be using Oracle Database. If your company’s database has a huge amount of data, Oracle has an enterprise solution to improve the efficiency and scalability of running the R statistical programming language, which can be effectively used in many cases for this type of predictive analytics.Oracle R Enterprise offers scalability, performance, and ease of production deployment. Using Oracle R Enterprise, your company’s data analysis procedures can overcome R memory constraints and, utilizing parallel distributed algorithms, considerably reduce execution time.Regardless of the amount of data your company has, you still need to consider how to get your advanced analytics into production quickly and easily. The ability to integrate R scripts with production database applications using SQL eliminates delays in moving from development to production use. And the quicker and easier you can analyze your data, the sooner you can benefit from valuable insights into customer segmentation, satisfaction, and retention in addition to many other customer/marketing applications.

This guest post from Phyllis Zimbler Miller, ‎Digital Marketer, comments on uses of predictive analytics for marketing insights that could benefit from in-database scalability and ease of production...

Best Practices

Addressing Data Order Between R and Relational Databases

Almost all data in R is a vector or is based upon vectors (vectors themselves, matrices, data frames, lists, and so forth).  The elements of a vector in R have an explicit order, and each element can be individually indexed.  R's in-memory processing relies on this order of elements for many computations, e.g., computing quantiles and summaries for time series objects.By design, query results in relational algebra are unordered.  Repeating the same query multiple times is not guaranteed to return results in the same order. Similarly, database-backed relational data also do not guarantee row order.  However, an explicit order on database tables and views can be defined by using an ORDER BY clause in the SQL SELECT statement.  Ordering is usually achieved by having a unique identifier, either a single or multi-column key specified in the ORDER BY clause.To bridge between ordered R data frames and normally unordered data in a relational database such as Oracle Database, Oracle R Enterprise provides the ability to create ordered and unordered ore.frame objects.  Oracle R Enterprise supports ordering an ore.frame by assigning row names using the function row.names. Ordering Using Row NamesOracle R Enterprise supports ordering using row names.  For example, suppose that the ore.frame object NARROW, which is a proxy object for the corresponding database table, is not indexed.  The following example illustrates using the row.names function to create a unique identifier for each row. When retrieving row names for unordered ore.frame objects, an error is returned:R> row.names(NARROW)Error: ORE object has no unique keyIf an ore.frame is unordered, row indexing is not permitted, since there is no unique ordering.  For example, an attempt to retrieve the 5th row from the NARROW data returns an error:R> NARROW[5,]Error: ORE object has no unique keyThe R function row.names can also be used to assign row names explicitly and thus create a unique row identifier.  We'll do this using the variable "ID" on the NARROW data:R> row.names(NARROW) <- NARROW$IDR> row.names(head(NARROW[ ,1:3]))[1] 101501 101502 101503 101504 101505 101506  We can now index to a specific row number using integer indexing:R> NARROW[5,]        ID GENDER AGE MARITAL_STATUS COUNTRY EDUCATION OCCUPATION YRS_RESIDENCE CLASS101505 101505 <NA> 34  NeverM United States of America Masters Sales 5 1Similarly, to index a range of row numbers, use:R> NARROW[2:3,] ID GENDER AGE MARITAL_STATUS COUNTRY EDUCATION OCCUPATION YRS_RESIDENCE CLASS101502 101502 <NA> 27 NeverM United States of America Bach. Sales 3 0101503 101503 <NA> 20 NeverM United States of America HS-grad Cleric. 2 0 To index a specific row by row name, use character indexing:R> NARROW["101502",]  ID GENDER AGE MARITAL_STATUS COUNTRY EDUCATION OCCUPATION YRS_RESIDENCE CLASS101502 101502 <NA> 27 NeverM United States of America Bach. Sales 3 0Ordering Using KeysYou can also use the primary key of a database table to order an ore.frame object.  When you execute ore.connect in an R session, Oracle R Enterprise creates a connection to a schema in an Oracle Database instance. To gain access to the data in the database tables in the schema, you can explicitly call the ore.sync function. That function creates an ore.frame object that is a proxy for a table in a schema.  With the schema argument, you can specify the schema for which you want to create an R environment to proxy objects.  With the use.keys argument, you can specify whether you can to use primary keys in the table to order the ore.frame object. To return the NARROW data to it's unordered state, remove the previously created row names:R> row.names(NARROW) <- NULLUsing a SQL statement, alter the NARROW table to add a composite primary key:R> ore.exec("alter table NARROW add constraint NARROW primary key (\"ID\")")Synchronize the table to obtain the updated key using the ore.sync command and setting the use.keys argument to TRUE.R> ore.sync(table = "NARROW", use.keys = TRUE)The row names of the ordered NARROW data are now the primary key column values:R> head(NARROW[, 1:3])   ID GENDER AGE1 101501   <NA>  412 101502   <NA>  273 101503   <NA>  204 101504   <NA>  455 101505   <NA>  346 101506   <NA>  38If your database table already contains a key, there is no need to create the key again.  Simply execute ore.sync with use.keys set to TRUE when you want to use the primary key:R> ore.sync(table = "TABLE_NAME", use.keys = TRUE)Ordering database tables and views is known to reduce performance because it requires sorting. As most operations in R do not require ordering, the performance hit due to sorting is unnecessary, and you should generally set use.keys to FALSE in ore.sync. Only when ordering is necessary, for operations such sampling data or running the diff command to compare objects, should keys be used.Options for OrderingOracle R Enterprise contains options that relate to the ordering of an ore.frame object. The ore.warn.order global option specifies whether you want Oracle R Enterprise to display a warning message if you use an unordered ore.frame object in a function whose results are order dependent. If you know what to expect from operations involving aggregates, group summaries, or embedded R computations,  then you might want to turn the warnings off so they do not appear in the output:R> options("ore.warn.order")$ore.warn.order[1] TRUER> options("ore.warn.order" = FALSE)R> options("ore.warn.order")$ore.warn.order[1] FALSENote that in some circumstances unordered data may appear to have a repeatable order, however since it was never guaranteed in the first place it is possible to change in future runs. Additionally, parallel queries can significantly impact the order of the result versus the sequential execution.

Almost all data in R is a vector or is based upon vectors (vectors themselves, matrices, data frames, lists, and so forth).  The elements of a vector in R have an explicit order, and each element can...

Best Practices

Are you experiencing analytics pain points?

At the user!2014 conference at UCLA in early July, which was a stimulating and well-attended conference, I spoke about Oracle’s R Technologies during the sponsor talks. One of my slides focused on examples of analytics pain points we often hear from customers and prospects. For example, “It takes too long to get my data or to get the ‘right’ data”“I can’t analyze or mine all of my data – it has to be sampled”“Putting R models and results into production is ad hoc and complex”“Recoding R models into SQL, C, or Java takes time and is error prone”“Our company is concerned about data security, backup and recovery”“We need to build 10s of thousands of models fast to meet business objectives”After the talk, several people approached me remarking how these are exactly the problems they encounter in their organizations. One person even asked, if I’d interviewed her for my talk since she is experiencing every one of these pain points. Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, addresses these pain points. Let’s take a look one by one. If it takes too long to get your data, perhaps because your moving it from the database where it resides to your external analytics server or laptop, the ideal solution is don’t move it. Analyze it where it is. This is exactly what Oracle R Enterprise allows you to do using the transparency layer and in-database predictive analytics capabilities. With Oracle R Enterprise, R functions normally performed on data.frames are translated to SQL for execution in the database, taking advantage of query optimization, indexes, parallel-distributed execution, etc. With the advent of Oracle Data In-Memory option, this has even more advantages, but that’s a topic for another post. The second part of this pain point is getting access to the “right” data. Allowing your data scientist to have a sandbox with access to the range of data necessary to perform his/her work avoids the delay of requesting flat file extracts via the DBA, only to realize that more or different data is required. The cycle time in getting the “right” data impedes progress, not to mention annoying some key individuals in your organization. We’ll come back to the security aspects later.Increasingly, data scientists want to avoid sampling data when analyzing data or building predictive models. Minimally, they at least want to use much more data than may fit in typical analytics servers. Oracle R Enterprise provides an R interface to powerful in-database analytic functions and data mining algorithms. These algorithms are designed to work in a parallel distributed manner whether the data fits in memory or not. In other cases, sampling is desired, if not required, but this results in the chicken-and-egg problem: The data need to be sampled since they won’t fit in memory, but the data are too big to fit in memory to sample! Users have developed home grown techniques to chunk the data and combine partial samples; however, they shouldn’t have to. When sampling is desired/required, with Oracle R Enterprise, we are able to leverage row indexing and in-database sampling to extract only database table rows that are in the sample, using standard R syntax or Oracle R Enterprise-based sampling functions.Our next pain point involves production deployment. Many good predictive models have been laid waste for lack of integration with or complexity introduced by production environments. Enterprise applications and dashboards often speak SQL and know how to access data. However, to craft a solution that extracts data, invokes an R script in an external R engine, and places batch results back in the database requires a lot of manual coding, often leveraging ad hoc cron jobs. Oracle R Enterprise enables the execution of R scripts on the database server machine, in local R engines under the control of Oracle Database. This can be done from R and SQL. Using the SQL API, R scripts can be invoked to return results in the form of table data, images, and XML. In addition, data can be moved to these R engines more efficiently, and the powerful database hardware, such as Exadata machines, can be leveraged for data-parallel and task-parallel R script execution. When users don’t have access to a tight integration between R and SQL as noted above, another pain point involves using R only to build the models and relying on developers to recode the scoring procedures in a programming language that fits with the production environment, e.g., SQL, C, or Java. This has multiple downsides: it takes time to recode, manual recoding is error prone, and the resulting code requires significant testing. When the model is refreshed, the process repeats. The pain points discussed so far also suffer from concerns about security, backup, and recovery. If data is being moved around in flat files, what security protocols or access controls are placed on those flat files? How can access be audited? Oracle R Enterprise enables analytics users to leverage an Oracle Database secured environment for data access. Moving on, if R scripts, models, and other R objects are stored and managed as flat files, how are these backed up? How are they synced with the deployed application? By storing all these artifacts in Oracle Database via Oracle R Enterprise, backup is a normal part of DBA operation with established protocols. The R Script Repository and Datastore simplify backup. Crafting ad hoc solutions involving third party analytic servers, there is the issue of recovery, or resilience to failures. Fewer moving parts mean lower complexity. Programming for failure contingencies in a distributed application adds significant complexity to an application. Allowing Oracle Database to control the execution of R scripts in database server side R engines reduces complexity and frees application developers and data scientists to focus on the more creative aspects of their work. Lastly, users of advanced analytics software – data scientists, analysts, statisticians – are increasing pushing the barrier of scalability. Not just in volume of data processed, but in the number and frequency of their computations and analyses, e.g., predictive model building. Where only a few models are involved, it may be tractable to manage a few files to store predictive models on disk (although as noted above, this has its own complications). When you need to build thousands of models or hundreds of thousands of models, managing these models becomes a challenge in its own right. In summary, customers are facing a wide range of pain points in their analytics activities. Oracle R Enterprise, a component of the Oracle Advanced Analytics option to Oracle Database, addresses these pain points allowing data scientists, analysts, and statisticians, as well as the IT staff who supports them, to be more productive, while promoting and enabling new uses of advanced analytics.

At the user!2014 conference at UCLA in early July, which was a stimulating and well-attended conference, I spoke about Oracle’s R Technologies during the sponsor talks. One of my slides focused on...

Customers

StubHub Taps into Big Data for Insight into Millions of Customers’ Ticket-Buying Patterns, Fraud Detection, and Optimized Ticket Prices

What can you use for a comprehensive platform for real-time analytics?How do you drive company growth to leverage actions of millions of customers?How can you process big data volumes for near-real-time recommendations and dramatically reduce fraud?These questions, and others, posed challenges set by Stubhub. Read what Stubhub achieved with Oracle R Enterprise from the Oracle Advanced Analytics option to Oracle Database.Mike Barber, Senior Manager of Data Science at StubHub said: “Big data is having a tremendous impact on how we run our business. Oracle Database and its various options—including Oracle Advanced Analytics—combine high-performance data-mining functions with the open source R language to enable predictive analytics, data mining, text mining, statistical analysis, advanced numerical computations, and interactive graphics—all inside the database.” Yadong Chen, Principal Architect, Data Systems at StubHub said:“We considered solutions from several other vendors, but Oracle Database was a natural choice for us because it enabled us to run analytics at the data source. This capability, together with the integration of open source R with the database, ensured scalability and enabled near-real-time analytics capabilities."Read the full press release here.

What can you use for a comprehensive platform for real-time analytics? How do you drive company growth to leverage actions of millions of customers?How can you process big data volumes for...

Customers

Using Embedded R Execution: Imputing Missing Data While Preserving Data Structure

This guest post from Matt Fritz, Data Scientist, demonstrates a method for imputing missing values in data using Embedded R Execution with Oracle R Enterprise.Missing data is a common issue among analyses and is mitigated by imputation. Several techniques handle this process within Oracle R Enterprise; however, some bias the data or generate outputs as data objects that are less accessible than others. This post illustrates ways to effectively impute data while specifying the exact data structure of the output keeping the output’s structure functional in Oracle R Enterprise.Let’s first create missing data in the WorldPhones data set and create it in Oracle R Enterprise:   WorldPhones[c(2,6),c(1,2,4)] <- NA  WorldPhones <- as.data.frame(WorldPhones)  ore.create(WorldPhones, table = 'PHONES')  class(PHONES)  > class(PHONES)  [1] "ore.frame"  > PHONES        N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer  1951  45939  21574 2876   1815    1646     89      555  1956     NA     NA 4708     NA    2366   1411      733  1957  64721  32510 5230   2695    2526   1546      773  1958  68484  35218 6662   2845    2691   1663      836  1959  71799  37598 6856   3000    2868   1769      911  1960     NA     NA 8220     NA    3054   1905     1008  1961  79831  43173 9053   3338    3224   2005     1076The easiest way to handle missing data is by substituting these values with a constant, such as zero. We are ready to recode the missing values and can use either the Transparency Layer or Embedded R Execution. The Transparency Layer will convert the base R code below into SQL and run the generated SQL inside the database:  newPHONES <- PHONES  newPHONES$N.Amer <- ifelse(is.na(newPHONES$N.Amer),0,newPHONES$N.Amer)  newPHONES$Europe <- ifelse(is.na(newPHONES$Europe),0,newPHONES$Europe)  newPHONES$S.Amer <- ifelse(is.na(newPHONES$S.Amer),0,newPHONES$S.Amer)  newPHONES  > newPHONES        N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer  1951  45939  21574 2876   1815    1646     89    555  1956  0      0     4708   0       2366   1411    733  1957  64721  32510 5230   2695    2526   1546    773  1958  68484  35218 6662   2845    2691   1663    836  1959  71799  37598 6856   3000    2868   1769    911  1960  0      0     8220   0       3054   1905    1008  1961  79831  43173 9053   3338    3224   2005    1076 This process can also be executed in Embedded R Execution – which spawns an R engine on the database server under the control of Oracle Database – by using a custom R function, such as:   function(x) ifelse(is.na(x),0,x) One way to call this custom function is with ore.doEval. This method requires code to be written as if it were to be executed on the client; however, the ore.doEval wrapper moves the code to the R Script Repository of Oracle R Enterprise in the database and then leverages the database server’s superior processing capacity:   newPHONE <- ore.doEval(     function() {        ore.sync(table="PHONES")        ore.attach()        data.frame(apply(ore.pull(PHONES)           ,2           ,function(x) ifelse(is.na(x),0,x)))}            ,ore.connect=TRUE)Note that we explicitly pull the data from the database using Oracle R Enterprise’s Transparency Layer on the database table PHONES. We must connect to the database to obtain the ore.frame that corresponds to the PHONES table. This is accomplished through the ore.sync function. The ore.attach function allows us to reference the ore.frame by its table name. The second way is via ore.tableApply, which applies a function on an entire input table within Oracle R Enterprise. The same result is created as with ore.doEval and although both operations are successful, the output’s structure defaults to an ORE object instead of a data frame:   newPHONES <- ore.tableApply(PHONES                    ,function(y) {                       apply(y                            ,2                            ,function(x) ifelse(is.na(x),0,x))})  class(newPHONES)  > class(newPHONES)   [1] "ore.object"Since we cannot work with this object the same way as data frames or matrices, we must pull the ORE object onto the client in order to deserialize the object into an R matrix:   newphones <- ore.pull(newPHONES)  class(newphones)  > class(newphones)   [1] "matrix"  > head(newphones)     N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer  1  45939  21574 2876   1815    1646     89      555  2  0      0     4708   0       2366   1411      733  3  64721  32510 5230   2695    2526   1546      773  4  68484  35218 6662   2845    2691   1663      836  5  71799  37598 6856   3000    2868   1769      911  6  0      0     8220   0       3054   1905     100In this example, it is preferred that the output be structured as a data frame so that we can continue to work within Oracle R Enterprise versus the client. The FUN.VALUE feature within Embedded R provides this flexibility by defining the output data’s structure. For example, the output can be explicitly expressed as a data frame of numeric columns that have identical names to the input.    newPHONES <- ore.tableApply(PHONES,                   function(y) {                     data.frame(apply(y,                           2,                           function(x) ifelse(is.na(x),0,x)))},                              FUN.VALUE=data.frame(setNames(replicate(7,                                                          numeric(0),                                                          simplify=F),                                                          colnames(PHONES))))   class(newPHONES)  > class(newPHONES)   [1] "ore.frame"We can now continue to work with the newPHONES output within Oracle R Enterprise just as we would a data frame.While these methods are technically sufficient, they are not practical for this type of data set. As this is panel data ranging from 1951 to 1961, simply recoding missing values to zero appears to strongly bias the data. Perhaps we prefer to calculate the average of each missing observation’s pre- and post-period values. Embedded R allows for a simple solution by utilizing the open-source zoo package.   newPHONES <-  ore.tableApply(PHONES,                    function(y) {                      library(zoo)                        data.frame(                        apply(y, 2, function(x) (na.locf(x) + rev(na.locf(rev(x))))/2))},                           FUN.VALUE=data.frame(setNames(replicate(7,                                                      numeric(0),                                                      simplify=F),                                                      colnames(PHONES))))  newPHONES   > newPHONES     N.Amer  Europe Asia S.Amer Oceania Africa Mid.Amer  1  45939 21574.0 2876   1815    1646     89      555  2  55330 27042.0 4708   2255    2366   1411      733  3  64721 32510.0 5230   2695    2526   1546      773  4  68484 35218.0 6662   2845    2691   1663      836  5  71799 37598.0 6856   3000    2868   1769      911  6  75815 40385.5 8220   3169    3054   1905     1008  7  79831 43173.0 9053   3338    3224   2005     1076These imputed values seem much more reasonable and the output’s structure acts just like a data frame within Oracle R Enterprise. To recap, handling missing values plays an important role in data analysis and several imputation methods can be leveraged via the Transparency Layer or Embedded R. Further, Embedded R’s FUN.VALUE feature explicitly defines the output’s structure and allows for results to be immediately analyzed within Oracle R Enterprise. The FUN.VALUE feature requires more tuning when the output comprises both numeric and character columns. Check back for a later post that explains how to define a data frame of ‘mixed class'.

This guest post from Matt Fritz, Data Scientist, demonstrates a method for imputing missing values in data using Embedded R Execution with Oracle R Enterprise.Missing data is a common issue among...

Best Practices

Convert ddply {plyr} to Oracle R Enterprise, or use with Embedded R Execution

The plyr package contains a set of tools for partitioning a problem into smaller sub-problems that can be more easily processed. One function within {plyr} is ddply, which allows you to specify subsets of a data.frame and then apply a function to each subset. The result is gathered into a single data.frame. Such a capability is very convenient. The function ddply also has a parallel option that if TRUE, will apply the function in parallel, using the backend provided by foreach. This type of functionality is available through Oracle R Enterprise using the ore.groupApply function. In this blog post, we show a few examples from Sean Anderson's "A quick introduction to plyr" to illustrate the correpsonding functionality using ore.groupApply. To get started, we'll create a demo data set and load the plyr package. set.seed(1)d <- data.frame(year = rep(2000:2014, each = 3),        count = round(runif(45, 0, 20)))dim(d)library(plyr)This first example takes the data frame, partitions it by year, and calculates the coefficient of variation of the count, returning a data frame. # Example 1res <- ddply(d, "year", function(x) {  mean.count <- mean(x$count)  sd.count <- sd(x$count)  cv <- sd.count/mean.count  data.frame(cv.count = cv)  })To illustrate the equivalent functionality in Oracle R Enterprise, using embedded R execution, we use the ore.groupApply function on the same data, but pushed to the database, creating an ore.frame. The function ore.push creates a temporary table in the database, returning a proxy object, the ore.frame. D <- ore.push(d)res <- ore.groupApply (D, D$year, function(x) {  mean.count <- mean(x$count)  sd.count <- sd(x$count)  cv <- sd.count/mean.count  data.frame(year=x$year[1], cv.count = cv)  }, FUN.VALUE=data.frame(year=1, cv.count=1))You'll notice the similarities in the first three arguments. With ore.groupApply, we augment the function to return the specific data.frame we want. We also specify the argument FUN.VALUE, which describes the resulting data.frame. From our previous blog posts, you may recall that by default, ore.groupApply returns an ore.list containing the results of each function invocation. To get a data.frame, we specify the structure of the result. The results in both cases are the same, however the ore.groupApply result is an ore.frame. In this case the data stays in the database until it's actually required. This can result in significant memory and time savings whe data is large. R> class(res)[1] "ore.frame"attr(,"package")[1] "OREbase"R> head(res)   year cv.count1 2000 0.39848482 2001 0.60621783 2002 0.23094014 2003 0.57735035 2004 0.30696806 2005 0.3431743 To make the ore.groupApply execute in parallel, you can specify the argument parallel with either TRUE, to use default database parallelism, or to a specific number, which serves as a hint to the database as to how many parallel R engines should be used. The next ddply example uses the summarise function, which creates a new data.frame. In ore.groupApply, the year column is passed in with the data. Since no automatic creation of columns takes place, we explicitly set the year column in the data.frame result to the value of the first row, since all rows received by the function have the same year. # Example 2ddply(d, "year", summarise, mean.count = mean(count))res <- ore.groupApply (D, D$year, function(x) {  mean.count <- mean(x$count)  data.frame(year=x$year[1], mean.count = mean.count)  }, FUN.VALUE=data.frame(year=1, mean.count=1))R> head(res)   year mean.count1 2000 7.6666672 2001 13.3333333 2002 15.0000004 2003 3.0000005 2004 12.3333336 2005 14.666667 Example 3 uses the transform function with ddply, which modifies the existing data.frame. With ore.groupApply, we again construct the data.frame explicilty, which is returned as an ore.frame. # Example 3ddply(d, "year", transform, total.count = sum(count))res <- ore.groupApply (D, D$year, function(x) {  total.count <- sum(x$count)  data.frame(year=x$year[1], count=x$count, total.count = total.count)  }, FUN.VALUE=data.frame(year=1, count=1, total.count=1)) > head(res)   year count total.count1 2000 5 232 2000 7 233 2000 11 234 2001 18 405 2001 4 406 2001 18 40In Example 4, the mutate function with ddply enables you to define new columns that build on columns just defined. Since the construction of the data.frame using ore.groupApply is explicit, you always have complete control over when and how to use columns. # Example 4ddply(d, "year", mutate, mu = mean(count), sigma = sd(count),      cv = sigma/mu)res <- ore.groupApply (D, D$year, function(x) {  mu <- mean(x$count)  sigma <- sd(x$count)  cv <- sigma/mu  data.frame(year=x$year[1], count=x$count, mu=mu, sigma=sigma, cv=cv)  }, FUN.VALUE=data.frame(year=1, count=1, mu=1,sigma=1,cv=1))R> head(res)   year count mu sigma cv1 2000 5 7.666667 3.055050 0.39848482 2000 7 7.666667 3.055050 0.39848483 2000 11 7.666667 3.055050 0.39848484 2001 18 13.333333 8.082904 0.60621785 2001 4 13.333333 8.082904 0.60621786 2001 18 13.333333 8.082904 0.6062178 In Example 5, ddply is used to partition data on multiple columns before constructing the result. Realizing this with ore.groupApply involves creating an index column out of the concatenation of the columns used for partitioning. This example also allows us to illustrate using the ORE transparency layer to subset the data. # Example 5baseball.dat <- subset(baseball, year > 2000) # data from the plyr packagex <- ddply(baseball.dat, c("year", "team"), summarize,           homeruns = sum(hr))We first push the data set to the database to get an ore.frame. We then add the composite column and perform the subset, using the transparency layer. Since the results from database execution are unordered, we will explicitly sort these results and view the first 6 rows. BB.DAT <- ore.push(baseball)BB.DAT$index <- with(BB.DAT, paste(year, team, sep="+"))BB.DAT2 <- subset(BB.DAT, year > 2000)X <- ore.groupApply (BB.DAT2, BB.DAT2$index, function(x) {  data.frame(year=x$year[1], team=x$team[1], homeruns=sum(x$hr))  }, FUN.VALUE=data.frame(year=1, team="A", homeruns=1), parallel=FALSE)res <- ore.sort(X, by=c("year","team")) R> head(res)   year team homeruns1 2001 ANA 42 2001 ARI 1553 2001 ATL 634 2001 BAL 585 2001 BOS 776 2001 CHA 63 Our next example is derived from the ggplot function documentation. This illustrates the use of ddply within using the ggplot2 package. We first create a data.frame with demo data and use ddply to create some statistics for each group (gp). We then use ggplot to produce the graph. We can take this same code, push the data.frame df to the database and invoke this on the database server. The graph will be returned to the client window, as depicted below.# Example 6 with ggplot2library(ggplot2)df <- data.frame(gp = factor(rep(letters[1:3], each = 10)),                 y = rnorm(30))# Compute sample mean and standard deviation in each grouplibrary(plyr)ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))# Set up a skeleton ggplot object and add layers:ggplot() +  geom_point(data = df, aes(x = gp, y = y)) +  geom_point(data = ds, aes(x = gp, y = mean),             colour = 'red', size = 3) +  geom_errorbar(data = ds, aes(x = gp, y = mean,                               ymin = mean - sd, ymax = mean + sd),             colour = 'red', width = 0.4)DF <- ore.push(df)ore.tableApply(DF, function(df) {  library(ggplot2)  library(plyr)  ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))  ggplot() +    geom_point(data = df, aes(x = gp, y = y)) +    geom_point(data = ds, aes(x = gp, y = mean),               colour = 'red', size = 3) +    geom_errorbar(data = ds, aes(x = gp, y = mean,                                 ymin = mean - sd, ymax = mean + sd),                  colour = 'red', width = 0.4)}) But let's take this one step further. Suppose we wanted to produce multiple graphs, partitioned on some index column. We replicate the data three times and add some noise to the y values, just to make the graphs a little different. We also create an index column to form our three partitions. Note that we've also specified that this should be executed in parallel, allowing Oracle Database to control and manage the server-side R engines. The result of ore.groupApply is an ore.list that contains the three graphs. Each graph can be viewed by printing the list element. df2 <- rbind(df,df,df)df2$y <- df2$y + rnorm(nrow(df2))df2$index <- c(rep(1,300), rep(2,300), rep(3,300))DF2 <- ore.push(df2)res <- ore.groupApply(DF2, DF2$index, function(df) {  df <- df[,1:2]  library(ggplot2)  library(plyr)  ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))  ggplot() +    geom_point(data = df, aes(x = gp, y = y)) +    geom_point(data = ds, aes(x = gp, y = mean),               colour = 'red', size = 3) +    geom_errorbar(data = ds, aes(x = gp, y = mean,                                 ymin = mean - sd, ymax = mean + sd),                  colour = 'red', width = 0.4)  }, parallel=TRUE)res[[1]]res[[2]]res[[3]]To recap, we've illustrated how various uses of ddply from the plyr package can be realized in ore.groupApply, which affords the user explicit control over the contents of the data.frame result in a straightforward manner. We've also highlighted how ddply can be used within an ore.groupApply call.

The plyr package contains a set of tools for partitioning a problem into smaller sub-problems that can be more easily processed. One function within {plyr} is ddply, which allows you to...

Best Practices

Financial institutions build predictive models using Oracle R Enterprise to speed model deployment

See the Oracle press release, Financial Institutions Leverage Metadata Driven Modeling Capability Built on the Oracle R Enterprise Platform to Accelerate Model Deployment and Streamline Governance for a description where a "unified environment for analytics data management and model lifecycle management brings the power and flexibility of the open source R statistical platform, delivered via the in-database Oracle R Enterprise engine to support open standards compliance."Through its integration with Oracle R Enterprise, Oracle Financial Services Analytical Applications provides "productivity, management, and governance benefits to financial institutions, including the ability to: Centrally manage and control models in a single, enterprise model repository, allowing for consistent management and application of security and IT governance policies across enterprise assets Reuse models and rapidly integrate with applications by exposing models as services Accelerate development with seeded models and common modeling and statistical techniques available out-of-the-box Cut risk and speed model deployment by testing and tuning models with production data while working within a safe sandbox Support compliance with regulatory requirements by carrying out comprehensive stress testing, which captures the effects of adverse risk events that are not estimated by standard statistical and business models. This approach supplements the modeling process and supports compliance with the Pillar I and the Internal Capital Adequacy Assessment Process stress testing requirements of the Basel II Accord Improve performance by deploying and running models co-resident with data. Oracle R Enterprise engines run in database, virtually eliminating the need to move data to and from client machines, thereby reducing latency and improving security"

See the Oracle press release, Financial Institutions Leverage Metadata Driven Modeling Capability Built on the Oracle R Enterprise Platform to Accelerate Model Deployment and Streamline Governance for...

FAQ

R Package Installation with Oracle R Enterprise

Programming languages give developers the opportunity to write reusable functions and to bundle those functions into logical deployable entities. In R, these are called packages. R has thousands of such packages provided by an almost equally large group of third-party contributors. To allow others to benefit from these packages, users can share packages on the CRAN system for use by the vast R development community worldwide. R's package system along with the CRAN framework provides a process for authoring, documenting and distributing packages to millions of users. In this post, we'll illustrate the various ways in which such R packages can be installed for use with R and together with Oracle R Enterprise. In the following, the same instructions apply when using either open source R or Oracle R Distribution. In this post, we cover the following package installation scenarios for: R command line Linux shell command line Use with Oracle R Enterprise Installation on Exadata or RAC Installing all packages in a CRAN Task View Troubleshooting common errors 1. R Package Installation BasicsR package installation basics are outlined in Chapter 6 of the R Installation and Administration Guide. There are two ways to install packages from the command line: from the R command line and from the shell command line. For this first example on Oracle Linux using Oracle R Distribution, we’ll install the arules package as root so that packages will be installed in the default R system-wide location where all users can access it, /usr/lib64/R/library.Within R, using the install.packages function always attempts to install the latest version of the requested package available on CRAN:R> install.packages("arules")If the arules package depends upon other packages that are not already installed locally, the R installer automatically downloads and installs those required packages. This is a huge benefit that frees users from the task of identifying and resolving those dependencies.You can also install R from the shell command line. This is useful for some packages when an internet connection is not available or for installing packages not uploaded to CRAN. To install packages this way, first locate the package on CRAN and then download the package source to your local machine. For example:$ wget http://cran.r-project.org/src/contrib/arules_1.1-2.tar.gz Then, install the package using the command R CMD INSTALL:$ R CMD INSTALL arules_1.1-2.tar.gzA major difference between installing R packages using the R package installer at the R command line and shell command line is that package dependencies must be resolved manually at the shell command line. Package dependencies are listed in the Depends section of the package’s CRAN site. If dependencies are not identified and installed prior to the package’s installation, you will see an error similar to:ERROR: dependency ‘xxx’ is not available for package ‘yyy’As a best practice and to save time, always refer to the package’s CRAN site to understand the package dependencies prior to attempting an installation. If you don’t run R as root, you won’t have permission to write packages into the default system-wide location and you will be prompted to create a personal library accessible by your userid. You can accept the personal library path chosen by R, or specify the library location by passing parameters to the install.packages function. For example, to create an R package repository in your home directory: R> install.packages("arules", lib="/home/username/Rpackages")or$ R CMD INSTALL arules_1.1-2.tar.gz --library=/home/username/RpackagesRefer to the install.packages help file in R or execute R CMD INSTALL --help atthe shell command line for a full list of command line options.To set the library location and avoid having to specify this at every package install, simply create the R startup environment file .Renviron in your home area if it does not already exist, and add the following piece of code to it:R_LIBS_USER = "/home/username/Rpackages" 2. Setting the RepositoryEach time you install an R package from the R command line, you are asked which CRAN mirror, or server, R should use. To set the repository and avoid having to specify this during every package installation, create the R startup command file .Rprofile in your home directory and add the following R code to it:cat("Setting Seattle repository")r = getOption("repos") r["CRAN"] = "http://cran.fhcrc.org/"options(repos = r)rm(r) This code snippet sets the R package repository to the Seattle CRAN mirror at the start of each R session. 3. Installing R Packages for use with Oracle R EnterpriseEmbedded R execution with Oracle R Enterprise allows the use of CRAN or other third-party R packages in user-defined R functions executed on the Oracle Database server. The steps for installing and configuring packages for use with Oracle R Enterprise are the same as for open source R. The database-side R engine just needs to know where to find the R packages.The Oracle R Enterprise installation is performed by user oracle, which typically does not have write permission to the default site-wide library, /usr/lib64/R/library. On Linux and UNIX platforms, the Oracle R Enterprise Server installation provides the ORE script, which is executed from the operating system shell to install R packages and to start R. The ORE script is a wrapper for the default R script, a shell wrapper for the R executable. It can be used to start R, run batch scripts, and build or install R packages. Unlike the default R script, the ORE script installs packages to a location writable by user oracle and accessible by all ORE users - $ORACLE_HOME/R/library.To install a package on the database server so that it can be used by any R user and for use in embedded R execution, an Oracle DBA would typically download the package source from CRAN using wget. If the package depends on any packages that are not in the R distribution in use, download the sources for those packages, also.  For a single Oracle Database instance, replace the R script with ORE to install the packages in the same location as the Oracle R Enterprise packages. $ wget http://cran.r-project.org/src/contrib/arules_1.1-2.tar.gz$ ORE CMD INSTALL arules_1.1-2.tar.gzBehind the scenes, the ORE script performs the equivalent of setting R_LIBS_USER to the value of $ORACLE_HOME/R/library, and all R packages installed with the ORE script are installed to this location. For installing a package on multiple database servers, such as those in an Oracle Real Application Clusters (Oracle RAC) or a multinode Oracle Exadata Database Machine environment, use the ORE script in conjunction with the Exadata Distributed Command Line Interface (DCLI) utility.$ dcli -g nodes -l oracle ORE CMD INSTALL arules_1.1-1.tar.gz The DCLI -g flag designates a file containing a list of nodes to install on, and the -l flag specifies the user id to use when executing the commands. For more information on using DCLI with Oracle R Enterprise, see Chapter 5 in the Oracle R Enterprise Installation Guide.If you are using an Oracle R Enterprise client, install the package the same as any R package, bearing in mind that you must install the same version of the package on both the client and server machines to avoid incompatibilities. 4. CRAN Task ViewsCRAN also maintains a set of Task Views that identify packages associated with a particular task or methodology. Task Views are helpful in guiding users through the huge set of available R packages. They are actively maintained by volunteers who include detailed annotations for routines and packages. If you find one of the task views is a perfect match, you can install every package in that view using the ctv package - an R package for automating package installation. To use the ctv package to install a task view, first, install and load the ctv package.R> install.packages("ctv")R> library(ctv)Then query the names of the available task views and install the view you choose.R> available.views() R> install.views("TimeSeries") 5. Using and Managing R packages To use a package, start up R and load packages one at a time with the library command.Load the arules package in your R session. R> library(arules)Verify the version of arules installed.R> packageVersion("arules")[1] '1.1.2'Verify the version of arules installed on the database server using embedded R execution.R> ore.doEval(function() packageVersion("arules"))View the help file for the apropos function in the arulespackageR> ?aproposOver time, your package repository will contain more and more packages, especially if you are using the system-wide repository where others are adding additional packages. It’s good to know the entire set of R packages accessible in your environment. To list all available packages in your local R session, use the installed.packages command:R> myLocalPackages <- row.names(installed.packages())R> myLocalPackagesTo access the list of available packages on the ORE database server from the ORE client, use the following embedded R syntax: R> myServerPackages <-ore.doEval(function() row.names(installed.packages()) R> myServerPackages 6. Troubleshooting Common ProblemsInstalling Older Versions of R packagesIf you immediately upgrade to the latest version of R, you will have no problem installing the most recent versions of R packages. However, if your version of R is older, some of the more recent package releases will not work and install.packages will generate a message such as: Warning message: In install.packages("arules"): package ‘arules’ is not availableThis is when you have to go to the Old sources link on the CRAN page for the arulespackage and determine which version is compatible with your version of R.Begin by determining what version of R you are using:$ R --versionOracle Distribution of R version 3.0.1 (--) -- "Good Sport" Copyright (C) The R Foundation for Statistical Computing Platform: x86_64-unknown-linux-gnu (64-bit)Given that R-3.0.1 was released May 16, 2013, any version of the arules package released after this date may work. Scanning the arules archive, we might try installing version 0.1.1-1, released in January of 2014:$ wget http://cran.r-project.org/src/contrib/Archive/arules/arules_1.1-1.tar.gz$ R CMD INSTALL arules_1.1-1.tar.gzFor use with ORE:$ ORE CMD INSTALL arules_1.1-1.tar.gzThe "package not available" error can also be thrown if the package you’re trying to install lives elsewhere, either another R package site, or it’s been removed from CRAN. A quick Google search usually leads to more information on the package’s location and status.Oracle R Enterprise is not in the R library pathOn Linux hosts, after installing the ORE server components, starting R, and attempting to load the ORE packages, you may receive the error:R> library(ORE)Error in library(ORE) : there is no package called ‘ORE’If you know the ORE packages have been installed and you receive this error, this is the result of not starting R with the ORE script. To resolve this problem, exit R and restart using the ORE script. After restarting R and running the command to load the ORE packages, you should not receive any errors.$ ORER> library(ORE)On Windows servers, the solution is to make the location of the ORE packages visible to R by adding them to the R library paths. To accomplish this, exit R, then add the following lines to the .Rprofile file. On Windows, the .Rprofile file is located in R\etc directory C:\Program Files\R\R-<version>\etc. Add the following lines:.libPaths("<path to $ORACLE_HOME>/R/library")The above line will tell R to include the R directory in the Oracle home as part of its search path. When you start R, the path above will be included, and future R package installations will also be saved to $ORACLE_HOME/R/library. This path should be writable by the user oracle, or the userid for the DBA tasked with installing R packages.Binary package compiled with different version of RBy default, R will install pre-compiled versions of packages if they are found. If the version of R under which the package was compiled does not match your installed version of R you will get an error message:Warning message: package ‘xxx’ was built under R version 3.0.0The solution is to download the package source and build it for your version of R.$ wget http://cran.r-project.org/src/contrib/Archive/arules/arules_1.1-1.tar.gz$ R CMD INSTALL arules_1.1-1.tar.gzFor use with ORE:$ ORE CMD INSTALL arules_1.1-1.tar.gzUnable to execute files in /tmp directoryBy default, R uses the /tmp directory to install packages. On security conscious machines, the /tmp directory is often marked as "noexec" in the /etc/fstab file. This means that no file under /tmp can ever be executed, and users who attempt to install R package will receive an error:ERROR: 'configure' exists but is not executable -- see the 'R Installation and Administration Manual’The solution is to set the TMP and TMPDIR environment variables to a location which R will use as the compilation directory. For example:$ mkdir <some path>/tmp$ export TMPDIR= <some path>/tmp$ export TMP= <some path>/tmpThis error typically appears on Linux client machines and not database servers, as Oracle Database writes to the value of the TMP environment variable for several tasks, including holding temporary files during database installation. 7. Creating your own R packageCreating your own package and submitting to CRAN is for advanced users, but it is not difficult. The procedure to follow, along with details of R's package system, is detailed in the Writing R Extensions manual.

Programming languages give developers the opportunity to write reusable functions and to bundle those functions into logical deployable entities. In R, these are called packages. R has thousands of...

Tips and Tricks

Model cross-validation with ore.CV()

pre {line-height: 1.0em} In this blog post we illustrate how to use Oracle R Enterprise for performing cross-validation of regression and classification models. We describe a new utility R function ore.CV that leverages features of Oracle R Enterprise and is available for download and use. Predictive models are usually built on given data and verified on held-aside or unseen data. Cross-validation is a model improvement technique that avoids the limitations of a single train-and-test experiment by building and testing multiple models via repeated sampling from the available data. It's purpose is to offer a better insight into how well the model would generalize to new data and avoid over-fitting and deriving wrong conclusions from misleading peculiarities of the seen data. In a k-fold cross-validation the data is partitioned into k (roughly) equal size subsets. One of the subsets is retained for testing and the remaining k-1 subsets are used for training. The process is repeated k times with each of the k subsets serving exactly once as testing data. Thus, all observations in the original data set are used for both training and testing. The choice of k depends, in practice on the size n of the data set. For large data, k=3 could be sufficient. For very small data, the extreme case where k=n, leave-one-out cross-validation (LOOCV) would use a single observation from the original sample as testing data and the remaining observations as training data. Common choices are k=10 or k=5. For a select set of algorithms and cases, the function ore.CV performs cross-validation for models generated by ORE regression and classification functions using in-databse data. ORE embedded R execution is leveraged to support cross-validation also for models built with vanilla R functions. Usage ore.CV(funType, function, formula, dataset, nFolds=<nb.folds>, fun.args=NULL, pred.args=NULL, pckg.lst=NULL) funType - "regression" or "classification" function - ORE predictive modeling functions for regression & classification or R function (regression only) formula - object of class "formula" dataset - name of the ore.frame nFolds - number of folds fun.args - list of supplementary arguments for 'function' pred.args - list of supplementary arguments for 'predict'. Must be consistent with the model object/model generator 'function'. pckg.lst - list of packages to be loaded by the DB R engine for embedded execution. The set of functions supported for ORE include: ore.lm ore.stepwise ore.neural ore.glm ore.odmDT ore.odmSVM ore.odmGLM ore.odmNB The set of functions supported for R include: lm glm svm Note: The 'ggplot' and 'reshape' packages are required on the R client side for data post-processing and plotting (classification CV). ExamplesIn the following examples, we illustrate various ways to invoke ore.CV using some datasets we have seen in previous posts. The datasets can be created as ore.frame objects using: IRIS <- ore.push(iris)LONGLEY <- ore.push(longley)library(rpart)KYPHOSIS <- ore.push(kyphosis)library(PASWR)TITANIC3 <- ore.push(titanic3)MTCARS <- pore.push(mtcars)(A) Cross-validation for models generated with ORE functions. # Basic specificationore.CV("regression","ore.lm",Sepal.Length~.-Species,"IRIS",nFolds=5)ore.CV("regression","ore.neural",Employed~GNP+Population+Year, "LONGLEY",nFolds=5)#Specification of function argumentsore.CV("regression","ore.stepwise",Employed~.,"LONGLEY",nFolds=5, fun.args= list(add.p=0.15,drop.p=0.15))ore.CV("regression","ore.odmSVM",Employed~GNP+Population+Year, "LONGLEY",nFolds=5, fun.args="regression")#Specification of function arguments and prediction argumentsore.CV("classification","ore.glm",Kyphosis~.,"KYPHOSIS",nFolds=5, fun.args=list(family=binomial()),pred.args=list(type="response"))ore.CV("classification","ore.odmGLM",Kyphosis~.,"KYPHOSIS",nFolds=5, fun.args= list(type="logistic"),pred.args=list(type="response")) (B) Cross-validation for models generated with R functions via the ORE embedded execution mechanism. ore.CV("regression","lm",mpg~cyl+disp+hp+drat+wt+qsec,"MTCARS",nFolds=3)ore.CV("regression","svm",Sepal.Length~.-Species,"IRIS",nFolds=5, fun.args=list(type="eps-regression"), pckg.lst=c("e1071")) Restrictions The signature of the model generator ‘function’ must be of the following type: function(formula,data,...). For example, functions like, ore.stepwise, ore.odmGLM and lm are supported but the R step(object,scope,...) function for AIC model selection via the stepwise algorithm, does not satisfy this requirement. The model validation process requires the prediction function to return a (1-dimensional) vector with the predicted values. If the (default) returned object is different the requirement must be met by providing an appropriate argument through ‘pred.args’. For example, for classification with ore.glm or ore.odmGLM the user should specify pred.args=list(type="response"). Cross-validation of classification models via embedded R execution of vanilla R functions is not supported yet. Remark: Cross-validation is not a technique intended for large data as the cost of multiple model training and testing can become prohibitive. Moreover, with large data sets, it is possible to effectively produce an effective sampled train and test data set. The current ore.CV does not impose any restrictions on the size of the input and the user working with large data should use good judgment when choosing the model generator and the number of folds. Output The function ore.CV provides output on several levels: datastores to contain model results, plots, and text output. Datastores The results of each cross-validation run are saved into a datastore named dsCV_funTyp_data_Target_function_nFxx where funTyp, function, nF(=nFolds) have been described above and Target is the left-hand-side of the formula. For example, if one runs the ore.neural, ore.glm, and ore.odmNB-based cross-validation examples from above, the following three datastores are produced: R> ds <- ore.datastore(pattern="dsCV")R> print(ds)datastore.name object.count size creation.date description1 dsCV_classification_KYPHOSIS_Kyphosis_ore.glm_nF5 104480326 2014-04-30 18:19:55 <NA>2 dsCV_classification_TITANIC3_survived_ore.odmNB_nF5 10 592083 2014-04-30 18:21:35 <NA>3 dsCV_regression_LONGLEY_Employed_ore.neural_nF5 10 497204 2014-04-30 18:16:35 <NA>Each datastore contains the models and prediction tables for every fold. Every prediction table has 3 columns: the fold index together with the target variable/class and the predicted values. If we consider the example from above and examine the most recent datastore (the Naive Bayes classification CV), we would see: R> ds.last <- ds$datastore.name[which.max(as.numeric(ds$creation.date))]R> ore.datastoreSummary(name=ds.last)object.name class size length row.count col.count1 model.fold1 ore.odmNB 66138 9 NA NA2 model.fold2 ore.odmNB 88475 9 NA NA3 model.fold3 ore.odmNB 110598 9 NA NA4 model.fold4 ore.odmNB 133051 9 NA NA5 model.fold5 ore.odmNB 155366 9 NA NA6 test.fold1 ore.frame 7691 3 261 37 test.fold2 ore.frame 7691 3 262 38 test.fold3 ore.frame 7691 3 262 39 test.fold4 ore.frame 7691 3 262 310 test.fold5 ore.frame 7691 3 262 3 Plots The following plots are generated automatically by ore.CV and saved in an automatically generated OUTPUT directory: Regression: ore.CV compares predicted vs target values, root mean square error (RMSE) and relative error (RERR) boxplots per fold. The example below is based on 5-fold cross-validation with the ore.lm regression model for Sepal.Length ~.-Species using the ore.frame IRIS dataset. Classification : ore.CV outputs a multi plot figure for classification metrics like Precision, Recall and F-measure. Each metrics is captured per target class (side-by-side barplots) and fold (groups of barplots). The example below is based on the 5-folds CV of the ore.odmSVM classification model for Species ~. using the ore.frame IRIS dataset. Text output For classification problems, the confusion tables for each fold are saved in an ouput file residing in the OUTPUT directory together with a summary table displaying the precision, recall and F-measure metrics for every fold and predicted class. file.show("OUTDIR/tbl_CV_classification_IRIS_Species_ore.odmSVM_nF5")Confusion table for fold 1 : setosa versicolor virginica setosa 9 0 0 versicolor 0 12 1 virginica 0 1 7Confusion table for fold 2 : setosa versicolor virginica setosa 9 0 0 versicolor 0 8 1 virginica 0 2 10Confusion table for fold 3 : setosa versicolor virginica setosa 11 0 0 versicolor 0 10 2 virginica 0 0 7Confusion table for fold 4 : setosa versicolor virginica setosa 9 0 0 versicolor 0 10 0 virginica 0 2 9Confusion table for fold 5 : setosa versicolor virginica setosa 12 0 0 versicolor 0 5 1 virginica 0 0 12Accuracy, Recall & F-measure table per {class,fold} fold class TP m n Precision Recall F_meas1 1 setosa 9 9 9 1.000 1.000 1.0002 1 versicolor 12 13 13 0.923 0.923 0.9233 1 virginica 7 8 8 0.875 0.875 0.8754 2 setosa 9 9 9 1.000 1.000 1.0005 2 versicolor 8 9 10 0.889 0.800 0.8426 2 virginica 10 12 11 0.833 0.909 0.8707 3 setosa 11 11 11 1.000 1.000 1.0008 3 versicolor 10 12 10 0.833 1.000 0.9099 3 virginica 7 7 9 1.000 0.778 0.87510 4 setosa 9 9 9 1.000 1.000 1.00011 4 versicolor 10 10 12 1.000 0.833 0.90912 4 virginica 9 11 9 0.818 1.000 0.90013 5 setosa 12 12 12 1.000 1.000 1.00014 5 versicolor 5 6 5 0.833 1.000 0.90915 5 virginica 12 12 13 1.000 0.923 0.960 What's next Several extensions of ore.CV are possible involving sampling, parallel model training and testing, support for vanilla R classifiers, post-processing and output. More material for future posts.

In this blog post we illustrate how to use Oracle R Enterprise for performing cross-validation of regression and classification models. We describe a new utility R function ore.CV that leverages...

Best Practices

Step-by-step: Returning R statistical results as a Database Table

R provides a rich set of statistical functions that we may want to use directly from SQL. Many of these results can be readily expressed as structured table data for use with other SQL tables, or for use by SQL-enabled applications, e.g., dashboards or other statistical tools. In this blog post, we illustrate in a sequence of five simple steps  how to go from an R function to a SQL-enabled result. Taken from recent "proof of concept" customer engagement, our example involves using the function princomp, which performs a principal components analysis on a given numeric data matrix and returns the results as an object of class princomp. The customer actively uses this R function to produce loadings used in subsequent computations and analysis. The loadings is a matrix whose columns contain the eigenvectors). The current process of pulling data from their Oracle Database, starting an R  engine, invoking the R script, and placing the results back in the database was proving non-performant and unnecessarily complex. The goal was to leverage Oracle R Enterprise to streamline this process and allow the results to be immediately accessiblethrough SQL. As a best practice, here is a process that can get you from start to finish:Step 1: Invoke from command line, understand resultsIf you're using a particular R function, chances are you are familiar with its content. However, you may not be familiar with its structure. We'll use an example from the R princomp documentation that uses the USArrests data set. We see that the class of the result is of type princomp, and the model prints the call and standard deviations of the components. To understand the underlying structure, we invoke the function str and see there are seven elements in the list, one of which is the matrix loadings.mod <- princomp(USArrests, cor = TRUE)class(mod)modstr(mod)Results:R> mod <- princomp(USArrests, cor = TRUE)R> class(mod)[1] "princomp"R> modCall:princomp(x = USArrests, cor = TRUE)Standard deviations:   Comp.1    Comp.2    Comp.3    Comp.41.5748783 0.9948694 0.5971291 0.41644944 variables and 50 observations.R> str(mod)List of 7$ sdev : Named num [1:4] 1.575 0.995 0.597 0.416..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"$ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 .....- attr(*, "dimnames")=List of 2.. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape".. ..$ : chr [1:4] "Comp.1" "Comap.2" "Comp.3" "Comp.4"$ center : Named num [1:4] 7.79 170.76 65.54 21.23..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"$ scale : Named num [1:4] 4.31 82.5 14.33 9.27..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"$ n.obs : int 50$ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 .....- attr(*, "dimnames")=List of 2.. ..$ : chr [1:50] "1" "2" "3" "4" ..... ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"$ call : language princomp(x = dat, cor = TRUE)- attr(*, "class")= chr "princomp"Step 2: Wrap script in a function, and invoke from ore.tableApplySince we want to invoke princomp on database data, we first push the demo data, USArrests, to the database to create an ore.frame. Other data we wish to use will also be in database tables.We'll use ore.tableApply (for the reasons cited in the previous blog post)  providing the ore.frame as the first argument and simply returning within our function the model produced by princomp. We'll then look at its class, retrieve the result from the database, and check its class and structure once again. Notice that we are able to obtain the exact same result we received using our local R engine as with the database R engine through embedded R execution. dat <- ore.push(USArrests)computePrincomp <- function(dat) princomp(dat, cor=TRUE)res <- ore.tableApply(dat, computePrincomp)class(res)res.local <- ore.pull(res)class(res.local)str(res.local)res.localresResults:R> dat <- ore.push(USArrests)R> computePrincomp <- function(dat) princomp(dat, cor=TRUE)R> res <- ore.tableApply(dat, dat, computePrincomp)R> class(res)[1] "ore.object"attr(,"package")[1] "OREembed"R> res.local <- ore.pull(res)R> class(res.local)[1] "princomp"R> str(res.local)List of 7$ sdev : Named num [1:4] 1.575 0.995 0.597 0.416..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"$ loadings: loadings [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 .....- attr(*, "dimnames")=List of 2.. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape".. ..$ : chr [1:4] "Comp.1" "Comap.2" "Comp.3" "Comp.4"$ center : Named num [1:4] 7.79 170.76 65.54 21.23..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"$ scale : Named num [1:4] 4.31 82.5 14.33 9.27..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"$ n.obs : int 50$ scores : num [1:50, 1:4] -0.986 -1.95 -1.763 0.141 -2.524 .....- attr(*, "dimnames")=List of 2.. ..$ : chr [1:50] "1" "2" "3" "4" ..... ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"$ call : language princomp(x = dat, cor = TRUE)- attr(*, "class")= chr "princomp"R> res.localCall:princomp(x = dat, cor = TRUE)Standard deviations:   Comp.1    Comp.2    Comp.3    Comp.41.5748783 0.9948694 0.5971291 0.41644944 variables and 50 observations.R> resCall:princomp(x = dat, cor = TRUE)Standard deviations:   Comp.1    Comp.2    Comp.3    Comp.41.5748783 0.9948694 0.5971291 0.41644944 variables and 50 observations.Step 3: Determine what results we really needSince we are only interested in the loadings and any result we return needs to be a data.frame to turn it into a database row set (table result), we build the model, transform the loadings object into a data.frame, and return the data.frame as the function result. We then view the class of the result and its values. Since we do this from the R API, we can simply print res to display the returned data.frame, as the print does an implicit ore.pull.returnLoadings <- function(dat) {                    mod <- princomp(dat, cor=TRUE)                    dd <- dim(mod$loadings)                    ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])                    ldgs$variables <- row.names(ldgs)                    ldgs                  }res <- ore.tableApply(dat, returnLoadings)class(res)resore.create(USArrests,table="USARRESTS")Results:R> res <- ore.tableApply(dat, returnLoadings)R> class(res)[1] "ore.object"attr(,"package")[1] "OREembed"R> res             Comp.1     Comp.2     Comp.3     Comp.4 variablesMurder   -0.5358995  0.4181809 -0.3412327  0.64922780 MurderAssault  -0.5831836  0.1879856 -0.2681484 -0.74340748 AssaultUrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773 UrbanPopRape     -0.5434321 -0.1673186  0.8177779  0.08902432 RapeStep 4: Load script into the R Script Repository in the databaseWe're at the point of being able to load the script into the R Script Repository before invoking it from SQL. We can create the function from R or from SQL. In R, ore.scriptCreate('princomp.loadings', returnLoadings)or from SQL, begin--sys.rqScriptDrop('princomp.loadings');sys.rqScriptCreate('princomp.loadings',      'function(dat) {        mod <- princomp(dat, cor=TRUE)        dd <- dim(mod$loadings)        ldgs <- as.data.frame(mod$loadings[1:dd[1],1:dd[2]])        ldgs$variables <- row.names(ldgs)        ldgs      }');end;/Step 5: invoke from SQL SELECT statementFinally, we're able to invoke the function from SQL using the rqTableEval table function. We pass in a cursor with the data from our USARRESTS table. We have no parameters, so the next argument is NULL. To get the results as a table, we specify a SELECT string that defines the structure of the result. Note that the column names must be identical to what is returned in the R data.frame. The last parameter is the name of the function we want to invoke from the R script repository. Invoking this, we see the result as a table from the SELECT statement.select *from table(rqTableEval( cursor(select * from USARRESTS),                        NULL,                       'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual',                        'princomp.loadings'));Results:SQL> select *from table(rqTableEval( cursor(select * from USARRESTS),NULL,          'select 1 as "Comp.1", 1 as "Comp.2", 1 as "Comp.3", 1 as "Comp.4", cast(''a'' as varchar2(12)) "variables" from dual','princomp.loadings'));2 3    Comp.1     Comp.2     Comp.3     Comp.4  variables---------- ---------- ---------- ---------- -------------.53589947  .418180865 -.34123273  .649227804 Murder-.58318363  .187985604 -.26814843 -.74340748  Assault-.27819087 -.87280619  -.37801579  .133877731 UrbanPop-.54343209 -.16731864   .817777908 .089024323 RapeAs you see above, we have the loadings result returned as a SQL table. In this example, we walked through the steps of moving from invoking an R function to obtain a specific result to producing that same result from SQL by invoking an R script at the database server under the control of Oracle Database.

R provides a rich set of statistical functions that we may want to use directly from SQL. Many of these results can be readily expressed as structured table data for use with other SQL tables, or for...

Best Practices

ore.doEval and ore.tableApply: Which one is right for me?

When beginning to use Oracle R Enterprise, users quickly grasp the techniques and benefits of using embedded R to run scripts in database-side R engines, and gain a solid sense of the available functions for executing R scripts through Oracle Database. However, various embedded R functions are closely related, and a few tips can help in learning which functions are most appropriate for the problem you wish to solve. In this post, we'll demonstrate best practices for two of the non-parallel embedded R functions, ore.doEval and ore.tableApply.As with all embedded R functions, both ore.doEval and ore.tableApply invoke R scripts at the database server in an R engine. The difference is that ore.doEval does not take a data table as an input parameter as it's designed simply to execute the function provided.  In contrast, ore.tableApply accepts a data table (i.e., ore.frame) as input to be delivered to the embedded R function. The functions ore.doEval and ore.tableApply can be made equivalent simply by passing the name of the database table and pulling the table data within the ore.doEval function.    In the following examples, we show embedded R run times for ore.doEval and ore.tableApply using simple functions that build linear models to predict flight arrival delay based on distance traveled and departure delay.Model 1: Although ore.doEval does not explicitly accept data from a dedicated input argument, it's possible to retrieve data from the database using ore.sync and ore.pull within the function: R> system.time(mod1 <- ore.doEval(function(){           ore.sync(table="ONTIME_S")           dat <- ore.pull(ore.get("ONTIME_S"))           lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)},           ore.connect = TRUE))    user  system elapsed    0.008   0.000   4.941Model 2: Data can also be passed to a function in the R interface of embedded R exectuion, as shown here with ore.doEval, when connected to the database schema where the data table resides.R> system.time(mod2 <- ore.doEval(function(dat){           lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)}, dat = ONTIME_S))   user  system elapsed   3.196   0.128   9.476Model 3: The ore.tableApply function is designed to accept a database table, that is, an ore.frame, as the first input argument.R> system.time(mod3 <- ore.tableApply(ONTIME_S, function(dat){                       lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat = dat)}))   user  system elapsed   0.001   0.000   3.870As the elapsed timings show, ore.tableApply (Model 3) is faster than both ore.doEval implementations (Model 1 and Model 2). Results may vary depending on data size and the operation being performed.  The ONTIME_S airline data used in these examples contains 220,000 rows and 26 columns, and the tests were executed on a Linux 5.8 server with 12 GB RAM and a single processor.In summary, ore.doEval takes a function parameter, but can be programmed to source data from Oracle Database or another external source. If your processing is driven by a database table, ore.tableApply is preferable because it's optimized for data transfer from Oracle Database to R. For both approaches, the data must be able to fit in the Database R Engine’s available memory.Unlike other embedded R functions, ore.doEval and ore.tableApply run serially by executing a single R process with the entire data in memory.  Other embedded R functions are enabled for parallel execution, and each has a distinct use case: row-wise "chunked" computations can be executed using ore.rowApply. The function ore.groupApply can be applied to grouped data for data sets that have natural partitioning. Lastly, the function ore.indexApply supports task-based execution, where one or more R engines perform the same or different calculations, or tasks, a specified number of times.In this post, we haven't yet addressed the functions rqEval and rqTableEval, the SQL equivalents for ore.doEval and ore.tableApply.  One distinction between the R and SQL embedded R execution interfaces is that you can pass a data.frame or ore.frame as an argument in the R interface, as illustrated with ore.doEval above, however, the SQL interface takes only scalar arguments as input, for example, as with the rqEval function. The rqTableEval function accepts a full table using a dedicated input argument, and this input may be a cursor or the result of a query.

When beginning to use Oracle R Enterprise, users quickly grasp the techniques and benefits of using embedded R to run scripts in database-side R engines, and gain a solid sense of the available...

Oracle's Strategy for Advanced Analytics

At Oracle our goal is to enable you to get timely insight from all of your data. We continuously enhance Oracle Database to allow workloads that have traditionally required extracting data from the database to run in-place. We do this to narrow the gap that exists between insights that can be obtained and available data - because any data movement introduces latencies, complexity due to more moving parts, the ensuing need for data reconciliation and governance, as well as increased cost. The Oracle tool set considers the needs of all types of enterprise users - users preferring GUI based access to analytics with smart defaults and heuristics out of the box, users choosing to work interactively and quantitatively with data using R, and users preferring SQL and focusing on operationalization of models.Oracle recognized the need to support data analysts, statisticians, and data scientists with a widely used and rapidly growing statistical programming language. Oracle chose R - recognizing it as the new de facto standard for computational statistics and advanced analytics. Oracle supports R in at least 3 ways: R as the language of interaction with the database R as the language in which analytics can be written and executed in the database as a high performance computing platform R as the language in which several native high performance analytics have been written that execute in database Additionally, of course, you may chose to leverage any of the CRAN algorithms to execute R scripts at the database server leveraging several forms of data parallelism.Providing the first and only supported commercial distribution of R from an established company, Oracle released Oracle R Distribution. In 2012 Oracle embarked on the Hadoop journey acknowledging alternative data management options emerging in the open source for management of unstructured or not-yet-structured data. In keeping with our strategy of delivering analytics close to where data is stored, Oracle extended Advanced Analytics capabilities to execute on HDFS resident data in Hadoop environments. R has been integrated into Hadoop in exactly the same manner as it has been with the database.Realizing that data is stored in both database and non-database environment, Oracle provides users options for storing their data (in Oracle Database, HDFS, and Spark RDD), where to perform computations (in-database or the Hadoop cluster), and where results should be stored (Oracle Database or HDFS). Users can write R scripts that can be leveraged across database and Hadoop environments. Oracle Database, as a preferred location for storing R scripts, data, and result objects, provides a real-time scoring and deployment platform. It is also easy to create a model factory environment with authorization, roles, and privileges, combined with auditing, backup, recovery, and security.Oracle provides a common infrastructure that supports both in-database and custom R algorithms. Oracle also provides an integrated GUI for business users. Oracle provides both R-based access and GUI-based access to in-database analytics. A major part of Oracle's strategy is to maintain agility in our portfolio of supported techniques - being responsive to customer needs.

At Oracle our goal is to enable you to get timely insight from all of your data. We continuously enhance Oracle Database to allow workloads that have traditionally required extracting data from...

Best Practices

Why choose Oracle for Advanced Analytics?

If you're an enterprise company, chances are you have your data in an Oracle database. You chose Oracle for it's global reputation at providing the best software products (and now engineered systems) to support your organization. Oracle database is known for stellar performance and scalability, and Oracle delivers world class support. If your data is already in Oracle Database or moving in that direction, leverage the high performance computing environment of the database to analyze your data. Traditionally it was common practice to move data to separate analytic servers for the explicit purpose of model building. This is no longer necessary nor is it scalable as your organization seeks to deliver value from Big Data. Oracle database now has several state of the art algorithms that execute in a parallel and distributed architecture directly in-database and augmented by custom algorithms in the R statistical programming language. Leveraging Oracle database for Advanced Analytics has benefits including: Eliminates data movement to analytic servers Enables analysis of all data not just samples Puts your database infrastructure to even greater use Eliminates impedance mismatch in the form of model translation when operationalizing models All aspects of modeling and deployment are optionally available via SQL making integration into other IT software Leverage CRAN algorithms directly in the database Customers such as Stubhub, dunnhumby, CERN OpenLab, Financiera Uno, Turkcell, and others leverage Oracle Advanced Analytics to scale their applications, simplify their analytics architecture, and reduce time to market of predictive models from weeks to hours or even minutes.Oracle leverages its own advanced analytics products, for example, by using Oracle Advanced Analytics in a wide range of Oracle Applications and internal deployments, ranging from: Human Capital Management with Predictive Workforce to produce employee turnover, performance prediction, and "what if" analysis Customer Relationship Management with Sales Prediction Engine to predict sales opportunities, what to sell, how much, and when Supply Chain Management with Spend Classification to flag non-compliance or anomalies in expense submissions Retail Analytics with Oracle Retail Customer Analytics to perform shopping cart analysis and next best offers Oracle Financial Services Analytic Applications to enable quantitative analysts in credit risk management divisions to author rules/models directly in R Oracle wants you to be successful with advanced analytics. Working closely with customers to integrate Oracle Advanced Analytics as an integral process of their analytics strategy, customers are able to put their advanced analytics into production much faster.

If you're an enterprise company, chances are you have your data in an Oracle database. You chose Oracle for it's global reputation at providing the best software products (and now engineered...

Best Practices

ROracle 1-1.11 released - binaries for Windows and other platforms available on OTN

We are pleased to announce the latest update of the open source ROracle package, version 1-1.11, with enhancements and bug fixes. ROracle provides high performance and scalable interaction from R with Oracle Database. In addition to availability on CRAN, ROracle binaries specific to Windows and other platforms can be downloaded from the Oracle Technology Network. Users of ROracle, please take our brief survey. We want to hear from you!Latest enhancements in version 1-1.11 of ROracle:• Performance enhancements for RAW data types and large result sets• Ability to cache the result set in memory to reduce memory consumption on successive reads• Added session mode to connect as SYSDBA or using external authentication• bug 17383542: Enhanced dbWritetable() & dbRemoveTable() to work on global schemaUsers of ROracle are quite pleased with the performance and functionality: "In my position as a quantitative researcher, I regularly analyze database data up to a gigabyte in size on client-side R engines. I switched to ROracle from RJDBC because the performance of ROracle is vastly superior, especially when writing large tables. I've also come to depend on ROracle for transactional support, pulling data to my R client, and general scalability. I have been very satisfied with the support from Oracle -- their response has been prompt, friendly and knowledgeable."           -- Antonio Daggett, Quantitative Researcher in Finance Industry "Having used ROracle for over a year now with our Oracle Database data, I've come to rely on ROracle for high performance read/write of large data sets (greater than 100 GB), and SQL execution with transactional support for building predictive models in R. We tried RODBC but found ROracle to be faster, much more stable, and scalable."           -- Dr. Robert Musk, Senior Forest Biometrician, Forestry Tasmania See the ROracle NEWS for the complete list of updates. We encourage ROracle users to post questions and provide feedback on the Oracle R Technology Forum. In addition to being a high performance database interface to Oracle Database from R for general use, ROracle supports database access for Oracle R Enterprise.

We are pleased to announce the latest update of the open source ROraclepackage, version 1-1.11, with enhancements and bug fixes. ROracle provides high performance and scalable interaction from R...

Best Practices

Oracle R Enterprise Upgrade Steps

We've recently announced that Oracle R Enterprise 1.4 is available on all platforms. To upgrade Oracle R Enterprise to the latest version:  1. *Install the version of R that is required for the new version of Oracle R Enterprise.         See the Oracle R Enterprise supported platforms matrix for the latest requirements.  2. Update Oracle R Enterprise Server on the database server by running the install.sh script         and follow the prompts for the upgrade path.  3. Update the Oracle R Enterprise Supporting packages on the database server.  4. Update the Oracle R Enterprise Client and Supporting packages on the client. For RAC/Exadata installations, upgrade items 1, 2 and 3 must be performed on all compute notes.  *If you've changed the R installation directory between releases, manually update the location of the R_HOME directory in the Oracle R Enterprise configuration table.  The sys.rqconfigset procedure edits settings in a configuration table called sys.rq_config. Use of this function requires the sys privilege. You can view the contents of this table to verify various environment settings for Oracle R Enterprise. Among the settings stored in sys.rq_config is the R installation directory, or R_HOME. The following query shows sample values stored in sys.rq_config for a Linux server:SQL> select * from sys.rq_config;NAME        VALUE------------------------------------------------------------------R_HOME       /usr/lib64/RMIN_VSIZE    32MMAX_VSIZE    4GR_LIBS_USER  /u01/app/oracle/product/12.0.1/dbhome_1/R/libraryVERSION      1.4MIN_NSIZE    2MMAX_NSIZE    20M7 rows selected.To point to the correct R_HOME:SQL > sys.rqconfigset('R_HOME', '<path to current R installation directory>')All Oracle R Enterprise downloads are available on the Oracle Technology Network. Refer to the instructions in section 8.3 of the Oracle R Enterprise Installations Guide for detailed steps on upgrading Oracle R Enterprise, and don't hesitate to post questions to the Oracle R forum.

We've recently announced that Oracle R Enterprise 1.4 is available on all platforms. To upgrade Oracle R Enterprise to the latest version:  1. *Install the version of R that is required for the...

News

Oracle R Enterprise 1.4 Released

We’re pleased to announce that Oracle R Enterprise (ORE) 1.4 is now available for download on all supported platforms. In addition to numerous bug fixes, ORE 1.4 introduces an enhanced high performance computing infrastructure, new and enhanced parallel distributed predictive algorithms for both scalability and performance, added support for production deployment, and compatibility with the latest R versions.  These updates enable IT administrators to easily migrate the ORE database schema to speed production deployment, and statisticians and analysts have access to a larger set of analytics techniques for more powerful predictive models. Here are the highlights for the new and upgraded features in ORE 1.4:Upgraded R version compatibilityORE 1.4 is certified with R-3.0.1 - both open source R and Oracle R Distribution. See the server support matrix for the complete list of supported R versions. R-3.0.1 brings improved performance and big-vector support to R, and compatibility with more than 5000 community-contributed R packages.High Performance Computing EnhancementsAbility to specify degree of parallelism (DOP) for parallel-enabled functions (ore.groupApply, ore.rowApply, and ore.indexApply)An additional global option, ore.parallel, to set the number of parallel threads used in embedded R executionData Transformations and Analyticsore.neural now provides a highly flexible network architecture with a wide range of activation functions, supporting 1000s of formula-derived columns, in addition to being a parallel and distributed implementation capable of supporting billion row data setsore.glm now also prevents selection of less optimal coefficient methods with parallel distributed in-database executionSupport for weights in regression modelsNew ore.esm enables time series analysis, supporting both simple and double exponential smoothing for scalable in-database executionExecute standard R functions for Principal Component Analysis (princomp), ANOVA (anova), and factor analysis (factanal) on database dataOracle Data Mining Model Algorithm FunctionsNewly exposed in-database Oracle Data Mining algorithms:ore.odmAssocRules function for building Oracle Data Mining association models using the apriori algorithmore.odmNMF function for building Oracle Data Mining feature extraction models using the Non-Negative Matrix Factorization (NMF) algorithmore.odmOC function for building Oracle Data Mining clustering models using the Orthogonal Partitioning Cluster (O-Cluster) algorithmProduction DeploymentNew migration utility eases production deployment from development environments"Snapshotting" of production environments for debugging in test systemsFor a complete list of new features, see the Oracle R Enterprise User's Guide. To learn more about Oracle R Enterprise, check out the white paper entitled, "Bringing R to the Enterprise -  A Familiar R Environment with Enterprise-Caliber Performance, Scalability, and Security.", visit Oracle R Enterprise on Oracle's Technology Network, or review the variety of use cases on the Oracle R blog.

We’re pleased to announce that Oracle R Enterprise (ORE) 1.4 is now available for download on all supported platforms. In addition to numerous bug fixes, ORE 1.4 introduces an enhanced high...

Oracle

Integrated Cloud Applications & Platform Services