Wednesday Apr 17, 2013

Mind Reading... What are our customers thinking?

Overhauling analytics processes is becoming a recurring theme among customers. A major telecommunication provider recently embarked on overhauling their analytics process for customer surveys. They had three broad technical goals:

  • Provide an agile environment that empowers business analysts to test hypotheses based on survey results
  • Allow dynamic customer segmentation based on survey responses and even specific survey questions to drive hypothesis testing
  • Make results of new surveys readily available for research

The ultimate goal is to derive greater value from survey research that drives measurable improvements in survey service delivery, and as a result, overall customer satisfaction.

This provider chose Oracle Advanced Analytics (OAA) to power their survey research. Survey results and analytics are maintained in Oracle Database and delivered via a parameterized BI dashboard. Both the database and BI infrastructure are standard components in their architecture.

A parameterized BI dashboard enables analysts to create samples for hypothesis testing by filtering respondents to a survey question based on a variety of filtering criteria. This provider required the ability to deploy a range of statistical techniques depending on the survey variables, level of measurement of each variable, and the needs of survey research analysts.

Oracle Advanced Analytics offers a range of in-database statistical techniques complemented by a unique architecture supporting deployment of open source R packages in-database to optimize data transport to and from database-side R engines. Additionally, depending on the nature of functionality in such R packages, it is possible to leverage data-parallelism constructs available as part of in-database R integration. Finally, all OAA functionality is exposed through SQL, the ubiquitous language of the IT environment. This enables OAA-based solutions to be readily integrated with BI and other IT technologies.

The survey application noted above has been in production for 3 months. It supports a team of 20 business analysts and has already begun to demonstrate measurable improvements in customer satisfaction.

In the rest of this blog, we explore the range of statistical techniques deployed as part of this application.

At the heart of survey research is hypothesis testing. A completed customer satisfaction survey contains data used to draw conclusions about the state of the world. In the survey domain, hypothesis testing is comparing the significance of answers to specific survey questions across two distinct groups of customers - such groups are identified based on knowledge of the business and technically specified through filtering predicates.

Hypothesis testing sets up the world as consisting of 2 mutually exclusive hypotheses:

a) Null hypothesis - states that there is no difference in satisfaction levels between the 2 groups of customers

b) Alternate hypothesis states that there is a significant difference in satisfaction levels between the 2 groups of customers

Obviously only one of these can be true and the true-ness is determined by the strength, probability, or likelihood of the null hypothesis over the alternate hypothesis. Simplistically, the degree of difference between, e.g., the average score from a specific survey question across two customer groups could provide the necessary evidence in helping decide which hypothesis is true.

In practice the process of providing evidence to make a decision involves having access to a range of test statistics – a number calculated from each group that helps determine the choice of null or alternate hypothesis. A great deal of theory, experience, and business knowledge goes into selecting the right statistic based on the problem at hand.

The t-statistic (available in-database) is a fundamental function used in hypothesis testing that helps understand the differences in means across two groups. When the t-values across 2 groups of customers for a specific survey question are extreme then the alternative hypothesis is likely to be true. It is common to set a critical value that the observed t-value should exceed to conclude that the satisfaction survey results across the two groups are significantly different. Other similar statistics available in-database include F-test, cross tabulation (frequencies of various response combinations captured as a table), related hypothesis testing functions such as chi-square functions, Fisher's exact test, Kendall's coefficients, correlation coefficients and a range of lambda functions.

If an analyst desires to compare across more than 2 groups then analysis of variance (ANOVA) is a collection of techniques that is commonly used. This is an area where the R package ecosystem is rich with several proven implementations. The R stats package has implementations of several test statistics and function glm allows analysis of count data common in survey results including building Poisson and log linear models. R's MASS package implements a popular survey analysis technique called iterative proportional fitting. R's survey package has a rich collection of features (http://faculty.washington.edu/tlumley/survey/).

The provider was specifically interested in one function in the survey package - raking (also known as sample balancing) - a process that assigns a weight to each customer that responded to a survey such that the weighted distribution of the sample is in very close agreement with other customer attributes, such as the type of cellular plan, demographics, or average bill amount. Raking is an iterative process that uses the sample design weight as the starting weight and terminates when a convergence is achieved.

For this survey application, R scripts that expose a wide variety of statistical techniques - some in-database accessible through the transparency layer in Oracle R Enterprise and some in CRAN packages - were built and stored in the Oracle R Enterprise in-database R script repository. These parameterized scripts accept various arguments that identify samples of customers to work with as well as specific constraints for the various hypothesis test functions. The net result is greater agility since the business analyst determines both the set of samples to analyze as well as the application of the appropriate technique to the sample based on the hypothesis being pursued.

For more information see these links for Oracle's R Technologies software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop

Friday Feb 03, 2012

What is R?

For many in the Oracle community, the addition of R through Oracle R Enterprise could leave them wondering "What is R?"

R has been receiving a lot of attention recently, although it’s been around for over 15 years. R is an open-source language and environment for statistical computing and data visualization, supporting data manipulation and transformations, as well as sophisticated graphical displays. It's being taught in colleges and universities in courses on statistics and advanced analytics - even replacing more traditional statistical software tools. Corporate data analysts and statisticians often know R and use it in their daily work, either writing their own R functionality, or leveraging the more than 3400 open source packages. The Comprehensive R Archive Network (CRAN) open source packages support a wide range of statistical and data analysis capabilities. They also focus on analytics specific to individual fields, such as bioinformatics, finance, econometrics, medical image analysis, and others (see CRAN Task Views).

So why do statisticians and data analysts use R?

Well, R is a statistics language similar to SAS or SPSS. It’s a powerful, extensible environment, and as noted above, it has a wide range of statistics and data visualization capabilities. It’s easy to install and use, and it’s free – downloadable from the CRAN R project website.

In contrast, statisticians and data analysts typically don’t know SQL and are not familiar with database tasks. R provides statisticians and data analysts access a wide range of analytical capabilities in a natural statistical language, allowing them to remain highly productive. For example, writing R functions is simple and can be done quickly. Functions can be made to return R objects that can be easily passed to and manipulated by other R functions. By comparison, traditional statistical tools can make the implementation of functions cumbersome, such that programmers resort to macro-oriented programming constructs instead.

So why do we need anything else?

R was conceived as a single user tool that is not multi-threaded.  The client and server components are bundled together as a single executable, much like Excel.

R is limited by the memory and processing power of the machine where it runs, but in addition, being single threaded, it cannot automatically leverage the CPU capacity on a user’s multi-processor laptop without special packages and programming.

However, there is another issue that limits R’s scalability…

R’s approach to passing data between function invocations results in data duplication – this chews up memory faster. So inherently, R is not good for big data, or depending on the machine and tasks, even gigabyte-sized data sets.

This is where Oracle R Enterprise comes in. As we'll continue to discuss in this blog, Oracle R Enterprise lifts this memory and computational constraint found in R today by executing requested R calculations on data in the database, using the database itself as the computational engine. Oracle R Enterprise allows users to further leverage Oracle's engineered systems, like Exadata, Big Data Appliance, and Exalytics, for enterprise-wide analytics, as well as reporting tools like Oracle Business Intelligence Enterprise Edition dashboards and BI Publisher documents.





About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today