### Mind Reading... What are our customers thinking?

#### By Mark Hornick on Apr 17, 2013

Overhauling analytics processes is becoming a recurring theme among customers. A major telecommunication provider recently embarked on overhauling their analytics process for customer surveys. They had three broad technical goals:

- Provide an agile environment that empowers business analysts to test hypotheses based on survey results
- Allow dynamic customer segmentation based on survey responses and even specific survey questions to drive hypothesis testing
- Make results of new surveys readily available for research

The ultimate goal is to derive greater value from survey research that drives measurable improvements in survey service delivery, and as a result, overall customer satisfaction.

This provider chose Oracle Advanced Analytics (OAA) to power their survey research. Survey results and analytics are maintained in Oracle Database and delivered via a parameterized BI dashboard. Both the database and BI infrastructure are standard components in their architecture.

A parameterized BI dashboard enables analysts to create samples for hypothesis testing by filtering respondents to a survey question based on a variety of filtering criteria. This provider required the ability to deploy a range of statistical techniques depending on the survey variables, level of measurement of each variable, and the needs of survey research analysts.

Oracle Advanced Analytics offers a range of in-database statistical techniques complemented by a unique architecture supporting deployment of open source R packages in-database to optimize data transport to and from database-side R engines. Additionally, depending on the nature of functionality in such R packages, it is possible to leverage data-parallelism constructs available as part of in-database R integration. Finally, all OAA functionality is exposed through SQL, the ubiquitous language of the IT environment. This enables OAA-based solutions to be readily integrated with BI and other IT technologies.

The survey application noted above has been in production for 3 months. It supports a team of 20 business analysts and has already begun to demonstrate measurable improvements in customer satisfaction.

In the rest of this blog, we explore the range of statistical techniques deployed as part of this application.

At the heart of survey research is *hypothesis testing*. A completed customer satisfaction survey
contains data used to draw conclusions about the state of the world. In the survey
domain, hypothesis testing is comparing the significance of answers to specific
survey questions across two distinct groups of customers - such groups are
identified based on knowledge of the business and technically specified through
filtering predicates.

Hypothesis testing sets up the world as consisting of 2 mutually exclusive hypotheses:

a) Null hypothesis - states that there is no difference in satisfaction levels between the 2 groups of customers

b) Alternate hypothesis states that there is a significant difference in satisfaction levels between the 2 groups of customers

Obviously only one of these can be true and the true-ness is determined by the strength, probability, or likelihood of the null hypothesis over the alternate hypothesis. Simplistically, the degree of difference between, e.g., the average score from a specific survey question across two customer groups could provide the necessary evidence in helping decide which hypothesis is true.

In practice the process of providing evidence to make a decision involves having access to a range of test statistics – a number calculated from each group that helps determine the choice of null or alternate hypothesis. A great deal of theory, experience, and business knowledge goes into selecting the right statistic based on the problem at hand.

The t-statistic (available in-database) is a fundamental function used in hypothesis testing that helps understand the differences in means across two groups. When the t-values across 2 groups of customers for a specific survey question are extreme then the alternative hypothesis is likely to be true. It is common to set a critical value that the observed t-value should exceed to conclude that the satisfaction survey results across the two groups are significantly different. Other similar statistics available in-database include F-test, cross tabulation (frequencies of various response combinations captured as a table), related hypothesis testing functions such as chi-square functions, Fisher's exact test, Kendall's coefficients, correlation coefficients and a range of lambda functions.

If an analyst desires to compare across more than 2 groups
then analysis of variance (ANOVA) is a collection of techniques that is commonly
used. This is an area where the R package ecosystem is rich with several proven
implementations. The R **stats** package
has implementations of several test statistics and function **glm** allows analysis of count data
common in survey results including building Poisson and log linear models. R's **MASS** package implements a popular
survey analysis technique called *iterative
proportional fitting*. R's **survey**
package has a rich collection of features
(http://faculty.washington.edu/tlumley/survey/).

The provider was specifically interested in one function in
the **survey **package - raking (also known as sample balancing) - a process that assigns
a weight to each customer that responded to a survey such that the weighted
distribution of the sample is in very close agreement with other customer attributes,
such as the type of cellular plan, demographics, or average bill amount. Raking
is an iterative process that uses the sample design weight as the starting
weight and terminates when a convergence is achieved.

For this survey application, R scripts that expose a wide variety of statistical techniques - some in-database accessible through the transparency layer in Oracle R Enterprise and some in CRAN packages - were built and stored in the Oracle R Enterprise in-database R script repository. These parameterized scripts accept various arguments that identify samples of customers to work with as well as specific constraints for the various hypothesis test functions. The net result is greater agility since the business analyst determines both the set of samples to analyze as well as the application of the appropriate technique to the sample based on the hypothesis being pursued.

For more information see these links for Oracle's R Technologies software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.