In a competitive world, the key to business success is to understand enough about your customers' behavior and preferences so that you can provide a personalized service to both your prospective and existing customer base. Using customer behavior analytics techniques, you can predict how a customer will respond to a business situation, thereby giving you the information you need to approach them. These techniques can be broadly grouped into two buckets:
Regression problems: The response to this type of problem is a numerical value; for example, based on customer demographics and past purchasing patterns, how much will the customer spend in a particular instance?
Classification problems: The response to this type of problem is categorical; for example, based on the customer demographics and his or her past experience with the product/service, will the customer stay or leave the product/service?
Predicting whether a customer will stop using your product or service is an important component of customer behavior analytics called churn prediction. In this post, I will analyze two aspects of churn prediction:
For the given scenario, which factors are mainly responsible for customer churn?
Given the data pertaining to a new customer, how do you predict whether the customer will churn or stay, and what are the associated probabilities?
The Business Problem: Predicting Churn at a Telecom Service Provider
The telecom business is challenged by frequent customer churn due to several factors related to service and customer demographics. The dataset we'll use in our analysis includes a list of service-related factors about existing customers and information about whether they have stayed or left the service provider. Our objective is to understand which of the factors contribute most to customer churn and to predict which customers will potentially churn based on service-related factors.
About the Dataset
The dataset used for the analysis can be downloaded here. It consists of information for 5,000 customers and includes independent variables such as account length, number of voicemail messages, total daytime charge, total evening charge, total night charge, total international charge, and number of customer service calls.
The dependent variable in the dataset is whether the customer churned or not, which is indicated by a 1 for "yes" and 0 for "no."
What is Discriminant Analysis?
In order to uncover which variables are responsible for churn and predict whether a customer will churn or not, we will use discriminant analysis.
Discriminant analysis is a segmentation tool. It segments groups in a way as to achieve maximum separation between them. This technique makes use of the information provided by the X variables to achieve the clearest possible separation between two groups (in our case, the two groups are customers who stay and customers who churn). Below is our formula:
D= b0 + b1X1 + b2X2 + .. bnXn
Here, D is the discriminant score, b is the discriminant coefficient, and X1 and X2 are independent variables.
The discriminant coefficient is estimated by maximizing the ratio of the variation between the classes of customers and the variation within the classes. In other words, points belonging to the same class should be close together, while also being far away from the other clusters.
Discriminant Analysis: A High-Level Approach
Below, I've broken down the steps we will be following in order to predict customer churn, starting with data preparation and ending with validating the accuracy of our model.
Let's get started.
Formulating the Problem and Preparing the Data
To begin, we need to read the CSV file in R and convert the target variables into categorical variables. For the purpose of discriminant analysis, the Y variable has to be a categorical variable (meaning it is one of a limited number of possible values).
## Setting the working directory and reading the source file
setwd("C:/Users/Sowmya CR/Google Drive/datascience_blog")
data=read.csv("churn2.csv")
## Converting the target variable to categorical variable
data$churn=factor(data$churn)
str(data)
## 'data.frame': 5000 obs. of 8 variables:
## $ account_length : int 128 107 137 84 75 118 121 147 117 141 ...
## $ number_vmail_messages : int 25 26 0 0 0 0 24 0 0 37 ...
## $ total_day_charge : num 45.1 27.5 41.4 50.9 28.3 ...
## $ total_eve_charge : num 16.78 16.62 10.3 5.26 12.61 ...
## $ total_night_charge : num 11.01 11.45 7.32 8.86 8.41 ...
## $ total_intl_charge : num 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
## $ number_customer_service_calls: int 1 1 0 2 3 0 3 0 1 0 ...
## $ churn : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
Now that we've converted our target variables, let's find the baseline churn rate for the dataset:
prop.table((table(data$churn)))
##
## 0 1
## 0.8586 0.1414
As you can see here, the baseline churn rate is 14.14%, which is an indicator that the dataset is unbalanced — there are more 0s than 1s. Subsequently, this will have an impact on how we interpret model performance measures.
Exploratory Data Analysis
Using ggplot2, I will explore our dataset and determine the impact our variables have on churn, starting with account length.
ggplot(data, aes(x=churn, y=account_length, fill=churn)) +
geom_boxplot()
As you can see in our box plot, account length does not seem to have an influence on customer churn. Now, let's look at voicemails:
ggplot(data, aes(x=churn, y=number_vmail_messages, fill=churn)) +
geom_boxplot()
The number of voicemail messages seems to be lower for customers who churn compared to those who don't.
ggplot(data, aes(x=churn, y=total_day_charge, fill=churn)) +
geom_boxplot()
Generally, tariffs for day, evening, night, and international calls are higher for customers who churn compared to those who don't. This could indicate that customers who churn are not happy with the amount of money they are paying for their plan.
ggplot(data, aes(x=churn, y=number_customer_service_calls, fill=churn)) +
geom_boxplot()
Here, we see that the number of customer service calls made to customers who churn is relatively high. This indicates that customers who have churned have tried contacting customer service — but might have not received a satisfactory resolution to their issue.
Using the assumptions we've derived from our exploratory data analysis, we will dive deeper into our variables using discriminant analysis.
Since the data have variables that are in various dimensions, it is advisable to scale them so the range of each variable does not have an influence on the discriminant coefficients. After the dataset is scaled, we will split the dataset into training and test datasets. We will build the model based on the training dataset and test the model performance using the test dataset.:
## Scale X variables
x <- subset(data,select=c(1:7))
scaled_x=scale(x)
data1=cbind(data[8],scaled_x)
## Split into training and test sets
library(caTools)
set.seed(123)
split = sample.split(data1$churn, SplitRatio = 0.7)
traindata = subset(data1, split == TRUE)
testdata = subset(data1, split == FALSE)
#### Check if distribution of partition data is correct Testing dataset
prop.table((table(traindata$churn)))
##
## 0 1
## 0.8585714 0.1414286
prop.table((table(testdata$churn)))
##
## 0 1
## 0.8586667 0.1413333
Step One: Test the Significance of the Discriminant Function Using MANOVA
To test the significance of the discriminant function, we will use multivariate analysis of variance (MANOVA). Here's the approach we will take:
The null hypothesis of MANOVA is that all the means of the independent variables are equal, which implies that the independent variables are not differentiators of the group.
The alternative hypothesis is that at least one independent variable has a different mean or, in other words, a significant differentiator.
head(traindata)
## churn account_length number_vmail_messages total_day_charge
## 1 0 0.6988716 1.2730178 1.57391660
## 3 0 0.9256029 -0.5724919 1.17116913
## 6 0 0.4469479 -0.5724919 0.80007390
## 7 0 0.5225250 1.1991974 0.70293426
## 9 0 0.4217555 -0.5724919 0.07862111
## 10 0 1.0263724 2.1588624 1.45276492
## total_eve_charge total_night_charge total_intl_charge
## 1 -0.06384268 0.87619875 -0.09549925
## 3 -1.57192653 -0.74666192 0.69590136
## 6 0.39463343 0.07136541 -1.43685621
## 7 2.92439754 0.24288727 -0.99420841
## 9 2.98723437 0.30445922 -0.56497419
## 10 0.42256091 2.49466143 0.33373498
## number_customer_service_calls
## 1 -0.436632
## 3 -1.202116
## 6 -1.202116
## 7 1.094336
## 9 -0.436632
## 10 -1.202116
X1=cbind(as.matrix(traindata[,2:8]))
Y1=as.vector(traindata[,1])
Manova=manova(X1~Y1)
summary(Manova, test = "Wilks")
## Df Wilks approx F num Df den Df Pr(>F)
## Y1 1 0.89534 58.315 7 3492 < 2.2e-16 ***
## Residuals 3498
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As you see above, the Wilks' lambda for MANOVA is closer to 1, indicating that the extent of discrimination in the model is relatively low. But the p-value is highly significant, indicating that the null hypothesis cannot be accepted. This implies that the discriminant model is highly significant.
Step 2: Develop the Fisher Discriminant Function
Now we need to identify a combination of features that separates our customers that are likely to churn from those who are not. I'll be using the packages DiscriMiner and MASS to do so.
library(DiscriMiner)
library(MASS)
discPower(X1,Y1)
## correl_ratio wilks_lambda F_statistic
## account_length 0.000069076 0.9999309 0.2416445
## number_vmail_messages 0.008786183 0.9912138 31.0064957
## total_day_charge 0.036972676 0.9630273 134.2956924
## total_eve_charge 0.005896454 0.9941035 20.7481365
## total_night_charge 0.002381859 0.9976181 8.3516345
## total_intl_charge 0.004819239 0.9951808 16.9393317
## number_customer_service_calls 0.043692063 0.9563079 159.8175953
## p_value
## account_length 6.230517e-01
## number_vmail_messages 2.765485e-08
## total_day_charge 0.000000e+00
## total_eve_charge 5.416984e-06
## total_night_charge 3.877117e-03
## total_intl_charge 3.948399e-05
## number_customer_service_calls 0.000000e+00
##Fischer discriminant functions
desDA(X1,Y1)
Since there are only two values for the Y variable, there is only one discriminant function DF1. The coefficients of the discriminant function can be seen below:
##
## Descriptive Discriminant Analysis
## ---------------------------------
## $power discriminant power
## $values table of eigenvalues
## $discrivar discriminant variables
## $discor correlations
## $scores discriminant scores
## ---------------------------------
##
## $power
## cor_ratio wilks_lamb
## account_length 0.00006907600 0.99993092400
## number_vmail_messages 0.00878618268 0.99121381732
## total_day_charge 0.03697267618 0.96302732382
## total_eve_charge 0.00589645399 0.99410354601
## total_night_charge 0.00238185881 0.99761814119
## total_intl_charge 0.00481923872 0.99518076128
## number_customer_service_calls 0.04369206260 0.95630793740
## F_statistic p_values
## account_length 0.24164453981 0.62305169936
## number_vmail_messages 31.00649574143 0.00000002765
## total_day_charge 134.29569242558 0.00000000000
## total_eve_charge 20.74813648496 0.00000541698
## total_night_charge 8.35163453174 0.00387711739
## total_intl_charge 16.93933172621 0.00003948399
## number_customer_service_calls 159.81759534952 0.00000000000
##
##
## $values
## value proportion accumulated
## DF1 0.103 100.000 100.000
##
##
## $discrivar
## DF1
## constant 0.02478
## account_length 0.05173
## number_vmail_messages -0.30702
## total_day_charge 0.62922
## total_eve_charge 0.28724
## total_night_charge 0.16000
## total_intl_charge 0.25025
## number_customer_service_calls 0.70106
##
##
## $discor
## DF1
## account_length 0.02569
## number_vmail_messages -0.28974
## total_day_charge 0.59435
## total_eve_charge 0.23736
## total_night_charge 0.15086
## total_intl_charge 0.21458
##
##
## $scores
## z1
## 1 0.45226
## 3 -0.25423
## 6 -0.35048
## 7 1.52316
## 9 0.73109
## 10 0.09042
## ...
If we sort the X variables by descending order of the coefficients, we are able to understand the influence of each X variable on differentiating the Y variable:
constant 0.02478
number_customer_service_calls 0.70106
total_day_charge 0.62922
total_eve_charge 0.28724
total_intl_charge 0.25025
total_night_charge 0.16
account_length 0.05173
number_vmail_messages -0.30702
In terms of magnitude, the number of customer service calls has the most impact, whereas the account length has the least impact. Number of voice mail messages has a negative sign indicating that it has a negative impact on churn. In other words, as the number of voice mail messages increases, the probability of churn decreases.
Step 3: Differentiation of Individual Independent Variables and Wilks' Lamba
The p-value is not significant for account length, but it highly significant for the other X variables. This indicates that all X variables except account length are excellent predictors in terms of differentiating our customer groups.
Step 4: Correlation Between Discriminant Function and Independent Variable
The relative importance of the X variables can be determined by the correlation ratio of each of the variables. The number of customer service calls emerges as the most significant variable in its ability to differentiate the groups.
Step 5: Classify Records Based on Discriminant Analysis of X Variables and Predict Y Variables for the Test Set
Since the discriminant model is significant, we will use it to classify records as belonging to either customers who will churn or those who will not churn depending on the X variables. We will use the lda() function in R to classify records based on value of X variables and predict the class and probability for the test set.
## churn account_length number_vmail_messages total_day_charge
## 1 0 0.6988716 1.2730178 1.57391660
## 3 0 0.9256029 -0.5724919 1.17116913
## 6 0 0.4469479 -0.5724919 0.80007390
## 7 0 0.5225250 1.1991974 0.70293426
## 9 0 0.4217555 -0.5724919 0.07862111
## 10 0 1.0263724 2.1588624 1.45276492
## total_eve_charge total_night_charge total_intl_charge
## 1 -0.06384268 0.87619875 -0.09549925
## 3 -1.57192653 -0.74666192 0.69590136
## 6 0.39463343 0.07136541 -1.43685621
## 7 2.92439754 0.24288727 -0.99420841
## 9 2.98723437 0.30445922 -0.56497419
## 10 0.42256091 2.49466143 0.33373498
## number_customer_service_calls
## 1 -0.436632
## 3 -1.202116
## 6 -1.202116
## 7 1.094336
## 9 -0.436632
## 10 -1.202116
sublda=lda(churn~.,data = traindata)
sublda
## Call:
## lda(churn ~ ., data = traindata)
##
## Prior probabilities of groups:
## 0 1
## 0.8585714 0.1414286
##
## Group means:
## account_length number_vmail_messages total_day_charge total_eve_charge
## 0 -0.002297854 0.0520458 -0.0936634 -0.03051636
## 1 0.021527668 -0.2178557 0.4578240 0.18870353
## total_night_charge total_intl_charge number_customer_service_calls
## 0 -0.02835861 -0.03201133 -0.09579425
## 1 0.11297312 0.16979135 0.49432007
##
## Coefficients of linear discriminants:
## LD1
## account_length 0.05172731
## number_vmail_messages -0.30702403
## total_day_charge 0.62921658
## total_eve_charge 0.28723614
## total_night_charge 0.15999506
## total_intl_charge 0.25025370
## number_customer_service_calls 0.70105887
Let us review the output from the lda() function:
The difference in group means is highest for number of customer service calls and lowest for account length. This gives us insight into which factors contribute most to the discrimination between the groups.
The coefficients also give a similar pattern and throw light on which X variables contribute most to group separation.
Step 6: Visualizing the Groups
Now, let's visualize what our non-churning customers look like compared to our customers who will most likely churn. Remember, 0 indicates customers who will not churn and 1 indicates customers who will.
plot(sublda, dimen = 1, type = "b")
The groups created by discriminant analysis can be seen in the graphs, and are in sync with the Wilks lambda value of 0.89 that we got from our MANOVA test. These graphs are a good indicator that although the model is significant, our two groups are not completely separated. There is some overlap.
Step 7: Make Predictions on the Test Set
Using our test set, let's predict whether a customer will churn or stay based on the X variables of the test set. We will apply the discriminant model that we built using the training set to make predictions about the test set. The objective of this is to measure how the model performs on a new set of data.
lda.pred=predict(sublda, newdata = testdata)
library(hmeasure)
class.lda=lda.pred$class
true.class<-testdata[,1]
lda.counts <- misclassCounts(class.lda,true.class)
lda.counts$conf.matrix
## pred.1 pred.0
## actual.1 27 185
## actual.0 30 1258
print(lda.counts$metrics,digits=3)
## ER Sens Spec Precision Recall TPR FPR F Youden
## 1 0.143 0.127 0.977 0.474 0.127 0.127 0.0233 0.201 0.104
Step 8: Evaluate Model Performance Measures
The accuracy of the model is 1-Error rate = 1-0.1433 = 85.67%
Accuracy indicates how many correct predictions are made by the model. The model has a fairly good accuracy. However, since the dataset is unbalanced, accuracy alone may not be the sole indicator that the model is a robust model. Hence we will look at few other model performance measures.
Sensitivity also called the true positive rate is defined as the proportion of actual positives that are correctly identified by the model. The sensitivity of the model is 12.7% which is very low. For a churn prediction model, it is important that the model picks up positives as positives. It is important to make an accurate prediction of customers who will churn which is given by sensitivity.
We will now vary the threshold of the model from the default 50% to other values to decide on a optimum balance for sensitivity and specificity.
## classifying with a default threshold of 0.5
lda.pred$posterior[1:3,]
## 0 1
## 2 0.9302405 0.06975946
## 4 0.8165096 0.18349036
## 5 0.8499270 0.15007298
scores.lda <- lda.pred$posterior[,2]
all((scores.lda > 0.5)== (class.lda=="1"))
## [1] TRUE
The model, by default, uses a 50% threshold to classify records as 0 or 1.
## threshold of 30%
lda.counts.T03 <- misclassCounts(scores.lda>0.3,true.class)
lda.counts.T03$conf.matrix
## pred.1 pred.0
## actual.1 82 130
## actual.0 106 1182
## threshold of 20%
lda.counts.T02 <- misclassCounts(scores.lda>0.2,true.class)
lda.counts.T02$conf.matrix
## pred.1 pred.0
## actual.1 134 78
## actual.0 237 1051
## threshold of 17%
lda.counts.T017 <- misclassCounts(scores.lda>0.17,true.class)
lda.counts.T017$conf.matrix
## pred.1 pred.0
## actual.1 151 61
## actual.0 302 986
## threshold of 16%
lda.counts.T016 <- misclassCounts(scores.lda>0.16,true.class)
lda.counts.T016$conf.matrix
## pred.1 pred.0
## actual.1 157 55
## actual.0 329 959
## threshold of 15%
lda.counts.T015 <- misclassCounts(scores.lda>0.15,true.class)
lda.counts.T015$conf.matrix
## pred.1 pred.0
## actual.1 163 49
## actual.0 359 929
## threshold of 10%
lda.counts.T01 <- misclassCounts(scores.lda>0.1,true.class)
lda.counts.T01$conf.matrix
## pred.1 pred.0
## actual.1 182 30
## actual.0 604 684
Now, let's compare the values of sensitivity and specificity for three threshold values.
## ER Sens Spec
## 1 0.1573333 0.3867925 0.9177019
lda.counts.T02$metrics[c('ER', 'Sens','Spec')]
## ER Sens Spec
## 1 0.21 0.6320755 0.8159938
lda.counts.T017$metrics[c('ER', 'Sens','Spec')]
## ER Sens Spec
## 1 0.242 0.7122642 0.765528
lda.counts.T016$metrics[c('ER', 'Sens','Spec')]
## ER Sens Spec
## 1 0.256 0.740566 0.7445652
lda.counts.T015$metrics[c('ER', 'Sens','Spec')]
## ER Sens Spec
## 1 0.272 0.7688679 0.7212733
lda.counts.T01$metrics[c('ER', 'Sens','Spec')]
## ER Sens Spec
## 1 0.4226667 0.8584906 0.5310559
When we plot the accuracy, sensitivity, and specificity for various threshold values, the three lines intersect at a particular point. The threshold corresponding to this point indicates the optimum threshold for the model.
From the chart above, it is clear that the optimum threshold for the model is 16%. When we make predictions at this threshold, the accuracy, sensitivity, and specificity are 74%.
Insights From the Model and Business Recommendations
Based on the discriminant coefficients and the correl_ratio provided by the model, an increase in the below variables increases the probability of customer churn:
Number of customer service calls
Total day charge
Total evening charge
Total international charge
Total night charge
Additionally, an increase in number of voice mail messages decreases the probability of customer churn.
These insights from the discriminant model can help the business formulate strategies to reduce customer churn. Here's what I would recommend to the business based on what we've learned: First, customer issues should be resolved within the first or second call, as repeated calls to customer service causes customer churn. Second, there should be an organized escalation procedure for issues not resolved within two calls. Lastly, the provider should offer more attractive plans that reduce the cost of day, evening, and international calls based on usage.