100% found this document useful (3 votes)

2K views28 pages

Capstone - Project - Final - Report - Churn - Prediction

This document presents the findings of a capstone project analyzing customer churn. It includes sections on exploratory data analysis, data cleaning, model building using logistic regression, decision trees, random forests and other techniques, and model validation. Univariate and bivariate analyses were conducted to understand customer attributes and their relationship to churn. Data preprocessing addressed missing values, outliers and transformations. Multiple models were trained and validated on test data, with random forests showing the best performance. Key insights and recommendations are provided to help reduce customer churn.

Uploaded by

Puvya Ravi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (3 votes)

2K views28 pages

Capstone - Project - Final - Report - Churn - Prediction

Uploaded by

Puvya Ravi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

CAPSTONE PROJECT

FINAL REPORT
SUBMITTED BY

PUVYA RAVI
TABLE OF CONTENTS

TABLE OF CONTENTS .................................................................................................................................. 2

LIST OF FIGURES .......................................................................................................................................... 4
LIST OF TABLES ............................................................................................................................................ 5
1. Introduction ............................................................................................................................................. 6
a. Problem statement................................................................................................................................................. 6
b. Need of the Study/Project ...................................................................................................................................... 6
c. Understanding Business/Social opportunity ......................................................................................................... 6
2. Exploratory Data Analysis and Business Implication ........................................................................... 6
a. Understanding how data was collected in terms of time, frequency and methodology ........................................ 6
b. Visual inspection of data ....................................................................................................................................... 6
c. Understanding of attributes .................................................................................................................................. 7
d. Univariate Analysis ............................................................................................................................................... 7
e. Bivariate Analysis ............................................................................................................................................... 13
3. Data Cleaning and Pre-processing ....................................................................................................... 18
a. Removal of unwanted variables .......................................................................................................................... 18
b. Missing values treatment .................................................................................................................................... 18
c. Outlier treatment ................................................................................................................................................. 19
d. Variable transformation...................................................................................................................................... 20
4. Model building ....................................................................................................................................... 20
a. Logistic Regression ............................................................................................................................................. 21
b. Decision Tree ...................................................................................................................................................... 21
c. Random Forest .................................................................................................................................................... 22
d. Linear Discriminant Analysis ............................................................................................................................. 22
e. K-Nearest Neighbours......................................................................................................................................... 22
5. Model Validation ................................................................................................................................... 23
a. Test predictive model against the test set using various appropriate performance metrics ............................... 23
i. Logistic Regression ..................................................................................................................................... 23
ii. Decision Tree .............................................................................................................................................. 23
iii. Random Forest ............................................................................................................................................ 24
iv. Linear Discriminant Analysis ..................................................................................................................... 25
v. K-Nearest Neighbours................................................................................................................................. 26
b. Interpretation of the model(s) ............................................................................................................................. 26
c. Ensemble modelling, wherever applicable ......................................................................................................... 27
d. Any other model tuning measure ........................................................................................................................ 27
e. Interpretation of the most optimum model and its implication on the business .................................................. 27
6. Final Interpretation and Recommendations ........................................................................................ 27
a. Insights ................................................................................................................................................................ 27
b. Recommendations ............................................................................................................................................... 28
Appendix .......................................................................................................................................................... 28
LIST OF FIGURES
Figure 2.1: Count plot – Account Segment ......................................................................................................................................... 7
Figure 2.2: Count plot – Gender ........................................................................................................................................................ 8
Figure 2.3: Count plot – Login Device ............................................................................................................................................... 8
Figure 2.4: Count plot – Marital Status ............................................................................................................................................. 9
Figure 2.5: Count plot – Payment ...................................................................................................................................................... 9
Figure 2.6: Count plot – Revenue per Month ................................................................................................................................... 10
Figure 2.7: Count plot – Tenure ....................................................................................................................................................... 10
Figure 2.8: Count plot – Account User Count .................................................................................................................................. 11
Figure 2.9: Count plot – Revenue Growth ........................................................................................................................................ 11
Figure 2.10: Count plot – Coupon used for Payment ....................................................................................................................... 12
Figure 2.11: Count plot – Days since CC connect ........................................................................................................................... 12
Figure 2.12: Count plot – Cashback ................................................................................................................................................. 13
Figure 2.13: Heat Map ..................................................................................................................................................................... 13
Figure 2.14: Account Segment vs Churn .......................................................................................................................................... 14
Figure 2.15: Gender vs Churn .......................................................................................................................................................... 14
Figure 2.16: Login Device vs Churn ................................................................................................................................................ 15
Figure 2.17: Marital Status vs Churn ............................................................................................................................................... 15
Figure 2.18: Revenue per Month vs Churn ....................................................................................................................................... 16
Figure 2.19: City Tier vs Churn........................................................................................................................................................ 16
Figure 2.20: Account User Count vs Churn ..................................................................................................................................... 17
Figure 2.21: Payment vs Churn ........................................................................................................................................................ 17
Figure 2.22: Service Score vs Churn ................................................................................................................................................ 18
Figure 3.1: Boxplot of Variables ...................................................................................................................................................... 19
Figure 3.2: Boxplot of Variables – Post outliers removal ................................................................................................................ 20
Figure 4.1: Coefficients of Logistic Regression model ..................................................................................................................... 21
Figure 4.2: Decision Tree plot – 1 .................................................................................................................................................... 22
Figure 4.3: Decision Tree plot – 2 .................................................................................................................................................... 22
Figure 5.1: Confusion matrix of LR model train (L) and test set (R) ................................................................................................ 23
Figure 5.2: Classification report of LR model train set (L) and test set (R) ..................................................................................... 23
Figure 5.3: ROC Curve of LR model train set (L) test set (R) .......................................................................................................... 23
Figure 5.4: Confusion matrix of DT model train (L) and test set (R) ............................................................................................... 23
Figure 5.5: Classification report of DT model train set (L) and test set (R)..................................................................................... 24
Figure 5.6: ROC Curve of DT model train set (L) and test set (R) ................................................................................................... 24
Figure 5.7: Confusion matrix of RF model train (L) and test set (R) ............................................................................................... 24
Figure 5.8: Classification report of RF model train set (L) and test set (R) ..................................................................................... 24
Figure 5.9: ROC Curve of RF model train set (L) and test set (R) ................................................................................................... 25
Figure 5.10: Confusion matrix of LDA model train (L) and test set (R) ........................................................................................... 25
Figure 5.11: Classification report of LDA model train set (L) and test set (R) ................................................................................ 25
Figure 5.12: ROC Curve of LDA model train set (L) and test set (R) .............................................................................................. 25
Figure 5.13: Confusion matrix of KNN model train (L) and test set (R) .......................................................................................... 26
Figure 5.14: Classification report of KNN model train set (L) and test set (R) ............................................................................... 26
Figure 5.15: ROC Curve of KNN model train set (L) and test set (R) .............................................................................................. 26
Figure 5.16: Comparison of ROC Curve of all model train set (L) and test set (R) ......................................................................... 27
LIST OF TABLES
Table 2.1: First 5 rows of dataset ....................................................................................................................................................... 6
Table 2.2: Last 5 rows of dataset ........................................................................................................................................................ 6
Table 2.3: Statistical description of the numerical variables of dataset ............................................................................................. 7
Table 2.4: Information about the dataset ............................................................................................................................................ 7
Table 3.1: Percent of null values in each column ............................................................................................................................. 19
Table 4.1: Top 5 rows of independent variables – Train set............................................................................................................. 20
Table 4.2: Top 5 rows of independent variables – Test set ............................................................................................................... 21
Table 4.3: Top 5 rows of dependent variables – Train set (L) and Test set (R) ................................................................................ 21
Table 4.4: Decision Tree Prediction probability – Train set (L) and Test set (R) ............................................................................ 22
Table 4.5: Random Forest Prediction probability – Train set (L) and Test set (R) .......................................................................... 22
Table 5.1: Logistic Regression model Performance metrics ............................................................................................................ 23
Table 5.2: Decision Tree model Performance metrics ..................................................................................................................... 24
Table 5.3: Random Forest model Performance metrics ................................................................................................................... 25
Table 5.4: Linear Discriminant Analysis model Performance metrics ............................................................................................. 26
Table 5.5: K-Nearest Neighbours model Performance metrics ........................................................................................................ 26
Table 5.6: Consolidated Performance Metrics ................................................................................................................................. 26
1. Introduction
a. Problem statement
An E Commerce company is facing a lot of competition in the current market and it has become a challenge
to retain the existing customers in the current situation. Hence, the company wants to develop a model through which
they can do churn prediction of the accounts and provide segmented offers to the potential churners. In this company,
account churn is a major thing because 1 account can have multiple customers. hence by losing one account the
company might be losing more than one customer.

b. Need of the Study/Project

• To understand the factors that are responsible for customer churn
• To propose commercial actions aimed at maintaining clients that are showing signs of churn and offer
them customised offer
• To develop a churn prediction model for this company
• Provide business recommendations on the campaign.

c. Understanding Business/Social opportunity

The Average customer churn stands at 16.84%. Through this opportunity we have the ability to reduce the
costs associated with customer churn by a significant margin by identifying the leaks and understanding the customer
churn. Data shows the retaining existing customers is equally, if not more important than acquiring new customers.
The cost of acquiring new customers is five times higher than the cost of retaining existing customers. While
acquisition allows you to increase the number of customers you have, customer retention allows you to maximize the
value of customers you have already captured.
Every business needs to balance acquisition and retention costs. Acquisition is important to draw in new
consumers and expand the base. Retention is normally less costly and builds loyalty and the brand. Retention also
often relies on far less price sensitivity from customers so that the long-term costs are recouped through sales. It is
essential to track the ROI for both customer acquisition and customer retention. Poor ratios in either area are strong
indicators that more research is necessary to see what is changing in the market or what is happening within the
business.

2. Exploratory Data Analysis and Business Implication

a. Understanding how data was collected in terms of time, frequency and methodology
This data does not give information on how it was collected. It does not contain a data or time variable.
Therefore, there is no sufficient evidence to conclude on how the data was collected and no insight can be drawn in
this regard.

b. Visual inspection of data

• The data is in a shape of 11260 rows and 19 columns
• There are 3 data types in the data set – 12 Object, 5 Float and 2 Integer
• The data set has 2676 null values in total
CC_Contact Service Account_us account_ CC_Agent Marital_ rev_per_ Complain rev_growth coupon_used_ Day_Since_C Login_
AccountID Churn Tenure City_Tier ed_LY Payment Gender _Score er_count segment _Score Status month _ly _yoy for_payment C_connect cashback device
0 20000 1 4 3 6 Debit Card Female 3 3 Super 2 Single 9 1 11 1 5 159.93 Mobile
1 20001 1 0 1 8 UPI Male 3 4 Regular Plus 3 Single 7 1 15 0 0 120.9 Mobile
2 20002 1 0 1 30 Debit Card Male 2 4 Regular Plus 3 Single 6 1 14 0 3 NaN Mobile
3 20003 1 0 3 15 Debit Card Male 2 4 Super 5 Single 8 0 23 0 3 134.07 Mobile
4 20004 1 0 1 12 Credit Card Male 2 3 Regular Plus 5 Single 3 0 11 1 3 129.6 Mobile

Table 2.1: First 5 rows of dataset

CC_Contact Service Account_us account_ CC_Agent Marital_ rev_per_ Complain rev_growth coupon_used_ Day_Since_C Login_devic
AccountID Churn Tenure City_Tier ed_LY Payment Gender _Score er_count segment _Score Status month _ly _yoy for_payment C_connect cashback e
11255 31255 0 10 1 34 Credit Card Male 3 2 Super 1 Married 9 0 19 1 4 153.71 Computer
11256 31256 0 13 1 19 Credit Card Male 3 5 HNI 5 Married 7 0 16 1 8 226.91 Mobile
11257 31257 0 1 1 14 Debit Card Male 3 2 Super 4 Married 7 1 22 1 4 191.42 Mobile
11258 31258 0 23 3 11 Credit Card Male 4 5 Super 4 Married 7 0 16 2 9 179.9 Computer
11259 31259 0 8 1 22 Credit Card Male 3 2 Super 3 Married 5 0 13 2 3 175.04 Mobile

Table 2.2: Last 5 rows of dataset

AccountID Churn City_Tier CC_Contacted_LY Service_Score CC_Agent_Score Complain_ly
count 11260 11260 11148 11158 11162 11144 10903
mean 25629.5 0.168384 1.653929 17.867091 2.902526 3.066493 0.285334
std 3250.62635 0.374223 0.915015 8.853269 0.725584 1.379772 0.451594
min 20000 0 1 4 0 1 0
25% 22814.75 0 1 11 2 2 0
50% 25629.5 0 1 16 3 3 0
75% 28444.25 0 3 23 3 4 1
max 31259 1 3 132 5 5 1

Table 2.3: Statistical description of the numerical variables of dataset

c. Understanding of attributes
# Column Non-Null Count Dtype
0 AccountID 11260 non-null int64
1 Churn 11260 non-null int64
2 Tenure 11158 non-null object
3 City_Tier 11148 non-null float64
4 CC_Contacted_LY 11158 non-null float64
5 Payment 11151 non-null object
6 Gender 11152 non-null object
7 Service_Score 11162 non-null float64
8 Account_user_count 11148 non-null object
9 account_segment 11163 non-null object
10 CC_Agent_Score 11144 non-null float64
11 Marital_Status 11048 non-null object
12 rev_per_month 11158 non-null object
13 Complain_ly 10903 non-null float64
14 rev_growth_yoy 11260 non-null object
15 coupon_used_for_payment 11260 non-null object
16 Day_Since_CC_connect 10903 non-null object
17 cashback 10789 non-null object
18 Login_device 11039 non-null object

Table 2.4: Information about the dataset

The first step in data exploration is to import several libraries in Python to explore and visualize the data. Then the
numerical and categorical columns will be explored in addition to identification of missing dat.

d. Univariate Analysis
Categorical variables are analysed using the following charts

Figure 2.1: Count plot – Account Segment

Figure 2.2: Count plot – Gender

Figure 2.3: Count plot – Login Device

Figure 2.4: Count plot – Marital Status

Figure 2.5: Count plot – Payment

Figure 2.6: Count plot – Revenue per Month

Figure 2.7: Count plot – Tenure

Figure 2.8: Count plot – Account User Count

Figure 2.9: Count plot – Revenue Growth

Figure 2.10: Count plot – Coupon used for Payment

Figure 2.11: Count plot – Days since CC connect

Figure 2.12: Count plot – Cashback

All the above shown charts are the count plots of the Categorical variables of the data set. The insights are as follows:

• According to the count plots, highest count is for the following cases
o Account Segment – Super
o Gender – Male
o Login Device – Mobile
o Marital Status – Married
o Payment – Debit Card
o Revenue per Month – 3
o Tenure – 1 Year
o Account user count – 4
o Revenue growth – 14
o Coupon used for payment – 1
o Days since CC connected – 3
• Majority of the customers (8186) fall under the Account segment Super and Regular Plus
• Customer count based on Payment covers 8098 customers just for Debit card and Credit card
• Customers with either no tenure period or just 1 year tenure period covers more than 2000 customers. All the
other tenure periods have a customer count less than 520.

e. Bivariate Analysis

Figure 2.13: Heat Map

Analysis is done to check the relation of a variable with Churn (target). The Heat map shown above indicates
the correlation among all the numerical variables. Focusing on the target, it is evident that the highest correlation of
Churn is with Complain_ly and the least is with AccountID. Only with AccountID, Churn has a negative correlation.

Figure 2.14: Account Segment vs Churn

Figure 2.15: Gender vs Churn

Figure 2.16: Login Device vs Churn

Figure 2.17: Marital Status vs Churn

Figure 2.18: Revenue per Month vs Churn

Figure 2.19: City Tier vs Churn

Figure 2.20: Account User Count vs Churn

Figure 2.21: Payment vs Churn

Figure 2.22: Service Score vs Churn

Insights from Bi-variate analysis are as follows:

• Churn count is the highest for the below scenarios

o Account Segment – Regular Plus
o Gender – Male
o Login Device – Mobile
o Marital Status – Single
o Payment – Debit card
o City tier – 1
o Account user count – 4
o Revenue per month – 3
o Service score - 3
• The main objective was to distinguish between churned and retained customers in addition to finding the
associated attributes leading to churn. At first, it was observed that single male customers are having slightly
higher probability of churn.
• In addition, Mobile preferred for a Login Device is related to customer churn as well. Furthermore, this might
be caused by the E-commerce’s customer user experience phone version of the ecommerce.
• However, this study shows that Service score is higher in churned customers which was not expected. On the
other hand, Tenure, and count of number of orders is lower for churned customers which is reasonable.

3. Data Cleaning and Pre-processing

a. Removal of unwanted variables
The column named Account ID is not related to any other variable. Churn, the target variable, has a very
less correlation with Account ID (-0.0095). Therefore, this is considered an unwanted variable and is removed.

b. Missing values treatment

In the dataset, there is a considerable number of null entries. As shown in the below table, a few variables do
not have null entries and the remaining have.
Column % of Dataset
AccountID 0.000000
Churn 0.000000
Tenure 0.009059
City_Tier 0.009947
CC_Contacted_LY 0.009059
Payment 0.009680
Gender 0.009591
Service_Score 0.008703
Account_user_count 0.009947
account_segment 0.008615
CC_Agent_Score 0.010302
Marital_Status 0.018828
rev_per_month 0.009059
Complain_ly 0.031705
rev_growth_yoy 0.000000
coupon_used_for_payment 0.000000
Day_Since_CC_connect 0.031705
cashback 0.041829
Login_device 0.019627

Table 3.1: Percent of null values in each column

This table gives a numerical interpretation of to what percentage of the overall data does the number of null
values of corresponding column contributes. It also shows that 4 variables do not have null values at all. Variable
with the highest percent of null values is Cashback. Null values are ignored in order to avoid error in the future steps
of the project. Values are not imputed with mean, median or mode of the column.

c. Outlier treatment
The following plots show the outliers present in each of the variables.

Figure 3.1: Boxplot of Variables

Due to the presence of outliers in the variables, including the target, it is essential to carry out outlier
treatment before moving forward with the analysis. The below figure shows the box plot of variables after the removal
of outliers.
Figure 3.2: Boxplot of Variables – Post outliers removal

d. Variable transformation
Variables that are supposed to be in integer or float data type are in object data type. It was necessary to change
them to the appropriate data type. They are as follows

Tenure – Object → Float

Complain_ly – Object → Float
City_Tier – Float → Object
Service_Score – Float → Object
CC_Agent_Score – Float → Object

4. Model building
Supervised Learning is defined as the category of data analysis where the target outcome is known or
labelled (e.g., whether the customer(s) churned did not). However, when the intention is to group them based on why
each churned, then it becomes Unsupervised. This may be done to explore the relationship between customers and
why they churn.
Classification and Regression both belong to Supervised Learning, but the former is applied where the
outcome is finite while the latter is for infinite possible values of outcome. Classification algorithm is used to identify
the category of new observations on the basis of training data. In Classification, a program learns from the given
dataset or observations and then classifies new observation into a number of classes or groups. Classes can be called
as targets/labels or categories.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data. Classification Algorithms can be further
divided into two categories: Linear Models and Non-Linear Models
Data is split into “X” and “y”, where “X” is a data frame of independent variables and “y” is a data frame of
dependent variable. These data frames are further split into train data and test data sets using train_test_split.
Train and test ratio is 70:30
coupon
Account account _used_ Day_Since
CC_Contacted Service_ _user_c _segmen CC_Agent Marital rev_per_ Complain rev_growth for_pa _CC_conn Login_
Tenure City_Tier _LY Payment Gender Score ount t _Score _Status month _ly _yoy yment ect cashback device
7580 11 1 22 4 1 0 3 4 5 1 6 0 5 1 2 2417 1
5198 22 3 14 3 1 4 3 4 5 1 5 0 2 1 3 3241 1
1929 15 1 14 2 1 3 2 2 1 1 1 0 3 5 9 5158 0
3427 15 1 14 1 0 3 3 4 3 3 7 0 5 2 2 3005 1
6249 0 1 22 1 0 3 4 4 4 0 5 0 5 1 0 1510 1

Table 4.1: Top 5 rows of independent variables – Train set

coupon_
Account_ used_for Day_Since
CC_Conta Service_ user_cou account_ CC_Agent Marital_ rev_per_ Complain rev_growth _paymen _CC_conn Login_
Tenure City_Tier cted_LY Payment Gender Score nt segment _Score Status month _ly _yoy t ect cashback device
784 0 1 31 2 0 2 0 3 2 3 1 1 2 1 1 575 1
6943 0 3 22 3 1 2 2 3 5 1 4 0 10 1 0 76 0
3709 10 3 23 3 0 4 4 4 5 0 5 0 4 2 24 3308 1
6439 0 1 10 1 0 2 2 3 4 3 3 0 9 1 1 111 1
5310 9 1 25 2 1 3 1 4 5 1 2 0 2 2 11 3105 1

Table 4.2: Top 5 rows of independent variables – Test set

7580 0 784 0
5198 0 6943 1
1929 0 3709 0
3427 0 6439 0
6249 1 5310 0

Table 4.3: Top 5 rows of dependent variables – Train set (L) and Test set (R)

Since, this project is focused on a Classification problem. The following models are built the following
model procedures to analyse and review the dataset and get the performance and importance of the features available
on the dataset which can gathers more information about the subjects.

• Logistic Regression
• Decision Tree
• Random Forest
• Linear Discriminant Analysis
• K Nearest Neighbours

a. Logistic Regression
Logistic Regression is one of the “white box” algorithms which helps us in determining the probability values
and the corresponding cut-of s. Logistic regression is used to solve such problem which gives us the corresponding
probability outputs and then we can decide the appropriate cut-of points to get the target class outputs. This model is
used because there are no assumptions to be made and classifications are faster
A max-iter of 100, solver as liblinear, tolerance of 0.0001 and a penalty of l2 is chosen for the model. Model is
fit on the training set and accuracy is obtained. Model intercept is -1.458 and coefficients are as follows

Figure 4.1: Coefficients of Logistic Regression model

b. Decision Tree
A decision tree is a type of supervised machine learning used to categorize or make predictions based on how a
previous set of questions were answered. The model is a form of supervised learning, meaning that the model is
trained and tested on a set of data that contains the desired categorization. This model is selected as one of the models
due to its white-box algorithm. Also, null values present in the data set do not affect the algorithm. Trees formed can
be visualized.
A criterion of Gini is selected along with a maximum depth of 10, minimum sample leaf of 10 and a minimum
sample split of 50. The best parameters are estimated using Grid Search.
Best score of the model: 0.894
Figure 4.2: Decision Tree plot – 1

Figure 4.3: Decision Tree plot – 2

0 1 0 1
0 0.988848 0.011152 0 0.000000 1.000000
1 1.000000 0.000000 1 0.187500 0.812500
2 1.000000 0.000000 2 0.988848 0.011152
3 0.988848 0.011152 3 0.405405 0.594595
4 0.913043 0.086957 4 1.000000 0.000000

Table 4.4: Decision Tree Prediction probability – Train set (L) and Test set (R)

c. Random Forest
Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and
Regression problems. It builds decision trees on different samples and takes their majority vote for classification and
average in case of regression. Grid Search is used to estimate the best parameters for the model. Selection of this
model is due to its efficiency of handling large datasets and higher level of accuracy.
Values are given as, maximum depth of 10, maximum features of 11, minimum samples leaf as 10, minimum sample
split as 50 and number of estimators of 100

0 1 0 1
0 0.991175 0.008825 0 0.190284 0.809716
1 0.988552 0.011448 1 0.290077 0.709923
2 0.948677 0.051323 2 0.966620 0.033380
3 0.991443 0.008557 3 0.627443 0.372557
4 0.510515 0.489485 4 0.992902 0.007098

Table 4.5: Random Forest Prediction probability – Train set (L) and Test set (R)

d. Linear Discriminant Analysis

Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function Analysis is a
dimensionality reduction technique that is commonly used for supervised classification problems. It is used for
modelling differences in groups i.e., separating two or more classes. Linear Discriminant Analysis is selected as one
of the model cause it causes minimum variance. LDA model is built using a eigen solver and an auto shrinkage.

e. K-Nearest Neighbours
The K-Nearest Neighbours algorithm, also known as KNN is a non-parametric, supervised learning classifier, which
uses proximity to make classifications or predictions about the grouping of an individual data point. This model is
selected for its main function that is does not have a training period. This makes it a faster model.
KNN model is built using 15 number of neighbours, weight being uniform and metric as minkowski.
5. Model Validation
a. Test predictive model against the test set using various appropriate performance metrics
For every model building procedure used in the project, a few steps are used to get the output generated by each
algorithm. The performance metrics opted to analyse these models are accuracy, confusion matrix, classification
report with F1 score, Recall and Precision in it for each model.

i. Logistic Regression
Accuracy score for the Decision Tree model’s train set is 0.8722 and test set is 0.8712. Area under the ROC curve is
0.841 for train set and 0.845 for test set

Figure 5.1: Confusion matrix of LR model train (L) and test set (R)

Figure 5.2: Classification report of LR model train set (L) and test set (R)

Figure 5.3: ROC Curve of LR model train set (L) test set (R)

Train Test
Precision 0.75 0.72
Recall 0.36 0.38
F1 0.48 0.50

Table 5.1: Logistic Regression model Performance metrics

ii. Decision Tree

Accuracy score for the Decision Tree model’s train set is 0.9211 and test set is 0.9002
Area under the ROC curve is 0.897 for train set and 0.891 for test set

Figure 5.4: Confusion matrix of DT model train (L) and test set (R)
Figure 5.5: Classification report of DT model train set (L) and test set (R)

Figure 5.6: ROC Curve of DT model train set (L) and test set (R)

Train Test
Precision 0.74 0.73
Recall 0.59 0.58
F1 0.66 0.64

Table 5.2: Decision Tree model Performance metrics

iii. Random Forest

Accuracy score for the Random Forest model’s train set is 0.9211 and test set is 0.9002
Area under the ROC curve is 0.961 for train set and 0.926 for test set

Figure 5.7: Confusion matrix of RF model train (L) and test set (R)

Figure 5.8: Classification report of RF model train set (L) and test set (R)
Figure 5.9: ROC Curve of RF model train set (L) and test set (R)

Train Test
Precision 0.87 0.82
Recall 0.67 0.64
F1 0.76 0.72

Table 5.3: Random Forest model Performance metrics

iv. Linear Discriminant Analysis

Accuracy score for the Linear Discriminant Analysis model’s train set is 0.871 and test set is 0.872
Area under the ROC curve is 0.897 for train set and 0.891 for test set

Figure 5.10: Confusion matrix of LDA model train (L) and test set (R)

Figure 5.11: Classification report of LDA model train set (L) and test set (R)

Figure 5.12: ROC Curve of LDA model train set (L) and test set (R)
Train Test
Precision 0.75 0.89
Recall 0.35 0.97
F1 0.48 0.93

Table 5.4: Linear Discriminant Analysis model Performance metrics

v. K-Nearest Neighbours
Accuracy score for the K-Nearest Neighbours model’s train set is 0.842 and test set is 0.830
Area under the ROC curve is 0.823 for train set and 0.720 for test set

Figure 5.13: Confusion matrix of KNN model train (L) and test set (R)

Figure 5.14: Classification report of KNN model train set (L) and test set (R)

Figure 5.15: ROC Curve of KNN model train set (L) and test set (R)

Train Test
Precision 0.69 0.48
Recall 0.12 0.07
F1 0.20 0.12

Table 5.5: K-Nearest Neighbours model Performance metrics

b. Interpretation of the model(s)

LR Model DT Model RF Model LDA Model KNN Model

Train Test Train Test Train Test Train Test Train Test
Precision 0.75 0.72 0.74 0.73 0.87 0.82 0.75 0.89 0.69 0.48
Recall 0.36 0.38 0.59 0.58 0.67 0.64 0.35 0.97 0.12 0.07
F1 Score 0.48 0.50 0.66 0.64 0.76 0.72 0.48 0.93 0.20 0.12
Accuracy 0.87 0.87 0.92 0.90 0.92 0.90 0.87 0.87 0.84 0.83
AUC 0.84 0.85 0.90 0.89 0.96 0.93 0.90 0.89 0.82 0.72
Table 5.6: Consolidated Performance Metrics
The table shown above is a consolidated view of all the performance metrics of each model. It is evident that all
the metrics for K-Nearest Neighbours model is low compared to the other models. Metrics of Random Forest model is
the highest of all models. F1 score is the point of focus and its highest for train and test set of Random Forest model.

c. Ensemble modelling, wherever applicable

Ensemble modelling is a process where multiple diverse models are created to predict an outcome, either by
using many different modelling algorithms or using different training data sets. The ensemble model then aggregates
the prediction of each base model and results in once final prediction for the unseen data.
In this case, 4 different models are built for analysing the data and predicting.

d. Any other model tuning measure

Grid Search, also known as parameter sweeping, is one of the most basic and traditional methods of
hyperparametric optimization. This method involves manually defining a subset of the hyperparametric space and
exhausting all combinations of the specified hyperparameter subsets. Each combination’s performance is then
evaluated, typically using cross-validation, and the best performing hyperparametric combination is chosen.
For each model, grid search is used to estimate the best parameters for building the model.

e. Interpretation of the most optimum model and its implication on the business
The most optimum model is Random Forest model. This is evident because of the performance metrics obtained
from the model used as shown above and the ROC curve combined as shown below
It is evident that Random Forest model has out performed all the other models.

Figure 5.16: Comparison of ROC Curve of all model train set (L) and test set (R)

6. Final Interpretation and Recommendations

a. Insights
• Business must increase the tenure of their customer which can be done by initiating some loyalty programs or
special pricings for loyal customers
• Right targeting of the customers in Tier 2 and 3 cities
• The best possible time is to give some sort of promos at T = 2 ( 2 units of tenure)
• More focus on customers who use CC, COD and E wallet mode of payment, since this is a very critical
segment
• Mostly customer rated "3" for the services provided by the business
• Mostly customer rated "3" for the interactions they have customer care representatives.
• Transaction via UPI and e-wallet is very low.
• Maximum churn is from the account segment "Regular+"
• Customers with marital status is "single" contributes max towards churn
• Any complaints raised in last 12 months doesn't show any impact toward churn.
• Tenure and cashback are directly proportional to each other.
• Computer usage is more in tier 1 city followed by tier 3 and tier 2 city

b. Recommendations
• Business can be in joint with other life style vendors to provide vouchers to the new as well existing loyal
customers.
• Customized email response to priority customers basis segmentation for better customer interaction.
• Specialized team of customer service for Top notch customers to avoid waiting time and better customer
experience and interaction.
• Business can promote using their own e-wallet as payment option by giving certain discount over the bill
• Thanking customers with hand written notes on invoices will create a good will factor
• Business needs to make sure that all complaints and queries raised are resolve on time
• Business can promote payment via standing instruction in bank account or UPI which can be hassle free and
safe for customer.
• Follow-up in customers issues and taking regular feedbacks on the same

Appendix
• https://public.tableau.com/views/Capstone-Notes1_16694798934420/Bi-varaite2?:language=en-
US&:display_count=n&:origin=viz_share_link

Predictive Modelling Project Report Final
45% (11)
Predictive Modelling Project Report Final
49 pages
ABC of Quality Improvement in Healthcare-Wiley-Blackwell (2020)
No ratings yet
ABC of Quality Improvement in Healthcare-Wiley-Blackwell (2020)
214 pages
Problem Statement
0% (2)
Problem Statement
2 pages
Detailed Lesson Plan in Philippine Politics and Governance 12 (HUMSS and GAS)
90% (10)
Detailed Lesson Plan in Philippine Politics and Governance 12 (HUMSS and GAS)
7 pages
Capstone Presentation: Telecom Churn Study
100% (3)
Capstone Presentation: Telecom Churn Study
19 pages
Rajendra Ladda SQL and Databases New Wheels Project Report
100% (1)
Rajendra Ladda SQL and Databases New Wheels Project Report
12 pages
Customer Churn - E-Commerce: Capstone Project Report
100% (1)
Customer Churn - E-Commerce: Capstone Project Report
43 pages
MRA ML1 - Kirtesh
100% (7)
MRA ML1 - Kirtesh
43 pages
Customer Churn Data - A Project Based On Logistic Regression
100% (12)
Customer Churn Data - A Project Based On Logistic Regression
31 pages
Cold Storage Assignment Solution Ankur Jain
75% (8)
Cold Storage Assignment Solution Ankur Jain
6 pages
Manali Andyal 26 05 2025 FRA Part A Guided Project Report PDF
100% (1)
Manali Andyal 26 05 2025 FRA Part A Guided Project Report PDF
19 pages
PM - ExtendedProject - Business Report
100% (4)
PM - ExtendedProject - Business Report
35 pages
Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
Customer Churn Prediction Project: by Shweta Gupta
100% (6)
Customer Churn Prediction Project: by Shweta Gupta
41 pages
Predictive Modeling PDF
100% (3)
Predictive Modeling PDF
49 pages
HPX100 1 Jul Dec2022 SA2 CZ V2 27062022 PDF
No ratings yet
HPX100 1 Jul Dec2022 SA2 CZ V2 27062022 PDF
7 pages
Capstone Project
No ratings yet
Capstone Project
11 pages
MRA Project Milestone 2
100% (2)
MRA Project Milestone 2
31 pages
Girish Chadha - 29th December 2022
100% (3)
Girish Chadha - 29th December 2022
35 pages
Capstone Proect Notes 2
100% (2)
Capstone Proect Notes 2
16 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
Business Report Machine Learning-1
100% (7)
Business Report Machine Learning-1
60 pages
Mra Project: Prepared By: Deepak Batabyal Date:-09 Feb 2020
100% (2)
Mra Project: Prepared By: Deepak Batabyal Date:-09 Feb 2020
32 pages
DVT Alternate Project
50% (2)
DVT Alternate Project
1 page
FRA Project Business Report
100% (2)
FRA Project Business Report
27 pages
ML Ts Proj
100% (9)
ML Ts Proj
58 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
Harshini Week 8 Doc PDF
No ratings yet
Harshini Week 8 Doc PDF
10 pages
PHIL 331 001 W1 2020 Ingle
No ratings yet
PHIL 331 001 W1 2020 Ingle
10 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Capstone Project Business: Predict Customer Churn in E-Commerce
100% (2)
Capstone Project Business: Predict Customer Churn in E-Commerce
10 pages
Predictive Modelling Project - Business Report
100% (1)
Predictive Modelling Project - Business Report
23 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
CustomerChurn Assignment
100% (3)
CustomerChurn Assignment
15 pages
Lifi
100% (1)
Lifi
16 pages
MRA Project ML 1: Abhishek Kapoor Dsba Aug A20
100% (1)
MRA Project ML 1: Abhishek Kapoor Dsba Aug A20
47 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
Project ML
100% (4)
Project ML
36 pages
Business Report Problem 2
No ratings yet
Business Report Problem 2
10 pages
MRA Project Milestone 1 PDF
No ratings yet
MRA Project Milestone 1 PDF
1 page
MRA Milestone-1 Graded Project
100% (2)
MRA Milestone-1 Graded Project
41 pages
Facebook Comment Volume Prediction
100% (1)
Facebook Comment Volume Prediction
12 pages
MRA - Project - Puvya - Ravi
100% (3)
MRA - Project - Puvya - Ravi
46 pages
House
100% (2)
House
19 pages
Project Time Series Forecasting
100% (1)
Project Time Series Forecasting
53 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
QUIZ Week 2 CART Practice PDF
No ratings yet
QUIZ Week 2 CART Practice PDF
10 pages
Capstone Grp6 PREDICTING INSURANCE RENEWAL PROPENSITY v3
100% (1)
Capstone Grp6 PREDICTING INSURANCE RENEWAL PROPENSITY v3
24 pages
Linear - Regression - Assignment: Problem Statement
100% (3)
Linear - Regression - Assignment: Problem Statement
24 pages
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
100% (1)
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
11 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
ML2 Easy Visa Project Business Report
100% (1)
ML2 Easy Visa Project Business Report
24 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
DVT Group Assignment PDF
100% (1)
DVT Group Assignment PDF
14 pages
FRA Assignment
100% (1)
FRA Assignment
31 pages
Advanced Statistics Project
17% (6)
Advanced Statistics Project
2 pages
Project 7 - DVT - Manoj
No ratings yet
Project 7 - DVT - Manoj
1 page
Anamit Deb Gupta Mra - Project Milestone - 1
100% (1)
Anamit Deb Gupta Mra - Project Milestone - 1
30 pages
ML Project Report
100% (2)
ML Project Report
35 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
Capstone Project Synopsis
No ratings yet
Capstone Project Synopsis
36 pages
Yash - Capstone Report
No ratings yet
Yash - Capstone Report
29 pages
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
CO 4 Grade-10-Technical-and-Operational-Definition
100% (2)
CO 4 Grade-10-Technical-and-Operational-Definition
5 pages
J of Consumer Behaviour - 2023 - McKee - Gen Z S Personalization Paradoxes A Privacy Calculus Examination of Digital
No ratings yet
J of Consumer Behaviour - 2023 - McKee - Gen Z S Personalization Paradoxes A Privacy Calculus Examination of Digital
18 pages
A.C.F. FENS - CSR Capability Maturity Model Development - Appendixes - 26-04-2013
No ratings yet
A.C.F. FENS - CSR Capability Maturity Model Development - Appendixes - 26-04-2013
270 pages
Wiki Gendersort
No ratings yet
Wiki Gendersort
12 pages
18 Bibliography
No ratings yet
18 Bibliography
29 pages
Www Tutorialspoint Com Excel Data Analysis Excel Data Analysis Quick Guide Htm
No ratings yet
Www Tutorialspoint Com Excel Data Analysis Excel Data Analysis Quick Guide Htm
50 pages
Assessing Student Learning Outcomes
No ratings yet
Assessing Student Learning Outcomes
8 pages
Page From LOLST1MAY15
No ratings yet
Page From LOLST1MAY15
1 page
Jipmer Overall 2
0% (1)
Jipmer Overall 2
2,787 pages
Reusable Packaging in Supply Chains A Review of Environmental and Economic Impacts, Logistics System Designs, and Operations Management
No ratings yet
Reusable Packaging in Supply Chains A Review of Environmental and Economic Impacts, Logistics System Designs, and Operations Management
15 pages
Discrete and Continuous Data
50% (4)
Discrete and Continuous Data
1 page
Chapter 6a - Classical Evolutionism
No ratings yet
Chapter 6a - Classical Evolutionism
17 pages
Action Verbs For Student Learning Outcomes
No ratings yet
Action Verbs For Student Learning Outcomes
2 pages
Business Forecasting John E. Hanke Dean Wichern Ninth Edition
No ratings yet
Business Forecasting John E. Hanke Dean Wichern Ninth Edition
159 pages
Guidance Notes For Completing The MSC Project Interim Report
No ratings yet
Guidance Notes For Completing The MSC Project Interim Report
1 page
Recruitment and Retention For The Modern Law Enforcement Agency
No ratings yet
Recruitment and Retention For The Modern Law Enforcement Agency
60 pages
Thonney (2011) - Teaching the Conventions of Academic Discourse
No ratings yet
Thonney (2011) - Teaching the Conventions of Academic Discourse
18 pages
Hydrologic Simulation Models
No ratings yet
Hydrologic Simulation Models
14 pages
BATS-Celebrating Bats On Postage Stamps
No ratings yet
BATS-Celebrating Bats On Postage Stamps
24 pages
Jawahar (200-WPS Office
No ratings yet
Jawahar (200-WPS Office
4 pages
Teaching Techniques (Bipa)
No ratings yet
Teaching Techniques (Bipa)
2 pages
Revolutionizing Investment With Block Chain Crowd Funding
No ratings yet
Revolutionizing Investment With Block Chain Crowd Funding
68 pages
Alok Internship
No ratings yet
Alok Internship
26 pages
Rendang Malaysia - Culinary Heritage and Cultural Identity
No ratings yet
Rendang Malaysia - Culinary Heritage and Cultural Identity
2 pages
Inventory Aging: An Impediment To The Value Chain - A Case Related To The Automotive Industry
No ratings yet
Inventory Aging: An Impediment To The Value Chain - A Case Related To The Automotive Industry
8 pages

Capstone - Project - Final - Report - Churn - Prediction

Uploaded by

Capstone - Project - Final - Report - Churn - Prediction

Uploaded by

CAPSTONE PROJECT

TABLE OF CONTENTS .................................................................................................................................. 2

b. Need of the Study/Project

c. Understanding Business/Social opportunity

2. Exploratory Data Analysis and Business Implication

b. Visual inspection of data

Table 2.1: First 5 rows of dataset

Table 2.2: Last 5 rows of dataset

Table 2.3: Statistical description of the numerical variables of dataset

Table 2.4: Information about the dataset

Figure 2.1: Count plot – Account Segment

Figure 2.3: Count plot – Login Device

Figure 2.5: Count plot – Payment

Figure 2.7: Count plot – Tenure

Figure 2.9: Count plot – Revenue Growth

Figure 2.11: Count plot – Days since CC connect

Figure 2.13: Heat Map

Figure 2.14: Account Segment vs Churn

Figure 2.15: Gender vs Churn

Figure 2.17: Marital Status vs Churn

Figure 2.19: City Tier vs Churn

Figure 2.21: Payment vs Churn

Insights from Bi-variate analysis are as follows:

• Churn count is the highest for the below scenarios

3. Data Cleaning and Pre-processing

b. Missing values treatment

Table 3.1: Percent of null values in each column

Figure 3.1: Boxplot of Variables

Tenure – Object → Float

Table 4.1: Top 5 rows of independent variables – Train set

Table 4.2: Top 5 rows of independent variables – Test set

Figure 4.1: Coefficients of Logistic Regression model

Figure 4.3: Decision Tree plot – 2

d. Linear Discriminant Analysis

Table 5.1: Logistic Regression model Performance metrics

ii. Decision Tree

Table 5.2: Decision Tree model Performance metrics

iii. Random Forest

Table 5.3: Random Forest model Performance metrics

iv. Linear Discriminant Analysis

Table 5.4: Linear Discriminant Analysis model Performance metrics

Table 5.5: K-Nearest Neighbours model Performance metrics

b. Interpretation of the model(s)

LR Model DT Model RF Model LDA Model KNN Model

c. Ensemble modelling, wherever applicable

d. Any other model tuning measure

6. Final Interpretation and Recommendations

You might also like