100% found this document useful (3 votes)
2K views28 pages

Capstone - Project - Final - Report - Churn - Prediction

This document presents the findings of a capstone project analyzing customer churn. It includes sections on exploratory data analysis, data cleaning, model building using logistic regression, decision trees, random forests and other techniques, and model validation. Univariate and bivariate analyses were conducted to understand customer attributes and their relationship to churn. Data preprocessing addressed missing values, outliers and transformations. Multiple models were trained and validated on test data, with random forests showing the best performance. Key insights and recommendations are provided to help reduce customer churn.

Uploaded by

Puvya Ravi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
2K views28 pages

Capstone - Project - Final - Report - Churn - Prediction

This document presents the findings of a capstone project analyzing customer churn. It includes sections on exploratory data analysis, data cleaning, model building using logistic regression, decision trees, random forests and other techniques, and model validation. Univariate and bivariate analyses were conducted to understand customer attributes and their relationship to churn. Data preprocessing addressed missing values, outliers and transformations. Multiple models were trained and validated on test data, with random forests showing the best performance. Key insights and recommendations are provided to help reduce customer churn.

Uploaded by

Puvya Ravi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

CAPSTONE PROJECT

FINAL REPORT
SUBMITTED BY

PUVYA RAVI
TABLE OF CONTENTS

TABLE OF CONTENTS .................................................................................................................................. 2


LIST OF FIGURES .......................................................................................................................................... 4
LIST OF TABLES ............................................................................................................................................ 5
1. Introduction ............................................................................................................................................. 6
a. Problem statement................................................................................................................................................. 6
b. Need of the Study/Project ...................................................................................................................................... 6
c. Understanding Business/Social opportunity ......................................................................................................... 6
2. Exploratory Data Analysis and Business Implication ........................................................................... 6
a. Understanding how data was collected in terms of time, frequency and methodology ........................................ 6
b. Visual inspection of data ....................................................................................................................................... 6
c. Understanding of attributes .................................................................................................................................. 7
d. Univariate Analysis ............................................................................................................................................... 7
e. Bivariate Analysis ............................................................................................................................................... 13
3. Data Cleaning and Pre-processing ....................................................................................................... 18
a. Removal of unwanted variables .......................................................................................................................... 18
b. Missing values treatment .................................................................................................................................... 18
c. Outlier treatment ................................................................................................................................................. 19
d. Variable transformation...................................................................................................................................... 20
4. Model building ....................................................................................................................................... 20
a. Logistic Regression ............................................................................................................................................. 21
b. Decision Tree ...................................................................................................................................................... 21
c. Random Forest .................................................................................................................................................... 22
d. Linear Discriminant Analysis ............................................................................................................................. 22
e. K-Nearest Neighbours......................................................................................................................................... 22
5. Model Validation ................................................................................................................................... 23
a. Test predictive model against the test set using various appropriate performance metrics ............................... 23
i. Logistic Regression ..................................................................................................................................... 23
ii. Decision Tree .............................................................................................................................................. 23
iii. Random Forest ............................................................................................................................................ 24
iv. Linear Discriminant Analysis ..................................................................................................................... 25
v. K-Nearest Neighbours................................................................................................................................. 26
b. Interpretation of the model(s) ............................................................................................................................. 26
c. Ensemble modelling, wherever applicable ......................................................................................................... 27
d. Any other model tuning measure ........................................................................................................................ 27
e. Interpretation of the most optimum model and its implication on the business .................................................. 27
6. Final Interpretation and Recommendations ........................................................................................ 27
a. Insights ................................................................................................................................................................ 27
b. Recommendations ............................................................................................................................................... 28
Appendix .......................................................................................................................................................... 28
LIST OF FIGURES
Figure 2.1: Count plot – Account Segment ......................................................................................................................................... 7
Figure 2.2: Count plot – Gender ........................................................................................................................................................ 8
Figure 2.3: Count plot – Login Device ............................................................................................................................................... 8
Figure 2.4: Count plot – Marital Status ............................................................................................................................................. 9
Figure 2.5: Count plot – Payment ...................................................................................................................................................... 9
Figure 2.6: Count plot – Revenue per Month ................................................................................................................................... 10
Figure 2.7: Count plot – Tenure ....................................................................................................................................................... 10
Figure 2.8: Count plot – Account User Count .................................................................................................................................. 11
Figure 2.9: Count plot – Revenue Growth ........................................................................................................................................ 11
Figure 2.10: Count plot – Coupon used for Payment ....................................................................................................................... 12
Figure 2.11: Count plot – Days since CC connect ........................................................................................................................... 12
Figure 2.12: Count plot – Cashback ................................................................................................................................................. 13
Figure 2.13: Heat Map ..................................................................................................................................................................... 13
Figure 2.14: Account Segment vs Churn .......................................................................................................................................... 14
Figure 2.15: Gender vs Churn .......................................................................................................................................................... 14
Figure 2.16: Login Device vs Churn ................................................................................................................................................ 15
Figure 2.17: Marital Status vs Churn ............................................................................................................................................... 15
Figure 2.18: Revenue per Month vs Churn ....................................................................................................................................... 16
Figure 2.19: City Tier vs Churn........................................................................................................................................................ 16
Figure 2.20: Account User Count vs Churn ..................................................................................................................................... 17
Figure 2.21: Payment vs Churn ........................................................................................................................................................ 17
Figure 2.22: Service Score vs Churn ................................................................................................................................................ 18
Figure 3.1: Boxplot of Variables ...................................................................................................................................................... 19
Figure 3.2: Boxplot of Variables – Post outliers removal ................................................................................................................ 20
Figure 4.1: Coefficients of Logistic Regression model ..................................................................................................................... 21
Figure 4.2: Decision Tree plot – 1 .................................................................................................................................................... 22
Figure 4.3: Decision Tree plot – 2 .................................................................................................................................................... 22
Figure 5.1: Confusion matrix of LR model train (L) and test set (R) ................................................................................................ 23
Figure 5.2: Classification report of LR model train set (L) and test set (R) ..................................................................................... 23
Figure 5.3: ROC Curve of LR model train set (L) test set (R) .......................................................................................................... 23
Figure 5.4: Confusion matrix of DT model train (L) and test set (R) ............................................................................................... 23
Figure 5.5: Classification report of DT model train set (L) and test set (R)..................................................................................... 24
Figure 5.6: ROC Curve of DT model train set (L) and test set (R) ................................................................................................... 24
Figure 5.7: Confusion matrix of RF model train (L) and test set (R) ............................................................................................... 24
Figure 5.8: Classification report of RF model train set (L) and test set (R) ..................................................................................... 24
Figure 5.9: ROC Curve of RF model train set (L) and test set (R) ................................................................................................... 25
Figure 5.10: Confusion matrix of LDA model train (L) and test set (R) ........................................................................................... 25
Figure 5.11: Classification report of LDA model train set (L) and test set (R) ................................................................................ 25
Figure 5.12: ROC Curve of LDA model train set (L) and test set (R) .............................................................................................. 25
Figure 5.13: Confusion matrix of KNN model train (L) and test set (R) .......................................................................................... 26
Figure 5.14: Classification report of KNN model train set (L) and test set (R) ............................................................................... 26
Figure 5.15: ROC Curve of KNN model train set (L) and test set (R) .............................................................................................. 26
Figure 5.16: Comparison of ROC Curve of all model train set (L) and test set (R) ......................................................................... 27
LIST OF TABLES
Table 2.1: First 5 rows of dataset ....................................................................................................................................................... 6
Table 2.2: Last 5 rows of dataset ........................................................................................................................................................ 6
Table 2.3: Statistical description of the numerical variables of dataset ............................................................................................. 7
Table 2.4: Information about the dataset ............................................................................................................................................ 7
Table 3.1: Percent of null values in each column ............................................................................................................................. 19
Table 4.1: Top 5 rows of independent variables – Train set............................................................................................................. 20
Table 4.2: Top 5 rows of independent variables – Test set ............................................................................................................... 21
Table 4.3: Top 5 rows of dependent variables – Train set (L) and Test set (R) ................................................................................ 21
Table 4.4: Decision Tree Prediction probability – Train set (L) and Test set (R) ............................................................................ 22
Table 4.5: Random Forest Prediction probability – Train set (L) and Test set (R) .......................................................................... 22
Table 5.1: Logistic Regression model Performance metrics ............................................................................................................ 23
Table 5.2: Decision Tree model Performance metrics ..................................................................................................................... 24
Table 5.3: Random Forest model Performance metrics ................................................................................................................... 25
Table 5.4: Linear Discriminant Analysis model Performance metrics ............................................................................................. 26
Table 5.5: K-Nearest Neighbours model Performance metrics ........................................................................................................ 26
Table 5.6: Consolidated Performance Metrics ................................................................................................................................. 26
1. Introduction
a. Problem statement
An E Commerce company is facing a lot of competition in the current market and it has become a challenge
to retain the existing customers in the current situation. Hence, the company wants to develop a model through which
they can do churn prediction of the accounts and provide segmented offers to the potential churners. In this company,
account churn is a major thing because 1 account can have multiple customers. hence by losing one account the
company might be losing more than one customer.

b. Need of the Study/Project


• To understand the factors that are responsible for customer churn
• To propose commercial actions aimed at maintaining clients that are showing signs of churn and offer
them customised offer
• To develop a churn prediction model for this company
• Provide business recommendations on the campaign.

c. Understanding Business/Social opportunity


The Average customer churn stands at 16.84%. Through this opportunity we have the ability to reduce the
costs associated with customer churn by a significant margin by identifying the leaks and understanding the customer
churn. Data shows the retaining existing customers is equally, if not more important than acquiring new customers.
The cost of acquiring new customers is five times higher than the cost of retaining existing customers. While
acquisition allows you to increase the number of customers you have, customer retention allows you to maximize the
value of customers you have already captured.
Every business needs to balance acquisition and retention costs. Acquisition is important to draw in new
consumers and expand the base. Retention is normally less costly and builds loyalty and the brand. Retention also
often relies on far less price sensitivity from customers so that the long-term costs are recouped through sales. It is
essential to track the ROI for both customer acquisition and customer retention. Poor ratios in either area are strong
indicators that more research is necessary to see what is changing in the market or what is happening within the
business.

2. Exploratory Data Analysis and Business Implication


a. Understanding how data was collected in terms of time, frequency and methodology
This data does not give information on how it was collected. It does not contain a data or time variable.
Therefore, there is no sufficient evidence to conclude on how the data was collected and no insight can be drawn in
this regard.

b. Visual inspection of data


• The data is in a shape of 11260 rows and 19 columns
• There are 3 data types in the data set – 12 Object, 5 Float and 2 Integer
• The data set has 2676 null values in total
CC_Contact Service Account_us account_ CC_Agent Marital_ rev_per_ Complain rev_growth coupon_used_ Day_Since_C Login_
AccountID Churn Tenure City_Tier ed_LY Payment Gender _Score er_count segment _Score Status month _ly _yoy for_payment C_connect cashback device
0 20000 1 4 3 6 Debit Card Female 3 3 Super 2 Single 9 1 11 1 5 159.93 Mobile
1 20001 1 0 1 8 UPI Male 3 4 Regular Plus 3 Single 7 1 15 0 0 120.9 Mobile
2 20002 1 0 1 30 Debit Card Male 2 4 Regular Plus 3 Single 6 1 14 0 3 NaN Mobile
3 20003 1 0 3 15 Debit Card Male 2 4 Super 5 Single 8 0 23 0 3 134.07 Mobile
4 20004 1 0 1 12 Credit Card Male 2 3 Regular Plus 5 Single 3 0 11 1 3 129.6 Mobile

Table 2.1: First 5 rows of dataset

CC_Contact Service Account_us account_ CC_Agent Marital_ rev_per_ Complain rev_growth coupon_used_ Day_Since_C Login_devic
AccountID Churn Tenure City_Tier ed_LY Payment Gender _Score er_count segment _Score Status month _ly _yoy for_payment C_connect cashback e
11255 31255 0 10 1 34 Credit Card Male 3 2 Super 1 Married 9 0 19 1 4 153.71 Computer
11256 31256 0 13 1 19 Credit Card Male 3 5 HNI 5 Married 7 0 16 1 8 226.91 Mobile
11257 31257 0 1 1 14 Debit Card Male 3 2 Super 4 Married 7 1 22 1 4 191.42 Mobile
11258 31258 0 23 3 11 Credit Card Male 4 5 Super 4 Married 7 0 16 2 9 179.9 Computer
11259 31259 0 8 1 22 Credit Card Male 3 2 Super 3 Married 5 0 13 2 3 175.04 Mobile

Table 2.2: Last 5 rows of dataset


AccountID Churn City_Tier CC_Contacted_LY Service_Score CC_Agent_Score Complain_ly
count 11260 11260 11148 11158 11162 11144 10903
mean 25629.5 0.168384 1.653929 17.867091 2.902526 3.066493 0.285334
std 3250.62635 0.374223 0.915015 8.853269 0.725584 1.379772 0.451594
min 20000 0 1 4 0 1 0
25% 22814.75 0 1 11 2 2 0
50% 25629.5 0 1 16 3 3 0
75% 28444.25 0 3 23 3 4 1
max 31259 1 3 132 5 5 1

Table 2.3: Statistical description of the numerical variables of dataset

c. Understanding of attributes
# Column Non-Null Count Dtype
0 AccountID 11260 non-null int64
1 Churn 11260 non-null int64
2 Tenure 11158 non-null object
3 City_Tier 11148 non-null float64
4 CC_Contacted_LY 11158 non-null float64
5 Payment 11151 non-null object
6 Gender 11152 non-null object
7 Service_Score 11162 non-null float64
8 Account_user_count 11148 non-null object
9 account_segment 11163 non-null object
10 CC_Agent_Score 11144 non-null float64
11 Marital_Status 11048 non-null object
12 rev_per_month 11158 non-null object
13 Complain_ly 10903 non-null float64
14 rev_growth_yoy 11260 non-null object
15 coupon_used_for_payment 11260 non-null object
16 Day_Since_CC_connect 10903 non-null object
17 cashback 10789 non-null object
18 Login_device 11039 non-null object

Table 2.4: Information about the dataset

The first step in data exploration is to import several libraries in Python to explore and visualize the data. Then the
numerical and categorical columns will be explored in addition to identification of missing dat.

d. Univariate Analysis
Categorical variables are analysed using the following charts

Figure 2.1: Count plot – Account Segment


Figure 2.2: Count plot – Gender

Figure 2.3: Count plot – Login Device


Figure 2.4: Count plot – Marital Status

Figure 2.5: Count plot – Payment


Figure 2.6: Count plot – Revenue per Month

Figure 2.7: Count plot – Tenure


Figure 2.8: Count plot – Account User Count

Figure 2.9: Count plot – Revenue Growth


Figure 2.10: Count plot – Coupon used for Payment

Figure 2.11: Count plot – Days since CC connect


Figure 2.12: Count plot – Cashback

All the above shown charts are the count plots of the Categorical variables of the data set. The insights are as follows:

• According to the count plots, highest count is for the following cases
o Account Segment – Super
o Gender – Male
o Login Device – Mobile
o Marital Status – Married
o Payment – Debit Card
o Revenue per Month – 3
o Tenure – 1 Year
o Account user count – 4
o Revenue growth – 14
o Coupon used for payment – 1
o Days since CC connected – 3
• Majority of the customers (8186) fall under the Account segment Super and Regular Plus
• Customer count based on Payment covers 8098 customers just for Debit card and Credit card
• Customers with either no tenure period or just 1 year tenure period covers more than 2000 customers. All the
other tenure periods have a customer count less than 520.

e. Bivariate Analysis

Figure 2.13: Heat Map


Analysis is done to check the relation of a variable with Churn (target). The Heat map shown above indicates
the correlation among all the numerical variables. Focusing on the target, it is evident that the highest correlation of
Churn is with Complain_ly and the least is with AccountID. Only with AccountID, Churn has a negative correlation.

Figure 2.14: Account Segment vs Churn

Figure 2.15: Gender vs Churn


Figure 2.16: Login Device vs Churn

Figure 2.17: Marital Status vs Churn


Figure 2.18: Revenue per Month vs Churn

Figure 2.19: City Tier vs Churn


Figure 2.20: Account User Count vs Churn

Figure 2.21: Payment vs Churn


Figure 2.22: Service Score vs Churn

Insights from Bi-variate analysis are as follows:

• Churn count is the highest for the below scenarios


o Account Segment – Regular Plus
o Gender – Male
o Login Device – Mobile
o Marital Status – Single
o Payment – Debit card
o City tier – 1
o Account user count – 4
o Revenue per month – 3
o Service score - 3
• The main objective was to distinguish between churned and retained customers in addition to finding the
associated attributes leading to churn. At first, it was observed that single male customers are having slightly
higher probability of churn.
• In addition, Mobile preferred for a Login Device is related to customer churn as well. Furthermore, this might
be caused by the E-commerce’s customer user experience phone version of the ecommerce.
• However, this study shows that Service score is higher in churned customers which was not expected. On the
other hand, Tenure, and count of number of orders is lower for churned customers which is reasonable.

3. Data Cleaning and Pre-processing


a. Removal of unwanted variables
The column named Account ID is not related to any other variable. Churn, the target variable, has a very
less correlation with Account ID (-0.0095). Therefore, this is considered an unwanted variable and is removed.

b. Missing values treatment


In the dataset, there is a considerable number of null entries. As shown in the below table, a few variables do
not have null entries and the remaining have.
Column % of Dataset
AccountID 0.000000
Churn 0.000000
Tenure 0.009059
City_Tier 0.009947
CC_Contacted_LY 0.009059
Payment 0.009680
Gender 0.009591
Service_Score 0.008703
Account_user_count 0.009947
account_segment 0.008615
CC_Agent_Score 0.010302
Marital_Status 0.018828
rev_per_month 0.009059
Complain_ly 0.031705
rev_growth_yoy 0.000000
coupon_used_for_payment 0.000000
Day_Since_CC_connect 0.031705
cashback 0.041829
Login_device 0.019627

Table 3.1: Percent of null values in each column

This table gives a numerical interpretation of to what percentage of the overall data does the number of null
values of corresponding column contributes. It also shows that 4 variables do not have null values at all. Variable
with the highest percent of null values is Cashback. Null values are ignored in order to avoid error in the future steps
of the project. Values are not imputed with mean, median or mode of the column.

c. Outlier treatment
The following plots show the outliers present in each of the variables.

Figure 3.1: Boxplot of Variables

Due to the presence of outliers in the variables, including the target, it is essential to carry out outlier
treatment before moving forward with the analysis. The below figure shows the box plot of variables after the removal
of outliers.
Figure 3.2: Boxplot of Variables – Post outliers removal

d. Variable transformation
Variables that are supposed to be in integer or float data type are in object data type. It was necessary to change
them to the appropriate data type. They are as follows

Tenure – Object → Float


Complain_ly – Object → Float
City_Tier – Float → Object
Service_Score – Float → Object
CC_Agent_Score – Float → Object

4. Model building
Supervised Learning is defined as the category of data analysis where the target outcome is known or
labelled (e.g., whether the customer(s) churned did not). However, when the intention is to group them based on why
each churned, then it becomes Unsupervised. This may be done to explore the relationship between customers and
why they churn.
Classification and Regression both belong to Supervised Learning, but the former is applied where the
outcome is finite while the latter is for infinite possible values of outcome. Classification algorithm is used to identify
the category of new observations on the basis of training data. In Classification, a program learns from the given
dataset or observations and then classifies new observation into a number of classes or groups. Classes can be called
as targets/labels or categories.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data. Classification Algorithms can be further
divided into two categories: Linear Models and Non-Linear Models
Data is split into “X” and “y”, where “X” is a data frame of independent variables and “y” is a data frame of
dependent variable. These data frames are further split into train data and test data sets using train_test_split.
Train and test ratio is 70:30
coupon
Account account _used_ Day_Since
CC_Contacted Service_ _user_c _segmen CC_Agent Marital rev_per_ Complain rev_growth for_pa _CC_conn Login_
Tenure City_Tier _LY Payment Gender Score ount t _Score _Status month _ly _yoy yment ect cashback device
7580 11 1 22 4 1 0 3 4 5 1 6 0 5 1 2 2417 1
5198 22 3 14 3 1 4 3 4 5 1 5 0 2 1 3 3241 1
1929 15 1 14 2 1 3 2 2 1 1 1 0 3 5 9 5158 0
3427 15 1 14 1 0 3 3 4 3 3 7 0 5 2 2 3005 1
6249 0 1 22 1 0 3 4 4 4 0 5 0 5 1 0 1510 1

Table 4.1: Top 5 rows of independent variables – Train set


coupon_
Account_ used_for Day_Since
CC_Conta Service_ user_cou account_ CC_Agent Marital_ rev_per_ Complain rev_growth _paymen _CC_conn Login_
Tenure City_Tier cted_LY Payment Gender Score nt segment _Score Status month _ly _yoy t ect cashback device
784 0 1 31 2 0 2 0 3 2 3 1 1 2 1 1 575 1
6943 0 3 22 3 1 2 2 3 5 1 4 0 10 1 0 76 0
3709 10 3 23 3 0 4 4 4 5 0 5 0 4 2 24 3308 1
6439 0 1 10 1 0 2 2 3 4 3 3 0 9 1 1 111 1
5310 9 1 25 2 1 3 1 4 5 1 2 0 2 2 11 3105 1

Table 4.2: Top 5 rows of independent variables – Test set

7580 0 784 0
5198 0 6943 1
1929 0 3709 0
3427 0 6439 0
6249 1 5310 0

Table 4.3: Top 5 rows of dependent variables – Train set (L) and Test set (R)

Since, this project is focused on a Classification problem. The following models are built the following
model procedures to analyse and review the dataset and get the performance and importance of the features available
on the dataset which can gathers more information about the subjects.

• Logistic Regression
• Decision Tree
• Random Forest
• Linear Discriminant Analysis
• K Nearest Neighbours

a. Logistic Regression
Logistic Regression is one of the “white box” algorithms which helps us in determining the probability values
and the corresponding cut-of s. Logistic regression is used to solve such problem which gives us the corresponding
probability outputs and then we can decide the appropriate cut-of points to get the target class outputs. This model is
used because there are no assumptions to be made and classifications are faster
A max-iter of 100, solver as liblinear, tolerance of 0.0001 and a penalty of l2 is chosen for the model. Model is
fit on the training set and accuracy is obtained. Model intercept is -1.458 and coefficients are as follows

Figure 4.1: Coefficients of Logistic Regression model

b. Decision Tree
A decision tree is a type of supervised machine learning used to categorize or make predictions based on how a
previous set of questions were answered. The model is a form of supervised learning, meaning that the model is
trained and tested on a set of data that contains the desired categorization. This model is selected as one of the models
due to its white-box algorithm. Also, null values present in the data set do not affect the algorithm. Trees formed can
be visualized.
A criterion of Gini is selected along with a maximum depth of 10, minimum sample leaf of 10 and a minimum
sample split of 50. The best parameters are estimated using Grid Search.
Best score of the model: 0.894
Figure 4.2: Decision Tree plot – 1

Figure 4.3: Decision Tree plot – 2

0 1 0 1
0 0.988848 0.011152 0 0.000000 1.000000
1 1.000000 0.000000 1 0.187500 0.812500
2 1.000000 0.000000 2 0.988848 0.011152
3 0.988848 0.011152 3 0.405405 0.594595
4 0.913043 0.086957 4 1.000000 0.000000

Table 4.4: Decision Tree Prediction probability – Train set (L) and Test set (R)

c. Random Forest
Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and
Regression problems. It builds decision trees on different samples and takes their majority vote for classification and
average in case of regression. Grid Search is used to estimate the best parameters for the model. Selection of this
model is due to its efficiency of handling large datasets and higher level of accuracy.
Values are given as, maximum depth of 10, maximum features of 11, minimum samples leaf as 10, minimum sample
split as 50 and number of estimators of 100

0 1 0 1
0 0.991175 0.008825 0 0.190284 0.809716
1 0.988552 0.011448 1 0.290077 0.709923
2 0.948677 0.051323 2 0.966620 0.033380
3 0.991443 0.008557 3 0.627443 0.372557
4 0.510515 0.489485 4 0.992902 0.007098

Table 4.5: Random Forest Prediction probability – Train set (L) and Test set (R)

d. Linear Discriminant Analysis


Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function Analysis is a
dimensionality reduction technique that is commonly used for supervised classification problems. It is used for
modelling differences in groups i.e., separating two or more classes. Linear Discriminant Analysis is selected as one
of the model cause it causes minimum variance. LDA model is built using a eigen solver and an auto shrinkage.

e. K-Nearest Neighbours
The K-Nearest Neighbours algorithm, also known as KNN is a non-parametric, supervised learning classifier, which
uses proximity to make classifications or predictions about the grouping of an individual data point. This model is
selected for its main function that is does not have a training period. This makes it a faster model.
KNN model is built using 15 number of neighbours, weight being uniform and metric as minkowski.
5. Model Validation
a. Test predictive model against the test set using various appropriate performance metrics
For every model building procedure used in the project, a few steps are used to get the output generated by each
algorithm. The performance metrics opted to analyse these models are accuracy, confusion matrix, classification
report with F1 score, Recall and Precision in it for each model.

i. Logistic Regression
Accuracy score for the Decision Tree model’s train set is 0.8722 and test set is 0.8712. Area under the ROC curve is
0.841 for train set and 0.845 for test set

Figure 5.1: Confusion matrix of LR model train (L) and test set (R)

Figure 5.2: Classification report of LR model train set (L) and test set (R)

Figure 5.3: ROC Curve of LR model train set (L) test set (R)

Train Test
Precision 0.75 0.72
Recall 0.36 0.38
F1 0.48 0.50

Table 5.1: Logistic Regression model Performance metrics

ii. Decision Tree


Accuracy score for the Decision Tree model’s train set is 0.9211 and test set is 0.9002
Area under the ROC curve is 0.897 for train set and 0.891 for test set

Figure 5.4: Confusion matrix of DT model train (L) and test set (R)
Figure 5.5: Classification report of DT model train set (L) and test set (R)

Figure 5.6: ROC Curve of DT model train set (L) and test set (R)

Train Test
Precision 0.74 0.73
Recall 0.59 0.58
F1 0.66 0.64

Table 5.2: Decision Tree model Performance metrics

iii. Random Forest


Accuracy score for the Random Forest model’s train set is 0.9211 and test set is 0.9002
Area under the ROC curve is 0.961 for train set and 0.926 for test set

Figure 5.7: Confusion matrix of RF model train (L) and test set (R)

Figure 5.8: Classification report of RF model train set (L) and test set (R)
Figure 5.9: ROC Curve of RF model train set (L) and test set (R)

Train Test
Precision 0.87 0.82
Recall 0.67 0.64
F1 0.76 0.72

Table 5.3: Random Forest model Performance metrics

iv. Linear Discriminant Analysis


Accuracy score for the Linear Discriminant Analysis model’s train set is 0.871 and test set is 0.872
Area under the ROC curve is 0.897 for train set and 0.891 for test set

Figure 5.10: Confusion matrix of LDA model train (L) and test set (R)

Figure 5.11: Classification report of LDA model train set (L) and test set (R)

Figure 5.12: ROC Curve of LDA model train set (L) and test set (R)
Train Test
Precision 0.75 0.89
Recall 0.35 0.97
F1 0.48 0.93

Table 5.4: Linear Discriminant Analysis model Performance metrics

v. K-Nearest Neighbours
Accuracy score for the K-Nearest Neighbours model’s train set is 0.842 and test set is 0.830
Area under the ROC curve is 0.823 for train set and 0.720 for test set

Figure 5.13: Confusion matrix of KNN model train (L) and test set (R)

Figure 5.14: Classification report of KNN model train set (L) and test set (R)

Figure 5.15: ROC Curve of KNN model train set (L) and test set (R)

Train Test
Precision 0.69 0.48
Recall 0.12 0.07
F1 0.20 0.12

Table 5.5: K-Nearest Neighbours model Performance metrics

b. Interpretation of the model(s)

LR Model DT Model RF Model LDA Model KNN Model


Train Test Train Test Train Test Train Test Train Test
Precision 0.75 0.72 0.74 0.73 0.87 0.82 0.75 0.89 0.69 0.48
Recall 0.36 0.38 0.59 0.58 0.67 0.64 0.35 0.97 0.12 0.07
F1 Score 0.48 0.50 0.66 0.64 0.76 0.72 0.48 0.93 0.20 0.12
Accuracy 0.87 0.87 0.92 0.90 0.92 0.90 0.87 0.87 0.84 0.83
AUC 0.84 0.85 0.90 0.89 0.96 0.93 0.90 0.89 0.82 0.72
Table 5.6: Consolidated Performance Metrics
The table shown above is a consolidated view of all the performance metrics of each model. It is evident that all
the metrics for K-Nearest Neighbours model is low compared to the other models. Metrics of Random Forest model is
the highest of all models. F1 score is the point of focus and its highest for train and test set of Random Forest model.

c. Ensemble modelling, wherever applicable


Ensemble modelling is a process where multiple diverse models are created to predict an outcome, either by
using many different modelling algorithms or using different training data sets. The ensemble model then aggregates
the prediction of each base model and results in once final prediction for the unseen data.
In this case, 4 different models are built for analysing the data and predicting.

d. Any other model tuning measure


Grid Search, also known as parameter sweeping, is one of the most basic and traditional methods of
hyperparametric optimization. This method involves manually defining a subset of the hyperparametric space and
exhausting all combinations of the specified hyperparameter subsets. Each combination’s performance is then
evaluated, typically using cross-validation, and the best performing hyperparametric combination is chosen.
For each model, grid search is used to estimate the best parameters for building the model.

e. Interpretation of the most optimum model and its implication on the business
The most optimum model is Random Forest model. This is evident because of the performance metrics obtained
from the model used as shown above and the ROC curve combined as shown below
It is evident that Random Forest model has out performed all the other models.

Figure 5.16: Comparison of ROC Curve of all model train set (L) and test set (R)

6. Final Interpretation and Recommendations


a. Insights
• Business must increase the tenure of their customer which can be done by initiating some loyalty programs or
special pricings for loyal customers
• Right targeting of the customers in Tier 2 and 3 cities
• The best possible time is to give some sort of promos at T = 2 ( 2 units of tenure)
• More focus on customers who use CC, COD and E wallet mode of payment, since this is a very critical
segment
• Mostly customer rated "3" for the services provided by the business
• Mostly customer rated "3" for the interactions they have customer care representatives.
• Transaction via UPI and e-wallet is very low.
• Maximum churn is from the account segment "Regular+"
• Customers with marital status is "single" contributes max towards churn
• Any complaints raised in last 12 months doesn't show any impact toward churn.
• Tenure and cashback are directly proportional to each other.
• Computer usage is more in tier 1 city followed by tier 3 and tier 2 city

b. Recommendations
• Business can be in joint with other life style vendors to provide vouchers to the new as well existing loyal
customers.
• Customized email response to priority customers basis segmentation for better customer interaction.
• Specialized team of customer service for Top notch customers to avoid waiting time and better customer
experience and interaction.
• Business can promote using their own e-wallet as payment option by giving certain discount over the bill
• Thanking customers with hand written notes on invoices will create a good will factor
• Business needs to make sure that all complaints and queries raised are resolve on time
• Business can promote payment via standing instruction in bank account or UPI which can be hassle free and
safe for customer.
• Follow-up in customers issues and taking regular feedbacks on the same

Appendix
• https://public.tableau.com/views/Capstone-Notes1_16694798934420/Bi-varaite2?:language=en-
US&:display_count=n&:origin=viz_share_link

You might also like