Capstone - Project - Final - Report - Churn - Prediction
Capstone - Project - Final - Report - Churn - Prediction
FINAL REPORT
SUBMITTED BY
PUVYA RAVI
TABLE OF CONTENTS
CC_Contact Service Account_us account_ CC_Agent Marital_ rev_per_ Complain rev_growth coupon_used_ Day_Since_C Login_devic
AccountID Churn Tenure City_Tier ed_LY Payment Gender _Score er_count segment _Score Status month _ly _yoy for_payment C_connect cashback e
11255 31255 0 10 1 34 Credit Card Male 3 2 Super 1 Married 9 0 19 1 4 153.71 Computer
11256 31256 0 13 1 19 Credit Card Male 3 5 HNI 5 Married 7 0 16 1 8 226.91 Mobile
11257 31257 0 1 1 14 Debit Card Male 3 2 Super 4 Married 7 1 22 1 4 191.42 Mobile
11258 31258 0 23 3 11 Credit Card Male 4 5 Super 4 Married 7 0 16 2 9 179.9 Computer
11259 31259 0 8 1 22 Credit Card Male 3 2 Super 3 Married 5 0 13 2 3 175.04 Mobile
c. Understanding of attributes
# Column Non-Null Count Dtype
0 AccountID 11260 non-null int64
1 Churn 11260 non-null int64
2 Tenure 11158 non-null object
3 City_Tier 11148 non-null float64
4 CC_Contacted_LY 11158 non-null float64
5 Payment 11151 non-null object
6 Gender 11152 non-null object
7 Service_Score 11162 non-null float64
8 Account_user_count 11148 non-null object
9 account_segment 11163 non-null object
10 CC_Agent_Score 11144 non-null float64
11 Marital_Status 11048 non-null object
12 rev_per_month 11158 non-null object
13 Complain_ly 10903 non-null float64
14 rev_growth_yoy 11260 non-null object
15 coupon_used_for_payment 11260 non-null object
16 Day_Since_CC_connect 10903 non-null object
17 cashback 10789 non-null object
18 Login_device 11039 non-null object
The first step in data exploration is to import several libraries in Python to explore and visualize the data. Then the
numerical and categorical columns will be explored in addition to identification of missing dat.
d. Univariate Analysis
Categorical variables are analysed using the following charts
All the above shown charts are the count plots of the Categorical variables of the data set. The insights are as follows:
• According to the count plots, highest count is for the following cases
o Account Segment – Super
o Gender – Male
o Login Device – Mobile
o Marital Status – Married
o Payment – Debit Card
o Revenue per Month – 3
o Tenure – 1 Year
o Account user count – 4
o Revenue growth – 14
o Coupon used for payment – 1
o Days since CC connected – 3
• Majority of the customers (8186) fall under the Account segment Super and Regular Plus
• Customer count based on Payment covers 8098 customers just for Debit card and Credit card
• Customers with either no tenure period or just 1 year tenure period covers more than 2000 customers. All the
other tenure periods have a customer count less than 520.
e. Bivariate Analysis
This table gives a numerical interpretation of to what percentage of the overall data does the number of null
values of corresponding column contributes. It also shows that 4 variables do not have null values at all. Variable
with the highest percent of null values is Cashback. Null values are ignored in order to avoid error in the future steps
of the project. Values are not imputed with mean, median or mode of the column.
c. Outlier treatment
The following plots show the outliers present in each of the variables.
Due to the presence of outliers in the variables, including the target, it is essential to carry out outlier
treatment before moving forward with the analysis. The below figure shows the box plot of variables after the removal
of outliers.
Figure 3.2: Boxplot of Variables – Post outliers removal
d. Variable transformation
Variables that are supposed to be in integer or float data type are in object data type. It was necessary to change
them to the appropriate data type. They are as follows
4. Model building
Supervised Learning is defined as the category of data analysis where the target outcome is known or
labelled (e.g., whether the customer(s) churned did not). However, when the intention is to group them based on why
each churned, then it becomes Unsupervised. This may be done to explore the relationship between customers and
why they churn.
Classification and Regression both belong to Supervised Learning, but the former is applied where the
outcome is finite while the latter is for infinite possible values of outcome. Classification algorithm is used to identify
the category of new observations on the basis of training data. In Classification, a program learns from the given
dataset or observations and then classifies new observation into a number of classes or groups. Classes can be called
as targets/labels or categories.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data. Classification Algorithms can be further
divided into two categories: Linear Models and Non-Linear Models
Data is split into “X” and “y”, where “X” is a data frame of independent variables and “y” is a data frame of
dependent variable. These data frames are further split into train data and test data sets using train_test_split.
Train and test ratio is 70:30
coupon
Account account _used_ Day_Since
CC_Contacted Service_ _user_c _segmen CC_Agent Marital rev_per_ Complain rev_growth for_pa _CC_conn Login_
Tenure City_Tier _LY Payment Gender Score ount t _Score _Status month _ly _yoy yment ect cashback device
7580 11 1 22 4 1 0 3 4 5 1 6 0 5 1 2 2417 1
5198 22 3 14 3 1 4 3 4 5 1 5 0 2 1 3 3241 1
1929 15 1 14 2 1 3 2 2 1 1 1 0 3 5 9 5158 0
3427 15 1 14 1 0 3 3 4 3 3 7 0 5 2 2 3005 1
6249 0 1 22 1 0 3 4 4 4 0 5 0 5 1 0 1510 1
7580 0 784 0
5198 0 6943 1
1929 0 3709 0
3427 0 6439 0
6249 1 5310 0
Table 4.3: Top 5 rows of dependent variables – Train set (L) and Test set (R)
Since, this project is focused on a Classification problem. The following models are built the following
model procedures to analyse and review the dataset and get the performance and importance of the features available
on the dataset which can gathers more information about the subjects.
• Logistic Regression
• Decision Tree
• Random Forest
• Linear Discriminant Analysis
• K Nearest Neighbours
a. Logistic Regression
Logistic Regression is one of the “white box” algorithms which helps us in determining the probability values
and the corresponding cut-of s. Logistic regression is used to solve such problem which gives us the corresponding
probability outputs and then we can decide the appropriate cut-of points to get the target class outputs. This model is
used because there are no assumptions to be made and classifications are faster
A max-iter of 100, solver as liblinear, tolerance of 0.0001 and a penalty of l2 is chosen for the model. Model is
fit on the training set and accuracy is obtained. Model intercept is -1.458 and coefficients are as follows
b. Decision Tree
A decision tree is a type of supervised machine learning used to categorize or make predictions based on how a
previous set of questions were answered. The model is a form of supervised learning, meaning that the model is
trained and tested on a set of data that contains the desired categorization. This model is selected as one of the models
due to its white-box algorithm. Also, null values present in the data set do not affect the algorithm. Trees formed can
be visualized.
A criterion of Gini is selected along with a maximum depth of 10, minimum sample leaf of 10 and a minimum
sample split of 50. The best parameters are estimated using Grid Search.
Best score of the model: 0.894
Figure 4.2: Decision Tree plot – 1
0 1 0 1
0 0.988848 0.011152 0 0.000000 1.000000
1 1.000000 0.000000 1 0.187500 0.812500
2 1.000000 0.000000 2 0.988848 0.011152
3 0.988848 0.011152 3 0.405405 0.594595
4 0.913043 0.086957 4 1.000000 0.000000
Table 4.4: Decision Tree Prediction probability – Train set (L) and Test set (R)
c. Random Forest
Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and
Regression problems. It builds decision trees on different samples and takes their majority vote for classification and
average in case of regression. Grid Search is used to estimate the best parameters for the model. Selection of this
model is due to its efficiency of handling large datasets and higher level of accuracy.
Values are given as, maximum depth of 10, maximum features of 11, minimum samples leaf as 10, minimum sample
split as 50 and number of estimators of 100
0 1 0 1
0 0.991175 0.008825 0 0.190284 0.809716
1 0.988552 0.011448 1 0.290077 0.709923
2 0.948677 0.051323 2 0.966620 0.033380
3 0.991443 0.008557 3 0.627443 0.372557
4 0.510515 0.489485 4 0.992902 0.007098
Table 4.5: Random Forest Prediction probability – Train set (L) and Test set (R)
e. K-Nearest Neighbours
The K-Nearest Neighbours algorithm, also known as KNN is a non-parametric, supervised learning classifier, which
uses proximity to make classifications or predictions about the grouping of an individual data point. This model is
selected for its main function that is does not have a training period. This makes it a faster model.
KNN model is built using 15 number of neighbours, weight being uniform and metric as minkowski.
5. Model Validation
a. Test predictive model against the test set using various appropriate performance metrics
For every model building procedure used in the project, a few steps are used to get the output generated by each
algorithm. The performance metrics opted to analyse these models are accuracy, confusion matrix, classification
report with F1 score, Recall and Precision in it for each model.
i. Logistic Regression
Accuracy score for the Decision Tree model’s train set is 0.8722 and test set is 0.8712. Area under the ROC curve is
0.841 for train set and 0.845 for test set
Figure 5.1: Confusion matrix of LR model train (L) and test set (R)
Figure 5.2: Classification report of LR model train set (L) and test set (R)
Figure 5.3: ROC Curve of LR model train set (L) test set (R)
Train Test
Precision 0.75 0.72
Recall 0.36 0.38
F1 0.48 0.50
Figure 5.4: Confusion matrix of DT model train (L) and test set (R)
Figure 5.5: Classification report of DT model train set (L) and test set (R)
Figure 5.6: ROC Curve of DT model train set (L) and test set (R)
Train Test
Precision 0.74 0.73
Recall 0.59 0.58
F1 0.66 0.64
Figure 5.7: Confusion matrix of RF model train (L) and test set (R)
Figure 5.8: Classification report of RF model train set (L) and test set (R)
Figure 5.9: ROC Curve of RF model train set (L) and test set (R)
Train Test
Precision 0.87 0.82
Recall 0.67 0.64
F1 0.76 0.72
Figure 5.10: Confusion matrix of LDA model train (L) and test set (R)
Figure 5.11: Classification report of LDA model train set (L) and test set (R)
Figure 5.12: ROC Curve of LDA model train set (L) and test set (R)
Train Test
Precision 0.75 0.89
Recall 0.35 0.97
F1 0.48 0.93
v. K-Nearest Neighbours
Accuracy score for the K-Nearest Neighbours model’s train set is 0.842 and test set is 0.830
Area under the ROC curve is 0.823 for train set and 0.720 for test set
Figure 5.13: Confusion matrix of KNN model train (L) and test set (R)
Figure 5.14: Classification report of KNN model train set (L) and test set (R)
Figure 5.15: ROC Curve of KNN model train set (L) and test set (R)
Train Test
Precision 0.69 0.48
Recall 0.12 0.07
F1 0.20 0.12
e. Interpretation of the most optimum model and its implication on the business
The most optimum model is Random Forest model. This is evident because of the performance metrics obtained
from the model used as shown above and the ROC curve combined as shown below
It is evident that Random Forest model has out performed all the other models.
Figure 5.16: Comparison of ROC Curve of all model train set (L) and test set (R)
b. Recommendations
• Business can be in joint with other life style vendors to provide vouchers to the new as well existing loyal
customers.
• Customized email response to priority customers basis segmentation for better customer interaction.
• Specialized team of customer service for Top notch customers to avoid waiting time and better customer
experience and interaction.
• Business can promote using their own e-wallet as payment option by giving certain discount over the bill
• Thanking customers with hand written notes on invoices will create a good will factor
• Business needs to make sure that all complaints and queries raised are resolve on time
• Business can promote payment via standing instruction in bank account or UPI which can be hassle free and
safe for customer.
• Follow-up in customers issues and taking regular feedbacks on the same
Appendix
• https://public.tableau.com/views/Capstone-Notes1_16694798934420/Bi-varaite2?:language=en-
US&:display_count=n&:origin=viz_share_link