0% found this document useful (0 votes)
18 views42 pages

Predictive Modeling

Uploaded by

amansinhmar2303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views42 pages

Predictive Modeling

Uploaded by

amansinhmar2303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

ndex

1. Introduction
2. Data Overview
 2.1 Dataset Description
 2.2 Data Exploration
3. Data Preprocessing
 3.1 Data Cleaning
 3.2 Feature Engineering
 3.3 Train-Test Split
4. Model Development
 4.1 Logistic Regression
 4.1.1 Model Training
 4.1.2 Model Evaluation
 4.2 Linear Discriminant Analysis (LDA)
 4.2.1 Model Training
 4.2.2 Model Evaluation
 4.3 Decision Tree
 4.3.1 Model Training
 4.3.2 Model Evaluation
5. Model Performance Comparison
 5.1 Training Set Performance
 5.2 Testing Set Performance
6. Feature Importance Analysis
7. Business Recommendations
8. Conclusion
9. References

Define the problem and perform Exploratory Data Analysis

Problem Definition: The primary objective is to analyze and build a machine learning
model to help identify which leads are more likely to convert to paid customers for
ExtraaLearn. This involves:
Analyzing the dataset to understand the features and their relevance to lead conversion.
Building a predictive model to identify leads with a higher probability of conversion.
Determining the factors driving the lead conversion process. Creating a profile of leads
likely to convert based on the insights gained from the model.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization


import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns


pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)

# Library to split data


from sklearn.model_selection import train_test_split

# To build model for prediction


from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models


from sklearn.model_selection import GridSearchCV

# To get diferent metric scores


from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
precision_recall_curve,
roc_curve,
make_scorer,
)

import warnings
warnings.filterwarnings("ignore")
Shape of the dataset: (4612, 15)

| Column | Data Type |


|------------------------|-------------------|
| ID | Object |
| age | Int64 |
| current_occupation | Object |
| first_interaction | Object |
| profile_completed | Object |
| website_visits | Int64 |
| time_spent_on_website | Int64 |
| page_views_per_visit | Float64 |
| last_activity | Object |
| print_media_type1 | Object |
| print_media_type2 | Object |
| digital_media | Object |
| educational_channels | Object |
| referral | Object |
| status | Int64 |

Statistical summary of numerical columns:

age website_visits time_spent_on_website page_views_per_visit \

count 4612.00000 4612.00000 4612.00000 4612.00000

mean 46.20121 3.56678 724.01127 3.02613

std 13.16145 2.82913 743.82868 1.96812

min 18.00000 0.00000 0.00000 0.00000

25% 36.00000 2.00000 148.75000 2.07775

50% 51.00000 3.00000 376.00000 2.79200

75% 57.00000 5.00000 1336.75000 3.75625

max 63.00000 30.00000 2537.00000 18.43400

status
count 4612.00000
mean 0.29857
std 0.45768
min 0.00000
25% 0.00000
50% 0.00000
75% 1.00000
max 1.00000

Univariate analysis

Number of leads who haven't visited the website: 174


Multivariate analysis

status 0 1 All
current_occupation
All 3235 1377 4612
Professional 1687 929 2616
Unemployed 1058 383 1441
Student 490 65 555
status 0 1 All
first_interaction
All 3235 1377 4612
Website 1383 1159 2542
Mobile App 1852 218 2070
status 0 1 All
profile_completed
All 3235 1377 4612
High 1318 946 2264
Medium 1818 423 2241
Low 99 8 107
status 0 1 All
last_activity
All 3235 1377 4612
Email Activity 1587 691 2278
Website Activity 677 423 1100
Phone Activity 971 263 1234
status 0 1 All
print_media_type1
All 3235 1377 4612
No 2897 1218 4115
Yes 338 159 497
status 0 1 All
print_media_type2
All 3235 1377 4612
No 3077 1302 4379
Yes 158 75 233
status 0 1 All
digital_media
All 3235 1377 4612
No 2876 1209 4085
Yes 359 168 527
status 0 1 All
educational_channels
All 3235 1377 4612
No 2727 1180 3907
Yes 508 197 705
status 0 1 All
referral
All 3235 1377 4612
No 3205 1314 4519
Yes 30 63 93
#Data Preparation for modeling

Shape of Training set : (3228, 4627)


Shape of Test set : (1384, 4627)
Percentage of classes in training set:
Status

0 0.70415
1 0.29585

Name: proportion, dtype: float64


Percentage of classes in test set:
status
0 0.69509
1 0.30491
Name: proportion, dtype: float64

#Model evaluation criterion


Accuracy Recall Precision F1
0 0.80130 0.60190 0.70360 0.64879
Debugging: Inside confusion_matrix_statsmodels function
Model: LogisticRegression()
Predictors shape: (1384, 4627)
Target shape: (1384,)
Predictions shape: (1384,)
Confusion Matrix: [[855 107]
[168 254]]
Model Performance Metrics:
Accuracy Recall Precision F1
0 0.80130 0.60190 0.70360 0.64879
Debugging: Inside confusion_matrix_statsmodels function
Model: LogisticRegression()
Predictors shape: (1384, 4627)
Target shape: (1384,)
Predictions shape: (1384,)
Confusion Matrix: [[855 107]
[168 254]]

#Building Logistic Regression Model

Model performance on test set:


Accuracy Recall Precision F1
0 0.80130 0.60190 0.70360 0.64879
Train ROC-AUC score is : 0.8772828307723493

Test ROC-AUC score is : 0.8565821107290302


#Using GridSearch for Hyperparameter tuning of our logistic
regression model
Train ROC-AUC score is : 0.9989528795811519

Test ROC-AUC score is : 0.38762673537555054


# Checking performance on training set

Performance on Training Set: 0.718091697645601


Confusion Matrix for Training Set:
[[2005 268]
[ 642 313]]

Performance on Test Set: 0.7044797687861272


Confusion Matrix for Test Set:
[[832 130]
[279 143]]
#Building Decision Tree Model

# importance of features in the tree building

|--- time_spent_on_website <= 415.50


| |--- age <= 26.50
| | |--- page_views_per_visit <= 0.04
| | | |--- weights: [2.70, 2.10] class: 0
| | |--- page_views_per_visit > 0.04
| | | |--- page_views_per_visit <= 3.34
| | | | |--- weights: [37.20, 0.70] class: 0
| | | |--- page_views_per_visit > 3.34
| | | | |--- time_spent_on_website <= 138.50
| | | | | |--- weights: [10.50, 0.00] class: 0
| | | | |--- time_spent_on_website > 138.50
| | | | | |--- weights: [17.10, 6.30] class: 0
| |--- age > 26.50
| | |--- page_views_per_visit <= 3.71
| | | |--- time_spent_on_website <= 175.50
| | | | |--- time_spent_on_website <= 169.50
| | | | | |--- weights: [138.90, 77.00] class: 0
| | | | |--- time_spent_on_website > 169.50
| | | | | |--- weights: [0.90, 2.80] class: 1
| | | |--- time_spent_on_website > 175.50
| | | | |--- page_views_per_visit <= 3.68
| | | | | |--- weights: [144.90, 58.10] class: 0
| | | | |--- page_views_per_visit > 3.68
| | | | | |--- weights: [1.20, 2.80] class: 1
| | |--- page_views_per_visit > 3.71
| | | |--- page_views_per_visit <= 3.84
| | | | |--- age <= 58.50
| | | | | |--- weights: [18.90, 0.70] class: 0
| | | | |--- age > 58.50
| | | | | |--- weights: [2.10, 1.40] class: 0
| | | |--- page_views_per_visit > 3.84
| | | | |--- page_views_per_visit <= 3.85
| | | | | |--- weights: [0.00, 1.40] class: 1
| | | | |--- page_views_per_visit > 3.85
| | | | | |--- weights: [84.30, 27.30] class: 0
|--- time_spent_on_website > 415.50
| |--- age <= 25.50
| | |--- website_visits <= 3.50
| | | |--- page_views_per_visit <= 5.39
| | | | |--- time_spent_on_website <= 2223.50
| | | | | |--- weights: [15.00, 16.80] class: 1
| | | | |--- time_spent_on_website > 2223.50
| | | | | |--- weights: [2.70, 0.00] class: 0
| | | |--- page_views_per_visit > 5.39
| | | | |--- weights: [3.60, 0.00] class: 0
| | |--- website_visits > 3.50
| | | |--- time_spent_on_website <= 1933.50
| | | | |--- weights: [14.10, 2.80] class: 0
| | | |--- time_spent_on_website > 1933.50
| | | | |--- time_spent_on_website <= 2039.50
| | | | | |--- weights: [0.30, 1.40] class: 1
| | | | |--- time_spent_on_website > 2039.50
| | | | | |--- weights: [3.60, 1.40] class: 0
| |--- age > 25.50
| | |--- time_spent_on_website <= 2204.00
| | | |--- weights: [169.20, 403.90] class: 1
| | |--- time_spent_on_website > 2204.00
| | | |--- weights: [14.70, 61.60] class: 1
#Feature Importance

Training performance comparison:

Training performance comparison:


Logistic Regression LDA Decision Tree
Accuracy 0.71747 0.71809 0.99318
Recall 0.28168 0.32775 0.97696
Precision 0.54343 0.53873 1.00000
F1 0.37103 0.40755 0.98835
Testing performance comparison:
Logistic Regression LDA Decision Tree
Accuracy 0.70014 0.70448 0.64451
Recall 0.27725 0.33886 0.45498
Precision 0.51542 0.52381 0.42291
F1 0.36055 0.41151 0.43836

Use Logistic Regression or LDA for Predictive Modeling: Both Logistic


Regression and LDA demonstrate stable performance across training and
testing sets. They offer a good balance between accuracy, recall, precision,
and F1-score. Therefore, they can be reliable choices for predictive modeling
in your business context.
Consider the Decision Tree Model for Further Investigation: Although the
Decision Tree model shows high accuracy on the training set, its performance
on the testing set is lower compared to Logistic Regression and LDA. This
suggests potential overfitting. Further investigation into the decision tree
model's structure, feature importance, and potential pruning techniques may
help improve its generalization performance.
Evaluate Feature Importance: Analyze the feature importance provided by
each model to understand which variables contribute most to the prediction.
This can provide valuable insights into customer behavior and preferences.
For example, features related to website visits, time spent on the website, and
page views per visit appear to be influential in predicting customer status.
Refine Marketing Strategies: Tailor marketing strategies based on the insights
gained from predictive modeling. For instance, focus marketing efforts on
customers who exhibit behaviors indicative of higher conversion rates, such
as frequent website visits, longer time spent on the website, and higher page
views per visit.
Regular Model Monitoring and Updates: Continuously monitor model
performance and update the models as new data becomes available.
Customer behavior and preferences may evolve over time, requiring
adjustments to the predictive models to maintain their effectiveness.
Aim for Balanced Performance Metrics: Aim for a balanced combination of
accuracy, recall, precision, and F1-score, depending on the specific business
objectives and constraints. For example, if false positives (predicting a
customer will convert when they won't) are costly, prioritize precision. If false
negatives (failing to identify potential converters) are more concerning,
prioritize recall.
By incorporating these recommendations into your business strategy, you can
leverage predictive modeling techniques to better understand customer
behavior, optimize marketing efforts, and ultimately improve conversion rates.

You might also like