0% found this document useful (0 votes)

18 views42 pages

Predictive Modeling

Uploaded by

amansinhmar2303

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views42 pages

Predictive Modeling

Uploaded by

amansinhmar2303

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

ndex

1. Introduction
2. Data Overview
 2.1 Dataset Description
 2.2 Data Exploration
3. Data Preprocessing
 3.1 Data Cleaning
 3.2 Feature Engineering
 3.3 Train-Test Split
4. Model Development
 4.1 Logistic Regression
 4.1.1 Model Training
 4.1.2 Model Evaluation
 4.2 Linear Discriminant Analysis (LDA)
 4.2.1 Model Training
 4.2.2 Model Evaluation
 4.3 Decision Tree
 4.3.1 Model Training
 4.3.2 Model Evaluation
5. Model Performance Comparison
 5.1 Training Set Performance
 5.2 Testing Set Performance
6. Feature Importance Analysis
7. Business Recommendations
8. Conclusion
9. References

Define the problem and perform Exploratory Data Analysis

Problem Definition: The primary objective is to analyze and build a machine learning
model to help identify which leads are more likely to convert to paid customers for
ExtraaLearn. This involves:
Analyzing the dataset to understand the features and their relevance to lead conversion.
Building a predictive model to identify leads with a higher probability of conversion.
Determining the factors driving the lead conversion process. Creating a profile of leads
likely to convert based on the insights gained from the model.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns

pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)

# Library to split data

from sklearn.model_selection import train_test_split

# To build model for prediction

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models

from sklearn.model_selection import GridSearchCV

# To get diferent metric scores

from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
precision_recall_curve,
roc_curve,
make_scorer,
)

import warnings
warnings.filterwarnings("ignore")
Shape of the dataset: (4612, 15)

| Column | Data Type |

|------------------------|-------------------|
| ID | Object |
| age | Int64 |
| current_occupation | Object |
| first_interaction | Object |
| profile_completed | Object |
| website_visits | Int64 |
| time_spent_on_website | Int64 |
| page_views_per_visit | Float64 |
| last_activity | Object |
| print_media_type1 | Object |
| print_media_type2 | Object |
| digital_media | Object |
| educational_channels | Object |
| referral | Object |
| status | Int64 |

Statistical summary of numerical columns:

age website_visits time_spent_on_website page_views_per_visit \

count 4612.00000 4612.00000 4612.00000 4612.00000

mean 46.20121 3.56678 724.01127 3.02613

std 13.16145 2.82913 743.82868 1.96812

min 18.00000 0.00000 0.00000 0.00000

25% 36.00000 2.00000 148.75000 2.07775

50% 51.00000 3.00000 376.00000 2.79200

75% 57.00000 5.00000 1336.75000 3.75625

max 63.00000 30.00000 2537.00000 18.43400

status
count 4612.00000
mean 0.29857
std 0.45768
min 0.00000
25% 0.00000
50% 0.00000
75% 1.00000
max 1.00000

Univariate analysis

Number of leads who haven't visited the website: 174

Multivariate analysis

status 0 1 All
current_occupation
All 3235 1377 4612
Professional 1687 929 2616
Unemployed 1058 383 1441
Student 490 65 555
status 0 1 All
first_interaction
All 3235 1377 4612
Website 1383 1159 2542
Mobile App 1852 218 2070
status 0 1 All
profile_completed
All 3235 1377 4612
High 1318 946 2264
Medium 1818 423 2241
Low 99 8 107
status 0 1 All
last_activity
All 3235 1377 4612
Email Activity 1587 691 2278
Website Activity 677 423 1100
Phone Activity 971 263 1234
status 0 1 All
print_media_type1
All 3235 1377 4612
No 2897 1218 4115
Yes 338 159 497
status 0 1 All
print_media_type2
All 3235 1377 4612
No 3077 1302 4379
Yes 158 75 233
status 0 1 All
digital_media
All 3235 1377 4612
No 2876 1209 4085
Yes 359 168 527
status 0 1 All
educational_channels
All 3235 1377 4612
No 2727 1180 3907
Yes 508 197 705
status 0 1 All
referral
All 3235 1377 4612
No 3205 1314 4519
Yes 30 63 93
#Data Preparation for modeling

Shape of Training set : (3228, 4627)

Shape of Test set : (1384, 4627)
Percentage of classes in training set:
Status

0 0.70415
1 0.29585

Name: proportion, dtype: float64

Percentage of classes in test set:
status
0 0.69509
1 0.30491
Name: proportion, dtype: float64

#Model evaluation criterion

Accuracy Recall Precision F1
0 0.80130 0.60190 0.70360 0.64879
Debugging: Inside confusion_matrix_statsmodels function
Model: LogisticRegression()
Predictors shape: (1384, 4627)
Target shape: (1384,)
Predictions shape: (1384,)
Confusion Matrix: [[855 107]
[168 254]]
Model Performance Metrics:
Accuracy Recall Precision F1
0 0.80130 0.60190 0.70360 0.64879
Debugging: Inside confusion_matrix_statsmodels function
Model: LogisticRegression()
Predictors shape: (1384, 4627)
Target shape: (1384,)
Predictions shape: (1384,)
Confusion Matrix: [[855 107]
[168 254]]

#Building Logistic Regression Model

Model performance on test set:

Accuracy Recall Precision F1
0 0.80130 0.60190 0.70360 0.64879
Train ROC-AUC score is : 0.8772828307723493

Test ROC-AUC score is : 0.8565821107290302

#Using GridSearch for Hyperparameter tuning of our logistic
regression model
Train ROC-AUC score is : 0.9989528795811519

Test ROC-AUC score is : 0.38762673537555054

# Checking performance on training set

Performance on Training Set: 0.718091697645601

Confusion Matrix for Training Set:
[[2005 268]
[ 642 313]]

Performance on Test Set: 0.7044797687861272

Confusion Matrix for Test Set:
[[832 130]
[279 143]]
#Building Decision Tree Model

# importance of features in the tree building

|--- time_spent_on_website <= 415.50

Training performance comparison:

Logistic Regression LDA Decision Tree
Accuracy 0.71747 0.71809 0.99318
Recall 0.28168 0.32775 0.97696
Precision 0.54343 0.53873 1.00000
F1 0.37103 0.40755 0.98835
Testing performance comparison:
Logistic Regression LDA Decision Tree
Accuracy 0.70014 0.70448 0.64451
Recall 0.27725 0.33886 0.45498
Precision 0.51542 0.52381 0.42291
F1 0.36055 0.41151 0.43836

Use Logistic Regression or LDA for Predictive Modeling: Both Logistic

Regression and LDA demonstrate stable performance across training and
testing sets. They offer a good balance between accuracy, recall, precision,
and F1-score. Therefore, they can be reliable choices for predictive modeling
in your business context.
Consider the Decision Tree Model for Further Investigation: Although the
Decision Tree model shows high accuracy on the training set, its performance
on the testing set is lower compared to Logistic Regression and LDA. This
suggests potential overfitting. Further investigation into the decision tree
model's structure, feature importance, and potential pruning techniques may
help improve its generalization performance.
Evaluate Feature Importance: Analyze the feature importance provided by
each model to understand which variables contribute most to the prediction.
This can provide valuable insights into customer behavior and preferences.
For example, features related to website visits, time spent on the website, and
page views per visit appear to be influential in predicting customer status.
Refine Marketing Strategies: Tailor marketing strategies based on the insights
gained from predictive modeling. For instance, focus marketing efforts on
customers who exhibit behaviors indicative of higher conversion rates, such
as frequent website visits, longer time spent on the website, and higher page
views per visit.
Regular Model Monitoring and Updates: Continuously monitor model
performance and update the models as new data becomes available.
Customer behavior and preferences may evolve over time, requiring
adjustments to the predictive models to maintain their effectiveness.
Aim for Balanced Performance Metrics: Aim for a balanced combination of
accuracy, recall, precision, and F1-score, depending on the specific business
objectives and constraints. For example, if false positives (predicting a
customer will convert when they won't) are costly, prioritize precision. If false
negatives (failing to identify potential converters) are more concerning,
prioritize recall.
By incorporating these recommendations into your business strategy, you can
leverage predictive modeling techniques to better understand customer
behavior, optimize marketing efforts, and ultimately improve conversion rates.

Urban Clap - Anu
No ratings yet
Urban Clap - Anu
10 pages
FDTD Application Examples
No ratings yet
FDTD Application Examples
21 pages
Aspen HYSYS V10 Cumulative Patch 1 Release Notes
100% (1)
Aspen HYSYS V10 Cumulative Patch 1 Release Notes
10 pages
Recommended Practices For Heat Shaping and Straightening With Oxyfuel Gas Heating Torches
No ratings yet
Recommended Practices For Heat Shaping and Straightening With Oxyfuel Gas Heating Torches
7 pages
Checklist - b350 Gold Virtual Airlines
No ratings yet
Checklist - b350 Gold Virtual Airlines
3 pages
Vertopal.com AML Project LearnerNotebook LowCode
No ratings yet
Vertopal.com AML Project LearnerNotebook LowCode
74 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
E-Commerce Product Delivery Prediction
No ratings yet
E-Commerce Product Delivery Prediction
13 pages
TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur
No ratings yet
TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur
18 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
Vertopal.com_ML Project 2
No ratings yet
Vertopal.com_ML Project 2
19 pages
Machine Learning Project - Parijat
No ratings yet
Machine Learning Project - Parijat
26 pages
Assignment3: 1) Identify Percentage of Missing Values in Each Column and Display The Same
No ratings yet
Assignment3: 1) Identify Percentage of Missing Values in Each Column and Display The Same
30 pages
Name - Atharva Navghane Roll No - 2301117 Div B Krai Flip Classroom Assignment On Regression Analysis
No ratings yet
Name - Atharva Navghane Roll No - 2301117 Div B Krai Flip Classroom Assignment On Regression Analysis
58 pages
howxtre
No ratings yet
howxtre
8 pages
DM Project Report
No ratings yet
DM Project Report
43 pages
Machine Learning Project
67% (3)
Machine Learning Project
30 pages
Dự báo và phát triển kinh doanh
No ratings yet
Dự báo và phát triển kinh doanh
43 pages
Group 3
No ratings yet
Group 3
15 pages
02-Linear Regression Project - Solutions
No ratings yet
02-Linear Regression Project - Solutions
12 pages
Exercise Univariate Analysis - Andoni Fikri - 13118111
No ratings yet
Exercise Univariate Analysis - Andoni Fikri - 13118111
9 pages
Employee Analysis
No ratings yet
Employee Analysis
19 pages
A - B Testing
No ratings yet
A - B Testing
31 pages
ML_Extended Project Business Report-Richa
No ratings yet
ML_Extended Project Business Report-Richa
32 pages
Cars Project PDF
No ratings yet
Cars Project PDF
9 pages
Machine Learning
100% (2)
Machine Learning
30 pages
Machine Learning Extended Project - BrahmaChari
No ratings yet
Machine Learning Extended Project - BrahmaChari
29 pages
Weka
No ratings yet
Weka
9 pages
DW 14
No ratings yet
DW 14
14 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Business Intelligence and Analytics
No ratings yet
Business Intelligence and Analytics
8 pages
Group 6 CC07
No ratings yet
Group 6 CC07
36 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Neural_Network
No ratings yet
Neural_Network
7 pages
Predictive - Modelling - Project - PDF 1
No ratings yet
Predictive - Modelling - Project - PDF 1
31 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Project Report Abhay PDF
100% (1)
Project Report Abhay PDF
20 pages
PDM
No ratings yet
PDM
8 pages
Abigail Tsani Darmawan - Streamlining Bank Campaign Promotion (Batch 16)
No ratings yet
Abigail Tsani Darmawan - Streamlining Bank Campaign Promotion (Batch 16)
56 pages
Employees Burnout Analysis
No ratings yet
Employees Burnout Analysis
20 pages
Coding
No ratings yet
Coding
9 pages
Student - Linear Regression Example - Colaboratory
No ratings yet
Student - Linear Regression Example - Colaboratory
6 pages
Chapter 1
No ratings yet
Chapter 1
19 pages
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
No ratings yet
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
77 pages
Satya772244@gmail Compdf
No ratings yet
Satya772244@gmail Compdf
7 pages
Aditya Slides For IBM
No ratings yet
Aditya Slides For IBM
125 pages
Exploratory Data Analysis For Machine Learning
No ratings yet
Exploratory Data Analysis For Machine Learning
11 pages
Code PLFS MVPA
No ratings yet
Code PLFS MVPA
12 pages
Exercise 5_Vlookup and SUMIF
No ratings yet
Exercise 5_Vlookup and SUMIF
211 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
30 pages
PM Guided Project
No ratings yet
PM Guided Project
25 pages
Basic of Statistics
No ratings yet
Basic of Statistics
4 pages
Lecture_2 (1) (2)
No ratings yet
Lecture_2 (1) (2)
30 pages
Note 4
No ratings yet
Note 4
18 pages
Lecture - 2 - BCAS3001-Big Data Computing PDF
No ratings yet
Lecture - 2 - BCAS3001-Big Data Computing PDF
36 pages
Let's Interact! Modeling Interaction Effects in Linear and Generalized Linear Models Using SAS
No ratings yet
Let's Interact! Modeling Interaction Effects in Linear and Generalized Linear Models Using SAS
69 pages
DM Project
No ratings yet
DM Project
36 pages
k-7 means
No ratings yet
k-7 means
2 pages
DATA SCIENCE SAMPLE
No ratings yet
DATA SCIENCE SAMPLE
5 pages
Six Weeks Summer Training Report PDF
100% (2)
Six Weeks Summer Training Report PDF
26 pages
1Demand
No ratings yet
1Demand
13 pages
Operation Ragnarok
From Everand
Operation Ragnarok
Kevin Coolidge
No ratings yet
A Mighty Tree: An Illustrated Epic
From Everand
A Mighty Tree: An Illustrated Epic
Chris Vandeleur
No ratings yet
Project Report 2019 Ti+ GT 1.1 Pltgu Grati
No ratings yet
Project Report 2019 Ti+ GT 1.1 Pltgu Grati
4 pages
Py Foam Advanced
No ratings yet
Py Foam Advanced
80 pages
Laptop Encryption Policy
No ratings yet
Laptop Encryption Policy
2 pages
Appendix 4.1 Erection and Dismantling of Scaffolding
No ratings yet
Appendix 4.1 Erection and Dismantling of Scaffolding
21 pages
Senses Paper
No ratings yet
Senses Paper
6 pages
10 Field Development and CVP Process August 2015 PDF
No ratings yet
10 Field Development and CVP Process August 2015 PDF
26 pages
Advanced Collision Repair System: Whiz Equipment Makes Your Repair Safe and Fast!
No ratings yet
Advanced Collision Repair System: Whiz Equipment Makes Your Repair Safe and Fast!
3 pages
Connectwell Terminal Block
No ratings yet
Connectwell Terminal Block
98 pages
CRITERION COLLECTION - The Complete Films of Agnès Varda Contents
No ratings yet
CRITERION COLLECTION - The Complete Films of Agnès Varda Contents
10 pages
Load File
No ratings yet
Load File
85 pages
I Objective/purpose of The Public Authority:: Pradhan Mantri Gram Sadak Yojana (PMGSY)
No ratings yet
I Objective/purpose of The Public Authority:: Pradhan Mantri Gram Sadak Yojana (PMGSY)
25 pages
2090F Drawing
No ratings yet
2090F Drawing
1 page
Forklift Fundamental
100% (4)
Forklift Fundamental
42 pages
Potential Appraisal Form
No ratings yet
Potential Appraisal Form
4 pages
Parts of Vertical Roller Mill
67% (3)
Parts of Vertical Roller Mill
1 page
Staircase Pressurization Fan, What Is The Equation To Calculate The Air Flow and Static Pressure For Fan - Bayt PDF
No ratings yet
Staircase Pressurization Fan, What Is The Equation To Calculate The Air Flow and Static Pressure For Fan - Bayt PDF
3 pages
Light Trespass From Exterior Lighting in Urban Residential Areas of Compact Cities
No ratings yet
Light Trespass From Exterior Lighting in Urban Residential Areas of Compact Cities
8 pages
A Control Lyapunov Function Approach To Multi Agent CoordinationclfCas03
No ratings yet
A Control Lyapunov Function Approach To Multi Agent CoordinationclfCas03
18 pages
Tutoral For Einglis Comp 111
No ratings yet
Tutoral For Einglis Comp 111
2 pages
Activity Voucher: TV060O5FPTC
No ratings yet
Activity Voucher: TV060O5FPTC
18 pages
Trotec Catalog 2021 English
No ratings yet
Trotec Catalog 2021 English
96 pages
Samss List & Scope
No ratings yet
Samss List & Scope
15 pages
Codificacion-en-Matlab IFN°3
No ratings yet
Codificacion-en-Matlab IFN°3
11 pages
Evaporadores ELA
No ratings yet
Evaporadores ELA
44 pages
(Ebook PDF) Synchronized Phasor Measurements and Their Applications 2nd edition by Arun Phadke, James Thorp 331950584X 9783319505848 full chapters - The ebook is now available, just one click to start reading
100% (5)
(Ebook PDF) Synchronized Phasor Measurements and Their Applications 2nd edition by Arun Phadke, James Thorp 331950584X 9783319505848 full chapters - The ebook is now available, just one click to start reading
81 pages
Chapter 21 - Water and Waste Systems: REV 3, May 03/05
No ratings yet
Chapter 21 - Water and Waste Systems: REV 3, May 03/05
12 pages

Predictive Modeling

Uploaded by

Predictive Modeling

Uploaded by

ndex

Define the problem and perform Exploratory Data Analysis

# libaries to help with data visualization

# Removes the limit for the number of displayed columns

# Library to split data

# To build model for prediction

# To tune different models

# To get diferent metric scores

| Column | Data Type |

Statistical summary of numerical columns:

age website_visits time_spent_on_website page_views_per_visit \

count 4612.00000 4612.00000 4612.00000 4612.00000

mean 46.20121 3.56678 724.01127 3.02613

std 13.16145 2.82913 743.82868 1.96812

min 18.00000 0.00000 0.00000 0.00000

25% 36.00000 2.00000 148.75000 2.07775

50% 51.00000 3.00000 376.00000 2.79200

75% 57.00000 5.00000 1336.75000 3.75625

max 63.00000 30.00000 2537.00000 18.43400

Number of leads who haven't visited the website: 174

Shape of Training set : (3228, 4627)

Name: proportion, dtype: float64

#Model evaluation criterion

#Building Logistic Regression Model

Model performance on test set:

Test ROC-AUC score is : 0.8565821107290302

Test ROC-AUC score is : 0.38762673537555054

Performance on Training Set: 0.718091697645601

Performance on Test Set: 0.7044797687861272

# importance of features in the tree building

|--- time_spent_on_website <= 415.50

Training performance comparison:

Training performance comparison:

Use Logistic Regression or LDA for Predictive Modeling: Both Logistic

You might also like