DataMining Aug2021
DataMining Aug2021
Business Report
AUG DSBA BATCH 2021
STUDENT NAME:
SHAKSHI NARANG
Table of Contents
1) Problem 1 : Clustering
1.1) Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).
1.2) Do you think scaling is necessary for clustering in this case? Justify
1.3) Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
1.4) Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve
and silhouette score. Explain the results properly. Interpret and write inferences on the finalized
clusters.
1.5) Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.
Table of Contents
2) Problem 2: CART-RF-ANN
2.1) Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and multivariate
analysis).
2.2) Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial Neural
Network
2.3) Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each model.
2.4) Final Model: Compare all the models and write an inference which model is best/optimized.
2.5) Inference: Based on the whole Analysis, what are the business insights and recommendations
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They
collected a sample that summarizes the activities of users during the past few months. You are given the task to
identify the segments based on credit card usage.
Dataset for Problem 1: bank_marketing_part1_Data.csv
We will explore the Bank Dataset and do the analysis on the dataset. The major topics to be covered are below:
Exploratory Data Analysis (EDA)
Scaling
Applying hierarchical clustering to scaled data using Dendrogram
Applying K-Means clustering on scaled data and determining optimum clusters
Applying elbow curve and silhouette score
Cluster Profiling
1.1) Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).
In this step, we performed the below operations to check what the data set comprises of.
Head of the dataset
Shape of the dataset
Info/Datatypes of the dataset
Checking of missing values (if any)
Summary of the dataset
Checking for duplicate data (if any)
Univariate /Bivariate Analysis
Multivariate Analysis
Checking for correlations
1.1) Insights from Exploratory Data Analysis (EDA)
The data seems to be perfect
The data seems to be perfect
The shape of the data is (210, 7)
The info of the data indicates that all values are float
No Null values in the data
No missing values in the data
Summary of the data:
No null values present in any variables.
The mean and median values seems to be almost equal.
The standard deviation for spending is high when compared to other variables.
No duplicates in the dataset
1.1) Univariate Analysis
1.1) Univariate Analysis
1.1) Multivariate Analysis - Pairplot
1.1) Checking correlation and depiction in heatmap
Observations
Yes, scaling is very important as the model works based on the distance based computations scaling is
necessary for unscaled data.
Scaling needs to be done as the values of the variables are in different scales.
Spending, advance payments are in different values and this may get more weightage.
Scaling will have all the values in the relative same range.
I have used standard scalar for scaling.
Below is the snapshot of scaled data.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them.
Cluster creation
Cluster creation
Both the method are almost similar means, minor variation, which we know it occurs.
There was not too much variations from both methods
Cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and based on the
dataset had gone for 3 group cluster
And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment (payment made).
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on the
finalized clusters.
Steps performed in this
K-means clustering,
Randomly we decide to give n_clusters = 3 and we look at the distribution of
clusters according to the n_clusters.
We apply K-means technique to the scaled data.
1.4 Apply elbow curve and silhouette score.
1.4 Cluster Profiling.
Cluster 0 – Medium
Cluster 1 – low
Cluster 2 – High
1.4 Observation.
By K- Mean’s method we can at cluster 3 we find it optimal after there is no huge drop in inertia values. Also the
elbow curve seems to show similar results.
The silhouette width score of the K – means also seems to very less value that indicates all the data points are
properly clustered to the cluster. There is no mismatch in the data points with regards to clustering
Cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and based on the dataset
had gone for 3 group cluster
And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment (payment made).
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to collect data
from the past few years. You are assigned the task to make a model which predicts the claim status and provide
recommendations to management. Use CART, RF & ANN and compare the models' performances in train and test
sets.
Dataset for Problem 2: insurance_part2_data-1.csv
Attribute Information:
Target: Claim Status (Claimed)
Code of tour firm (Agency_Code)
Type of tour insurance firms (Type)
Distribution channel of tour insurance agencies (Channel)
Name of the tour insurance products (Product)
Duration of the tour (Duration in days)
Destination of the tour (Destination)
Amount worth of sales per customer in procuring tour insurance policies in rupees (in 100’s)
The commission received for tour insurance firm (Commission is in percentage of sales)
Age of insured (Age)
Problem Statement 2: CART-RF-ANN
Insurance- Dataset
We will explore the Insurance Dataset and do the analysis on the dataset. The major topics to be covered are below:
Exploratory Data Analysis (EDA)
Splitting of data into train and test set
Building Classification models in the following:
CART
Random Forest (RF)
Artificial Neural Network (ANN)
Performance check for each model/Classification reports for each model
Compare all models prepared
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).
The insurance data set has 3000 observations and 10 variables in the data set.
There are 6 categorical variables namely, Agency_Code, Type, Claimed, Channel, Product Name and Destination in this data
set and there are 4 Continuous or numerical variables namely, Age, Commision, Duration and Sales.
There are 6 object type variables, 2 integer type and 2 float type variables in the dataset.
There are no missing values in the data set.
There are 139 duplicate rows present in the data set.
Outliers are present in all the 4 numerical variables (Age, Commision, Duration and Sales). We can treat outliers in Random
Forest classification.
2.1 Insights from EDA (Contd.)
Summary of the dataset
Info function clearly indicates the dataset has object, integer and float so we have to change the object data type to
categorical codes.
2.1 Further Analysis based on EDA results (Contd.)
We also checked proportion of 1s and 0s after converting object datatype to categorical codes.
As we found that the dataset has duplicate rows, we will further remove the duplicate rows as shown below:
We will further look at the distribution of dataset in Univariate/Bivariate analysis, Pairwise distribution and check for
correlations.
Bivariate Analysis of Categorical Variables
Univariate Analysis of Numerical Variables
Bivariate Analysis of Categorical Variables
Bivariate Analysis of Categorical Variables (Contd.)
Multivariate Analysis:
Pairplot of Numerical Variables Heatmap showing correlation coefficients.
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network.
Steps performed in this are
Extracting the target column “Claimed” into separate vectors for training and test set.
For training and testing purpose, we are splitting the dataset into train and test data in the ration 70:30.
Checking the dimensions of the training and test data by shape command.
X_train (2002, 9) and X_test (859, 9)
Building classification model for CART, Random Forest (RF) and Artificial Neural Network (ANN)
Model 1: Steps performed for building Decision Tree Classifier for Training data are:
Firstly, we will check for the feature/variable to be used further for prediction.
Secondly, perform Grid search for finding out the optimal values for the hyper parameters
Fitting the optimal values to the best grid of training set
Adding Tuning parameters to the model.
Finding the variable importance.
Predicting on Training and Test dataset for Decision Tree.
2.2 Data Split: Split the data into test and train, Random Forest, Artificial Neural Network
(Contd.).
Steps performed in this are
Model 2: Steps performed for building a Ensemble Random Forest Classifier for Training and Test data are:
Firstly, we will treat the Outliers present in the dataset for Random Forest as shown below:
Graphic representation
after Outlier treatment
2.2 Data Split: Split the data into test and train, Random Forest, Artificial Neural Network
(Contd.).
Steps performed in this are
Model 2: Random Forest Classifier Contd.:
Splitting of data in 70:30 ration for training and test set
Using RandomForestClassifier for fitting in the random selected features .
Performing Grid search to find out the Optimal values for the hyper parameters.
Fitting the model to RFCL values obtained by Optimal Grid search method.
Assigning best param and best estimator values to Best Grid
Predicting on Training and Test dataset for RF
Model 3: Building a Neural Network Classifier:
Before building the model, we scale the values to Standard Scaler using MinMaxScaler
Applying the scaling transformation to training and test data
We build the model using MLPClassifier and fit the same to the training data.
Performing Grid search to find out the Optimal values for the hyper parameters.
Fitting the model using the Optimal values from Grid search method.
Assigning the best estimator values to Best Grid
Predicting on Training and Test dataset for Neural Network Classifier
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score, classification reports for each model.
Decision Tree Prediction on Training set using Model Evaluation for Decision Tree on Training set
Accuracy and Confusion Matrix showing AUC score and depicting ROC curve
2.3 Decision Tree: Performance check on Predictions of Test set using Accuracy, Confusion
Matrix, Plot ROC curve and get ROC_AUC score.
Decision Tree Prediction on Test Dataset using Model Evaluation for Decision Tree on Test dataset
Accuracy and Confusion Matrix showing AUC score and depicting ROC curve
2.3 Model 1: Decision Tree: Performance metrics on Training and Test Sets
RF Prediction on Training set using Accuracy and Model Evaluation for RF on Training set showing AUC
Confusion Matrix score and depicting ROC curve
2.3 Random Forest: Performance check on Predictions of Test set using Accuracy, Confusion
Matrix, Plot ROC curve and get ROC_AUC score.
RF Prediction on Test set using Accuracy and Model Evaluation for RF on Test set showing AUC
Confusion Matrix score and depicting ROC curve
2.3 Model 2: Random Forest: Performance metrics on Training and Test Sets
ANN Prediction on Training set using Accuracy and Model Evaluation for ANN on Training set showing
Confusion Matrix AUC score and depicting ROC curve
2.3 ANN: Performance check on Predictions of Test set using Accuracy, Confusion Matrix, Plot
ROC curve and get ROC_AUC score.
ANN Prediction on Test set using Accuracy and Model Evaluation for ANN on Test set showing AUC
Confusion Matrix score and depicting ROC curve
2.3 Model 3: ANN: Performance metrics on Training and Test Sets
Depiction of ROC curve on Training Data Depiction of ROC curve on Test Data
CONCLUSION:
I am selecting the RF model, as it has better accuracy, precision, recall, and f1 score better than other two models CART &
ANN. Also the ROC curve.
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations
Looking at the model, more data will help us understand and predict models better.
Streamlining online experiences benefitted customers, leading to an increase in conversions, which subsequently raised
profits.
As per the data 90% of insurance is done by online channel.
Other interesting fact, is almost all the offline business has a claimed associated.
Need to train the JZI agency resources to pick up sales as they are in bottom, need to run promotional marketing campaign or
evaluate if we need to tie up with alternate agency.
Also based on the model we are getting 80%accuracy, so we need customer books airline tickets or plans, cross sell the
insurance based on the claim data pattern.
Other interesting fact is more sales happen via Agency than Airlines and the trend shows the claim are processed more at
Airline. So we may need to deep dive into the process to understand the workflow and why?
Key performance indicators (KPI) The KPI’s of insurance claims are:
Increase customer satisfaction which in fact will give more revenue
Combat fraud transactions, deploy measures to avoid fraudulent transactions at earliest
Optimize claims recovery method
Reduce claim handling costs.
Thank You!