100% found this document useful (2 votes)
221 views

DataMining Aug2021

This document discusses using data mining techniques to analyze customer data for a bank and an insurance company. For the bank, it covers segmenting customers into clusters based on credit card usage data using hierarchical clustering and k-means clustering. It recommends different promotional strategies for the identified clusters. For the insurance company, it covers building classification models using CART, Random Forest and Artificial Neural Network on customer data to predict higher risk customers, and comparing the performance of the models to identify the best model. The overall aim is to gain business insights from analyzing the customer data.

Uploaded by

sakshi narang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
221 views

DataMining Aug2021

This document discusses using data mining techniques to analyze customer data for a bank and an insurance company. For the bank, it covers segmenting customers into clusters based on credit card usage data using hierarchical clustering and k-means clustering. It recommends different promotional strategies for the identified clusters. For the insurance company, it covers building classification models using CART, Random Forest and Artificial Neural Network on customer data to predict higher risk customers, and comparing the performance of the models to identify the best model. The overall aim is to gain business insights from analyzing the customer data.

Uploaded by

sakshi narang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Data Mining

Business Report
AUG DSBA BATCH 2021
STUDENT NAME:
SHAKSHI NARANG
Table of Contents

1) Problem 1 : Clustering
1.1) Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).
1.2) Do you think scaling is necessary for clustering in this case? Justify
1.3) Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
1.4) Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve
and silhouette score. Explain the results properly. Interpret and write inferences on the finalized
clusters.
1.5) Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.
Table of Contents

2) Problem 2: CART-RF-ANN
2.1) Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and multivariate
analysis).
2.2) Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial Neural
Network
2.3) Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each model.
2.4) Final Model: Compare all the models and write an inference which model is best/optimized.
2.5) Inference: Based on the whole Analysis, what are the business insights and recommendations
Problem 1: Clustering

 A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They
collected a sample that summarizes the activities of users during the past few months. You are given the task to
identify the segments based on credit card usage.
 Dataset for Problem 1: bank_marketing_part1_Data.csv

 Data Dictionary for Market Segmentation:


 spending: Amount spent by the customer per month (in 1000s)
 advance_payments: Amount paid by the customer in advance by cash (in 100s)
 probability_of_full_payment: Probability of payment done in full by the customer to the bank
 current_balance: Balance amount left in the account to make purchases (in 1000s)
 credit_limit: Limit of the amount in credit card (10000s)
 min_payment_amt : minimum paid by the customer while making payments for purchases made monthly (in 100s)
 max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)
Problem 1: Clustering

Bank Marketing- Dataset

We will explore the Bank Dataset and do the analysis on the dataset. The major topics to be covered are below:
 Exploratory Data Analysis (EDA)
 Scaling
 Applying hierarchical clustering to scaled data using Dendrogram
 Applying K-Means clustering on scaled data and determining optimum clusters
 Applying elbow curve and silhouette score
 Cluster Profiling
1.1) Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).

Basic Data Exploration

In this step, we performed the below operations to check what the data set comprises of.
 Head of the dataset
 Shape of the dataset
 Info/Datatypes of the dataset
 Checking of missing values (if any)
 Summary of the dataset
 Checking for duplicate data (if any)
 Univariate /Bivariate Analysis
 Multivariate Analysis
 Checking for correlations
1.1) Insights from Exploratory Data Analysis (EDA)
 The data seems to be perfect
 The data seems to be perfect
 The shape of the data is (210, 7)
 The info of the data indicates that all values are float
 No Null values in the data
 No missing values in the data
 Summary of the data:
 No null values present in any variables.
 The mean and median values seems to be almost equal.
 The standard deviation for spending is high when compared to other variables.
 No duplicates in the dataset
1.1) Univariate Analysis
1.1) Univariate Analysis
1.1) Multivariate Analysis - Pairplot
1.1) Checking correlation and depiction in heatmap

Heatmap for Better Visualization

Observations

Strong positive correlation Between


 Spending & advance payments,
 Advance payments & current balance,
 Credit limit & spending
 Spending & current balance
 Credit limit & advance payments
 Max_spent_in_single_shopping current balance
1.2 Do you think scaling is necessary for clustering in this case? Justify

 Yes, scaling is very important as the model works based on the distance based computations scaling is
necessary for unscaled data.
 Scaling needs to be done as the values of the variables are in different scales.
 Spending, advance payments are in different values and this may get more weightage.
 Scaling will have all the values in the relative same range.
 I have used standard scalar for scaling.
 Below is the snapshot of scaled data.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them.

Steps performed in this are

Hierarchical Clustering – Ward’s method


1.3 Creating clusters, Cluster Frequency and Cluster Profiling.
Hierarchical Clustering – Ward’s method

Cluster creation

Cluster frequency and profiling


1.3 Apply hierarchical clustering to scaled data using Average Method

Steps performed in this are:


1.3 Creating clusters, Cluster Frequency and Cluster Profiling.
Hierarchical Clustering – Average method

Cluster creation

Cluster frequency and profiling


1.3 Observation

 Both the method are almost similar means, minor variation, which we know it occurs.
 There was not too much variations from both methods
 Cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and based on the
dataset had gone for 3 group cluster
 And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment (payment made).
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on the
finalized clusters.
Steps performed in this
 K-means clustering,
 Randomly we decide to give n_clusters = 3 and we look at the distribution of
clusters according to the n_clusters.
 We apply K-means technique to the scaled data.
1.4 Apply elbow curve and silhouette score.
1.4 Cluster Profiling.

3-Group clusters via K- Means has equal split of percentage of results.

Cluster 0 – Medium
Cluster 1 – low
Cluster 2 – High
1.4 Observation.

 By K- Mean’s method we can at cluster 3 we find it optimal after there is no huge drop in inertia values. Also the
elbow curve seems to show similar results.
 The silhouette width score of the K – means also seems to very less value that indicates all the data points are
properly clustered to the cluster. There is no mismatch in the data points with regards to clustering
 Cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and based on the dataset
had gone for 3 group cluster
 And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment (payment made).
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.

 Group 1: High Spending Group:


 Giving any reward points might increase their purchases.
 Maximum max_spent_in_single_shopping is high for this group, so can be offered discount/offer on next
transactions upon full payment
 Increase their credit limit and
 Increase spending habits
 Give loan against the credit card, as they are customers with good repayment record.
 Tie up with luxury brands, which will drive more one_time_maximun spending
 Group 2: Low Spending Group - customers should be given remainders for payments. Offers can be provided on
early payments to improve their payment rate. Increase their spending habits by tying up with grocery stores,
utilities (electricity, phone, gas, others)
 Group 3: Medium Spending Group - They are potential target customers who are paying bills and doing
purchases and maintaining comparatively good credit score. So we can increase credit limit or can lower down
interest rate. Promote premium cards/loyalty cars to increase transactions. Increase spending habits by trying
with premium ecommerce sites, travel portal, travel airlines/hotel, as this will encourage them to spend more
Problem 2: CART-RF-ANN

 An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to collect data
from the past few years. You are assigned the task to make a model which predicts the claim status and provide
recommendations to management. Use CART, RF & ANN and compare the models' performances in train and test
sets.
 Dataset for Problem 2: insurance_part2_data-1.csv
 Attribute Information:
 Target: Claim Status (Claimed)
 Code of tour firm (Agency_Code)
 Type of tour insurance firms (Type)
 Distribution channel of tour insurance agencies (Channel)
 Name of the tour insurance products (Product)
 Duration of the tour (Duration in days)
 Destination of the tour (Destination)
 Amount worth of sales per customer in procuring tour insurance policies in rupees (in 100’s)
 The commission received for tour insurance firm (Commission is in percentage of sales)
 Age of insured (Age)
Problem Statement 2: CART-RF-ANN
Insurance- Dataset

We will explore the Insurance Dataset and do the analysis on the dataset. The major topics to be covered are below:
 Exploratory Data Analysis (EDA)
 Splitting of data into train and test set
 Building Classification models in the following:
 CART
 Random Forest (RF)
 Artificial Neural Network (ANN)
 Performance check for each model/Classification reports for each model
 Compare all models prepared
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).

Basic Data Exploration


In this step, we performed the below operations to check what the data set comprises of.
 Head of the dataset
 Shape of the dataset
 Info/Datatypes of the dataset
 Checking of missing values (if any)
 Summary of the dataset
 Checking for duplicate data (if any)
 Checking for Outliers(if any)
 Univariate /Bivariate Analysis
 Pairwise Distribution of Continuous variables
 Checking for correlations
 Proportions of 1s and 0s
2.1 Insights from EDA

 The insurance data set has 3000 observations and 10 variables in the data set.
 There are 6 categorical variables namely, Agency_Code, Type, Claimed, Channel, Product Name and Destination in this data
set and there are 4 Continuous or numerical variables namely, Age, Commision, Duration and Sales.
 There are 6 object type variables, 2 integer type and 2 float type variables in the dataset.
 There are no missing values in the data set.
 There are 139 duplicate rows present in the data set.
 Outliers are present in all the 4 numerical variables (Age, Commision, Duration and Sales). We can treat outliers in Random
Forest classification.
2.1 Insights from EDA (Contd.)
 Summary of the dataset

 From the summary of the dataset, we observed the following:


 We have 4 numeric values and 6 categorical values, Agency code EPX has a frequency of 1365.
 The most preferred type seems to be travel agency Channel is Online.
 Customised Plan is the most sought plan by customers.
 Destination ASIA seems to be most sought destination place by customers.
2.1 Further Analysis based on EDA results (Contd.)

 Info function clearly indicates the dataset has object, integer and float so we have to change the object data type to
categorical codes.
2.1 Further Analysis based on EDA results (Contd.)

 We also checked proportion of 1s and 0s after converting object datatype to categorical codes.

 As we found that the dataset has duplicate rows, we will further remove the duplicate rows as shown below:

 We will further look at the distribution of dataset in Univariate/Bivariate analysis, Pairwise distribution and check for
correlations.
Bivariate Analysis of Categorical Variables
Univariate Analysis of Numerical Variables
Bivariate Analysis of Categorical Variables
Bivariate Analysis of Categorical Variables (Contd.)
Multivariate Analysis:
Pairplot of Numerical Variables Heatmap showing correlation coefficients.
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network.
Steps performed in this are

 Extracting the target column “Claimed” into separate vectors for training and test set.
 For training and testing purpose, we are splitting the dataset into train and test data in the ration 70:30.
 Checking the dimensions of the training and test data by shape command.
 X_train (2002, 9) and X_test (859, 9)
 Building classification model for CART, Random Forest (RF) and Artificial Neural Network (ANN)
 Model 1: Steps performed for building Decision Tree Classifier for Training data are:
 Firstly, we will check for the feature/variable to be used further for prediction.
 Secondly, perform Grid search for finding out the optimal values for the hyper parameters
 Fitting the optimal values to the best grid of training set
 Adding Tuning parameters to the model.
 Finding the variable importance.
 Predicting on Training and Test dataset for Decision Tree.
2.2 Data Split: Split the data into test and train, Random Forest, Artificial Neural Network
(Contd.).
Steps performed in this are
 Model 2: Steps performed for building a Ensemble Random Forest Classifier for Training and Test data are:
 Firstly, we will treat the Outliers present in the dataset for Random Forest as shown below:

We made a copy of the


original dataset df_insured
and stored the same in
df_insured_rf and used the
same for RF classifier.

Graphic representation
after Outlier treatment
2.2 Data Split: Split the data into test and train, Random Forest, Artificial Neural Network
(Contd.).
Steps performed in this are
 Model 2: Random Forest Classifier Contd.:
 Splitting of data in 70:30 ration for training and test set
 Using RandomForestClassifier for fitting in the random selected features .
 Performing Grid search to find out the Optimal values for the hyper parameters.
 Fitting the model to RFCL values obtained by Optimal Grid search method.
 Assigning best param and best estimator values to Best Grid
 Predicting on Training and Test dataset for RF
 Model 3: Building a Neural Network Classifier:
 Before building the model, we scale the values to Standard Scaler using MinMaxScaler
 Applying the scaling transformation to training and test data
 We build the model using MLPClassifier and fit the same to the training data.
 Performing Grid search to find out the Optimal values for the hyper parameters.
 Fitting the model using the Optimal values from Grid search method.
 Assigning the best estimator values to Best Grid
 Predicting on Training and Test dataset for Neural Network Classifier
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score, classification reports for each model.
Decision Tree Prediction on Training set using Model Evaluation for Decision Tree on Training set
Accuracy and Confusion Matrix showing AUC score and depicting ROC curve
2.3 Decision Tree: Performance check on Predictions of Test set using Accuracy, Confusion
Matrix, Plot ROC curve and get ROC_AUC score.

Decision Tree Prediction on Test Dataset using Model Evaluation for Decision Tree on Test dataset
Accuracy and Confusion Matrix showing AUC score and depicting ROC curve
2.3 Model 1: Decision Tree: Performance metrics on Training and Test Sets

Cart Metrics on Training Set

Cart Metrics on Test Set


2.3 Random Forest: Performance check on Predictions of Training set using Accuracy, Confusion
Matrix, Plot ROC curve and get ROC_AUC score.

RF Prediction on Training set using Accuracy and Model Evaluation for RF on Training set showing AUC
Confusion Matrix score and depicting ROC curve
2.3 Random Forest: Performance check on Predictions of Test set using Accuracy, Confusion
Matrix, Plot ROC curve and get ROC_AUC score.

RF Prediction on Test set using Accuracy and Model Evaluation for RF on Test set showing AUC
Confusion Matrix score and depicting ROC curve
2.3 Model 2: Random Forest: Performance metrics on Training and Test Sets

RF Metrics on Training Set

RF Metrics on Test Set


2.3 ANN: Performance check on Predictions of Training set using Accuracy, Confusion Matrix,
Plot ROC curve and get ROC_AUC score.

ANN Prediction on Training set using Accuracy and Model Evaluation for ANN on Training set showing
Confusion Matrix AUC score and depicting ROC curve
2.3 ANN: Performance check on Predictions of Test set using Accuracy, Confusion Matrix, Plot
ROC curve and get ROC_AUC score.

ANN Prediction on Test set using Accuracy and Model Evaluation for ANN on Test set showing AUC
Confusion Matrix score and depicting ROC curve
2.3 Model 3: ANN: Performance metrics on Training and Test Sets

ANN Metrics on Training Set

ANN Metrics on Test Set


2.4 Final Model: Compare all the models and write an inference which model is
best/optimized.

Comparison of the performance metrics from the 3 models


2.4 Final Model: ROC curve for the 3 models

Depiction of ROC curve on Training Data Depiction of ROC curve on Test Data

CONCLUSION:

I am selecting the RF model, as it has better accuracy, precision, recall, and f1 score better than other two models CART &
ANN. Also the ROC curve.
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations
 Looking at the model, more data will help us understand and predict models better.
 Streamlining online experiences benefitted customers, leading to an increase in conversions, which subsequently raised
profits.
 As per the data 90% of insurance is done by online channel.
 Other interesting fact, is almost all the offline business has a claimed associated.
 Need to train the JZI agency resources to pick up sales as they are in bottom, need to run promotional marketing campaign or
evaluate if we need to tie up with alternate agency.
 Also based on the model we are getting 80%accuracy, so we need customer books airline tickets or plans, cross sell the
insurance based on the claim data pattern.
 Other interesting fact is more sales happen via Agency than Airlines and the trend shows the claim are processed more at
Airline. So we may need to deep dive into the process to understand the workflow and why?
 Key performance indicators (KPI) The KPI’s of insurance claims are:
 Increase customer satisfaction which in fact will give more revenue
 Combat fraud transactions, deploy measures to avoid fraudulent transactions at earliest
 Optimize claims recovery method
 Reduce claim handling costs.
Thank You!

You might also like