0% found this document useful (0 votes)

109 views93 pages

Data Mining For Business in Python Deck

Here are the key steps: 1. Prepare the data - Clean, transform and engineer features 2. Fit a CHAID model to identify significant predictors of call-back rates 3. Visualize and interpret the CHAID tree to understand how call-back rates differ by resume characteristics The case study examines how resume characteristics like name, education and experience affect call-back rates and explores potential discrimination. CHAID can help identify the most important factors and interactions that predict call-back outcomes. Data Mining for Business in Python 2021 CHAID Algorithm Explanation Visualization - CHAID (Chi-squared Automatic Interaction Detection) is a decision tree technique

Uploaded by

Cloudzone

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views93 pages

Data Mining For Business in Python Deck

Uploaded by

Cloudzone

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 93

Data Mining for

Business in Python

Data Mining for Business in Python 2021

Data Mining for Business in Python 2021
1 Survival Analysis

2 Cox Proportional Hazard

3 CHAID

4 Gaussian Mixture Model

5 Dimension Reduction

6 Association Rule Learning

7 Random Forest

8 LIME

9 XGBoost and SHAP

Data Mining for Business in Python 2021

Survival Analysis

Data Mining for Business in Python 2021

Introduction to Survival Analysis

Context Visualization

• Survival Analysis is very common for

Subscription type businesses and very apt Customers that have not renewed
to study customer churn
100%
• Imagine a customer decides to cancel their
subscription. How long do you wait until you
try to get that customer?

• 1 day
• 1 week
• 1 month
Time

Data Mining for Business in Python 2021

Case Study Briefing

Lung Cancer
Survival in patients with advanced lung cancer from the North
Central Cancer Treatment Group.
Case study1

• Determine the survival curve through the Kaplan Meyer

Estimator

• Understand differences between Males and Females

1: Author Terry M Therneau [aut, cre], Thomas Lumley [ctb, trl] (original S->R port and R maintainer until 2009), Atkinson Elizabeth [ctb], Crowson Cynthia [ctb]

Data Mining for Business in Python 2021

Survival Analysis Step by Step

Prepare Dataset

Perform Survival Curve

Visualize and Interpret Results

Perform Log Rank Test (if the case requires it)

Data Mining for Business in Python 2021

Kaplan-Meier Estimator

Explanation Formula
• Non-parametric statistic used to estimate the 𝑑𝑖
𝑆(𝑡𝑖 ) = 𝑆(𝑡𝑖−1 ) ∗ (1 − )
survival function (probability of a person 𝑛𝑖
surviving) from the lifetime data.
• In medical research, it is often used to measure Where:
the fraction of patients living for a specific time
𝑆(𝑡𝑖 ) = probability of survival
after treatment or diagnosis.
at time t
Customers 𝑑𝑖 = number of events at time
t
𝑛𝑖 = number of survivors at
time t

Data Mining for Business in Python 2021

Time
Censoring

Description
Right Censoring:
The subject under observation is still alive. In this case, we can not
Types have our timing when our event of interest (death) occurs.

Left Censoring:
The event cannot be observed for some reason. The event may
also have started before recording.

Interval Censoring:
We only have data for a specific interval, so it is possible that the
event of interest does not occur during that time.

Data Mining for Business in Python 2021

Log Rank Test

Context Visualization

Goal:
To test if there are statistical differences
in the survival distribution of >= 2 groups

Null Hypothesis:
There is no difference between both
groups

If p-value > 0.05:

There is no difference between both
groups

Data Mining for Business in Python 2021

Survival Analysis extra Resources

Deep dives

Book
Survival Analysis
Stephen P. Jenkins

Data Mining for Business in Python 2021

Challenge – Electronic components

Experiment Output
In 1988, an experiment was
designed and implemented at one 1 Transform Solder Variable into 1 and 0
AT&T’s factory.

The goal was to investigate 2 Fit the Kaplan-Meyer estimator

alternatives in the "wave
soldering" procedure for
3 Plot Survival curves
mounting electronic components
to printed circuit boards.
4 Do Logrank test for the Panel variable. Use
The response is the number of multivariate_logrank_test
visible solder skips. Or be creative :D
Dataset source: Survival package from CRAN

Data Mining for Business in Python 2021

Cox Proportional
Hazard Regression

Data Mining for Business in Python 2021

Cox Proportional Hazard Regression

Explanation Formula

• Survival Analysis does not allow other ℎ 𝑡 = ℎ𝑜 𝑡 ∗ exp(𝑏1 ∗ 𝑥1 + 𝑏𝑛 ∗ 𝑥𝑛 )

predictors
Where:
• At best, you can split in groups of gender,
age, etc… and perform a Log-rank test ℎ𝑜 𝑡 = baseline hazard

• Thus, Cox Proportional Hazard regressions 𝑏𝑛 = impact coefficients

helps to determine the relationship between
the survival time of a subject and one or 𝑥𝑛 = covariates
more predictor variables Result interpretation:
HR > 1: increase
• exp(𝑏𝑛 ) are called the Hazard Ratios (HR)
HR < 1: decrease
HR = 1: neutral
Data Mining for Business in Python 2021
Cox Proportional Hazard Regression Step by Step

Prepare Dataset

Cox Proportional Regression

Visualize and Interpret Results

Data Mining for Business in Python 2021

Case Study Briefing

Lung Cancer
Survival in patients with advanced lung cancer from the North
Central Cancer Treatment Group.
Case study1
1 Driver Analysis with Cox Proportional Hazard

2 Visualize Results

1: Author Terry M Therneau [aut, cre], Thomas Lumley [ctb, trl] (original S->R port and R maintainer until 2009), Atkinson Elizabeth [ctb], Crowson Cynthia [ctb]

Data Mining for Business in Python 2021

Cox Proportional Hazard extra Resources

Deep dives

Time-dependent covariates in the cox proportional-hazards

regression model
Lloyd D. Fisher and D. Y. Lin 1999

Cox Proportional-Hazards Regression for Survival Data in R

John Fox & Sanford Weisberg

Data Mining for Business in Python 2021

Challenge – Veteran Lung Cancer A/B test

Experiment
Randomised trial of two treatment regimens for lung cancer.

Challenge1
1 Transform Cell Type into dummy variables.
Use pd.get_dummies or drop the variable

2 Cox Proportional Hazard Regression

3 Plot CPH results

Dataset source: Survival package from CRAN

Data Mining for Business in Python 2021

CHAID

Data Mining for Business in Python 2021

Case Study Briefing

Labor Market Ethnic Discrimination

Cross-section data about resume, call-back and employer
information
Case study1
4,870 fictitious resumes sent in response to employment
advertisements in Chicago and Boston in
2001

The resumes contained information concerning the ethnicity of

the applicant.

Bertrand, M. and Mullainathan, S. (2004).

Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market
Discrimination. American
Economic Review, 94, 991–1013.
Data Mining for Business in Python 2021
CHAID Step by Step

Variable selection

Transforming continuous variables into categorical

Do your first tree

Prune it for better interpretability

Data Mining for Business in Python 2021

Factors influencing call-backs
High ? Medium X Low

Employment gaps Computer skills

Yes No Yes No

Honorary Yes ? College Yes ?

degree
No X ? No ? X

Resume quality Military experience

Yes No Yes No

Work High ? Special Yes ?

experience skills
Low ? X No ? X

Data Mining for Business in Python 2021

Complexity increases as you deep dive in your problem

Description
Problem depth:
Having more than 20, 50 or 100 drivers increases the complexity
Why?

Importance:
how do you know which driver actually matters most?

Relevance:
Some variables might be relevant in combination with some, but
not all

Data Mining for Business in Python 2021

One of the CHAID‘s benefits is that figures out which
drivers are more important
Description
Importance ranking:
CHAID figures out which drivers matter more, by doing significance
Which? tests
Segmented Driver Analysis:
CHAID will segment the population and perform driver analysis for
each of them.
Interpretability:
CHAID provides easy to read graphs with customer segments

Data Mining for Business in Python 2021

Let‘s see how it works visually
Results

High
Low
Quality
No Education Low
Computer Yes
Yes High Military
No
Honors Yes
No
Emp holes
Yes
No
Experience Low High
Quality
Low
High High
Jobs
Low
Data Mining for Business in Python 2021
How CHAID processes

Is called What does it do?

Yes No CHAID looks at all predictors and tries to find the
one where the “yes” is most different from the “no”
Has Yes
Honors How does it work?
No
CHAID performs a Chi-square test. It shows
whether the frequencies of the categorical
Of the people who have honors: variables are different or not. Very similar to t-test,
Is called but it is a test of variance, and ideal for categorical
Yes No variables.

IT Skills Yes And then?

After it finds the first segment split, tries to find the
No
next where the “yes” differs most from the “no”
Data Mining for Business in Python 2021
Last few things consider

Description
Tree size:
You can choose how many levels the tree will have
Which?

Bucket size:
You can choose a minimum threshold that you want your buckets
to have
Continuous variables:
CHAID accepts only categorical variables

Data Mining for Business in Python 2021

CHAID extra Resources

Deep dives

A CHAID Based Performance Prediction Model in Educational

Data Mining
M. Ramaswami and R. Bhaskaran, 2010

Tree Structured Data Analysis: AID, CHAID and CART

Leland Wilkinson 1992

Data Mining for Business in Python 2021

Challenge – Police Racial Bias

Description
You have been hired to understand to investigate Vehicle
searches by the police, and if there is racial bias
Challenge1
1 Create a dataset with these 5 variables: problem,
vehicleSearch, race, gender, policePrecinct
2 Transform string variables into dummy.

3 Get names of Dependent and Independent

variables
Perform CHAID and visualize. Set max depth
4
to 2
Dataset source: carStops package from CRAN

Data Mining for Business in Python 2021

Clustering:
Gaussian Mixture
Model

Data Mining for Business in Python 2021

Case Study Briefing – Country Segmentation

Socio-Economic Data
Data with country socio-economic data

Case study1 1 Find optimal Number of cluster

2 Visualize optimal number of clusters

3 Create clusters

4 Interpret the clusters

Data Mining for Business in Python 2021

What are clustering techniques?

Visualization Key ideas

• Groups observations in terms of their

X2
characteristics

• Main task of exploratory data mining

• Clustering is an art rather than Science

Data Mining for Business in Python 2021

Gaussian Mixture Model

Visualization Key ideas

• Gaussian Mixture Model is a

probabilistic method for clustering

• Better to use than traditional clustering

algorithms, like Kmeans

• The probabilities allow to better

evaluate edge cases

Data Mining for Business in Python 2021

Gaussian Mixture Model vs. Kmeans

Key ideas
• No need to standardize data • Faster to compute
• The cluster sizes do not have specific • Poor at dealing low amount of data points
structures that might or might not apply.
Kmeans Gaussian Mixture Model
X2

X2
Data Mining for Business in Python 2021 X1 X1
Akaike’s Information Criterion (AIC) and Bayesian
Information Criterion (BIC)
Key Ideas Pseudo-visualization

• AIC and BIC helps us determining the optimal Goodness

number of clusters of fit

• AIC and BIC provide a means to select a model

• Trade-off between simplicity and goodness of fit

• Deal with overfitting and underfitting

Simplicity

Data Mining for Business in Python 2021

Gaussian Mixture Model Step by Step

Prepare Dataset

Find Optimal Clusters

Perform Gaussian Mixture Model

Interpret results

Data Mining for Business in Python 2021

Gaussian Mixture Model extra Resources

Deep dives

On the Number of Components in a Gaussian mixture model

Geoffrey J. McLachlan, Suren Rathnayake

The Infinite Gaussian Mixture Model

Carl Edward Rasmussen, 2000

Data Mining for Business in Python 2021

Challenge – Wine Quality

Description
You are a wanna be Wine Connoisseur, trying to find the best
wines for your parties using Data Mining
Challenge1
1 Determine the Optimal number of Clusters

2 Perform Gaussian Mixture Model

3 Interpret Results
Paulo Cortez,
University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez
A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde
Region(CVRVV), Porto, Portugal
@2009
Data Mining for Business in Python 2021
Dimension Reduction

Data Mining for Business in Python 2021

Dimension Reduction Goal

Data set with 4 independent variables Components after Dimension Reduction

Variance Variance

Data Mining for Business in Python 2021

You have more information than you need

Dimension Reduction helps to solve

Problem 1 Multicollinearity issues

statement
2 Computational issues of large number of predictors

3 Noisy models due to overfitting

4 Create new variables (called components)

5 Pre processing data for predictive models or

forecasting

Data Mining for Business in Python 2021

What is Principal Component Analysis?

Key Ideas Visualization

• An algorithm for Dimension Reduction

• Linearly Transforms variables into

components

• Components can be determined by the

percentage of variance explained

• Choosing Components is more of an art than

a science

Data Mining for Business in Python 2021

PCA vs Manifold

Visualization Key ideas

• There are inherent curves in the

relationship among the data that have
X1 information
• Methods like PCA cannot absorb that
PCA information because of their linearity
• No need to standardize data
• Con: Manifold is less interpretable than
Manifold PCA
• Con : No good quantitative way of
X2 determining components.
• There are several algorithms for
Manifold. We will use t-SNE
Data Mining for Business in Python 2021
Pros and Cons t-SNE (t-Distributed Stochastic Neighbor
Embedding)

Excellent in high dimensional Very Computationally intensive

1 1
datasets

Focuses on preserving local

2
structures

Easy implementation
3

Data Mining for Business in Python 2021

Dimension Reduction extra Resources

Deep dives
Principal component analysis
Herve Abdi ´ and Lynne J. Williams

What is principal component analysis?

Markus Ringnér

Algorithms for manifold learning

Lawrence Cayton

Large-Scale Manifold Learning

Ameet Talwalkar, Courant Sanjiv Kumar, and Henry Rowley

Data Mining for Business in Python 2021

Challenge - Abalone

Description
The age of abalone is determined by cutting the shell through the cone, staining it,
and counting the number of rings through a microscope - a time-consuming task.
Challenge1 Other measurements, which are easier to obtain, are used to predict the age.
1 Transform gender variable and remove rings variable
2 Perform Correlation Matrix and Standardize data

3 Find Optimal Number of Clusters

4 Perform PCA and interpret components
5 Perform t-SNE and visualize results
Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994)
"The Population Biology of Abalone (_Haliotis_ species) in Tasmania. I. Blacklip Abalone (_H. rubra_) from
the North Coast and Islands of Bass Strait",
Sea FisheriesData Mining for Business
Division, Technicalin Python 2021 No. 48 (ISSN 1034-3288)
Report
Association Rule
Learning – Apriori

Data Mining for Business in Python 2021

Case Study Briefing - Groceries

Transaction Data
Data customer grocery shopping purchases

Case study1
1 We have a file with almost 10k transactions

2 We need to find patterns in our data to maximize

baskets

3 Perform Association Rule Learning

Michael Hahsler, Kurt Hornik, and Thomas Reutterer (2006)

Implications of probabilistic data modeling for mining association rules.
In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nuernberger, and W. Gaul, editors, From Data and Information
Analysis to Knowledge Engineering, Studies in Classification, Data Analysis, and Knowledge Organization,
pages 598–605. DataSpringer-Verlag.
Mining for Business in Python 2021
Association Rule Learning Step by Step

Prepare Dataset

Define Support

Define Confidence

Execute Association Rules Learning

Visualize Results

Data Mining for Business in Python 2021

The output of Association Rule Learning Algorithm

If... Then...

Game of Thrones → Lord of the rings

Burger → Fries

Jay-Z → Kanye

Key Ideas
The output is an If…then… type of analysis
Association Rule Learning is a very simple recommender system

Data Mining for Business in Python 2021

Concepts you need to know - Support

To consider
Methodological background
• It does not matter if
Burgers happen more
than once per
# 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑤𝑖𝑡ℎ 𝐵𝑢𝑟𝑔𝑒𝑟 transaction
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐵𝑢𝑟𝑔𝑒𝑟𝑠 =
𝑇𝑜𝑡𝑎𝑙 𝑇𝑟𝑎𝑛𝑠𝑎𝑡𝑖𝑜𝑛𝑠 • Support indicates the
Relevance of the item

Data Mining for Business in Python 2021

Burger Support Visualization

Visualization

Data Formula
• Population is 20 6
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐵𝑢𝑟𝑔𝑒𝑟𝑠 = = 30%
• 6 people like burgers 20

Data Mining for Business in Python 2021

Mayo Support Visualization

Visualization

Data Formula
• Population is 20 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑀𝑎𝑦𝑜 =
4
= 20%
• 4 people like Mayo 20

Data Mining for Business in Python 2021

Concepts you need to know - Confidence

Methodological background To consider

• It does not matter if
Burgers or Mayo happen
more than once per
# 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑤𝑖𝑡ℎ 𝐵𝑢𝑟𝑔𝑒𝑟 & 𝑀𝑎𝑦𝑜
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑀𝑎𝑦𝑜|𝑏𝑢𝑟𝑔𝑒𝑟𝑠 = transaction
𝑇𝑜𝑡𝑎𝑙 𝐵𝑢𝑟𝑔𝑒𝑟 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠
• Confidence indicates the
strength of the
relationship

Data Mining for Business in Python 2021

Confidence Visualization

Visualization

Data Formula
• Population is 20 2
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑀𝑎𝑦𝑜|𝑏𝑢𝑟𝑔𝑒𝑟𝑠 = = 33%
• 6 people like burgers 6
• Of the 6, 2 like Mayo
Data Mining for Business in Python 2021
Concepts you need to know - Lift

Key idea
Methodological background
• Lift measures the
likelihood of buying
Mayo and Burgers
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑀𝑎𝑦𝑜|𝑏𝑢𝑟𝑔𝑒𝑟𝑠 together vs. Just buying
𝐿𝑖𝑓𝑡 𝑀𝑎𝑦𝑜|𝑏𝑢𝑟𝑔𝑒𝑟𝑠 = Mayo
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝑀𝑎𝑦𝑜)
• Lift bigger than 1 means
increased likelihood to
buy

Data Mining for Business in Python 2021

Apriori is an Association Rule Learning Algorithm

Key characteristics

1 Mines frequent itemsets for Boolean Association Rules

What is it?
Works by finding items that have occurred a
2
minimum number of times (Support)
And the corresponding itemsets that pass a certain
3
cut-off (confidence
Limitations
1 Slow in processing Itemsets

2 Only allows Boolean values

Data Mining for Business in Python 2021

Association Rule Learning extra Resources

Deep dives

Online Association Rule Mining

Christian Hidber

Association Rule Mining: A Survey

Qiankun Zhao Nanyang and Sourav S. Bhowmick

Algorithms for Association Rule Mining – A General Survey and Comparison

Jochen Hipp, Ulrich Guntz, and Gholamreza Nakhaeizadeh

Data Mining for Business in Python 2021

Challenge - NYC restaurants cuisine, borough and
sanitary grade

Description
You have a dataset with NYC restaurants, their boroughs and
sanitaty grade
Challenge1
1 Create a list with the transactions

2 Encode the transaction list into a Dataframe

Perform Association Rules Learning. Play around

3 with support and confidence

4 Visualize the results

Data Mining for Business in Python 2021

Random Forest

Data Mining for Business in Python 2021

You were hired to figure out which the main drivers of
customers that sign up to a savings account in a bank
Description
Customer churn:
Calling a customer who cannot sign up can lead for he/she to
Problem unsubscribe
Relevance
Opportunity cost:
Sending to wrong product for the customer to sign up can create a
loss in the case the customer would be interesting to sign up for
another
Relevance:
Sending constinuously information that the customer is not
interested can potentially lead for lower open rate willingness in the
future
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier,
62:22-31, June 2014
Data Mining for Business in Python 2021
Random Forest Step by Step

Prepare Dataset

Split into training and test set

Perform Random Forest

Predict using the Random Forest

Model Assessment

Execute Driver Importance

Data Mining for Business in Python 2021
Random Forest is an Ensemble Learning Algorithm

Description

1 Ensemble Learning is when you have a plurality of

What is it? models predicting your output
2 In simple words, ensemble is an average of Models

3 A Random Forest is a combination of decision trees

Data Mining for Business in Python 2021

How do Decision trees work?

Visualization Decision tree

X2 < 50
x1
No Yes

X1 > 10 X1 > 70
70 No Yes No Yes

Blue Blue

Key Ideas:
• A split or leaf is done taken a maximum entropy logic
10 - Where would it yield more information
• The prediction would be done based on the relative frequency
50 x2

Data Mining for Business in Python 2021

Random Forest is an Ensemble Learning Algorithm

Description

1 Ensemble Learning is when you have a plurality of

What is it? models predicting your output
2 In simple words, ensemble is an average of Models

3 A Random Forest is a combination of decision trees

4 Can be used for Regression and Classification problems

5 Random Forests have a tendency to overfit

Data Mining for Business in Python 2021

Let‘s imagine this is our full data set

Description

Data Mining for Business in Python 2021

Splitting between training and test enables an unbiased
model assessment

Training Set Test Set

Model Assessment

Data Mining for Business in Python 2021

The Confusion Matrix allows to access the results of a classifier

Accuracy
• Accuracy = (True positive + True negative ) / All
Confusion Matrix • Used when we have balanced dataset
Truth Sensitivity or Recall
False True • True positive / ( true positive + false negative)
Predicted False True negative False Negative • Used when skewed towards False values

True False Positive True positive Specifiticy or False Positive Rate

• True negative / ( true negative + false positive)
• Used when skewed towards True values

Precision
• True Positive / ( true positive + false positive)
• Used when skewed towards False values
Data Mining for Business in Python 2021
Area under the ROC curve (AUC)

Visualization
Sensitivity
Key ideas
ROC Random chance
100% • AUC is a performance measure for
classification problems

• It tells us how well the model is able

to distinguish between positives and
negatives
100%

Specificity

Data Mining for Business in Python 2021

The F1 score should be used when we have an unbalanced dataset

Sensitivity or Recall
• Accuracy = (True positive + True negative ) / All
Confusion Matrix • Used when we have balanced dataset
Truth
Precision
False True
• True Positive / ( true positive + false positive)
Predicted False True negative False Negative • Used when we are skewed towards True values
True False Positive True positive
F-score
• 2 * (precision * recall) / (precision + recall)
• Used for unbalanced dataset

Data Mining for Business in Python 2021

Random Forest extra Resources

Deep dives

How Many Trees in a Random Forest?

Thais Mayumi Oshiro, Pedro Santoro Perez, and José Augusto Baranauskas

Random forest classifier for remote sensing classification

M. Pal

Real-Time Human Pose Recognition in Parts from Single Depth Images

Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio,
Richard Moore, Alex Kipman, and Andrew Blake

A Random Forest Guided Tour

Gérard Biau and Erwan Scornet

Data Mining for Business in Python 2021

Random Forest Challenge – Extramarital affairs

A Theory of Extramarital Affairs

Key characteristics of cheaters

Challenge1 1 Isolate X and Y

2 Transform Y into binary format
Create a dummy variable out of the occupation
3
variable
4 Transform X string variables into dummies

5 Perform Random Forest

6 Create Importance drivers

Data Mining for Business in Python 2021

LIME

Data Mining for Business in Python 2021

Interpreting Advanced Machine Learning Models

Introducing Lime
• Local interpretable model-agnostic explanations
Problem Statement -> works with most models
• How do we explain Advanced Machine
• LIME is the application of surrogate models
Learning models?
• How can we trust something that does • Surrogate models are trained to approximate
explain itself? the predictions of the underlying black box
model
• From a Data Mining perspective, it feels
like a great loss to not be able to take • LIME is best applied to Classification problems!
advantage of the Data Science newer
algorithms • LIME focus is on explaining individual
predictions

Data Mining for Business in Python 2021

LIME explanation example

Data Mining for Business in Python 2021

LIME extra Resources

Deep dives

Model-Agnostic Interpretability of Machine Learning

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin

Statistical stability indices for LIME: obtaining reliable explanations for

Machine Learning models
Giorgio Visania,b, Enrico Baglib , Federico Chesania , Alessandro
Poluzzib and Davide Capuzzo

Data Mining for Business in Python 2021

Challenge – Understanding Remote Work predictions

Stackoverflow dataset
Worker‘s characteristics, and job related queries

Challenge1 1 Install LIME

2 Transform string variables

3 Isolate X and Y

4 Perform Random Forest

5 Prepare LIME explainer

6 Use LIME to explain a couple of instances

Data Mining for Business in Python 2021

SHAP

Data Mining for Business in Python 2021

Case Study Briefing – Car prices

Pricing a car
List of cars, their price, and characteristics

Case study1
1 Build a XGBoost model to measure accuracy

2 Use SHAP to get insights

Data Mining for Business in Python 2021

XGBoost and SHAP step by step

Prepare dataset, isolate X and Y

Split into Training and Test Set, and create Matrices

Set Parameters

Run XGBoost

Assess Model

Implement SHAP
Data Mining for Business in Python 2021
XGBoost is a state-of-art Machine Learning Algorithm

Description

1 Stands for Extreme Gradient Boosting

What is it? Can be contructed with a tree based algorithm or

2
linear (worse results)

3 It is an emsemble algorithm

Each new model is built upon the precedent one ->

4
continuous improvement

5 Can be used for both Regression and Classification

Data Mining for Business in Python 2021

Linear vs Decision Trees

Linear Approach Decision Tree

Revenue Crust
No Yes

Not a pie Borders

No Yes
β
Not a pie Pie

Marketing
Costs
Data Mining for Business in Python 2021
XGBoost gives different weights depending on how
difficult it is to predict

First Tree Second Tree Third Tree

Outcome Predictor Weight Outcome Predictor Weight Outcome Predictor Weight

1 X 25% 1 X 20% 1 X 23%
0 X 25% 0 X 20% 0 X 15%
0 X 25% 0 X 30% 0 X 35%
1 X 25% 1 X 30% 1 X 27%

Data Mining for Business in Python 2021

XGBoost looks at parts of the observations at a time

First Tree Second Tree Third Tree

Outcome Predictor Weight Outcome Predictor Weight Outcome Predictor Weight

1 X1 25% 1 X1 20% 1 X1 23%
0 X2 25% 0 X2 20%
0 X3 30% 0 X3 35%
1 X4 25% 1 X4 27%

Key Idea
XGBoost only looks at a fraction of the observation at the time
Observations that are more difficult to predict are given a bigger weight

Data Mining for Business in Python 2021

The logic is similar for Regression-based tasks

First Tree Second tree

Error Outcome Predictor Weight Error Outcome Predictor Weight

-5 15 X1 33% -1 19 X1 40%
2 22 X2 33%
-1 25 X2 30%
4 34 X4 33% 3 35 X4 35%

Data Mining for Business in Python 2021

XGBoost also gives different weights to different
predictors

First Tree Second Tree

Error Outcome X1 X2 X3 Weight Error Outcome X1 X2 X3 Weight
-5 15 33% -1 19 40%
2 22 33%
50%

50%

50%
-1 25 30%
4 34 33% 3 35 35%

Third Tree
Error Outcome X1 X2 X3 Weight
1 21 35% Key Idea
Predictors also have different weights
40%

60%

if they yield different model results

0 24 30%
2 36 40%
Data Mining for Business in Python 2021
XGBoost quirks

Description

Which? NA:
Unlike other regression models, XGBoost treats NA‘s as
information

Non-linearity:
XGBoost is excellent dealing with non-linearity relationship
between the dependent and the independent variables.

Data Mining for Business in Python 2021

XGBoost has 7 main tuning parameters

Parameter Description

Minimum Child Relates to the sum of the weights of each observation. Low values can
weight mean that maybe not a lot of observations are in the round

ETA Learning Rate. How fast do you want the model to learn?

Max depth How big should the tree be? Bigger trees go into more detail

Gamma How fast should the tree be split?

Subsample Share of observations in each tree?

Colsample by tree How much of the tree should be analysed per round?

Number of rounds How many times do we want the analysis to be run?

Data Mining for Business in Python 2021

Mean Absolute Error (MAE) vs Root Squared Mean Error
(RSME)
Visualization Key ideas

Y
• MAE and RSME are performance indicators for
Model Regression models with continuous dependent
variables

σ 𝑦ො − 𝑦 2
σ 𝑦 − 𝑦ො
𝑀𝐴𝐸 = 𝑅𝑆𝑀𝐸 =
𝑛 𝑛

• RSME is quite useful for models with extremes /

outliers

X • MAE is more interpretable.

Data Mining for Business in Python 2021

Introduction to SHAP

• SHapley Additive exPlanations were introducted by Lundberg and Lee (2016)

• SHAP aims to explain each instance by computing the marginal contribution of each
feature to the prediction

• SHAP computes each value using coalitional game theory

Data Mining for Business in Python 2021

There are 3 main areas of insights

Global Interpretability Local Interpretability Dependency Plots

• The SHAP values can • Each observation gets its • Shows the relations
show how much each own set of SHAP values between an independent
predictor contributes, variable and the output
either positively or • We can explain why a
negatively, to the target case receives its • Also shows how the
variable prediction and the predictor interactor with its
contributions of the closest independent
predictors variable

Data Mining for Business in Python 2021

XGBoost and SHAP extra Resources

Deep dives

XGBoost: A Scalable Tree Boosting System

Tianqi Chen and Carlos Guestrin

A Unified Approach to Interpreting Model Predictions

Scott M. Lundberg and Su-In Lee

Toward safer highways, application of XGBoost and SHAP for real-time

accident detection and feature analysis
Amir Bahador Parsaa, Ali Movahedia, Homa Taghipoura, Sybil
Derribleb, and Abolfazl (Kouros) Mohammadian

Data Mining for Business in Python 2021

Challenge – Understanding house price drivers

Dataset with house characteristics and prices

1 Install SHAP and import libraries

Challenge1 2 Transform string variables

3 Isolate X and Y, and generate XGBoost matrix

4 Set parameters and run XGBoost

5 Local interpretability

6 Dependency plots

7 Global interpretability

Data Mining for Business in Python 2021

1) Intro To Datamining
No ratings yet
1) Intro To Datamining
17 pages
Data Sciencefor Business
No ratings yet
Data Sciencefor Business
107 pages
Basic Machine Learning
No ratings yet
Basic Machine Learning
34 pages
Unit 5(DS)
No ratings yet
Unit 5(DS)
15 pages
Fcthgchgtbelow
No ratings yet
Fcthgchgtbelow
6 pages
Data Mining Question Set
No ratings yet
Data Mining Question Set
5 pages
AIML_UNIT-4
No ratings yet
AIML_UNIT-4
82 pages
Introducing Decision Theory Analysis (DTA) and Classification and Regression Trees (CART)
No ratings yet
Introducing Decision Theory Analysis (DTA) and Classification and Regression Trees (CART)
30 pages
DM Guidelines 14jan2022
No ratings yet
DM Guidelines 14jan2022
5 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
NAC.pdf (1)
No ratings yet
NAC.pdf (1)
23 pages
Week 4 - Introduction to Data Mining and Data Mining Techniques (3)
No ratings yet
Week 4 - Introduction to Data Mining and Data Mining Techniques (3)
44 pages
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
No ratings yet
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
95 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
Recap of Machine Learning (1)
No ratings yet
Recap of Machine Learning (1)
29 pages
MILIT PPT Modifies
No ratings yet
MILIT PPT Modifies
43 pages
03
No ratings yet
03
22 pages
DT 444
No ratings yet
DT 444
19 pages
(IJCST-V3I1P21) : S. Padmapriya
No ratings yet
(IJCST-V3I1P21) : S. Padmapriya
5 pages
Data Mining Notes
No ratings yet
Data Mining Notes
43 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
Unit 1(DS)
No ratings yet
Unit 1(DS)
15 pages
Introd M
No ratings yet
Introd M
37 pages
Date Preparation and Exploration:: Titanic Data - CSV
No ratings yet
Date Preparation and Exploration:: Titanic Data - CSV
5 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
DATA SCIENCE iNTERVIEW QUESTION
No ratings yet
DATA SCIENCE iNTERVIEW QUESTION
42 pages
The Handbook of Data Mining - 1st Edition ISBN 0805840818, 9780805840810 Complete EPUB eBook
No ratings yet
The Handbook of Data Mining - 1st Edition ISBN 0805840818, 9780805840810 Complete EPUB eBook
17 pages
Time Table Scheduling in Data Mining
No ratings yet
Time Table Scheduling in Data Mining
61 pages
Wk. 1. Introduction [08.10.2020]
No ratings yet
Wk. 1. Introduction [08.10.2020]
30 pages
Unit 2
No ratings yet
Unit 2
48 pages
01 Classification
No ratings yet
01 Classification
77 pages
CH 1 Intro To Data Mining
No ratings yet
CH 1 Intro To Data Mining
17 pages
Intro To Data Science Lecture 1
No ratings yet
Intro To Data Science Lecture 1
7 pages
AIML Expt
No ratings yet
AIML Expt
7 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Chapter 4 Data Mining
No ratings yet
Chapter 4 Data Mining
5 pages
Spreadsheet Modeling & Decision Analysis: A Practical Introduction To Business Analytics
No ratings yet
Spreadsheet Modeling & Decision Analysis: A Practical Introduction To Business Analytics
35 pages
Chapter 01 2
No ratings yet
Chapter 01 2
19 pages
Dr. Gaurav Dixit: Department of Management Studies
No ratings yet
Dr. Gaurav Dixit: Department of Management Studies
26 pages
Analytics Boot Camp
No ratings yet
Analytics Boot Camp
126 pages
SPSS Prgms
No ratings yet
SPSS Prgms
25 pages
Basic Concept of Classification (Data Mining)
No ratings yet
Basic Concept of Classification (Data Mining)
11 pages
DM Unit 2-5 Notes
No ratings yet
DM Unit 2-5 Notes
78 pages
(eBook PDF) Essentials of Business Analytics 2nd Edition pdf download
100% (1)
(eBook PDF) Essentials of Business Analytics 2nd Edition pdf download
51 pages
1. DADV_Lab_Subject_303105315
No ratings yet
1. DADV_Lab_Subject_303105315
35 pages
1152CS239-Intro. To Data Science-Syllabus
No ratings yet
1152CS239-Intro. To Data Science-Syllabus
6 pages
data-mining-lab-manual-CSE-VII-Sem
No ratings yet
data-mining-lab-manual-CSE-VII-Sem
63 pages
DATA SCIENCE With DA, ML, DL, AI Using Python & R PDF
100% (1)
DATA SCIENCE With DA, ML, DL, AI Using Python & R PDF
10 pages
(eBook PDF) Essentials of Business Analytics 2nd Editioninstant download
100% (2)
(eBook PDF) Essentials of Business Analytics 2nd Editioninstant download
48 pages
DS assignment COMPLETED DOC
No ratings yet
DS assignment COMPLETED DOC
11 pages
Download full Data Mining for Business Intelligence Concepts Techniques and Applications in Microsoft Office Excel r with XLMiner r 2nd ed Edition Patel ebook all chapters
100% (6)
Download full Data Mining for Business Intelligence Concepts Techniques and Applications in Microsoft Office Excel r with XLMiner r 2nd ed Edition Patel ebook all chapters
61 pages
SPSS Prgms
No ratings yet
SPSS Prgms
17 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
ssrn-3526707
No ratings yet
ssrn-3526707
5 pages
Module 1
No ratings yet
Module 1
91 pages
01 Intro To Data Mining
No ratings yet
01 Intro To Data Mining
32 pages
Module-1 C1-C2
No ratings yet
Module-1 C1-C2
39 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Thorsten Forster
No ratings yet
CO PO PSO Statements
No ratings yet
CO PO PSO Statements
911 pages
Statistics for Engineers and Scientists 5th Edition William Navidi - The latest updated ebook version is ready for download
No ratings yet
Statistics for Engineers and Scientists 5th Edition William Navidi - The latest updated ebook version is ready for download
49 pages
MINITAB 14 Supplement For Biostatistics For Health Sciences
No ratings yet
MINITAB 14 Supplement For Biostatistics For Health Sciences
87 pages
Econometrics I: Problem Set II: Prof. Nicolas Berman November 30, 2018
No ratings yet
Econometrics I: Problem Set II: Prof. Nicolas Berman November 30, 2018
4 pages
Regresi Logistik - Bahan
No ratings yet
Regresi Logistik - Bahan
89 pages
Assignment 4 Area Under The Stanard Normal Curve
No ratings yet
Assignment 4 Area Under The Stanard Normal Curve
7 pages
2.conditional Probability and Bayes Theorem
No ratings yet
2.conditional Probability and Bayes Theorem
68 pages
CLSSGB Exam Mock - Sample Paper
No ratings yet
CLSSGB Exam Mock - Sample Paper
44 pages
R Programming On Abalone Dataset
100% (1)
R Programming On Abalone Dataset
12 pages
Bernoulli PDF
No ratings yet
Bernoulli PDF
19 pages
Exercise 1:: Chapter 3: Describing Data: Numerical Measures
100% (1)
Exercise 1:: Chapter 3: Describing Data: Numerical Measures
11 pages
Effect of Teamwork On Employee Performance: Cite This Paper
No ratings yet
Effect of Teamwork On Employee Performance: Cite This Paper
18 pages
Chap15 Statistical Quality Control
No ratings yet
Chap15 Statistical Quality Control
111 pages
Quantitative Research in Public Administration PA-224
No ratings yet
Quantitative Research in Public Administration PA-224
3 pages
DEM 1110_2015_DRAFT NOTES_POINTERS- statistics
No ratings yet
DEM 1110_2015_DRAFT NOTES_POINTERS- statistics
52 pages
Get Data Visualization: Exploring and Explaining with Data 1st Edition Jeffrey D. Camm - eBook PDF PDF ebook with Full Chapters Now
100% (7)
Get Data Visualization: Exploring and Explaining with Data 1st Edition Jeffrey D. Camm - eBook PDF PDF ebook with Full Chapters Now
69 pages
3 Bpa MV Regression Reference Guide May2012 Final
No ratings yet
3 Bpa MV Regression Reference Guide May2012 Final
58 pages
BBA Syllabus 2021 Revised
No ratings yet
BBA Syllabus 2021 Revised
39 pages
Business Analytics For Management Decision
No ratings yet
Business Analytics For Management Decision
7 pages
الأسئلة11
No ratings yet
الأسئلة11
4 pages
SSC CGL Guide
No ratings yet
SSC CGL Guide
95 pages
Lean 6 Sigma Formulas
No ratings yet
Lean 6 Sigma Formulas
30 pages
3 Variation Reduction Overview
No ratings yet
3 Variation Reduction Overview
37 pages
3334 Exam Cheat Sheet
No ratings yet
3334 Exam Cheat Sheet
26 pages
Amiblu Stream Magazine November19
No ratings yet
Amiblu Stream Magazine November19
18 pages
Preferred Types of Marketing Strategies and Types of Senior High School Consumers in Central Colleges of The Philippines
No ratings yet
Preferred Types of Marketing Strategies and Types of Senior High School Consumers in Central Colleges of The Philippines
76 pages
1997 - 3 - Structural Integrity Assessment Procedures For
No ratings yet
1997 - 3 - Structural Integrity Assessment Procedures For
92 pages
Parametric Statistics
No ratings yet
Parametric Statistics
2 pages
Implementation of Senior High School Work Immersion Classroom and Field - TABAMO
No ratings yet
Implementation of Senior High School Work Immersion Classroom and Field - TABAMO
8 pages
Discrete Simulation
No ratings yet
Discrete Simulation
26 pages