0% found this document useful (0 votes)
109 views93 pages

Data Mining For Business in Python Deck

Here are the key steps: 1. Prepare the data - Clean, transform and engineer features 2. Fit a CHAID model to identify significant predictors of call-back rates 3. Visualize and interpret the CHAID tree to understand how call-back rates differ by resume characteristics The case study examines how resume characteristics like name, education and experience affect call-back rates and explores potential discrimination. CHAID can help identify the most important factors and interactions that predict call-back outcomes. Data Mining for Business in Python 2021 CHAID Algorithm Explanation Visualization - CHAID (Chi-squared Automatic Interaction Detection) is a decision tree technique

Uploaded by

Cloudzone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views93 pages

Data Mining For Business in Python Deck

Here are the key steps: 1. Prepare the data - Clean, transform and engineer features 2. Fit a CHAID model to identify significant predictors of call-back rates 3. Visualize and interpret the CHAID tree to understand how call-back rates differ by resume characteristics The case study examines how resume characteristics like name, education and experience affect call-back rates and explores potential discrimination. CHAID can help identify the most important factors and interactions that predict call-back outcomes. Data Mining for Business in Python 2021 CHAID Algorithm Explanation Visualization - CHAID (Chi-squared Automatic Interaction Detection) is a decision tree technique

Uploaded by

Cloudzone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Data Mining for

Business in Python

Data Mining for Business in Python 2021


Data Mining for Business in Python 2021
1 Survival Analysis

2 Cox Proportional Hazard

3 CHAID

4 Gaussian Mixture Model

5 Dimension Reduction

6 Association Rule Learning

7 Random Forest

8 LIME

9 XGBoost and SHAP

Data Mining for Business in Python 2021


Survival Analysis

Data Mining for Business in Python 2021


Introduction to Survival Analysis

Context Visualization

• Survival Analysis is very common for


Subscription type businesses and very apt Customers that have not renewed
to study customer churn
100%
• Imagine a customer decides to cancel their
subscription. How long do you wait until you
try to get that customer?

• 1 day
• 1 week
• 1 month
Time

Data Mining for Business in Python 2021


Case Study Briefing

Lung Cancer
Survival in patients with advanced lung cancer from the North
Central Cancer Treatment Group.
Case study1

• Determine the survival curve through the Kaplan Meyer


Estimator

• Understand differences between Males and Females

1: Author Terry M Therneau [aut, cre], Thomas Lumley [ctb, trl] (original S->R port and R maintainer until 2009), Atkinson Elizabeth [ctb], Crowson Cynthia [ctb]

Data Mining for Business in Python 2021


Survival Analysis Step by Step

Prepare Dataset

Perform Survival Curve

Visualize and Interpret Results

Perform Log Rank Test (if the case requires it)

Data Mining for Business in Python 2021


Kaplan-Meier Estimator

Explanation Formula
• Non-parametric statistic used to estimate the 𝑑𝑖
𝑆(𝑡𝑖 ) = 𝑆(𝑡𝑖−1 ) ∗ (1 − )
survival function (probability of a person 𝑛𝑖
surviving) from the lifetime data.
• In medical research, it is often used to measure Where:
the fraction of patients living for a specific time
𝑆(𝑡𝑖 ) = probability of survival
after treatment or diagnosis.
at time t
Customers 𝑑𝑖 = number of events at time
t
𝑛𝑖 = number of survivors at
time t

Data Mining for Business in Python 2021


Time
Censoring

Description
Right Censoring:
The subject under observation is still alive. In this case, we can not
Types have our timing when our event of interest (death) occurs.

Left Censoring:
The event cannot be observed for some reason. The event may
also have started before recording.

Interval Censoring:
We only have data for a specific interval, so it is possible that the
event of interest does not occur during that time.

Data Mining for Business in Python 2021


Log Rank Test

Context Visualization

Goal:
To test if there are statistical differences
in the survival distribution of >= 2 groups

Null Hypothesis:
There is no difference between both
groups

If p-value > 0.05:


There is no difference between both
groups

Data Mining for Business in Python 2021


Survival Analysis extra Resources

Deep dives

Book
Survival Analysis
Stephen P. Jenkins

Data Mining for Business in Python 2021


Challenge – Electronic components

Experiment Output
In 1988, an experiment was
designed and implemented at one 1 Transform Solder Variable into 1 and 0
AT&T’s factory.

The goal was to investigate 2 Fit the Kaplan-Meyer estimator


alternatives in the "wave
soldering" procedure for
3 Plot Survival curves
mounting electronic components
to printed circuit boards.
4 Do Logrank test for the Panel variable. Use
The response is the number of multivariate_logrank_test
visible solder skips. Or be creative :D
Dataset source: Survival package from CRAN

Data Mining for Business in Python 2021


Cox Proportional
Hazard Regression

Data Mining for Business in Python 2021


Cox Proportional Hazard Regression

Explanation Formula

• Survival Analysis does not allow other ℎ 𝑡 = ℎ𝑜 𝑡 ∗ exp(𝑏1 ∗ 𝑥1 + 𝑏𝑛 ∗ 𝑥𝑛 )


predictors
Where:
• At best, you can split in groups of gender,
age, etc… and perform a Log-rank test ℎ𝑜 𝑡 = baseline hazard

• Thus, Cox Proportional Hazard regressions 𝑏𝑛 = impact coefficients


helps to determine the relationship between
the survival time of a subject and one or 𝑥𝑛 = covariates
more predictor variables Result interpretation:
HR > 1: increase
• exp(𝑏𝑛 ) are called the Hazard Ratios (HR)
HR < 1: decrease
HR = 1: neutral
Data Mining for Business in Python 2021
Cox Proportional Hazard Regression Step by Step

Prepare Dataset

Cox Proportional Regression

Visualize and Interpret Results

Data Mining for Business in Python 2021


Case Study Briefing

Lung Cancer
Survival in patients with advanced lung cancer from the North
Central Cancer Treatment Group.
Case study1
1 Driver Analysis with Cox Proportional Hazard

2 Visualize Results

1: Author Terry M Therneau [aut, cre], Thomas Lumley [ctb, trl] (original S->R port and R maintainer until 2009), Atkinson Elizabeth [ctb], Crowson Cynthia [ctb]

Data Mining for Business in Python 2021


Cox Proportional Hazard extra Resources

Deep dives

Time-dependent covariates in the cox proportional-hazards


regression model
Lloyd D. Fisher and D. Y. Lin 1999

Cox Proportional-Hazards Regression for Survival Data in R


John Fox & Sanford Weisberg

Data Mining for Business in Python 2021


Challenge – Veteran Lung Cancer A/B test

Experiment
Randomised trial of two treatment regimens for lung cancer.

Challenge1
1 Transform Cell Type into dummy variables.
Use pd.get_dummies or drop the variable

2 Cox Proportional Hazard Regression

3 Plot CPH results

Dataset source: Survival package from CRAN

Data Mining for Business in Python 2021


CHAID

Data Mining for Business in Python 2021


Case Study Briefing

Labor Market Ethnic Discrimination


Cross-section data about resume, call-back and employer
information
Case study1
4,870 fictitious resumes sent in response to employment
advertisements in Chicago and Boston in
2001

The resumes contained information concerning the ethnicity of


the applicant.

Bertrand, M. and Mullainathan, S. (2004).


Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market
Discrimination. American
Economic Review, 94, 991–1013.
Data Mining for Business in Python 2021
CHAID Step by Step

Variable selection

Transforming continuous variables into categorical

Do your first tree

Prune it for better interpretability

Data Mining for Business in Python 2021


Factors influencing call-backs
High ? Medium X Low

Employment gaps Computer skills


Yes No Yes No

Honorary Yes ? College Yes ?


degree
No X ? No ? X

Resume quality Military experience


Yes No Yes No

Work High ? Special Yes ?


experience skills
Low ? X No ? X

Data Mining for Business in Python 2021


Complexity increases as you deep dive in your problem

Description
Problem depth:
Having more than 20, 50 or 100 drivers increases the complexity
Why?

Importance:
how do you know which driver actually matters most?

Relevance:
Some variables might be relevant in combination with some, but
not all

Data Mining for Business in Python 2021


One of the CHAID‘s benefits is that figures out which
drivers are more important
Description
Importance ranking:
CHAID figures out which drivers matter more, by doing significance
Which? tests
Segmented Driver Analysis:
CHAID will segment the population and perform driver analysis for
each of them.
Interpretability:
CHAID provides easy to read graphs with customer segments

Data Mining for Business in Python 2021


Let‘s see how it works visually
Results

High
Low
Quality
No Education Low
Computer Yes
Yes High Military
No
Honors Yes
No
Emp holes
Yes
No
Experience Low High
Quality
Low
High High
Jobs
Low
Data Mining for Business in Python 2021
How CHAID processes

Is called What does it do?


Yes No CHAID looks at all predictors and tries to find the
one where the “yes” is most different from the “no”
Has Yes
Honors How does it work?
No
CHAID performs a Chi-square test. It shows
whether the frequencies of the categorical
Of the people who have honors: variables are different or not. Very similar to t-test,
Is called but it is a test of variance, and ideal for categorical
Yes No variables.

IT Skills Yes And then?


After it finds the first segment split, tries to find the
No
next where the “yes” differs most from the “no”
Data Mining for Business in Python 2021
Last few things consider

Description
Tree size:
You can choose how many levels the tree will have
Which?

Bucket size:
You can choose a minimum threshold that you want your buckets
to have
Continuous variables:
CHAID accepts only categorical variables

Data Mining for Business in Python 2021


CHAID extra Resources

Deep dives

A CHAID Based Performance Prediction Model in Educational


Data Mining
M. Ramaswami and R. Bhaskaran, 2010

Tree Structured Data Analysis: AID, CHAID and CART


Leland Wilkinson 1992

Data Mining for Business in Python 2021


Challenge – Police Racial Bias

Description
You have been hired to understand to investigate Vehicle
searches by the police, and if there is racial bias
Challenge1
1 Create a dataset with these 5 variables: problem,
vehicleSearch, race, gender, policePrecinct
2 Transform string variables into dummy.

3 Get names of Dependent and Independent


variables
Perform CHAID and visualize. Set max depth
4
to 2
Dataset source: carStops package from CRAN

Data Mining for Business in Python 2021


Clustering:
Gaussian Mixture
Model

Data Mining for Business in Python 2021


Case Study Briefing – Country Segmentation

Socio-Economic Data
Data with country socio-economic data

Case study1 1 Find optimal Number of cluster

2 Visualize optimal number of clusters

3 Create clusters

4 Interpret the clusters

Data Mining for Business in Python 2021


What are clustering techniques?

Visualization Key ideas

• Groups observations in terms of their


X2
characteristics

• Main task of exploratory data mining

• Clustering is an art rather than Science

X1

Data Mining for Business in Python 2021


Gaussian Mixture Model

Visualization Key ideas

• Gaussian Mixture Model is a


probabilistic method for clustering

• Better to use than traditional clustering


algorithms, like Kmeans

• The probabilities allow to better


evaluate edge cases

Data Mining for Business in Python 2021


Gaussian Mixture Model vs. Kmeans

Key ideas
• No need to standardize data • Faster to compute
• The cluster sizes do not have specific • Poor at dealing low amount of data points
structures that might or might not apply.
Kmeans Gaussian Mixture Model
X2

X2
Data Mining for Business in Python 2021 X1 X1
Akaike’s Information Criterion (AIC) and Bayesian
Information Criterion (BIC)
Key Ideas Pseudo-visualization

• AIC and BIC helps us determining the optimal Goodness


number of clusters of fit

• AIC and BIC provide a means to select a model

• Trade-off between simplicity and goodness of fit

• Deal with overfitting and underfitting

Simplicity

Data Mining for Business in Python 2021


Gaussian Mixture Model Step by Step

Prepare Dataset

Find Optimal Clusters

Perform Gaussian Mixture Model

Interpret results

Data Mining for Business in Python 2021


Gaussian Mixture Model extra Resources

Deep dives

On the Number of Components in a Gaussian mixture model


Geoffrey J. McLachlan, Suren Rathnayake

The Infinite Gaussian Mixture Model


Carl Edward Rasmussen, 2000

Data Mining for Business in Python 2021


Challenge – Wine Quality

Description
You are a wanna be Wine Connoisseur, trying to find the best
wines for your parties using Data Mining
Challenge1
1 Determine the Optimal number of Clusters

2 Perform Gaussian Mixture Model

3 Interpret Results
Paulo Cortez,
University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez
A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde
Region(CVRVV), Porto, Portugal
@2009
Data Mining for Business in Python 2021
Dimension Reduction

Data Mining for Business in Python 2021


Dimension Reduction Goal

Data set with 4 independent variables Components after Dimension Reduction

Variance Variance

Data Mining for Business in Python 2021


You have more information than you need

Dimension Reduction helps to solve

Problem 1 Multicollinearity issues


statement
2 Computational issues of large number of predictors

3 Noisy models due to overfitting

4 Create new variables (called components)

5 Pre processing data for predictive models or


forecasting

Data Mining for Business in Python 2021


What is Principal Component Analysis?

Key Ideas Visualization

• An algorithm for Dimension Reduction

• Linearly Transforms variables into


components

• Components can be determined by the


percentage of variance explained

• Choosing Components is more of an art than


a science

Data Mining for Business in Python 2021


PCA vs Manifold

Visualization Key ideas

• There are inherent curves in the


relationship among the data that have
X1 information
• Methods like PCA cannot absorb that
PCA information because of their linearity
• No need to standardize data
• Con: Manifold is less interpretable than
Manifold PCA
• Con : No good quantitative way of
X2 determining components.
• There are several algorithms for
Manifold. We will use t-SNE
Data Mining for Business in Python 2021
Pros and Cons t-SNE (t-Distributed Stochastic Neighbor
Embedding)

Excellent in high dimensional Very Computationally intensive


1 1
datasets

Focuses on preserving local


2
structures

Easy implementation
3

Data Mining for Business in Python 2021


Dimension Reduction extra Resources

Deep dives
Principal component analysis
Herve Abdi ´ and Lynne J. Williams

What is principal component analysis?


Markus Ringnér

Algorithms for manifold learning


Lawrence Cayton

Large-Scale Manifold Learning


Ameet Talwalkar, Courant Sanjiv Kumar, and Henry Rowley

Data Mining for Business in Python 2021


Challenge - Abalone

Description
The age of abalone is determined by cutting the shell through the cone, staining it,
and counting the number of rings through a microscope - a time-consuming task.
Challenge1 Other measurements, which are easier to obtain, are used to predict the age.
1 Transform gender variable and remove rings variable
2 Perform Correlation Matrix and Standardize data

3 Find Optimal Number of Clusters


4 Perform PCA and interpret components
5 Perform t-SNE and visualize results
Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994)
"The Population Biology of Abalone (_Haliotis_ species) in Tasmania. I. Blacklip Abalone (_H. rubra_) from
the North Coast and Islands of Bass Strait",
Sea FisheriesData Mining for Business
Division, Technicalin Python 2021 No. 48 (ISSN 1034-3288)
Report
Association Rule
Learning – Apriori

Data Mining for Business in Python 2021


Case Study Briefing - Groceries

Transaction Data
Data customer grocery shopping purchases

Case study1
1 We have a file with almost 10k transactions

2 We need to find patterns in our data to maximize


baskets

3 Perform Association Rule Learning

Michael Hahsler, Kurt Hornik, and Thomas Reutterer (2006)


Implications of probabilistic data modeling for mining association rules.
In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nuernberger, and W. Gaul, editors, From Data and Information
Analysis to Knowledge Engineering, Studies in Classification, Data Analysis, and Knowledge Organization,
pages 598–605. DataSpringer-Verlag.
Mining for Business in Python 2021
Association Rule Learning Step by Step

Prepare Dataset

Define Support

Define Confidence

Execute Association Rules Learning

Visualize Results

Data Mining for Business in Python 2021


The output of Association Rule Learning Algorithm

If... Then...

Game of Thrones → Lord of the rings

Burger → Fries

Jay-Z → Kanye

Key Ideas
The output is an If…then… type of analysis
Association Rule Learning is a very simple recommender system

Data Mining for Business in Python 2021


Concepts you need to know - Support

To consider
Methodological background
• It does not matter if
Burgers happen more
than once per
# 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑤𝑖𝑡ℎ 𝐵𝑢𝑟𝑔𝑒𝑟 transaction
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐵𝑢𝑟𝑔𝑒𝑟𝑠 =
𝑇𝑜𝑡𝑎𝑙 𝑇𝑟𝑎𝑛𝑠𝑎𝑡𝑖𝑜𝑛𝑠 • Support indicates the
Relevance of the item

Data Mining for Business in Python 2021


Burger Support Visualization

Visualization

Data Formula
• Population is 20 6
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐵𝑢𝑟𝑔𝑒𝑟𝑠 = = 30%
• 6 people like burgers 20

Data Mining for Business in Python 2021


Mayo Support Visualization

Visualization

Data Formula
• Population is 20 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑀𝑎𝑦𝑜 =
4
= 20%
• 4 people like Mayo 20

Data Mining for Business in Python 2021


Concepts you need to know - Confidence

Methodological background To consider


• It does not matter if
Burgers or Mayo happen
more than once per
# 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑤𝑖𝑡ℎ 𝐵𝑢𝑟𝑔𝑒𝑟 & 𝑀𝑎𝑦𝑜
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑀𝑎𝑦𝑜|𝑏𝑢𝑟𝑔𝑒𝑟𝑠 = transaction
𝑇𝑜𝑡𝑎𝑙 𝐵𝑢𝑟𝑔𝑒𝑟 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠
• Confidence indicates the
strength of the
relationship

Data Mining for Business in Python 2021


Confidence Visualization

Visualization

Data Formula
• Population is 20 2
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑀𝑎𝑦𝑜|𝑏𝑢𝑟𝑔𝑒𝑟𝑠 = = 33%
• 6 people like burgers 6
• Of the 6, 2 like Mayo
Data Mining for Business in Python 2021
Concepts you need to know - Lift

Key idea
Methodological background
• Lift measures the
likelihood of buying
Mayo and Burgers
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑀𝑎𝑦𝑜|𝑏𝑢𝑟𝑔𝑒𝑟𝑠 together vs. Just buying
𝐿𝑖𝑓𝑡 𝑀𝑎𝑦𝑜|𝑏𝑢𝑟𝑔𝑒𝑟𝑠 = Mayo
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝑀𝑎𝑦𝑜)
• Lift bigger than 1 means
increased likelihood to
buy

Data Mining for Business in Python 2021


Apriori is an Association Rule Learning Algorithm

Key characteristics

1 Mines frequent itemsets for Boolean Association Rules


What is it?
Works by finding items that have occurred a
2
minimum number of times (Support)
And the corresponding itemsets that pass a certain
3
cut-off (confidence
Limitations
1 Slow in processing Itemsets

2 Only allows Boolean values

Data Mining for Business in Python 2021


Association Rule Learning extra Resources

Deep dives

Online Association Rule Mining


Christian Hidber

Association Rule Mining: A Survey


Qiankun Zhao Nanyang and Sourav S. Bhowmick

Algorithms for Association Rule Mining – A General Survey and Comparison


Jochen Hipp, Ulrich Guntz, and Gholamreza Nakhaeizadeh

Data Mining for Business in Python 2021


Challenge - NYC restaurants cuisine, borough and
sanitary grade

Description
You have a dataset with NYC restaurants, their boroughs and
sanitaty grade
Challenge1
1 Create a list with the transactions

2 Encode the transaction list into a Dataframe

Perform Association Rules Learning. Play around


3 with support and confidence

4 Visualize the results

Data Mining for Business in Python 2021


Random Forest

Data Mining for Business in Python 2021


You were hired to figure out which the main drivers of
customers that sign up to a savings account in a bank
Description
Customer churn:
Calling a customer who cannot sign up can lead for he/she to
Problem unsubscribe
Relevance
Opportunity cost:
Sending to wrong product for the customer to sign up can create a
loss in the case the customer would be interesting to sign up for
another
Relevance:
Sending constinuously information that the customer is not
interested can potentially lead for lower open rate willingness in the
future
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier,
62:22-31, June 2014
Data Mining for Business in Python 2021
Random Forest Step by Step

Prepare Dataset

Split into training and test set

Perform Random Forest

Predict using the Random Forest

Model Assessment

Execute Driver Importance


Data Mining for Business in Python 2021
Random Forest is an Ensemble Learning Algorithm

Description

1 Ensemble Learning is when you have a plurality of


What is it? models predicting your output
2 In simple words, ensemble is an average of Models

3 A Random Forest is a combination of decision trees

Data Mining for Business in Python 2021


How do Decision trees work?

Visualization Decision tree

X2 < 50
x1
No Yes

X1 > 10 X1 > 70
70 No Yes No Yes

Blue Blue

Key Ideas:
• A split or leaf is done taken a maximum entropy logic
10 - Where would it yield more information
• The prediction would be done based on the relative frequency
50 x2

Data Mining for Business in Python 2021


Random Forest is an Ensemble Learning Algorithm

Description

1 Ensemble Learning is when you have a plurality of


What is it? models predicting your output
2 In simple words, ensemble is an average of Models

3 A Random Forest is a combination of decision trees

4 Can be used for Regression and Classification problems

5 Random Forests have a tendency to overfit

Data Mining for Business in Python 2021


Let‘s imagine this is our full data set

Description

Data Mining for Business in Python 2021


Splitting between training and test enables an unbiased
model assessment

Training Set Test Set

Model Assessment

Data Mining for Business in Python 2021


The Confusion Matrix allows to access the results of a classifier

Accuracy
• Accuracy = (True positive + True negative ) / All
Confusion Matrix • Used when we have balanced dataset
Truth Sensitivity or Recall
False True • True positive / ( true positive + false negative)
Predicted False True negative False Negative • Used when skewed towards False values

True False Positive True positive Specifiticy or False Positive Rate


• True negative / ( true negative + false positive)
• Used when skewed towards True values

Precision
• True Positive / ( true positive + false positive)
• Used when skewed towards False values
Data Mining for Business in Python 2021
Area under the ROC curve (AUC)

Visualization
Sensitivity
Key ideas
ROC Random chance
100% • AUC is a performance measure for
classification problems

• It tells us how well the model is able


to distinguish between positives and
negatives
100%

Specificity

Data Mining for Business in Python 2021


The F1 score should be used when we have an unbalanced dataset

Sensitivity or Recall
• Accuracy = (True positive + True negative ) / All
Confusion Matrix • Used when we have balanced dataset
Truth
Precision
False True
• True Positive / ( true positive + false positive)
Predicted False True negative False Negative • Used when we are skewed towards True values
True False Positive True positive
F-score
• 2 * (precision * recall) / (precision + recall)
• Used for unbalanced dataset

Data Mining for Business in Python 2021


Random Forest extra Resources

Deep dives

How Many Trees in a Random Forest?


Thais Mayumi Oshiro, Pedro Santoro Perez, and José Augusto Baranauskas

Random forest classifier for remote sensing classification


M. Pal

Real-Time Human Pose Recognition in Parts from Single Depth Images


Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio,
Richard Moore, Alex Kipman, and Andrew Blake

A Random Forest Guided Tour


Gérard Biau and Erwan Scornet

Data Mining for Business in Python 2021


Random Forest Challenge – Extramarital affairs

A Theory of Extramarital Affairs


Key characteristics of cheaters

Challenge1 1 Isolate X and Y


2 Transform Y into binary format
Create a dummy variable out of the occupation
3
variable
4 Transform X string variables into dummies

5 Perform Random Forest

6 Create Importance drivers

Data Mining for Business in Python 2021


LIME

Data Mining for Business in Python 2021


Interpreting Advanced Machine Learning Models

Introducing Lime
• Local interpretable model-agnostic explanations
Problem Statement -> works with most models
• How do we explain Advanced Machine
• LIME is the application of surrogate models
Learning models?
• How can we trust something that does • Surrogate models are trained to approximate
explain itself? the predictions of the underlying black box
model
• From a Data Mining perspective, it feels
like a great loss to not be able to take • LIME is best applied to Classification problems!
advantage of the Data Science newer
algorithms • LIME focus is on explaining individual
predictions

Data Mining for Business in Python 2021


LIME explanation example

Data Mining for Business in Python 2021


LIME extra Resources

Deep dives

Model-Agnostic Interpretability of Machine Learning


Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin

Statistical stability indices for LIME: obtaining reliable explanations for


Machine Learning models
Giorgio Visania,b, Enrico Baglib , Federico Chesania , Alessandro
Poluzzib and Davide Capuzzo

Data Mining for Business in Python 2021


Challenge – Understanding Remote Work predictions

Stackoverflow dataset
Worker‘s characteristics, and job related queries

Challenge1 1 Install LIME


2 Transform string variables

3 Isolate X and Y

4 Perform Random Forest

5 Prepare LIME explainer

6 Use LIME to explain a couple of instances

Data Mining for Business in Python 2021


SHAP

Data Mining for Business in Python 2021


Case Study Briefing – Car prices

Pricing a car
List of cars, their price, and characteristics

Case study1
1 Build a XGBoost model to measure accuracy

2 Use SHAP to get insights

Data Mining for Business in Python 2021


XGBoost and SHAP step by step

Prepare dataset, isolate X and Y

Split into Training and Test Set, and create Matrices

Set Parameters

Run XGBoost

Assess Model

Implement SHAP
Data Mining for Business in Python 2021
XGBoost is a state-of-art Machine Learning Algorithm

Description

1 Stands for Extreme Gradient Boosting

What is it? Can be contructed with a tree based algorithm or


2
linear (worse results)

3 It is an emsemble algorithm

Each new model is built upon the precedent one ->


4
continuous improvement

5 Can be used for both Regression and Classification

Data Mining for Business in Python 2021


Linear vs Decision Trees

Linear Approach Decision Tree

Revenue Crust
No Yes

Not a pie Borders

No Yes
β
Not a pie Pie

Marketing
Costs
Data Mining for Business in Python 2021
XGBoost gives different weights depending on how
difficult it is to predict

First Tree Second Tree Third Tree

Outcome Predictor Weight Outcome Predictor Weight Outcome Predictor Weight


1 X 25% 1 X 20% 1 X 23%
0 X 25% 0 X 20% 0 X 15%
0 X 25% 0 X 30% 0 X 35%
1 X 25% 1 X 30% 1 X 27%

Data Mining for Business in Python 2021


XGBoost looks at parts of the observations at a time

First Tree Second Tree Third Tree

Outcome Predictor Weight Outcome Predictor Weight Outcome Predictor Weight


1 X1 25% 1 X1 20% 1 X1 23%
0 X2 25% 0 X2 20%
0 X3 30% 0 X3 35%
1 X4 25% 1 X4 27%

Key Idea
XGBoost only looks at a fraction of the observation at the time
Observations that are more difficult to predict are given a bigger weight

Data Mining for Business in Python 2021


The logic is similar for Regression-based tasks

First Tree Second tree

Error Outcome Predictor Weight Error Outcome Predictor Weight


-5 15 X1 33% -1 19 X1 40%
2 22 X2 33%
-1 25 X2 30%
4 34 X4 33% 3 35 X4 35%

Data Mining for Business in Python 2021


XGBoost also gives different weights to different
predictors

First Tree Second Tree


Error Outcome X1 X2 X3 Weight Error Outcome X1 X2 X3 Weight
-5 15 33% -1 19 40%
2 22 33%
50%

50%

50%

50%
-1 25 30%
4 34 33% 3 35 35%

Third Tree
Error Outcome X1 X2 X3 Weight
1 21 35% Key Idea
Predictors also have different weights
40%

60%

if they yield different model results


0 24 30%
2 36 40%
Data Mining for Business in Python 2021
XGBoost quirks

Description

Which? NA:
Unlike other regression models, XGBoost treats NA‘s as
information

Non-linearity:
XGBoost is excellent dealing with non-linearity relationship
between the dependent and the independent variables.

Data Mining for Business in Python 2021


XGBoost has 7 main tuning parameters

Parameter Description

Minimum Child Relates to the sum of the weights of each observation. Low values can
weight mean that maybe not a lot of observations are in the round

ETA Learning Rate. How fast do you want the model to learn?

Max depth How big should the tree be? Bigger trees go into more detail

Gamma How fast should the tree be split?

Subsample Share of observations in each tree?

Colsample by tree How much of the tree should be analysed per round?

Number of rounds How many times do we want the analysis to be run?

Data Mining for Business in Python 2021


Mean Absolute Error (MAE) vs Root Squared Mean Error
(RSME)
Visualization Key ideas

Y
• MAE and RSME are performance indicators for
Model Regression models with continuous dependent
variables

σ 𝑦ො − 𝑦 2
σ 𝑦 − 𝑦ො
𝑀𝐴𝐸 = 𝑅𝑆𝑀𝐸 =
𝑛 𝑛

• RSME is quite useful for models with extremes /


outliers

X • MAE is more interpretable.

Data Mining for Business in Python 2021


Introduction to SHAP

• SHapley Additive exPlanations were introducted by Lundberg and Lee (2016)

• SHAP aims to explain each instance by computing the marginal contribution of each
feature to the prediction

• SHAP computes each value using coalitional game theory

Data Mining for Business in Python 2021


There are 3 main areas of insights

Global Interpretability Local Interpretability Dependency Plots

• The SHAP values can • Each observation gets its • Shows the relations
show how much each own set of SHAP values between an independent
predictor contributes, variable and the output
either positively or • We can explain why a
negatively, to the target case receives its • Also shows how the
variable prediction and the predictor interactor with its
contributions of the closest independent
predictors variable

Data Mining for Business in Python 2021


XGBoost and SHAP extra Resources

Deep dives

XGBoost: A Scalable Tree Boosting System


Tianqi Chen and Carlos Guestrin

A Unified Approach to Interpreting Model Predictions


Scott M. Lundberg and Su-In Lee

Toward safer highways, application of XGBoost and SHAP for real-time


accident detection and feature analysis
Amir Bahador Parsaa, Ali Movahedia, Homa Taghipoura, Sybil
Derribleb, and Abolfazl (Kouros) Mohammadian

Data Mining for Business in Python 2021


Challenge – Understanding house price drivers

Dataset with house characteristics and prices

1 Install SHAP and import libraries


Challenge1 2 Transform string variables

3 Isolate X and Y, and generate XGBoost matrix

4 Set parameters and run XGBoost

5 Local interpretability

6 Dependency plots

7 Global interpretability

Data Mining for Business in Python 2021


Data Mining for Business in Python 2021

You might also like