Data Mining For Business in Python Deck
Data Mining For Business in Python Deck
Business in Python
3 CHAID
5 Dimension Reduction
7 Random Forest
8 LIME
Context Visualization
• 1 day
• 1 week
• 1 month
Time
Lung Cancer
Survival in patients with advanced lung cancer from the North
Central Cancer Treatment Group.
Case study1
1: Author Terry M Therneau [aut, cre], Thomas Lumley [ctb, trl] (original S->R port and R maintainer until 2009), Atkinson Elizabeth [ctb], Crowson Cynthia [ctb]
Prepare Dataset
Explanation Formula
• Non-parametric statistic used to estimate the 𝑑𝑖
𝑆(𝑡𝑖 ) = 𝑆(𝑡𝑖−1 ) ∗ (1 − )
survival function (probability of a person 𝑛𝑖
surviving) from the lifetime data.
• In medical research, it is often used to measure Where:
the fraction of patients living for a specific time
𝑆(𝑡𝑖 ) = probability of survival
after treatment or diagnosis.
at time t
Customers 𝑑𝑖 = number of events at time
t
𝑛𝑖 = number of survivors at
time t
Description
Right Censoring:
The subject under observation is still alive. In this case, we can not
Types have our timing when our event of interest (death) occurs.
Left Censoring:
The event cannot be observed for some reason. The event may
also have started before recording.
Interval Censoring:
We only have data for a specific interval, so it is possible that the
event of interest does not occur during that time.
Context Visualization
Goal:
To test if there are statistical differences
in the survival distribution of >= 2 groups
Null Hypothesis:
There is no difference between both
groups
Deep dives
Book
Survival Analysis
Stephen P. Jenkins
Experiment Output
In 1988, an experiment was
designed and implemented at one 1 Transform Solder Variable into 1 and 0
AT&T’s factory.
Explanation Formula
Prepare Dataset
Lung Cancer
Survival in patients with advanced lung cancer from the North
Central Cancer Treatment Group.
Case study1
1 Driver Analysis with Cox Proportional Hazard
2 Visualize Results
1: Author Terry M Therneau [aut, cre], Thomas Lumley [ctb, trl] (original S->R port and R maintainer until 2009), Atkinson Elizabeth [ctb], Crowson Cynthia [ctb]
Deep dives
Experiment
Randomised trial of two treatment regimens for lung cancer.
Challenge1
1 Transform Cell Type into dummy variables.
Use pd.get_dummies or drop the variable
Variable selection
Description
Problem depth:
Having more than 20, 50 or 100 drivers increases the complexity
Why?
Importance:
how do you know which driver actually matters most?
Relevance:
Some variables might be relevant in combination with some, but
not all
High
Low
Quality
No Education Low
Computer Yes
Yes High Military
No
Honors Yes
No
Emp holes
Yes
No
Experience Low High
Quality
Low
High High
Jobs
Low
Data Mining for Business in Python 2021
How CHAID processes
Description
Tree size:
You can choose how many levels the tree will have
Which?
Bucket size:
You can choose a minimum threshold that you want your buckets
to have
Continuous variables:
CHAID accepts only categorical variables
Deep dives
Description
You have been hired to understand to investigate Vehicle
searches by the police, and if there is racial bias
Challenge1
1 Create a dataset with these 5 variables: problem,
vehicleSearch, race, gender, policePrecinct
2 Transform string variables into dummy.
Socio-Economic Data
Data with country socio-economic data
3 Create clusters
X1
Key ideas
• No need to standardize data • Faster to compute
• The cluster sizes do not have specific • Poor at dealing low amount of data points
structures that might or might not apply.
Kmeans Gaussian Mixture Model
X2
X2
Data Mining for Business in Python 2021 X1 X1
Akaike’s Information Criterion (AIC) and Bayesian
Information Criterion (BIC)
Key Ideas Pseudo-visualization
Simplicity
Prepare Dataset
Interpret results
Deep dives
Description
You are a wanna be Wine Connoisseur, trying to find the best
wines for your parties using Data Mining
Challenge1
1 Determine the Optimal number of Clusters
3 Interpret Results
Paulo Cortez,
University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez
A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde
Region(CVRVV), Porto, Portugal
@2009
Data Mining for Business in Python 2021
Dimension Reduction
Variance Variance
Easy implementation
3
Deep dives
Principal component analysis
Herve Abdi ´ and Lynne J. Williams
Description
The age of abalone is determined by cutting the shell through the cone, staining it,
and counting the number of rings through a microscope - a time-consuming task.
Challenge1 Other measurements, which are easier to obtain, are used to predict the age.
1 Transform gender variable and remove rings variable
2 Perform Correlation Matrix and Standardize data
Transaction Data
Data customer grocery shopping purchases
Case study1
1 We have a file with almost 10k transactions
Prepare Dataset
Define Support
Define Confidence
Visualize Results
If... Then...
Burger → Fries
Jay-Z → Kanye
Key Ideas
The output is an If…then… type of analysis
Association Rule Learning is a very simple recommender system
To consider
Methodological background
• It does not matter if
Burgers happen more
than once per
# 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑤𝑖𝑡ℎ 𝐵𝑢𝑟𝑔𝑒𝑟 transaction
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐵𝑢𝑟𝑔𝑒𝑟𝑠 =
𝑇𝑜𝑡𝑎𝑙 𝑇𝑟𝑎𝑛𝑠𝑎𝑡𝑖𝑜𝑛𝑠 • Support indicates the
Relevance of the item
Visualization
Data Formula
• Population is 20 6
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝐵𝑢𝑟𝑔𝑒𝑟𝑠 = = 30%
• 6 people like burgers 20
Visualization
Data Formula
• Population is 20 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑀𝑎𝑦𝑜 =
4
= 20%
• 4 people like Mayo 20
Visualization
Data Formula
• Population is 20 2
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑀𝑎𝑦𝑜|𝑏𝑢𝑟𝑔𝑒𝑟𝑠 = = 33%
• 6 people like burgers 6
• Of the 6, 2 like Mayo
Data Mining for Business in Python 2021
Concepts you need to know - Lift
Key idea
Methodological background
• Lift measures the
likelihood of buying
Mayo and Burgers
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑀𝑎𝑦𝑜|𝑏𝑢𝑟𝑔𝑒𝑟𝑠 together vs. Just buying
𝐿𝑖𝑓𝑡 𝑀𝑎𝑦𝑜|𝑏𝑢𝑟𝑔𝑒𝑟𝑠 = Mayo
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝑀𝑎𝑦𝑜)
• Lift bigger than 1 means
increased likelihood to
buy
Key characteristics
Deep dives
Description
You have a dataset with NYC restaurants, their boroughs and
sanitaty grade
Challenge1
1 Create a list with the transactions
Prepare Dataset
Model Assessment
Description
X2 < 50
x1
No Yes
X1 > 10 X1 > 70
70 No Yes No Yes
Blue Blue
Key Ideas:
• A split or leaf is done taken a maximum entropy logic
10 - Where would it yield more information
• The prediction would be done based on the relative frequency
50 x2
Description
Description
Model Assessment
Accuracy
• Accuracy = (True positive + True negative ) / All
Confusion Matrix • Used when we have balanced dataset
Truth Sensitivity or Recall
False True • True positive / ( true positive + false negative)
Predicted False True negative False Negative • Used when skewed towards False values
Precision
• True Positive / ( true positive + false positive)
• Used when skewed towards False values
Data Mining for Business in Python 2021
Area under the ROC curve (AUC)
Visualization
Sensitivity
Key ideas
ROC Random chance
100% • AUC is a performance measure for
classification problems
Specificity
Sensitivity or Recall
• Accuracy = (True positive + True negative ) / All
Confusion Matrix • Used when we have balanced dataset
Truth
Precision
False True
• True Positive / ( true positive + false positive)
Predicted False True negative False Negative • Used when we are skewed towards True values
True False Positive True positive
F-score
• 2 * (precision * recall) / (precision + recall)
• Used for unbalanced dataset
Deep dives
Introducing Lime
• Local interpretable model-agnostic explanations
Problem Statement -> works with most models
• How do we explain Advanced Machine
• LIME is the application of surrogate models
Learning models?
• How can we trust something that does • Surrogate models are trained to approximate
explain itself? the predictions of the underlying black box
model
• From a Data Mining perspective, it feels
like a great loss to not be able to take • LIME is best applied to Classification problems!
advantage of the Data Science newer
algorithms • LIME focus is on explaining individual
predictions
Deep dives
Stackoverflow dataset
Worker‘s characteristics, and job related queries
3 Isolate X and Y
Pricing a car
List of cars, their price, and characteristics
Case study1
1 Build a XGBoost model to measure accuracy
Set Parameters
Run XGBoost
Assess Model
Implement SHAP
Data Mining for Business in Python 2021
XGBoost is a state-of-art Machine Learning Algorithm
Description
3 It is an emsemble algorithm
Revenue Crust
No Yes
No Yes
β
Not a pie Pie
Marketing
Costs
Data Mining for Business in Python 2021
XGBoost gives different weights depending on how
difficult it is to predict
Key Idea
XGBoost only looks at a fraction of the observation at the time
Observations that are more difficult to predict are given a bigger weight
50%
50%
50%
-1 25 30%
4 34 33% 3 35 35%
Third Tree
Error Outcome X1 X2 X3 Weight
1 21 35% Key Idea
Predictors also have different weights
40%
60%
Description
Which? NA:
Unlike other regression models, XGBoost treats NA‘s as
information
Non-linearity:
XGBoost is excellent dealing with non-linearity relationship
between the dependent and the independent variables.
Parameter Description
Minimum Child Relates to the sum of the weights of each observation. Low values can
weight mean that maybe not a lot of observations are in the round
ETA Learning Rate. How fast do you want the model to learn?
Max depth How big should the tree be? Bigger trees go into more detail
Colsample by tree How much of the tree should be analysed per round?
Y
• MAE and RSME are performance indicators for
Model Regression models with continuous dependent
variables
σ 𝑦ො − 𝑦 2
σ 𝑦 − 𝑦ො
𝑀𝐴𝐸 = 𝑅𝑆𝑀𝐸 =
𝑛 𝑛
• SHAP aims to explain each instance by computing the marginal contribution of each
feature to the prediction
• The SHAP values can • Each observation gets its • Shows the relations
show how much each own set of SHAP values between an independent
predictor contributes, variable and the output
either positively or • We can explain why a
negatively, to the target case receives its • Also shows how the
variable prediction and the predictor interactor with its
contributions of the closest independent
predictors variable
Deep dives
5 Local interpretability
6 Dependency plots
7 Global interpretability