Practical Data Science
An Introduction to Supervised Machine Learning
and Pattern Classification: The Big Picture
Sebastian Raschka
Michigan State University
NextGen Bioinformatics Seminars - 2015
Feb. 11, 2015
A Little Bit About Myself ...
PhD candidate in Dr. L. Kuhns Lab:
Developing software & methods for
- Protein ligand docking
- Large scale drug/inhibitor discovery
and some other machine learning side-projects
What is Machine Learning?
"Field of study that gives computers the
ability to learn without being explicitly
programmed.
(Arthur Samuel, 1959)
By Phillip Taylor [CC BY 2.0]
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Examples of Machine Learning
Text Recognition
Biology
http://commons.wikimedia.org/wiki/
File:American_book_company_1916._letter_envelope-2.JPG#filelinks
[public domain]
Spam Filtering
https://flic.kr/p/5BLW6G [CC BY 2.0]
Examples of Machine Learning
Self-driving cars
Recommendation systems
http://commons.wikimedia.org/wiki/File:Netflix_logo.svg [public domain]
By Steve Jurvetson [CC BY 2.0]
Photo search
and many, many
more ...
http://googleresearch.blogspot.com/2014/11/a-picture-is-worth-thousand-coherent.html
How many of you have used
machine learning before?
Our Agenda
Concepts and the big picture
Workflow
Practical tips & good habits
Labeled data
Direct feedback
Predict outcome/future
Supervised
Learning
Unsupervised
No labels
No feedback
Find hidden structure
Reinforcement
Decision process
Reward system
Learn series of actions
Unsupervised
learning
Unsupervised Learning
Supervised
learning
Supervised Learning
Clustering:
Regression:
Classification:
[DBSCAN on a toy dataset]
[Soccer Fantasy Score prediction]
[SVM on 2 classes of the Wine dataset]
Todays topic
Nomenclature
IRIS
https://archive.ics.uci.edu/ml/datasets/Iris
Instances (samples, observations)
sepal_length
sepal_width
petal_length
petal_width
class
5.1
3.5
1.4
0.2
setosa
4.9
3.0
1.4
0.2
setosa
50
6.4
3.2
4.5
1.5
veriscolor
150
5.9
3.0
5.1
1.8
virginica
Features (attributes, dimensions)
Classes (targets)
Classification
1) Learn from training data
class1
class2
x1
x2
2) Map unseen (new) data
Supervised
Learning
Raw Data Collection
Missing Data
Pre-Processing
Feature Extraction
Sampling
Training Dataset
Split
Feature Selection
Feature Scaling
Pre-Processing
Test Dataset
New Data
Dimensionality Reduction
Final Model
Evaluation
Prediction
Learning Algorithm
Training
Cross Validation
Refinement
Hyperparameter
Optimization
Performance Metrics
Post-Processing
Model Selection
Final Classification/
Regression Model
Sebastian Raschka 2014
This work is licensed under a Creative Commons Attribution 4.0 International License.
Supervised
Learning
Raw Data Collection
Missing Data
Pre-Processing
Feature Extraction
Sampling
Training Dataset
Split
Feature Selection
Feature Scaling
Pre-Processing
Test Dataset
New Data
Dimensionality Reduction
Final Model
Evaluation
Prediction
Learning Algorithm
Training
Cross Validation
Refinement
Hyperparameter
Optimization
Performance Metrics
Post-Processing
Model Selection
Final Classification/
Regression Model
Sebastian Raschka 2014
This work is licensed under a Creative Commons Attribution 4.0 International License.
A Few Common Classifiers
Perceptron
Naive Bayes
Decision Tree
K-Nearest Neighbor
Logistic Regression
Artificial Neural Network / Deep Learning
Support Vector Machine
Ensemble Methods: Random Forest, Bagging, AdaBoost
Discriminative Algorithms
Map x y directly.
E.g., distinguish between people speaking different languages
without learning the languages.
Logistic Regression, SVM, Neural Networks
Generative Algorithms
Models a more general problem: how the data was generated.
I.e., the distribution of the class; joint probability distribution p(x,y).
Naive Bayes, Bayesian Belief Network classifier, Restricted
Boltzmann Machine
Examples of Discriminative Classifiers:
Perceptron
F. Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
xi1
xi2
w0
w1
w2
yi
x1
y {-1,1}
x2
y = wTx = w0 + w1x1 + w2x2
wj = weight
xi = training sample
yi = desired output
y^ i = actual output
t = iteration step
= learning rate
= threshold (here 0)
update rule:
yi
1 if wTxi
-1 otherwise
wj(t+1) = wj(t) + (yi - yi)xi
until
t+1 = max iter
or error = 0
Discriminative Classifiers:
Perceptron
F. Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
1
xi1
xi2
w0
w1
w2
yi
y {-1,1}
x1
x2
Binary classifier (one vs all, OVA)
Convergence problems (set n iterations)
Modification: stochastic gradient descent
Modern perceptron: Support Vector Machine (maximize margin)
Multilayer perceptron (MLP)
Generative Classifiers:
Naive Bayes
Bayes Theorem:
P(j | xi) =
Posterior probability =
Iris example:
P(xi | j) P(j)
P(xi)
Likelihood x Prior probability
P(Setosa"| xi),
Evidence
xi = [4.5 cm, 7.4 cm]
Generative Classifiers:
Naive Bayes
Bayes Theorem:
Decision Rule:
P(j | xi) =
P(xi | j) P(j)
pred. class label j
P(xi)
argmax P(j | xi)
i = 1, , m
e.g., j {Setosa, Versicolor, Virginica}
Generative Classifiers:
Naive Bayes
Evidence:
P(j | xi) =
Prior probability:
P(xi | j) P(j)
P(xi)
Nj
P(j) =
Nc
Class-conditional
probability
(here Gaussian kernel):
(cancels out)
(class frequency)
1
P(xik |j) = (2 j2) exp
P(xi |j)
(-
(xik - j)2
2j2
P(xik |j)
Generative Classifiers:
Naive Bayes
Naive conditional independence assumption typically
violated
Works well for small datasets
Multinomial model still quite popular for text classification
(e.g., spam filter)
Non-Parametric Classifiers:
K-Nearest Neighbor
k=3
e.g., k=1
Simple!
Lazy learner
Very susceptible to curse of dimensionality
Iris Example
C=3
k=3
mahalanobis dist.
uniform weights
Setosa
Virginica
depth = 2
Versicolor
Decision Tree
petal length <= 2.45?
N
petal length <= 4.75?
Setosa
Y
Virginica
N
Versicolor
Entropy = pi logk pi
i
depth = 4
e.g.,
2 (- 0.5 log2(0.5)) = 1
Information Gain =
entropy(parent) [avg entropy(children)]
"No Free Lunch" :(
D. H. Wolpert. The supervised learning no-free-lunch theorems. In Soft Computing and Industry, pages 2542. Springer, 2002.
Our model is a simplification of reality
Simplification is based on assumptions (model bias)
Assumptions fail in certain situations
Roughly speaking:
No one model works best for all possible situations.
Which Algorithm?
What is the size and dimensionality of my training set?
Is the data linearly separable?
How much do I care about computational efficiency?
- Model building vs. real-time prediction time
- Eager vs. lazy learning / on-line vs. batch learning
- prediction performance vs. speed
Do I care about interpretability or should it "just work well?"
...
Supervised
Learning
Raw Data Collection
Missing Data
Pre-Processing
Feature Extraction
Sampling
Training Dataset
Split
Feature Selection
Feature Scaling
Pre-Processing
Test Dataset
New Data
Dimensionality Reduction
Final Model
Evaluation
Prediction
Learning Algorithm
Training
Cross Validation
Refinement
Hyperparameter
Optimization
Performance Metrics
Post-Processing
Model Selection
Final Classification/
Regression Model
Sebastian Raschka 2014
This work is licensed under a Creative Commons Attribution 4.0 International License.
Missing Values:
- Remove features (columns)
- Remove samples (rows)
- Imputation (mean, nearest neighbor, )
Sampling:
- Random split into training and validation sets
- Typically 60/40, 70/30, 80/20
- Dont use validation set until the very end!
(overfitting)
Categorical Variables
M
10.1
class
label
class1
color size
0 green
1
red
13.5
class2
blue
XL
15.3
class1
nominal
ordinal
green (1,0,0)
red (0,1,0)
blue (0,0,1)
0
1
2
class
label
0
1
0
prize
M1
L2
XL 3
color=blue color=green
0
1
0
0
1
0
color=red
0
1
0
prize
10.1
13.5
15.3
size
1
2
3
Supervised
Learning
Raw Data Collection
Missing Data
Pre-Processing
Feature Extraction
Sampling
Training Dataset
Split
Feature Selection
Feature Scaling
Pre-Processing
Test Dataset
New Data
Dimensionality Reduction
Final Model
Evaluation
Prediction
Learning Algorithm
Training
Cross Validation
Refinement
Hyperparameter
Optimization
Performance Metrics
Post-Processing
Model Selection
Final Classification/
Regression Model
Sebastian Raschka 2014
This work is licensed under a Creative Commons Attribution 4.0 International License.
Generalization Error and Overfitting
How well does the model perform on unseen data?
Generalization Error and Overfitting
Error Metrics: Confusion Matrix
here: setosa = positive
TP
FN
FP
TN
[Linear SVM on sepal/petal lengths]
Error Metrics
here: setosa = positive
TP
FP
FN
TN
[Linear SVM on sepal/petal lengths]
micro and macro
averaging for multi-class
TP + TN
Accuracy =
FP +FN +TP +TN
= 1 - Error
FP
False Positive Rate =
N
TP
True Positive Rate =
P
(Recall)
TP
Precision =
TP + FP
Receiver Operating Characteristic
(ROC) Curves
Model Selection
Complete dataset
Training dataset
Test dataset
k-fold cross-validation (k=4):
fold 1
fold 2
fold 3
fold 4
Test set
Test set
Test set
Test set
1st iteration
calc. error
2nd iteration
calc. error
3rd iteration
calc. error
4th iteration
calc. error
calculate
avg. error
k-fold CV and ROC
Feature Selection
IMPORTANT!
(Noise, overfitting, curse of dimensionality, efficiency)
-
Domain knowledge
Variance threshold
Exhaustive search
Decision trees
Simplest example:
Greedy Backward Selection
start:
X = [x1, x2, x3, x4]
X = [x1, x3, x4]
stop:
(if d = k)
X = [x1, x3]
Dimensionality Reduction
Transformation onto a new feature subspace
e.g., Principal Component Analysis (PCA)
Find directions of maximum variance
Retain most of the information
PCA in 3 Steps
0. Standardize data
xik - k
z=
1. Compute covariance matrix
ik =
1 (xij - j) (xik - k)
n -1 i
2 1
21
=
31
41
12
2 2
32
42
13
23
2 3
43
14
24
34
2 4
PCA in 3 Steps
2. Eigendecomposition and sorting eigenvalues
Xv=v
Eigenvectors
[[ 0.52237162
[-0.26335492
[ 0.58125401
[ 0.56561105
-0.37231836 -0.72101681 0.26199559]
-0.92555649 0.24203288 -0.12413481]
-0.02109478 0.14089226 -0.80115427]
-0.06541577 0.6338014
0.52354627]]
Eigenvalues
[ 2.93035378
0.92740362
(from high to low)
0.14834223
0.02074601]
PCA in 3 Steps
3. Select top k eigenvectors and transform data
Eigenvectors
[[ 0.52237162
[-0.26335492
[ 0.58125401
[ 0.56561105
-0.37231836 -0.72101681 0.26199559]
-0.92555649 0.24203288 -0.12413481]
-0.02109478 0.14089226 -0.80115427]
-0.06541577 0.6338014
0.52354627]]
Eigenvalues
[ 2.93035378
0.92740362
[First 2 PCs of Iris]
0.14834223
0.02074601]
Hyperparameter Optimization:
GridSearch in scikit-learn
C=1000,
gamma=0.1
C=1
k=11
uniform weights
Non-Linear Problems
- XOR gate
depth=4
Kernel Trick
Kernel function
Kernel
Map onto high-dimensional space (non-linear combinations)
Kernel Trick
Trick: No explicit dot product!
Radius Basis Function (RBF) Kernel:
Kernel PCA
PC1, linear PCA
PC1, kernel PCA
Supervised
Learning
Raw Data Collection
Missing Data
Pre-Processing
Feature Extraction
Sampling
Training Dataset
Split
Feature Selection
Feature Scaling
Pre-Processing
Test Dataset
New Data
Dimensionality Reduction
Final Model
Evaluation
Prediction
Learning Algorithm
Training
Cross Validation
Refinement
Hyperparameter
Optimization
Performance Metrics
Post-Processing
Model Selection
Final Classification/
Regression Model
Sebastian Raschka 2014
This work is licensed under a Creative Commons Attribution 4.0 International License.
Thanks!
Questions?
@rasbt
[email protected]
https://github.com/rasbt
Additional Slides
Inspiring Literature
P. N. Klein. Coding the Matrix: Linear
Algebra Through Computer Science
Applications. Newtonian Press, 2013.
S. Gutierrez. Data Scientists at Work.
Apress, 2014.
R. Schutt and C. ONeil. Doing Data
Science: Straight Talk from the Frontline.
OReilly Media, Inc., 2013.
R. O. Duda, P. E. Hart, and D. G. Stork.
Pattern classification. 2nd. Edition. New
York, 2001.
Useful Online Resources
https://www.coursera.org/course/ml
http://stats.stackexchange.com
http://www.kaggle.com
My Favorite Tools
http://scikit-learn.org/stable/
http://www.numpy.org
http://pandas.pydata.org
Seaborn
http://stanford.edu/~mwaskom/software/seaborn/
http://ipython.org/notebook.html
Which one to pick?
class1
class2
class1
class2
Generalization error!
The problem of overfitting