Machine Learning
Dr. Shazzad Hosain
Department of EECS
North South Universtiy
[email protected]
What is Machine Learning?
Learning Trained
algorithm
machine
TRAINING
DATA Answer
Data Mining is similar concept Query
For which tasks ?
Classification (binary/categorical target)
Regression and time series prediction
(continuous targets)
Clustering (targets unknown)
Rule discovery
For which applications ?
training
Customer knowledge
examples
Quality control
106 Market Analysis
105
Text Categorization
104
System diagnosis
OCR
103 Machine vision
HWR
102
Bioinformatics
10
10 102 103 104 105 inputs
Banking / Telecom / Retail
Identify:
Prospective customers
Dissatisfied customers
Good customers
Bad payers
Obtain:
More effective
advertising
Less credit risk
Fewer fraud
Decreased churn rate
Biomedical / Biometrics
Medicine:
Screening
Diagnosis and prognosis
Drug discovery
Security:
Face recognition
Signature / fingerprint / iris
verification
DNA fingerprinting
Computer / Internet
Computer interfaces:
Troubleshooting wizards
Handwriting and speech
Brain waves
Internet
Hit ranking
Spam filtering
Text categorization
Text translation
Recommendation
ML in a Nutshell
Tens of thousands of machine learning algorithms
Hundreds new every year
Every machine learning algorithm has three
components:
Representation
Evaluation
Optimization
8 Machine Learning
Representation
Decision trees
Sets of rules / Logic programs
Instances
Graphical models (Bayes/Markov nets)
Neural networks
Support vector machines
Model ensembles
Etc.
9 Machine Learning
Evaluation
Accuracy
Precision and recall
Squared error
Likelihood
Posterior probability
Cost / Utility
Margin
Entropy
K-L divergence
Etc.
10 Machine Learning
Optimization
Combinatorial optimization
E.g.: Greedy search
Convex optimization
E.g.: Gradient descent
Constrained optimization
E.g.: Linear programming
11 Machine Learning
Types of Learning
Supervised (inductive) learning
Training data includes desired outputs
Unsupervised learning
Training data does not include desired outputs
Semi-supervised learning
Training data includes a few desired outputs
Reinforcement learning
Rewards from sequence of actions
12 Machine Learning
Supervised Learning
Learning Through Examples
Supervised Learning
When a set of targets of interest is provided by an
external teacher
we say that the learning is Supervised
The targets usually are in the form of an input output
mapping that the net should learn
Learning From Examples
1 9
1 3 16 36
4 6 25 4
5 2
What We’ll Cover
Supervised learning
Decision tree induction
Neural networks
Rule induction
Instance-based learning
Bayesian learning
Support vector machines
Model ensembles
Learning theory
16 Machine Learning
Classification: Decision Trees
if X > 5 then blue
else if Y > 3 then blue
Y else if X > 2 then green
else blue
2 5 X
17
Classification: Neural Nets
Can select more
complex regions
Can be more accurate
Also can overfit the
data – find patterns in
random noise
18
Decision Tree Learning
Learning Through Examples
Learning decision trees
Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous)
E.g., situations where I will/won't wait for a table:
Classification of examples is positive (T) or negative (F)
Decision tree
Choosing an
attribute
Idea: a good attribute splits the examples into subsets that
are (ideally) "all positive" or "all negative“
Patrons? is a better choice
Choosing the Best Attribute
The key problem is choosing which attribute to split a
given set of examples.
Some possibilities are:
Random: Select any attribute at random
Least-Values: Choose the attribute with the smallest number
of possible values (fewer branches)
Most-Values: Choose the attribute with the largest number of
possible values (smaller subsets)
Max-Gain: Choose the attribute that has the largest expected
information gain, i.e. select attribute that will result in the
smallest expected size of the subtrees rooted at its children.
The ID3 algorithm uses the Max-Gain method of
selecting the best attribute.
ID3 (Iterative Dichotomiser 3) Algorithm
Top-down, greedy search through space of
possible decision trees
Remember, decision trees represent hypotheses, so
this is a search through hypothesis space.
What is top-down?
How to start tree?
What attribute should represent the root?
As you proceed down tree, choose attribute for each
successive node.
No backtracking:
So, algorithm proceeds from top to bottom
Question?
How do you determine which attribute best
classifies data?
Answer: Entropy!
Information gain:
Statistical quantity measuring how well an
attribute classifies the data.
Calculate the information gain for each attribute.
Choose attribute with greatest information gain.
Information Theory Background
If there are n equally probable possible messages, then the
probability p of each is 1/n
Information conveyed by a message is -log(p) = log(n)
Eg, if there are 16 messages, then log(16) = 4 and we need 4
bits to identify/send each message.
In general, if we are given a probability distribution
P = (p1, p2, .., pn)
the information conveyed by distribution (aka Entropy of P) is:
H(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))
Information Gain
Information gain is our metric for how well one attribute A i
classifies the training data.
Calculate the entropy for all training examples
positive and negative cases
p+ = #pos/Tot p- = #neg/Tot
H(S) = -p+log2(p+) - p-log2(p-)
Determine which single attribute best classifies the training
examples using information gain.
For each attribute find:
Gain( S , Ai ) H ( S) P( A
v Values ( Ai )
i v ) H ( Sv )
entropy Entropy for
value v
Use attribute with greatest information gain as a root
Example: PlayTennis
Four attributes used for classification:
Outlook = {Sunny,Overcast,Rain}
Temperature = {Hot, Mild, Cool}
Humidity = {High, Normal}
Wind = {Weak, Strong}
One predicted (target) attribute (binary)
PlayTennis = {Yes, No}
Given 14 Training examples
9 positive
5 negative
Training Examples
Examples,
minterms,
cases, objects,
test cases,
14 cases 9 positive cases
Step 1: Calculate entropy for all cases:
NPos = 9 NNeg = 5 NTot = 14
H(S) = -(9/14)*log2(9/14) - (5/14)*log2(5/14) = 0.940
entropy
Step 2: Loop over all attributes, calculate gain:
Attribute = Outlook
Loop over values of Outlook
Outlook = Sunny
NPos = 2 NNeg = 3 NTot = 5
H(Sunny) = -(2/5)*log2(2/5) - (3/5)*log2(3/5) = 0.971
Outlook = Overcast
NPos = 4 NNeg = 0 NTot = 4
H(Overcast) = -(4/4)*log24/4) - (0/4)*log2(0/4) = 0.00
Outlook = Rain
NPos = 3 NNeg = 2 NTot = 5
H(Rain) = -(3/5)*log2(3/5) - (2/5)*log2(2/5) = 0.971
Calculate Information Gain for attribute Outlook
Gain(S, Outlook) = H(S) - NSunny/NTot*H(Sunny)
- NOver/NTot*H(Overcast)
- NRain/NTot*H(Rain)
Gain(S, Outlook) = 0.940 - (5/14)*0.971 - (4/14)*0 - (5/14)*0.971
Gain(S, Outlook) = 0.246
Attribute = Temperature
(Repeat process looping over {Hot, Mild, Cool})
Gain(S, Temperature) = 0.029
Attribute = Humidity
(Repeat process looping over {High, Normal})
Gain(S, Humidity) = 0.029
Attribute = Wind
(Repeat process looping over {Weak, Strong})
Gain(S, Wind) = 0.048
Find attribute with greatest information gain:
Gain(S,Outlook) = 0.246, Gain(S,Temperature) = 0.029
Gain(S,Humidity) = 0.029, Gain(S,Wind) = 0.048
Outlook is root node of tree
Iterate algorithm to find attributes which best classify
training examples under the values of the root node
Example continued
Take three subsets:
Outlook = Sunny (NTot = 5)
Outlook = Overcast (NTot = 4)
Outlook = Rainy (NTot = 5)
For each subset, repeat the above calculation looping over all
attributes other than Outlook
For example:
Outlook = Sunny (NPos = 2, NNeg=3, NTot = 5) H=0.971
Temp = Hot (NPos = 0, NNeg=2, NTot = 2) H = 0.0
Temp = Mild (NPos = 1, NNeg=1, NTot = 2) H = 1.0
Temp = Cool (NPos = 1, NNeg=0, NTot = 1) H = 0.0
Gain(SSunny, Temperature) = 0.971 - (2/5)*0 - (2/5)*1 - (1/5)*0
Gain(SSunny, Temperature) = 0.571
Similarly:
Gain(SSunny, Humidity) = 0.971
Gain(SSunny, Wind) = 0.020
Humidity classifies Outlook=Sunny
instances best and is placed as the node under
Sunny outcome.
Repeat this process for Outlook = Overcast & Rainy
End up with tree:
Important:
Attributes are excluded from consideration if
they appear higher in the tree
Process continues for each new leaf node
until:
Every attribute has already been included
along path through the tree
or
Training examples associated with this leaf
all have same target attribute value.
Note: In this example data were perfect.
No contradictions
Branches led to unambiguous Yes, No decisions
If there are contradictions take the majority vote
This handles noisy data.
Another note:
Attributes are eliminated when they are assigned to a
node and never reconsidered.
e.g.You would not go back and reconsider Outlook under
Humidity
ID3 uses all of the training data at once
Contrast to Candidate-Elimination
Can handle noisy data.
The ID3 algorithm is used to build a decision tree, given a set of non-categorical attributes
C1, C2, .., Cn, the categorical attribute C, and a training set T of records.
function ID3 (R: a set of non-categorical attributes,
C: the categorical attribute,
S: a training set) returns a decision tree;
begin
If S is empty, return a single node with value Failure;
If every example in S has the same value for categorical
attribute, return single node with that value;
If R is empty, then return a single node with most
frequent of the values of the categorical attribute found in
examples S; [note: there will be errors, i.e., improperly
classified records];
Let D be attribute with largest Gain(D,S) among R’s attributes;
Let {dj| j=1,2, .., m} be the values of attribute D;
Let {Sj| j=1,2, .., m} be the subsets of S consisting
respectively of records with value dj for attribute D;
Return a tree with root labeled D and arcs labeled
d1, d2, .., dm going respectively to the trees
ID3(R-{D},C,S1), ID3(R-{D},C,S2) ,.., ID3(R-{D},C,Sm);
end ID3;
Entropy
Decision Tree Learning
–Does Entropy Make Sense?
If an event conveys information, that means it’s a
surprise.
If an event always occurs, P(Ai)=1, then it carries no
information. -log2(1) = 0
If an event rarely occurs (e.g. P(Ai)=0.001), it
carries a lot of info. -log2(0.001) = 9.97
The less likely (uncertain) the event, the more the
information it carries since, for 0 P(Ai) 1,
-log2(P(Ai)) increases as P(Ai) goes from 1 to 0.
(Note: ignore events with P(Ai)=0 since they never occur.)
What about entropy?
Is it a good measure of the information carried by an
ensemble of events?
If the events are equally probable, the entropy is maximum.
1) For N events, each occurring with probability 1/N.
H = -(1/N)log2(1/N) = -log2(1/N)
This is the maximum value.
(e.g. For N=256 (ascii characters) -log2(1/256) = 8
number of bits needed for characters.
Base 2 logs measure information in bits.)
This is a good thing since an ensemble of equally probable
events is as uncertain as it gets.
(Remember, information corresponds to surprise - uncertainty.)
Largest
entropy Entropy
Boolean
functions
with the same
number of
ones and
zeros have
largest
entropy
2) H is a continuous function of the probabilities.
That is always a good thing.
3) If you sub-group events into compound events, the
entropy calculated for these compound groups is the same.
That is good since the uncertainty is the same.
It is a remarkable fact that the equation for entropy
shown above (up to a multiplicative constant) is the
only function which satisfies these three conditions.
Choice of base 2 log corresponds to choosing units of
information.(BIT’s)
Another remarkable thing:
This is the same definition of entropy used in statistical
mechanics for the measure of disorder.
Corresponds to macroscopic thermodynamic quantity of
Second Law of Thermodynamics.
The concept of a quantitative measure for information
content plays an important role in many areas:
For example,
Data communications (channel capacity)
Data compression (limits on error-free encoding)
Entropy in a message corresponds to minimum number of
bits needed to encode that message.
In our case, for a set of training data, the entropy measures
the number of bits needed to encode classification for an
instance.
Use probabilities found from entire set of training data.
Prob(Class=Pos) = Num. of positive cases / Total case
Prob(Class=Neg) = Num. of negative cases / Total cases
Hypothesis Space
Decision Tree Learning
Hypothesis Space
The tree itself forms hypothesis
Disjunction (OR’s) of conjunctions (AND’s)
Each path from root to leaf forms conjunction
of constraints on attributes
Separate branches are disjunctions
Example from PlayTennis decision tree:
(Outlook=Sunny Humidity=Normal)
(Outlook=Overcast)
(Outlook=Rain Wind=Weak)
Expressiveness
Decision trees can express any function of the input attributes.
E.g., for Boolean functions, truth table row → path to leaf:
Trivially, there is a consistent decision tree for any training set with one path to
leaf for each example (unless f nondeterministic in x) but it probably won't
generalize to new examples
Prefer to find more compact decision trees
Hypothesis spaces
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
= number of distinct truth tables with 2n rows = 22n
E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 trees
Aim: find a small tree consistent with the training examples
Idea: (recursively) choose "most significant" attribute as root of
(sub)tree
Extensions of Decision Tree Learning
Extensions of the Decision Tree Learning
Noisy data and Overfitting
Cross-Validation for Experimental Validation of
Performance
Pruning Decision Trees
Real-valued data
Using gain ratios
Generation of rules
Setting Parameters
Incremental learning
Noisy data and Overfitting
Many kinds of "noise" that could occur in the examples:
Two examples have same attribute/value pairs, but different classifications
Some values of attributes are incorrect because of:
Errors in the data acquisition process
Errors in the preprocessing phase
The classification is wrong (e.g., + instead of -) because of some error
Some attributes are irrelevant to the decision-making process,
e.g., color of a die is irrelevant to its outcome.
Irrelevant attributes can result in overfitting the training data.
Noisy data and Overfitting
Black dots are positive, others negative
Two lines represent two hypothesis
Thick line is complex hypothesis correctly
classifies all data
Thin line is simple hypothesis but incorrectly
classifies some data
The simple hypothesis makes some errors
but reasonably closely represents the trend in
the data
The complex solution does not at all
represent the full set of data
Fix overfitting /overlearning
problem
By cross validation
By pruning lower nodes in the decision
Cross Validation: An Evaluation Methodology
Standard methodology: cross validation
1. Collect a large set of examples (all with correct classifications!).
2. Randomly divide collection into two disjoint sets: training and
test.
3. Apply learning algorithm to training set giving hypothesis H
4. Measure performance of H w.r.t. test set
Important: keep the training and test sets disjoint!
Learning is not to minimize training error (wrt data) but
the error for test/cross-validation: a way to fix overfitting
To study the efficiency and robustness of an algorithm,
repeat steps 2-4 for different training sets and sizes of
training sets.
If you improve your algorithm, start again with step 1 to
avoid evolving the algorithm to work well on just this
collection.
Pruning Decision Trees
Pre Pruning: Stop growing before a
fully grown tree
Post Pruning : Trim fully grown
tree from the bottom
Reduced Error Pruning
Rule post pruning
Reduced Error Pruning
Partitioning data in tree induction
Reduced Error Pruning
A post-pruning, cross-validation approach.
Partition training data in “grow” and “validation” sets.
Build a complete tree from the “grow” data.
Until accuracy on validation set decreases do:
For each non-leaf node, n, in the tree do:
Temporarily prune the subtree below n and replace it with a
leaf labeled with the current majority class at that node.
Measure and record the accuracy of the pruned tree on the validation set.
Permanently prune the node that results in the greatest increase in accuracy
on
the validation set.
60
Ockham’s Razor
Principle proposed by William of
Ockham in the fourteenth century:
“Pluralitas non est ponenda sine
neccesitate”.
Of two theories providing
similarly good predictions, prefer
the simplest one.
Shave off unnecessary parameters
of your models.
Real-valued data
Select a set of thresholds defining intervals;
each interval becomes a discrete value of the attribute
We can use some simple heuristics
always divide into quartiles
We can use domain knowledge
divide age into infant (0-2), toddler (3 - 5), and school aged (5-8)
or treat this as another learning problem
try a range of ways to discretize the continuous variable
Find out which yield “better results” with respect to some metric.
Performance Evaluation
Decision Tree Learning
Metrics for Performance Evaluation
Focus on the predictive capability of a model
Rather than how fast it takes to classify or build models, scalability, etc.
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
Class=Yes TP FN
ACTUAL Class=No FP TN TP (true positive)
CLASS FN (false negative)
TP: predicted to be in YES, and is actually in it
FP: predicted to be in YES, but is not actually in it FP (false positive)
TN: predicted not to be in YES, and is not actually in it TN (true negative)
FN: predicted not to be in YES, but is actually in it
Metrics for Performance
Accuracy
Evaluation…
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes TP FN
CLASS
Class=No FP TN
Most widely-used metric:
TP TN
Accuracy
TP TN FP FN
Limitation
Class of Accuracy
imbalance problem
Consider a 2-class problem
Number of Class 0 examples = 9990
Number of Class 1 examples = 10
If model predicts everything to be class 0, accuracy
is 9990/10000 = 99.9 %
Accuracy is misleading because model does not detect
any class 1 example
Classifier Evaluation Metrics:
Accuracy, Error Rate, Sensitivity and Specificity
A\P Yes No
Yes TP FN P
Sensitivity: True Positive
No FP TN N
recognition rate
Sensitivity = TP/P
P’ N’ All
Specificity: True Negative
recognition rate
Classifier Accuracy, or recognition Specificity = TN/N
rate: percentage of test set tuples
that are correctly classified
Accuracy = (TP + TN)/All
Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All
67
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive TP
precision
TP FP
Recall: completeness – what % of positive tuples did the
classifier label as positive? TP
Perfect score is 1.0 recall
TP FN
F-measure (F1 score or F-score)
harmonic mean of precision and recall,
2 precision recall
F
precision recall
Precision is biased towards TP & FP
Recall is biased towards TP & FN
F-measure is biased towards all except TN
Classifier Evaluation Metrics:
Matthews correlation coefficient (MCC)
MCC takes into account true and false positives and negatives.
Generally regarded as a balanced measure which can be used
even if the classes are of very different sizes.
It returns a value between −1 and +1.
1 represents a perfect prediction
0 no better than random prediction
−1 indicates total disagreement between prediction and observation
Classifier Evaluation Metrics:
Matthews correlation coefficient (MCC)
N TN TP FN FP
TP FN
S
N
TP FP
P
N
TP / N S P
MCC
PS (1 S )(1 P )
TP TN FP FN
MCC
(TP FP )(TP FN )(TN FP )(TN FN )
Summary
Decision Tree Learning
A greedy search approach
At each step, make decision which makes
greatest improvement in whatever you are
trying optimize.
Do not backtrack (unless you hit a dead end)
This type of search is likely not to be a
globally optimum solution, but generally works
well.
Types of problems decision tree learning is
good for:
Instances represented by attribute-value pairs
For algorithm in book, attributes take on a small number
of discrete values
Robust to imperfect training data
classification errors
errors in attribute values
missing attribute values
Can be extended to real-valued attributes
(numerical data)
Target function has discrete output values
Algorithm in book assumes Boolean functions
Can be extended to multiple output values
Example Use
Equipment diagnosis
Medical diagnosis
Credit card risk analysis
Robot movement
Pattern Recognition
face recognition
hexapod walking gates
How well does it work?
Many case studies have shown that decision trees are at
least as accurate as human experts.
A study for diagnosing breast cancer:
humans correctly classifying the examples 65% of the time,
the decision tree classified 72% correct.
British Petroleum designed a decision tree for gas-oil separation
for offshore oil platforms/
It replaced an earlier rule-based expert system.
Cessna designed an airplane flight controller using 90,000
examples and 20 attributes per example.
Summary of DT Learning
Inducing decision trees is one of the most widely used learning
methods in practice
Can out-perform human experts in many problems
Strengths include
Fast
simple to implement
can convert result to a set of easily interpretable rules
empirically valid in many commercial products
handles noisy data
Weaknesses include:
"Univariate" splits/partitioning using only one attribute at a time so limits
types of possible trees
large decision trees may be hard to understand
requires fixed-length feature vectors
References
Chapter 18 of “Artificial
Intelligence: A modern
approach” by Stuart Russell, Peter Norvig.
Chapter 10 of “AI Illuminated” by Ben Coppin.