2 Supervised Learning
2 Supervised Learning
Machine Learning
Dr. Hammad Afzal
[email protected]
Prof (NUST)
Data and Text Processing Lab
www.codteem.com
1
Agenda
• Supervised Learning
• Decision Tree
2
Supervised Learning
Lecture 3
3
Classification Process
4
Classification Applications
• Classification
– Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations.
– New data is classified based on the training set
5
Classification—A Two-Step Process
• Model construction: describing a set of predetermined
classes
Classification
Algorithms
Training
Data
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes 9
Dividing a Dataset
• We need independent data sets to train, set
parameters, and test performance
10
Model Evaluation and
Seletion
11
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
12
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
13
Classifier Evaluation Metrics: Confusion Matrix
• True Positives
– Postive tuples correctly classified as positive.
• True Negatives:
– Negative tuples correctly classified as negative.
• False Positives:
– Negative tuples incorrectly classified as positives.
• False Negatives:
– Positive tuples incorrectly classified as negatives
14
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
16
Sensitivity and Specificity
◼ Class Imbalance Problem:
◼ One class may be rare, e.g. fraud, or HIV-positive
class
17
Precision and Recall, and F-measures
18
Precision and Recall, and F-measures
19
Classifier Evaluation Metrics: Example
– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
20
Holdout Method
• Holdout method
– Given data is randomly partitioned into two independent sets
▪ Training set (e.g., 2/3) for model construction
▪ Test set (e.g., 1/3) for accuracy estimation
21
Cross-Validation Methods
• Cross-validation (k-fold, where k = 10 is most popular)
– Each set is used equal number of times for training and once
for testing.
22
5-Fold Cross-Validation
Cross-Validation Methods
– Leave-one-out: k folds where k = # of tuples, for small sized
data
– *Stratified cross-validation*:
– folds are stratified so that class dist. in each fold is approx. the
same as that in the initial data
24
Decision Tree
25
Decision Tree
26
Decision Tree
27
Apply Model To Test Data
28
Apply Model To Test Data
29
Apply Model To Test Data
30
Apply Model To Test Data
31
Apply Model To Test Data
32
Apply Model To Test Data
33
Apply Model To Test Data
34
Example Decision Tree
35
Decision Tree Classification
36
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
37
Algorithm for Decision Tree Induction
38
Decision Tree Classification
39
Decision Tree Classification
40
How to Determine the best Split
41
How to Determine the best Split
42
Brief Review of Entropy
43
Attribute Selection Measure: Information Gain
(ID3/C4.5)
Computing the Entropy
Information Gain (ID3/C4.5)
Class P: buys_computer = “yes” 5 4
Infoage ( D) = I (2,3) + I (4,0)
Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = − log2 ( ) − log2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
5
I (2,3) means “age <=30” has 5 out
age income student credit_rating buys_computer 14
<=30 high no fair no of 14 samples, with 2 yes’es and
<=30 high no excellent no 3 no’s. Hence
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes Gain(age) = Info( D) − Infoage ( D) = 0.246
>40 low yes excellent no Similarly,
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes Gain(income) = 0.029
>40 medium yes fair yes
<=30 medium yes excellent yes Gain( student ) = 0.151
31…40
31…40
medium no excellent
high yes fair
yes
yes
Gain(credit _ rating ) = 0.048
>40 medium no excellent no 46
A Decision Tree for “Buys_Computer”
age?
<=30 overcast
30..40 >40
no yes no yes
47 CS490D
Extracting Classification Rules from Trees
48
Gain Ratio (C4.5)
• Information gain measure is biased towards attributes
with a large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D) = − log2 ( )
j =1 |D| |D|
– Gain_Ratio(A) = Gain(A)/SplitInfo(A)
• Ex.
– Information gain:
▪ biased towards multi-valued attributes
– Gain ratio:
▪ tends to prefer unbalanced splits in which one partition is
much smaller than the others
51
Other Attribute Selection
Measures
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
• C-SEP: performs better than info. gain and gini index in certain cases
– The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree