0% found this document useful (0 votes)
376 views52 pages

2 Supervised Learning

The document discusses machine learning topics including supervised learning, model selection and evaluation, and decision trees. It provides an overview of classification processes, including constructing classification models from training data and using the models to classify new data. Various evaluation metrics for classification models are also described such as accuracy, precision, recall, and cross-validation methods.

Uploaded by

Ahlam Azam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
376 views52 pages

2 Supervised Learning

The document discusses machine learning topics including supervised learning, model selection and evaluation, and decision trees. It provides an overview of classification processes, including constructing classification models from training data and using the models to classify new data. Various evaluation metrics for classification models are also described such as accuracy, precision, recall, and cross-validation methods.

Uploaded by

Ahlam Azam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

CS-471

Machine Learning
Dr. Hammad Afzal
[email protected]

Prof (NUST)
Data and Text Processing Lab
www.codteem.com

1
Agenda
• Supervised Learning

• Model Selection and Evaluation

• Decision Tree

2
Supervised Learning
Lecture 3

3
Classification Process

4
Classification Applications
• Classification
– Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations.
– New data is classified based on the training set

5
Classification—A Two-Step Process
• Model construction: describing a set of predetermined
classes

– Each tuple/sample is assumed to belong to a


predefined class, as determined by the class label
attribute

– The set of tuples used for model construction is


training set

– The model is represented as classification rules,


decision trees, or mathematical formulae 6
Classification—A Two-Step Process
• Model usage:
– For classifying future or unknown objects
– Estimate accuracy of the model
▪ The known label of test sample is compared with
the classified result from the model
▪ Accuracy rate is the percentage of test set
samples that are correctly classified by the model
▪ Test set is independent of training set (otherwise
over-fitting)
– If the accuracy is acceptable, use the model to
classify new data 7
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
OR years
8 >6
THEN tenured = ‘yes’
Process 2: Model Usage for Prediction
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes 9
Dividing a Dataset
• We need independent data sets to train, set
parameters, and test performance

• Thus we will often divide a data set into three


– Training set
– Parameter selection set (Validation Dataset)
– Test set

• These must be independent


• Data set 2 is not always necessary

10
Model Evaluation and
Seletion

11
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other
metrics to consider?

• Use test set of class-labeled tuples instead of training set


when assessing accuracy

• Some of the measures are:


– Accuracy – suitable when class tuples are evenly distributed
– Precision - suitable when class tuples are not evenly distributed
– Recall - Sensitivity

12
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:

Actual class\Predicted class Yes No


Yes True Positives (TP) False Negatives (FN)
No False Positives (FP) True Negatives (TN)

Actual class\Predicted class C1 ¬ C1


C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

• Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i


that were labeled by the classifier as class j
• May have extra rows/columns to provide totals

13
Classifier Evaluation Metrics: Confusion Matrix

• True Positives
– Postive tuples correctly classified as positive.

• True Negatives:
– Negative tuples correctly classified as negative.

• False Positives:
– Negative tuples incorrectly classified as positives.

• False Negatives:
– Positive tuples incorrectly classified as negatives
14
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted class buy_computer = yes buy_computer = no Total

buy_computer = yes 6954 46 7000


buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i


that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
15
Accuracy/Error Rate

A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All

• Classifier Accuracy, or recognition rate: percentage


of test set tuples that are correctly classified
Accuracy = (TP + TN)/All

• Error rate: 1 – accuracy, or


Error rate = (FP + FN)/All

16
Sensitivity and Specificity
◼ Class Imbalance Problem:
◼ One class may be rare, e.g. fraud, or HIV-positive

◼ Significant majority of the negative class and minority of the positive

class

◼ Sensitivity: True Positive recognition rate


◼ Sensitivity = TP/P

◼ Specificity: True Negative recognition rate


◼ Specificity = TN/N

17
Precision and Recall, and F-measures

• Precision: exactness – what % of tuples that the


classifier labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the


classifier label as positive?
• Perfect score is 1.0

18
Precision and Recall, and F-measures

• Inverse relationship between precision & recall


• F measure (F1 or F-score): harmonic mean of precision and
recall,

• Fß: weighted measure of precision and recall


– assigns ß times as much weight to recall as to precision

19
Classifier Evaluation Metrics: Example
– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity

cancer = no 140 9560 9700 98.56


(specificity)

Total 230 9770 10000 96.40 (accuracy)

20
Holdout Method

• Holdout method
– Given data is randomly partitioned into two independent sets
▪ Training set (e.g., 2/3) for model construction
▪ Test set (e.g., 1/3) for accuracy estimation

– Random sampling: a variation of holdout


▪ Repeat holdout k times, accuracy = avg. of the accuracies
obtained

21
Cross-Validation Methods
• Cross-validation (k-fold, where k = 10 is most popular)

– Randomly partition the data into k mutually exclusive subsets,


each approximately equal size.

– Training and testing are performed k times.

– At i-th iteration, use Di as test set and others as training set

– Each set is used equal number of times for training and once
for testing.

– Accuracy = Overall correct classifications in k iterations/total


number of tuples.

22
5-Fold Cross-Validation
Cross-Validation Methods
– Leave-one-out: k folds where k = # of tuples, for small sized
data

– *Stratified cross-validation*:
– folds are stratified so that class dist. in each fold is approx. the
same as that in the initial data

24
Decision Tree

25
Decision Tree

26
Decision Tree

27
Apply Model To Test Data

28
Apply Model To Test Data

29
Apply Model To Test Data

30
Apply Model To Test Data

31
Apply Model To Test Data

32
Apply Model To Test Data

33
Apply Model To Test Data

34
Example Decision Tree

35
Decision Tree Classification

36
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner

– At start, all the training examples are at the root

– Attributes are categorical (Works best with Categorical – If values are


continuous, they can be discretized)

– Examples are partitioned recursively based on selected attributes

– Test attributes are selected on the basis of a statistical measure (e.g.,


information gain)

37
Algorithm for Decision Tree Induction

• Conditions for stopping partitioning

– All samples for a given node belong to the


same class

– There are no remaining attributes for further


partitioning – majority voting is employed for
classifying the leaf

38
Decision Tree Classification

39
Decision Tree Classification

40
How to Determine the best Split

41
How to Determine the best Split

42
Brief Review of Entropy

43
Attribute Selection Measure: Information Gain
(ID3/C4.5)
Computing the Entropy
Information Gain (ID3/C4.5)
 Class P: buys_computer = “yes” 5 4
Infoage ( D) = I (2,3) + I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = − log2 ( ) − log2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
5
I (2,3) means “age <=30” has 5 out
age income student credit_rating buys_computer 14
<=30 high no fair no of 14 samples, with 2 yes’es and
<=30 high no excellent no 3 no’s. Hence
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes Gain(age) = Info( D) − Infoage ( D) = 0.246
>40 low yes excellent no Similarly,
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes Gain(income) = 0.029
>40 medium yes fair yes
<=30 medium yes excellent yes Gain( student ) = 0.151
31…40
31…40
medium no excellent
high yes fair
yes
yes
Gain(credit _ rating ) = 0.048
>40 medium no excellent no 46
A Decision Tree for “Buys_Computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

47 CS490D
Extracting Classification Rules from Trees

• Represent the knowledge in the form of IF-THEN rules


• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”

48
Gain Ratio (C4.5)
• Information gain measure is biased towards attributes
with a large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D) = −  log2 ( )
j =1 |D| |D|
– Gain_Ratio(A) = Gain(A)/SplitInfo(A)
• Ex.

– Gain_Ratio (income) = 0.029/1.557 = 0.019

• The attribute with the maximum gain ratio is selected as


the splitting attribute 50
Comparing Attribute Selection
Measures
• The three measures, in general, return good results but

– Information gain:
▪ biased towards multi-valued attributes

– Gain ratio:
▪ tends to prefer unbalanced splits in which one partition is
much smaller than the others

51
Other Attribute Selection
Measures
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence

• C-SEP: performs better than info. gain and gini index in certain cases

• G-statistic: has a close approximation to χ2 distribution

• MDL (Minimal Description Length) principle (i.e., the simplest solution is


preferred):

– The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree

• Multivariate splits (partition based on multiple variable combinations)

• CART: finds multivariate splits based on a linear comb. of attrs.


• Which attribute selection measure is the best?
52
– Most give good results, none is significantly superior than others
Thank You

Artificial Intelligence - Online Workshop 53

You might also like