0% found this document useful (0 votes)
20 views43 pages

Chapter4 Machine Learning Part3

The document discusses decision trees and the ID3/C4.5 algorithms for building decision trees from data. It explains that ID3/C4.5 use a top-down approach to recursively select the feature that best splits the data at each node, using information gain to determine the feature that creates the "purest" subsets. It provides examples of building decision trees from sample datasets to classify outcomes.

Uploaded by

Max Sun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views43 pages

Chapter4 Machine Learning Part3

The document discusses decision trees and the ID3/C4.5 algorithms for building decision trees from data. It explains that ID3/C4.5 use a top-down approach to recursively select the feature that best splits the data at each node, using information gain to determine the feature that creates the "purest" subsets. It provides examples of building decision trees from sample datasets to classify outcomes.

Uploaded by

Max Sun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Chapter 4 Machine Learning

COMP 472 Artificial Intelligence

Russell & Norvig – Section 18.1 & 18.2


2 Supervised Learning Algorithms

´ Linear Regression
´ Logistic Regression
´ Naïve Bayes Classifier
´ Decision Tree
´ Random Forest
3 Decision Tree

´ Simplest, but most successful form of learning algorithm


´ Very well-know algorithm is ID3 (Quinlan, 1987) and its successor
C4.5
´ Look for features that are very good indicators of the result, place
these features (as questions) in nodes of the tree
´ Split the examples so that those with different values for the chosen
feature are in a different set
´ Repeat the same process with another feature

*ID3 = Iterative Dichotomiser 3


4 ID3* / C4.5 Algorithm

´ Top-down construction of the decision tree


´ Recursive selection of the “best feature” to use at the current node
in the tree
´ Once the feature is selected for the current node, generate
children nodes, one for each possible value of the selected
attribute
´ Partition the examples using the possible values of this attribute,
and assign these subsets of the examples to the appropriate
child node
´ Repeat for each child node until all examples associated with a
node are classified
5 Example 1
Information on last year’s students to determine if a student will get an ‘A’ this year

Features (X) Output f(X)

‘A’ last Black Works


Student Drinks? ‘A’ this year?
year? hair? hard?
X1: Richard Yes Yes No Yes No
X2: Alan Yes Yes Yes No Yes
X3: Alison No No Yes No No
X4: Jeff No Yes No Yes No
X5: Gail Yes No Yes Yes Yes
X6: Simon No Yes Yes Yes No
6 Example 1
Features Output f(X)

“A” last year? Student ‘A’ last Black Works Drinks? ‘A’ this year?
year? hair? hard?
Richard Yes Yes No Yes No
yes no Alan Yes Yes Yes No Yes
Alison No No Yes No No
Output = No Jeff No Yes No Yes No
Works hard?
Gail Yes No Yes Yes Yes
Simon No Yes Yes Yes No

yes no

Output = Yes Output = No


7 Example 2 The Restaurant
´ Goal: learn whether one should wait for a table
´ Attributes
´ Alternate: another suitable restaurant nearby
´ Bar: comfortable bar for waiting
´ Fri/Sat: true on Fridays and Saturdays
´ Hungry: whether one is hungry
´ Patrons: how many people are present (none, some, full)
´ Price: price range ($, $$, $$$)
´ Raining: raining outside
´ Reservation: reservation made
´ Type: kind of restaurant (French, Italian, Thai, Burger)
´ WaitEstimate: estimated wait by host (0-10 mins, 10-30, 30-60, >60)
8 Example 2 The Restaurant
´ Training Data
9 A First Decision Tree
Is it the best decision tree we can build?
10 Ockham’s Razer Principle
It is vain to do more than can be done with less… Entities should
not be multiplied beyond necessity. [Ockham, 1324]

´ In other words… always favor the simplest answer that correctly fits
the training data
´ i.e. the smallest tree on average
´ This type of assumption is called inductive bias
´ inductive bias = making a choice beyond what the training
instances contain
11 Finding the Best Tree
empty tree

´ can be seen as searching the space of


all possible decision trees
´ Inductive bias: prefer shorter trees on
average
´ how?
´ search the space of all decision trees
´ always pick the next attribute to split
the data based on its "discriminating
power" (information gain)
´ in effect, steepest ascent hill-
climbing search where heuristic is
information gain
complete tree
12 Which Tree is the Best ?
F1?

class F2?

class F3?

class F4?

class F5?

class F6?

class F7?

class class

F1?

F2? F3?

F4? F5? F6? F7?

class class class class class class class class


13 Choosing the Next Principle

´The key problem is choosing which feature to split a given


set of examples
´ID3 uses Maximum Information-Gain:
´Choose the attribute that has the largest information gain
´i.e., the attribute that will result in the smallest
expected size of the subtrees rooted at its children
´information theory
14 Intuitively
Output f(X)

´ Patron:
´ If value is Some… all outputs=Yes
´ If value is None… all outputs=No
´ If value is Full… we need more tests
´ Type:
´ If value is French… we need more tests
´ If value is Italian… we need more tests
´ If value is Thai… we need more tests
´ If value is Burger… we need more tests
´ …
´ So patron may lead to shorter tree…
15 Next Feature

´ For only data where patron = Full


´ hungry
´ If value is Yes… we need more tests
´ If value is No… all output= No
´ type:
´ If value is French… all output= No
´ If value is Italian… all output= No
´ If value is Thai… we need more tests
´ If value is Burger… we need more tests
´…
´ So hungry is more discriminating (only 1 new branch)…
16 Next Feature

´ 4 tests instead of 9
´ 11 branches instead of 21
17 Choosing the Next Attribute

´The key problem is choosing which feature to split a given set of


examples
´Most used strategy: information theory

H(X) = - å p(xi )log2p(xi ) Entropy (or information content)


xiÎX
æ1 1ö
H(fair coin toss) = - å p(xi )log2p(xi ) = Hç , ÷
xiÎX è2 2ø
æ1 1 1 1ö
= ç log2 + log2 ÷ = 1 bit
è2 2 2 2ø entropy of a fair coin toss with
2 possible outcomes, each
with a probability of 1/2
18 Entropy

´ Let X be a discrete random variable (RV) with i possible outcomes xi


´ Entropy (or information content)
n
H(X) = -å p(xi )log2p(xi )
i=1
´ measures the amount of information in a RV
´ average uncertainty of a RV
´ the average length of the message needed to transmit an outcome xi
of that variable
´ measured in bits
´ for only 2 outcomes x1 and x2, then 1 ≥ H(X) ≥ 0
19 Why -p(x)log2 (p(x))
20 Example: The Coin Flip
n
æ1 1 1 1ö
´ Fair coin: H(X) = -å p(xi )log2p(xi ) = - ç log2 + log2 ÷ = 1 bit
i=1 è2 2 2 2ø
n
æ 99 99 1 1 ö
´ Rigged coin: H(X) = -å p(xi )log2p(xi ) = - ç log2 + log2 ÷ = 0.08 bits
i=1 è 100 100 100 100 ø

fair coin -> high entropy


Entropy

rigged coin -> low entropy

P(head)
21 Choosing the Best Feature

´The "discriminating power" of an attribute A given a data set S


´ Let Values(A) = the set of values that attribute A can take
´ Let Sv = the set of examples in the data set which have value
v for attribute A (for each value v from Values(A) )

information gain (or


entropy reduction)

gain(S, A) = H(S) - H(S | A)


Sv
= H(S) - å x H(Sv )
v Î values(A) S
22 Some Intuition
Size Color Shape Output
Big Red Circle +
Small Red Circle +
Small Red Square -
Big Blue Circle -

´ Size is the least discriminating attribute (i.e. smallest


information gain)
´ Shape and color are the most discriminating attributes
(i.e. highest information gain)
23 A Small Example 1
Size Color Shape Output
Values(Color) = {red,blue}
Big Red Circle +
Small Red Circle + nColor

Small Red Square -


nred: 2+ 1- nblue: 0+ 1-
Big Blue Circle -
Sv
gain(S, Color) = H(S) - å x H(Sv )
æ2 2 2 2ö S
H(S) = -ç log2 + log2 ÷ = 1 v Î values(Color)
è4 4 4 4ø
for each v of Values(Color)
æ2 1 ö æ2 2 1 1ö
H(S | Color = red) = Hç , ÷ = -ç log2 + log2 ÷ = 0.918
è3 3ø è3 3 3 3ø
æ1 1ö
H(S | Color = blue) = H(1,0 ) = -ç log2 ÷ = 0
è1 1ø
3 1
H(S | Color) = (0.918) + (0) = 0.6885
4 4

gain(Color ) = H(S) - H(S | Color) = 1 - 0.6885 = 0.3115


24 A Small Example 2
Size Color Shape Output
Big Red Circle + n Shape
Small Red Circle +
n circle: 2+ 1- nsquare: 0+ 1-
Small Red Square -
Big Blue Circle -
Note: by definition,
´ Log 0 = -∞
æ2 2 2 2ö
H(S) = -ç log2 + log2 ÷ = 1
è4 4 4 4ø ´ 0log0 is 0

3 1
H(S | Shape) = (0.918) + (0) = 0.6885
4 4
gain(Shape) = H(S) - H(S | Shape) = 1 - 0.6885 = 0.3115
25 A Small Example 3
Size Color Shape Output

Big Red Circle + n Size

Small Red Circle + n big: 1+ 1- n small: 1+ 1-


Small Red Square -
Big Blue Circle -

æ2 2 2 2ö
H(S) = -ç log2 + log2 ÷ = 1
è4 4 4 4ø

1 1
H(S | Size) = (1) + (1) = 1
2 2
gain(Size) = H(S) - H(S | Size) = 1 - 1 = 0
26 A Small Example 4
Size Color Shape Output
Big Red Circle +
Small Red Circle +
Small Red Square -
Big Blue Circle -

gain(Shape) = 0.3115
gain(Color ) = 0.3115
gain(Size) = 0
´ So first separate according to either color or shape
(root of the tree)
27 A Small Example 4
Color
Size Color Shape Output
red blue
Big Red Circle +
S2 Size? or
Small Red Circle + -
Shape?
Small Red Square -
Big Blue Circle -

æ2 2 1 1ö
H(S2) = -ç log2 + log2 ÷
è3 3 3 3ø
for each v of Values(Size) for each v of Values(Shape)
æ1 0 ö æ2 0ö
H(S2 | Size = big) = Hç , ÷ = 0 H(S2 | Shape = circle) = Hç , ÷ = 0
è1 1 ø
è2 2 ø
æ1 1ö
H(S2 | Size = small) = Hç , ÷ = 1 æ1 1ö
è2 2ø H(S2 | Shape = square) = Hç , ÷ = 0
è1 1ø
1 2
H(S2 | Size) = (0) + (1) H(S2 | Shape)
3 3
gain(Size) = H(S2) - H(S2 | Color) gain(Shape) = H(S2) - H(S2 | Shape)
28 Back to the Restaurant
´ Training data:
29 The Restaurant Example
gain(alt) = ... gain(bar) = ... gain(fri) = ... gain(hun) = ...

æ2 æ 0 2ö 4 æ0 4ö 6 æ 2 4 öö
gain(pat) = 1 - çç x Hç , ÷ + x Hç , ÷ + x Hç , ÷ ÷÷
è 12 è 2 2 ø 12 è 4 4 ø 12 è 6 6 øø
æ2 æ0 0 2 2ö 4 æ0 0 4 4ö ö
= 1 - çç x - ç log2 + log2 ÷ + x - ç log2 + log2 ÷ + ... ÷÷ » 0.541bits
è 12 è2 2 2 2 ø 12 è4 4 4 4ø ø
gain(price) = ... gain(rain) = ... gain(res) = ...

æ2 æ1 1ö 2 æ1 1ö 4 æ2 2ö 4 æ 2 2 öö
gain(type) = 1 - çç x Hç , ÷ + x Hç , ÷ + x Hç , ÷ + x Hç , ÷ ÷÷ = 0 bits
è 12 è 2 2 ø 12 è 2 2 ø 12 è 4 4 ø 12 è 4 4 øø

gain(est) = ...

n Attribute pat (Patron) has the highest gain, so root of the


tree should be attribute Patrons
n do recursively for subtrees
30 Decision Boundaries
Feature 1

Feature 2
31 Decision Boundaries

Feature 1

Feature 2 > t1

??

t1
Feature 2
32 Decision Boundaries

Feature 1
Feature 2 > t1

t2 Feature 1 > t2

t1 Feature 2
??
33 Decision Boundaries
Feature 2 > t1
Feature 1

Feature 1 > t2
t2

t3 t1 Feature 2 > t3
Feature 2
34 Supervised Learning Algorithms

´ Linear Regression
´ Logistic Regression
´ Naïve Bayes Classifier
´ Decision Tree
´ Random Forest
35 Random Forest

´ Random forest builds multiple decision trees (called the forest)


and glues them together to get a more accurate and stable
prediction.
36 Why Random Forest

´ More Accuracy
´ reduce the variations and the predictions by combining the
result of multiple decision trees on different samples of the
data set.
´ Example: use the following data set to create a Random
Forest that predicts of a person has heart disease or not.
37 Creating a Random Forest

´ Example: use the following data set to create a Random


Forest that predicts of a person has heart disease or not.

Blood Blocked Chest Weight Heart


Flow Arteries Pain Disease
Normal Yes Yes 195 Yes
Abnormal No No 130 No
Normal No Yes 218 No
Abnormal Yes Yes 180 Yes
38 Creating a Random Forest

´ Step 1: Create a Bootstrapped Data Set


Bootstrapping is an estimation methods used to make predictions on a
data set by re-sampling it.

Blood Blocked Chest Weight Heart


Flow Arteries Pain Disease
Normal Yes Yes 195 Yes
Abnormal No No 130 No
Abnormal Yes Yes 180 Yes
Abnormal Yes Yes 180 Yes
39 Creating a Random Forest

´ Step 2: Creating Decision Tree


´ Build a Decision Tree by using the bootstrapped data set
´ Begin at the root nod & choose the best attribute to split the data set
´ Repeat the same process for each of the upcoming branch nodes.

Blocked Ateries
40 Creating a Random Forest

´ Step 3: Go back to Step1 and Repeat


´ Each Decision Tree predicts the output
class based on the respective predictor
variables used in the tree.
´ Go back to step1, create a new
bootstrapped data set and then build a
Decision Tree by considering only a
subset of variables at each step.
´ This iteration is performed 100’s of times,
creating multiple decision trees.
41 Creating a Random Forest

´ Step 4: Predicting the outcome of a new data point


´ To predict whether a new patient has heart disease, run the new data
down the decision trees.
´ After running the data down all the trees in the Random Forest, we
check which class got the majority votes.

A new patient’s record

Blood Flow Blocked Arteries Chest Pain Weight Heart Disease


Abnormal No Yes 185

Heart Disease
Yes No
1 0
Output
42 Creating a Random Forest

´ Step 5: Evaluate the Model


´ In a real-world problem, about 1/3rd of the original data set is not
included in the bootstrapped data set.
´ This sample data set that does not include in the bootstrapped data
set is known as the Out-Of-Bag(OOB) data set.
´ We can measure the accuracy of a Random Forest by the
proportion of OOB samples that are correctly classified.

Blood Flow Blocked Chest Pain Weight Heart


Arteries Disease
Normal Yes Yes 195 Yes Blood Blocked Chest Weight Heart
Flow Arteries Pain Disease
Abnormal No No 130 No
Normal No Yes 218 No
Abnormal Yes Yes 180 Yes

Abnormal Yes Yes 180 Yes OOB data set (testing data set)

Bootstrapped data set


43 The End

You might also like