FuncIon
ApproximaIon
Problem Se*ng
• Set of possible instances X
• Set of possible labels Y
• Unknown target funcIon f : X ! Y
• Set of funcIon hypotheses H = {h | h : X ! Y}
Input: Training examples of unknown target funcIon f
n
{hxi , yi i}i=1 = {hx1 , y1 i , . . . , hxn , yn i}
Output: Hypothesis that best approximates f
h2H
Based on slide by Tom Mitchell
Sample Dataset
• Columns denote features Xi
• Rows denote labeled instances hxi , yi i
• Class label denotes whether a tennis game was played
hxi , yi i
Decision Tree
• A possible decision tree for the data:
• Each internal node: test one aHribute Xi
• Each branch from a node: selects one value for Xi
• Each leaf node: predict Y (or )
p(Y | x 2 leaf)
Based on slide by Tom Mitchell
Decision Tree
• A possible decision tree for the data:
• What predicIon would we make for
<outlook=sunny, temperature=hot, humidity=high, wind=weak> ?
Based on slide by Tom Mitchell
Decision Tree
• If features are conInuous, internal nodes can
test the value of a feature against a threshold
6
DecisionDecision Tree Learning
Tree Learning
Problem Setting:
• Set of possible instances X
– each instance x in X is a feature vector
– e.g., <Humidity=low, Wind=weak, Outlook=rain, Temp=hot>
• Unknown target function f : XY
– Y is discrete valued
• Set of function hypotheses H={ h | h : XY }
– each hypothesis h is a decision tree
– trees sorts x to leaf, which assigns y
Slide by Tom Mitchell
Stages of (Batch) Machine Learning
n
Given: labeled training data X, Y = {hxi , yi i}i=1
• Assumes each with
xi ⇠ D(X ) yi = ftarget (xi )
X, Y
Train the model:
learner
model ß classifier.train(X, Y )
x model ypredicIon
Apply the model to new data:
• Given: new unlabeled instance x ⇠ D(X )
ypredicIon ß model.predict(x)
Example ApplicaIon: A Tree to
Predict Caesarean SecIon Risk
Based on Example by Tom Mitchell
Decision Tree Induced ParIIon
Color
green blue
red
Size + Shape
big small square round
- + Size +
big small
- +
Decision Tree – Decision Boundary
• Decision trees divide the feature space into axis-
parallel (hyper-)rectangles
• Each rectangular region is labeled with one label
– or a probability distribuIon over labels
Decision
boundary
11
Expressiveness
• Decision trees can represent any boolean funcIon of
the input aHributes
Truth table row à path to leaf
• In the worst case, the tree will require exponenIally
many nodes
Expressiveness
Decision trees have a variable-sized hypothesis space
• As the #nodes (or depth) increases, the hypothesis
space grows
– Depth 1 (“decision stump”): can represent any boolean
funcIon of one feature
– Depth 2: any boolean fn of two features; some involving
three features (e.g., )
(x1 ^ x2 ) _ (¬x1 ^ ¬x3 )
– etc.
Based on slide by Pedro Domingos
Another Example:
Restaurant Domain (Russell & Norvig)
Model a patron’s decision of whether to wait for a table at a restaurant
~7,000 possible cases
A Decision Tree
from IntrospecIon
Is this the best decision tree?
Preference bias: Ockham’s Razor
• Principle stated by William of Ockham (1285-1347)
– “non sunt mul0plicanda en0a praeter necessitatem”
– enIIes are not to be mulIplied beyond necessity
– AKA Occam’s Razor, Law of Economy, or Law of Parsimony
Idea: The simplest consistent explanaIon is the best
• Therefore, the smallest decision tree that correctly
classifies all of the training examples is best
• Finding the provably smallest decision tree is NP-hard
• ...So instead of construcIng the absolute smallest tree
consistent with the training examples, construct one that
is preHy small
Basic Algorithm for Top-Down
InducIon of Decision Trees
[ID3, C4.5 by Quinlan]
node = root of decision tree
Main loop:
1. A ß the “best” decision aHribute for the next node.
2. Assign A as decision aHribute for node.
3. For each value of A, create a new descendant of node.
4. Sort training examples to leaf nodes.
5. If training examples are perfectly classified, stop.
Else, recurse over new leaf nodes.
How do we choose which aHribute is best?
Choosing the Best AHribute
Key problem: choosing which aHribute to split a
given set of examples
• Some possibiliIes are:
– Random: Select any aHribute at random
– Least-Values: Choose the aHribute with the smallest
number of possible values
– Most-Values: Choose the aHribute with the largest
number of possible values
– Max-Gain: Choose the aHribute that has the largest
expected informa0on gain
• i.e., aHribute that results in smallest expected size of subtrees
rooted at its children
• The ID3 algorithm uses the Max-Gain method of
selecIng the best aHribute
Choosing an AHribute
Idea: a good aHribute splits the examples into subsets
that are (ideally) “all posiIve” or “all negaIve”
Which split is more informaIve: Patrons? or Type?
Based on Slide from M. desJardins & T. Finin
ID3-induced
Decision Tree
Based on Slide from M. desJardins & T. Finin
Compare the Two Decision Trees
Based on Slide from M. desJardins & T. Finin
InformaIon Gain
Which test is more informaIve?
Split over whether Split over whether
Balance exceeds 50K applicant is employed
Less or equal 50K Over 50K Unemployed Employed
22
Based on slide by Pedro Domingos
InformaIon Gain
Impurity/Entropy (informal)
– Measures the level of impurity in a group of
examples
23
Based on slide by Pedro Domingos
Impurity
Very impure group Less impure Minimum
impurity
24
Based on slide by Pedro Domingos
Entropy: a common way to measure impurity
Entropy # of possible
values for X
Entropy H(X) of a random variable X
H(X) is the expected number of bits needed to encode a
randomly drawn value of X (under most efficient code)
Why? Information theory:
• Most efficient code assigns -log2P(X=i) bits to encode
the message X=i
• So, expected number of bits to code one random X is:
Slide by Tom Mitchell
Entropy: a common way to measure impurity
Entropy # of possible
values for X
Entropy H(X) of a random variable X
H(X) is the expected number of bits needed to encode a
randomly drawn value of X (under most efficient code)
Why? Information theory:
• Most efficient code assigns -log2P(X=i) bits to encode
the message X=i
• So, expected number of bits to code one random X is:
Slide by Tom Mitchell
Example: Huffman code
• In 1952 MIT student David Huffman devised, in the course
of doing a homework assignment, an elegant coding
scheme which is opImal in the case where all symbols’
probabiliIes are integral powers of 1/2.
• A Huffman code can be built in the following manner:
– Rank all symbols in order of probability of occurrence
– Successively combine the two symbols of the lowest
probability to form a new composite symbol; eventually
we will build a binary tree where each node is the
probability of all nodes beneath it
– Trace a path to each leaf, noIcing direcIon at each node
Based on Slide from M. desJardins & T. Finin
Huffman code example
M code length prob
M P A 000 3 0.125 0.375
B 001 3 0.125 0.375
A .125 1
C 01 2 0.250 0.500
B .125 0 1
D 1 1 0.500 0.500
C .25 .5 .5 average message length 1.750
D .5 0 1 D
.25 .25 If we use this code to many
messages (A,B,C or D) with this
0 1 C probability distribuIon, then, over
.125 .125
Ime, the average bits/message
should approach 1.75
A B
Based on Slide from M. desJardins & T. Finin
2-Class Cases:
n
X
Entropy H(x) = P (x = i) log2 P (x = i)
i=1
• What is the entropy of a group in which all Minimum
examples belong to the same class? impurity
– entropy = - 1 log21 = 0
not a good training set for learning
• What is the entropy of a group with 50% Maximum
in either class? impurity
– entropy = -0.5 log20.5 – 0.5 log20.5 =1
good training set for learning
30
Based on slide by Pedro Domingos
Sample Entropy
Sample Entropy
Slide by Tom Mitchell
InformaIon Gain
• We want to determine which aHribute in a given set
of training feature vectors is most useful for
discriminaIng between the classes to be learned.
• InformaIon gain tells us how important a given
aHribute of the feature vectors is.
• We will use it to decide the ordering of aHributes in
the nodes of a decision tree.
32
Based on slide by Pedro Domingos
From Entropy to InformaIon Gain
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mututal information (aka Information Gain) of X and Y :
Slide by Tom Mitchell
From Entropy to InformaIon Gain
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mututal information (aka Information Gain) of X and Y :
Slide by Tom Mitchell
From Entropy to InformaIon Gain
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mututal information (aka Information Gain) of X and Y :
Slide by Tom Mitchell
From Entropy to InformaIon Gain
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mututal information (aka Information Gain) of X and Y :
Slide by Tom Mitchell
InformaIon Gain
Information Gain is the mutual information between
input attribute A and target variable Y
Information Gain is the expected reduction in entropy
of target variable Y for data sample S, due to sorting
on variable A
Slide by Tom Mitchell
CalculaIng InformaIon Gain
InformaOon Gain = entropy(parent) – [average entropy(children)]
child = −⎛⎜ 13 ⋅ log 13 ⎞⎟ − ⎛⎜ 4 ⋅ log 4 ⎞⎟ = 0.787
impurity 2 2
entropy ⎝ 17 17 ⎠ ⎝ 17 17 ⎠
Entire population (30 instances)
17 instances
child ⎛1 1 ⎞ ⎛ 12 12 ⎞
impurity = −⎜
entropy 13 ⋅ log 2 ⎟ − ⎜ ⋅ log 2 ⎟ = 0.391
⎝ 13 ⎠ ⎝ 13 13 ⎠
parent = −⎛⎜ 14 ⋅ log 2 14 ⎞⎟ − ⎛⎜ 16 ⋅ log 2 16 ⎞⎟ = 0.996
impurity
entropy ⎝ 30 30 ⎠ ⎝ 30 30 ⎠ 13 instances
⎛ 17 ⎞ ⎛ 13 ⎞
(Weighted) Average Entropy of Children = ⎜ ⋅ 0.787 ⎟ + ⎜ ⋅ 0.391⎟ = 0.615
⎝ 30 ⎠ ⎝ 30 ⎠
Information Gain= 0.996 - 0.615 = 0.38 38
Based on slide by Pedro Domingos
Entropy-Based AutomaIc Decision
Tree ConstrucIon
Training Set X Node 1
x1=(f11,f12,…f1m) What feature
x2=(f21,f22, f2m) should be used?
.
What values?
.
xn=(fn1,f22, f2m)
Quinlan suggested informaIon gain in his ID3 system
and later the gain raIo, both based on entropy.
39
Based on slide by Pedro Domingos
Using InformaIon Gain to Construct
a Decision Tree
Choose the aHribute A
Full Training Set X with highest informaIon
AHribute A
gain for the full training
v1 v2 vk set at the root of the tree.
Construct child nodes
for each value of A. Set X X ={x X | value(A)=v1}
Each has an associated
subset of vectors in repeat
which A has a parIcular recursively
Ill when?
value.
Disadvantage of informaIon gain:
• It prefers aHributes with large number of values that split
the data into small, pure subsets
• Quinlan’s gain raIo uses normalizaIon to improve this
40
Based on slide by Pedro Domingos
Slide by Tom Mitchell
Slide by Tom Mitchell
Slide by Tom Mitchell