Note to other teachers and users of these slides.
Andrew would be delighted if you found this source
material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrews
tutorials: [Link] .
Comments and corrections gratefully received.
Decision Trees
Todays lecture
Information Gain for measuring association
between inputs and outputs
Learning a decision tree classifier from data
Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University
[Link]/~awm
awm@[Link]
412-268-7599
Copyright 2001, Andrew W. Moore
July 30, 2001
Decision Trees: Slide 2
Deciding whether a pattern is
interesting
Data Mining
Data Mining is all about automating the
process of searching for patterns in the
data.
Which patterns are interesting?
Thats what well look at right now.
And the answer will turn out to be the engine that
drives decision tree learning.
Which might be mere illusions?
And how can they be exploited?
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 3
Note to other teachers and users of these slides.
Andrew would be delighted if you found this source
material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrews
tutorials: [Link] .
Comments and corrections gratefully received.
Information Gain
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
[Link]/~awm
awm@[Link]
412-268-7599
Copyright 2001, Andrew W. Moore
Copyright 2001, Andrew W. Moore
We will use information theory
A very large topic, originally used for
compressing signals
But more recently used for data mining
(The topic of Information Gain will now be
discussed, but you will find it in a separate
Andrew Handout)
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 4
Bits
You are watching a set of independent random samples of X
You see that X has four possible values
P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4
So you might see: BAACBADCDADDDA
You transmit data over a binary serial link. You can encode
each reading with two bits (e.g. A = 00, B = 01, C = 10, D =
11)
0100001001001110110011111100
July 30, 2001
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 6
Fewer Bits
Fewer Bits
Someone tells you that the probabilities are not equal
Someone tells you that the probabilities are not equal
P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8
P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8
Its possible
to invent a coding for your transmission that only uses
1.75 bits on average per symbol. How?
Its possible
to invent a coding for your transmission that only uses
1.75 bits on average per symbol. How?
10
110
111
(This is just one of several ways)
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 7
Copyright 2001, Andrew W. Moore
Fewer Bits
General Case
Suppose there are three equally likely values
Suppose X can have one of m values V1, V2,
P(X=B) = 1/3 P(X=C) = 1/3 P(X=D) = 1/3
00
01
10
P(X=V1) = p1
P(X=V2) = p2
P(X=Vm) = pm
In theory, it can in fact be done with 1.58496 bits per
symbol.
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 9
= p j log 2 p j
j =1
H(X) = The entropy of X
High Entropy means X is from a uniform (boring) distribution
Low Entropy means X is from varied (peaks and valleys) distribution
Copyright 2001, Andrew W. Moore
General Case
P(X=V1) = p1
Vm
H ( X ) = p1 log 2 p1 p2 log 2 p2 K pm log 2 pm
Can you think of a coding that would need only 1.6 bits
per symbol on average?
Suppose X can have one of m values V1, V2,
Whats the smallest possible number of bits, on average, per
symbol, needed to transmit a stream of symbols drawn from
Xs distribution? Its
Heres a nave coding, costing 2 bits per symbol
A
Decision Trees: Slide 8
Decision Trees: Slide 10
General Case
Vm
Suppose X can have one of m values V1, V2,
P(X=V2) = p2
.
P(X=Vm) = pm
A histogram of the
Whats the smallest possible number of frequency
bits, on average,
distributionper
of
symbol, needed
to transmit
a stream values
of symbols
drawn
of X would
havefrom
A histogram
of the
many lows and one or
Xs distribution?
Its
frequency
distribution of
two highs
values of X would be flat
H ( X ) = p1 log 2 p1 p2 log 2 p2 K pm log 2 pm
m
= p j log 2 p j
P(X=V1) = p1
Vm
P(X=V2) = p2
.
P(X=Vm) = pm
A histogram of the
Whats the smallest possible number of frequency
bits, on average,
distributionper
of
symbol, needed
to transmit
a stream values
of symbols
drawn
of X would
havefrom
A histogram
of the
many lows and one or
Xs distribution?
Its
frequency
distribution of
two highs
values of X would be flat
H ( X ) = p1 log 2 p1 p2 log 2 p2 K pm log 2 pm
m
= p ..and
j log so
j values
2 pthe
j =1
j =1
sampled from it would
be all over the place
..and so the values
sampled from it would
be more predictable
H(X) = The entropy of X
H(X) = The entropy of X
High Entropy means X is from a uniform (boring) distribution
Low Entropy means X is from varied (peaks and valleys) distribution
High Entropy means X is from a uniform (boring) distribution
Low Entropy means X is from varied (peaks and valleys) distribution
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 11
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 12
Entropy in a nut-shell
Low Entropy
High Entropy
Entropy in a nut-shell
Low Entropy
High Entropy
..the values (locations
of soup) sampled
entirely from within
the soup bowl
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 13
Specific Conditional Entropy H(Y|X=v)
Suppose Im trying to predict output Y and I have input X
X = College Major
Y = Likes Gladiator
X
Y
Lets assume this reflects the true
probabilities
Copyright 2001, Andrew W. Moore
X = College Major
Y = Likes Gladiator
Math
Yes
P(LikeG = Yes) = 0.5
Math
Yes
History
No
P(Major = Math & LikeG = No) = 0.25
History
No
CS
Yes
CS
Yes
Math
No
P(Major = Math) = 0.5
Math
No
Math
No
Math
No
CS
Yes
Yes
History
No
H(X) = 1.5
CS
History
No
Math
Yes
H(Y) = 1
Math
Yes
Note:
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 15
Copyright 2001, Andrew W. Moore
Y = Likes Gladiator
Math
Yes
History
No
CS
Yes
Math
No
Math
No
CS
Yes
History
Math
Definition of Specific Conditional
Entropy:
H(Y |X=v) = The entropy of Y
among only those records in which
X has value v
X = College Major
Y = Likes Gladiator
Math
Yes
Example:
History
No
H(Y|X=Math) = 1
CS
Yes
Math
No
Math
No
CS
Yes
No
History
No
Yes
Math
Yes
Copyright 2001, Andrew W. Moore
H(Y|X=History) = 0
H(Y|X=CS) = 0
Decision Trees: Slide 17
Definition of Specific Conditional
Entropy:
H(Y |X=v) = The entropy of Y
among only those records in which
X has value v
Decision Trees: Slide 16
Conditional Entropy H(Y|X)
Specific Conditional Entropy H(Y|X=v)
X = College Major
Decision Trees: Slide 14
Specific Conditional Entropy H(Y|X=v)
E.G. From this data we estimate
P(LikeG = Yes | Major = History) = 0
..the values (locations of
soup) unpredictable...
almost uniformly sampled
throughout our dining room
Copyright 2001, Andrew W. Moore
Definition of Conditional
Entropy:
H(Y |X) = The average specific
conditional entropy of Y
= if you choose a record at random what
will be the conditional entropy of Y,
conditioned on that rows value of X
= Expected number of bits to transmit Y if
both sides will know the value of X
= j Prob(X=vj) H(Y | X = vj)
Decision Trees: Slide 18
Conditional Entropy
Y = Likes Gladiator
Information Gain
Definition of Conditional Entropy:
X = College Major
H(Y|X) = The average conditional
entropy of Y
X = College Major
= jProb(X=vj) H(Y | X = vj)
Math
Yes
Example:
History
No
CS
Yes
vj
Math
No
Math
No
CS
Yes
History
No
Math
Yes
Math
History
CS
Prob(X=vj)
0.5
0.25
0.25
H(Y | X = vj)
1
0
0
H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 19
Note to other teachers and users of these slides.
Andrew would be delighted if you found this source
material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrews
tutorials: [Link] .
Comments and corrections gratefully received.
Back to
Decision Trees
Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University
Definition of Information Gain:
Y = Likes Gladiator
IG(Y|X) = I must transmit Y.
How many bits on average
would it save me if both ends of
the line knew X?
Math
Yes
History
No
CS
Yes
Math
No
Math
No
CS
Yes
History
No
Math
Yes
IG(Y|X) = H(Y) - H(Y | X)
Example:
Copyright 2001, Andrew W. Moore
H(Y) = 1
H(Y|X) = 0.5
Thus IG(Y|X) = 1 0.5 = 0.5
Decision Trees: Slide 20
Learning Decision Trees
A Decision Tree is a tree-structured plan of
a set of attributes to test in order to predict
the output.
To decide which attribute should be tested
first, simply find the one with the highest
information gain.
Then recurse
[Link]/~awm
awm@[Link]
412-268-7599
Copyright 2001, Andrew W. Moore
July 30, 2001
A small dataset: Miles Per Gallon
mpg
40
Records
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
cylinders displacement horsepower
4
6
4
8
6
4
4
8
:
:
:
8
8
8
4
6
4
4
8
4
5
low
medium
medium
high
medium
low
low
high
:
:
:
high
high
high
low
medium
medium
low
high
low
medium
low
medium
medium
high
medium
medium
medium
high
:
:
:
high
medium
high
low
medium
low
low
high
medium
medium
weight
acceleration modelyear maker
low
medium
medium
high
medium
low
low
high
:
:
:
high
high
high
low
medium
low
medium
high
low
medium
high
medium
low
low
medium
medium
low
low
:
:
:
low
high
low
low
high
low
high
low
medium
medium
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe
From the UCI repository (thanks to Ross Quinlan)
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 23
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 22
Suppose we want to
predict MPG.
Look at all
the
information
gains
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 24
A Decision Stump
Recursion Step
Records
in which
cylinders
=4
Take the
Original
Dataset..
Records
in which
cylinders
=5
And partition it
according
to the value of
the attribute
we split on
Records
in which
cylinders
=6
Records
in which
cylinders
=8
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 25
Recursion Step
Build tree from
These records..
Records in
which
cylinders = 4
Build tree from
These records..
Records in
which
cylinders = 5
Copyright 2001, Andrew W. Moore
Build tree from
These records..
Records in
which
cylinders = 6
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 26
Second level of tree
Build tree from
These records..
Records in
which
cylinders = 8
Decision Trees: Slide 27
Recursively build a tree from the seven
records in which there are four cylinders and
the maker was based in Asia
Copyright 2001, Andrew W. Moore
(Similar recursion in the
other cases)
Decision Trees: Slide 28
Base Case
One
The final tree
Dont split a
node if all
matching
records have
the same
output value
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 29
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 30
Base Case
Two
Dont split a
node if none
of the
attributes
can create
multiple nonempty
children
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 31
Base Case Two:
No attributes
can distinguish
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 32
Base Cases
Base Cases: An idea
Base Case One: If all records in current data subset have
the same output then dont recurse
Base Case Two: If all records have exactly the same set of
input attributes then dont recurse
Base Case One: If all records in current data subset have
the same output then dont recurse
Base Case Two: If all records have exactly the same set of
input attributes then dont recurse
Proposed Base Case 3:
If all attributes have zero information
gain then dont recurse
Is this a good idea?
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 33
Copyright 2001, Andrew W. Moore
The problem with Base Case 3
a
b
0
0
1
1
y
0
1
0
1
0
1
1
0
The information gains:
Copyright 2001, Andrew W. Moore
If we omit Base Case 3:
a
y = a XOR b
b
0
0
1
1
The resulting decision
tree:
Decision Trees: Slide 35
Decision Trees: Slide 34
y
0
1
0
1
0
1
1
0
y = a XOR b
The resulting decision tree:
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 36
Basic Decision Tree Building
Summarized
Training Set Error
BuildTree(DataSet,Output)
If all output values are the same in DataSet, return a leaf node that
says predict this unique output
If all input values are the same, return a leaf node that says predict
the majority output
Else find attribute X with highest Info Gain
Suppose X has nX distinct values (i.e. X has arity nX).
For each record, follow the decision tree to
see what it would predict
Copyright 2001, Andrew W. Moore
Copyright 2001, Andrew W. Moore
Create and return a non-leaf node with nX children.
The ith child should be built by calling
BuildTree(DSi,Output)
Where DSi built consists of all those records in DataSet for which X = ith
distinct value of X.
Decision Trees: Slide 37
For what number of records does the decision
trees prediction disagree with the true value in
the database?
This quantity is called the training set error.
The smaller the better.
MPG Training
error
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 39
MPG Training
error
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 41
Decision Trees: Slide 38
MPG Training
error
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 40
Stop and reflect: Why are we
doing this learning anyway?
It is not usually in order to predict the
training datas output on data we have
already seen.
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 42
Stop and reflect: Why are we
doing this learning anyway?
Stop and reflect: Why are we
doing this learning anyway?
It is not usually in order to predict the
training datas output on data we have
already seen.
It is more commonly in order to predict the
output value for future data we have not yet
seen.
It is not usually in order to predict the
training datas output on data we have
already seen.
It is more commonly in order to predict the
output value for future data we have not yet
seen.
Warning: A common data mining misperception is that the
above two bullets are the only possible reasons for learning.
There are at least a dozen others.
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 43
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 44
MPG Test set
error
Test Set Error
Suppose we are forward thinking.
We hide some data away when we learn the
decision tree.
But once learned, we see how well the tree
predicts that data.
This is a good simulation of what happens
when we try to predict future data.
And it is called Test Set Error.
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 45
MPG Test set
error
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 46
An artificial example
Well create a training dataset
The test set error is much worse than the
training set error
why?
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 47
32 records
Five inputs, all bits, are
generated in all 32 possible
combinations
Copyright 2001, Andrew W. Moore
Output y = copy of e,
Except a random 25%
of the records have y
set to the opposite of e
Decision Trees: Slide 48
In our artificial example
Building a tree with the artificial
training set
Suppose someone generates a test set
according to the same method.
The test set is identical, except that some of
the ys will be different.
Some ys that were corrupted in the training
set will be uncorrupted in the testing set.
Some ys that were uncorrupted in the
training set will be corrupted in the test set.
Suppose we build a full tree (we always split until base case 2)
Copyright 2001, Andrew W. Moore
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 49
Training set error for our artificial
tree
Root
e=0
a=0
a=1
a=0
a=1
25% of these leaf node labels will be corrupted
Decision Trees: Slide 50
Testing the tree with the test set
All the leaf nodes contain exactly one record and so
We would have a training set error
of zero
e=1
1/4 of the tree nodes
are corrupted
3/4 are fine
1/4 of the test set
records are
corrupted
1/16 of the test set will 3/16 of the test set will
be correctly predicted be wrongly predicted
for the wrong reasons because the test record is
corrupted
3/4 are fine
3/16 of the test
predictions will be
wrong because the
tree node is corrupted
9/16 of the test
predictions will be fine
In total, we expect to be wrong on 3/8 of the test set predictions
Decision Trees: Slide 51
Whats this example shown us?
This explains the discrepancy between
training and test set error
But more importantly it indicates theres
something we should do about it if we want
to predict well on future data.
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 52
Suppose we had less data
Lets not look at the irrelevant bits
Output y = copy of e, except a
random 25% of the records
have y set to the opposite of e
These bits are hidden
32 records
Copyright 2001, Andrew W. Moore
What decision tree would we learn now?
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 53
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 54
Without access to the irrelevant bits
Without access to the irrelevant bits
Root
e=0
Root
e=1
These nodes will be unexpandable
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 55
e=0
e=1
In about 12 of
the 16 records
in this node the
output will be 0
In about 12 of
the 16 records
in this node the
output will be 1
So this will
almost certainly
predict 0
So this will
almost certainly
predict 1
Copyright 2001, Andrew W. Moore
Root
e=0
e=1
almost certainly all
are fine
1/4 of the test n/a
set records
are corrupted
1/4 of the test set
will be wrongly
predicted because
the test record is
corrupted
3/4 are fine
3/4 of the test
predictions will be
fine
n/a
Decision Trees: Slide 56
Overfitting
Without access to the irrelevant bits
almost certainly
none of the tree
nodes are
corrupted
These nodes will be unexpandable
Definition: If your machine learning
algorithm fits noise (i.e. pays attention to
parts of the data that are irrelevant) it is
overfitting.
Fact (theoretical and empirical): If your
machine learning algorithm is overfitting
then it may perform less well on test set
data.
In total, we expect to be wrong on only 1/4 of the test set predictions
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 57
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 58
Avoiding overfitting
Usually we do not know in advance which
are the irrelevant variables
and it may depend on the context
Consider this
split
For example, if y = a AND b then b is an irrelevant
variable only in the portion of the tree in which a=0
But we can use simple statistics to
warn us that we might be
overfitting.
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 59
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 60
10
A chi-squared test
A chi-squared test
Suppose that mpg was completely uncorrelated with
maker.
What is the chance wed have seen data of at least this
apparent level of association anyway?
Suppose that mpg was completely uncorrelated with
maker.
What is the chance wed have seen data of at least this
apparent level of association anyway?
By using a particular kind of chi-squared test, the
answer is 13.5%.
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 61
Copyright 2001, Andrew W. Moore
CS
Non CS
15972 145643
Likes
Matrix
Hates 3
Matrix
CS
Decision Trees: Slide 62
Using Chi-squared to avoid
overfitting
What is a Chi-Square test?
Google chi square for
excellent explanations
Takes into account surprise
that a feature generates:
((unsplit-number splitnumber)2/unsplit-number)
Gives probability that rate
you saw was generated by
luck of the draw
Does likes-Matrix predict
CS grad?
Copyright 2001, Andrew W. Moore
37
Non CS
Likes 21543 145643
Matrix
Hates 3
173
Decision
Trees:
Slide 63
Matrix
Build the full decision tree as before.
But when you can grow it no more, start to
prune:
Beginning at the bottom of the tree, delete
splits in which pchance > MaxPchance.
Continue working your way up until there are no
more prunable nodes.
MaxPchance is a magic parameter you must specify to the decision tree,
indicating your willingness to risk fitting noise.
Copyright 2001, Andrew W. Moore
Pruning example
Decision Trees: Slide 64
MaxPchance
With MaxPchance = 0.1, you will see the
following MPG decision tree:
Good news:
The decision tree can automatically adjust
its pruning decisions according to the amount of apparent
noise and data.
Bad news:
Note the improved
test set accuracy
compared with the
unpruned tree
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 65
The user must come up with a good value of
MaxPchance. (Note, Andrew usually uses 0.05, which is his
favorite value for any magic parameter).
Good news:
But with extra work, the best MaxPchance
value can be estimated automatically by a technique called
cross-validation.
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 66
11
MaxPchance
The simplest tree
Expected Test set
Error
Technical note (dealt with in other lectures):
MaxPchance is a regularization parameter.
The simplest tree structure for which all within-leafnode disagreements can be explained by chance
Decreasing
High Bias
Copyright 2001, Andrew W. Moore
MaxPchance
Increasing
High Variance
Decision Trees: Slide 67
Expressiveness of Decision Trees
Assume all inputs are Boolean and all outputs are
Boolean.
What is the class of Boolean functions that are
possible to represent by decision trees?
Answer: All Boolean functions.
Simple proof:
1.
2.
3.
Note that this pruning is heuristically trying
to find
Take any Boolean function
Convert it into a truth table
Construct a decision tree in which each row of the truth table
corresponds to one path through the decision tree.
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 69
One branch for each numeric
value idea:
This is not the same as saying the simplest
classification scheme for which
Decision trees are biased to prefer classifiers
that can be expressed as trees.
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 68
Real-Valued inputs
What should we do if some of the inputs are
real-valued?
mpg
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
good
bad
good
bad
cylinders displacemen horsepower weight acceleration modelyear maker
4
6
4
8
6
4
4
8
:
:
:
97
199
121
350
198
108
113
302
:
:
:
4
8
4
5
75
90
110
175
95
94
95
139
:
:
:
120
455
107
131
2265
2648
2600
4100
3102
2379
2228
3570
:
:
:
79
225
86
103
18.2
15
12.8
13
16.5
16.5
14
12.8
:
:
:
2625
4425
2464
2830
77
70
77
73
74
73
71
78
:
:
:
18.6
10
15.5
15.9
82
70
76
78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
europe
europe
Idea One: Branch on each possible real value
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 70
A better idea: thresholded splits
Suppose X is real valued.
Define IG(Y|X:t) as H(Y) - H(Y|X:t)
Define H(Y|X:t) =
H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)
Hopeless: with such high branching factor will shatter
the dataset and over fit
Note pchance is 0.222 in the aboveif MaxPchance
was 0.05 that would end up pruning away to a single
root node.
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 71
IG(Y|X:t) is the information gain for predicting Y if all
you know is whether X is greater than or less than t
Then define IG*(Y|X) = maxt IG(Y|X:t)
For each real-valued attribute, use IG*(Y|X)
for assessing its suitability as a split
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 72
12
Computational Issues
Example with
MPG
You can compute IG*(Y|X) in time
R log R + 2 R ny
Where
R is the number of records in the node under consideration
ny is the arity (number of distinct values of) Y
How?
Sort records according to increasing values of X. Then create a 2xny
contingency table corresponding to computation of IG(Y|X:xmin). Then
iterate through the records, testing for each threshold between adjacent
values of X, incrementally updating the contingency table as you go. For a
minor additional speedup, only test between values of Y that differ.
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 73
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 74
Pruned tree using reals
Unpruned
tree using
reals
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 75
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 76
Binary categorical splits
One of Andrews
favorite tricks
Allow splits of the
following form
Predicting age
from census
Example:
Root
Attribute
equals
value
Copyright 2001, Andrew W. Moore
Attribute
doesnt
equal value
Decision Trees: Slide 77
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 78
13
Predicting gender from census
Predicting
wealth from
census
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 79
Copyright 2001, Andrew W. Moore
Conclusions
What you should know
Decision trees are the single most popular
data mining tool
Easy to understand
Easy to implement
Easy to use
Computationally cheap
Its possible to get in trouble with overfitting
They do classification: predict a categorical
output from categorical and/or real inputs
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 81
Whats information gain, and why we use it
The recursive algorithm for building an
unpruned decision tree
What are training and test set errors
Why test set errors can be bigger than
training set
Why pruning can reduce test set error
How to exploit real-valued inputs
Copyright 2001, Andrew W. Moore
What we havent discussed
Its easy to have real-valued outputs too---these are called
Regression Trees*
Bayesian Decision Trees can take a different approach to
preventing overfitting
Computational complexity (straightforward and cheap) *
Alternatives to Information Gain for splitting nodes
How to choose MaxPchance automatically *
The details of Chi-Squared testing *
Boosting---a simple way to improve accuracy *
* = discussed in other Andrew lectures
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 80
Decision Trees: Slide 83
Decision Trees: Slide 82
For more information
Two nice books
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
Classification and Regression Trees. Wadsworth, Belmont,
CA, 1984.
C4.5 : Programs for Machine Learning (Morgan Kaufmann
Series in Machine Learning) by J. Ross Quinlan
Dozens of nice papers, including
Learning Classification Trees, Wray Buntine, Statistics and
Computation (1992), Vol 2, pages 63-73
Kearns and Mansour, On the Boosting Ability of Top-Down
Decision Tree Learning Algorithms, STOC: ACM Symposium
on Theory of Computing, 1996
Dozens of software implementations available on the web for free and
commercially for prices ranging between $50 - $300,000
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 84
14
Discussion
Instead of using information gain, why not choose the
splitting attribute to be the one with the highest prediction
accuracy?
Instead of greedily, heuristically, building the tree, why not
do a combinatorial search for the optimal tree?
If you build a decision tree to predict wealth, and marital
status, age and gender are chosen as attributes near the
top of the tree, is it reasonable to conclude that those
three inputs are the major causes of wealth?
..would it be reasonable to assume that attributes not
mentioned in the tree are not causes of wealth?
..would it be reasonable to assume that attributes not
mentioned in the tree are not correlated with wealth?
What about multi-attribute splits?
Copyright 2001, Andrew W. Moore
Decision Trees: Slide 85
15