0% found this document useful (0 votes)

75 views15 pages

Dtree

This document contains lecture notes by Andrew W. Moore on decision trees and information gain, emphasizing their application in data mining. It provides guidelines for other educators to use and modify the slides, along with a detailed explanation of concepts like entropy and conditional entropy. The document also includes examples and a dataset for predicting miles per gallon (MPG) using decision trees.

Uploaded by

johnnokia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views15 pages

Dtree

Uploaded by

johnnokia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Note to other teachers and users of these slides.

Andrew would be delighted if you found this source

material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrews
tutorials: [Link] .
Comments and corrections gratefully received.

Decision Trees

Todays lecture
Information Gain for measuring association
between inputs and outputs
Learning a decision tree classifier from data

Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University
[Link]/~awm
awm@[Link]
412-268-7599
Copyright 2001, Andrew W. Moore

July 30, 2001

Decision Trees: Slide 2

Deciding whether a pattern is

interesting

Data Mining
Data Mining is all about automating the
process of searching for patterns in the
data.
Which patterns are interesting?
Thats what well look at right now.
And the answer will turn out to be the engine that
drives decision tree learning.

Which might be mere illusions?

And how can they be exploited?
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 3

Note to other teachers and users of these slides.

Andrew would be delighted if you found this source
material useful in giving your own lectures. Feel free
to use these slides verbatim, or to modify them to fit
your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in
your own lecture, please include this message, or the
following link to the source repository of Andrews
tutorials: [Link] .
Comments and corrections gratefully received.

Information Gain
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
[Link]/~awm
awm@[Link]
412-268-7599
Copyright 2001, Andrew W. Moore

Copyright 2001, Andrew W. Moore

We will use information theory

A very large topic, originally used for
compressing signals
But more recently used for data mining
(The topic of Information Gain will now be
discussed, but you will find it in a separate
Andrew Handout)
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 4

Bits
You are watching a set of independent random samples of X
You see that X has four possible values

P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4

So you might see: BAACBADCDADDDA

You transmit data over a binary serial link. You can encode
each reading with two bits (e.g. A = 00, B = 01, C = 10, D =
11)
0100001001001110110011111100

July 30, 2001

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 6

Fewer Bits

Someone tells you that the probabilities are not equal

P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8

Its possible
to invent a coding for your transmission that only uses
1.75 bits on average per symbol. How?

110

111

(This is just one of several ways)

Decision Trees: Slide 7

Copyright 2001, Andrew W. Moore

Fewer Bits

General Case

Suppose there are three equally likely values

Suppose X can have one of m values V1, V2,

P(X=B) = 1/3 P(X=C) = 1/3 P(X=D) = 1/3

P(X=V1) = p1

P(X=V2) = p2

P(X=Vm) = pm

In theory, it can in fact be done with 1.58496 bits per

symbol.
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 9

= p j log 2 p j
j =1

H(X) = The entropy of X

High Entropy means X is from a uniform (boring) distribution
Low Entropy means X is from varied (peaks and valleys) distribution
Copyright 2001, Andrew W. Moore

General Case
P(X=V1) = p1

H ( X ) = p1 log 2 p1 p2 log 2 p2 K pm log 2 pm

Can you think of a coding that would need only 1.6 bits
per symbol on average?

Suppose X can have one of m values V1, V2,

Whats the smallest possible number of bits, on average, per

symbol, needed to transmit a stream of symbols drawn from
Xs distribution? Its

Heres a nave coding, costing 2 bits per symbol

Decision Trees: Slide 8

Decision Trees: Slide 10

General Case

Suppose X can have one of m values V1, V2,

P(X=V2) = p2

.
P(X=Vm) = pm
A histogram of the
Whats the smallest possible number of frequency
bits, on average,
distributionper
of
symbol, needed
to transmit
a stream values
of symbols
drawn
of X would
havefrom
A histogram
of the
many lows and one or
Xs distribution?
Its
frequency
distribution of
two highs
values of X would be flat
H ( X ) = p1 log 2 p1 p2 log 2 p2 K pm log 2 pm
m

= p j log 2 p j

P(X=V1) = p1

P(X=V2) = p2

= p ..and
j log so
j values
2 pthe

j =1

sampled from it would

be all over the place

..and so the values

sampled from it would
be more predictable

H(X) = The entropy of X

High Entropy means X is from a uniform (boring) distribution

Low Entropy means X is from varied (peaks and valleys) distribution

High Entropy means X is from a uniform (boring) distribution

Low Entropy means X is from varied (peaks and valleys) distribution

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 11

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 12

Entropy in a nut-shell

Low Entropy

High Entropy

Entropy in a nut-shell

Low Entropy

High Entropy
..the values (locations
of soup) sampled
entirely from within
the soup bowl

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 13

Specific Conditional Entropy H(Y|X=v)

Suppose Im trying to predict output Y and I have input X
X = College Major
Y = Likes Gladiator
X
Y

Lets assume this reflects the true

probabilities

Copyright 2001, Andrew W. Moore

X = College Major
Y = Likes Gladiator

Math

Yes

P(LikeG = Yes) = 0.5

Math

Yes

History

P(Major = Math & LikeG = No) = 0.25

History

Yes

Math

P(Major = Math) = 0.5

Math

Yes

History

H(X) = 1.5

CS
History

Math

Yes

H(Y) = 1

Math

Yes

Note:

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 15

Copyright 2001, Andrew W. Moore

Y = Likes Gladiator

Math

Yes

History

Yes

Math

Yes

History
Math

Definition of Specific Conditional

Entropy:

H(Y |X=v) = The entropy of Y

among only those records in which
X has value v

X = College Major
Y = Likes Gladiator

Math

Yes

Example:

History

H(Y|X=Math) = 1

Yes

Math

Yes

History

Yes

Math

Yes

Copyright 2001, Andrew W. Moore

H(Y|X=History) = 0
H(Y|X=CS) = 0

Decision Trees: Slide 17

Definition of Specific Conditional

Entropy:

H(Y |X=v) = The entropy of Y

among only those records in which
X has value v

Decision Trees: Slide 16

Conditional Entropy H(Y|X)

Specific Conditional Entropy H(Y|X=v)

X = College Major

Decision Trees: Slide 14

Specific Conditional Entropy H(Y|X=v)

E.G. From this data we estimate

P(LikeG = Yes | Major = History) = 0

..the values (locations of

soup) unpredictable...
almost uniformly sampled
throughout our dining room

Copyright 2001, Andrew W. Moore

Definition of Conditional
Entropy:

H(Y |X) = The average specific

conditional entropy of Y
= if you choose a record at random what

will be the conditional entropy of Y,

conditioned on that rows value of X

= Expected number of bits to transmit Y if

both sides will know the value of X

= j Prob(X=vj) H(Y | X = vj)

Decision Trees: Slide 18

Conditional Entropy
Y = Likes Gladiator

Information Gain

Definition of Conditional Entropy:

X = College Major

H(Y|X) = The average conditional

entropy of Y

X = College Major

= jProb(X=vj) H(Y | X = vj)

Math

Yes

Example:

History

Yes

Math

Yes

History

Math

Yes

Math
History
CS

Prob(X=vj)

0.5
0.25
0.25

H(Y | X = vj)

1
0
0

H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 19

Note to other teachers and users of these slides.

Back to
Decision Trees
Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University

Definition of Information Gain:

Y = Likes Gladiator

IG(Y|X) = I must transmit Y.

How many bits on average
would it save me if both ends of
the line knew X?

Math

Yes

History

Yes

Math

Yes

History

Math

Yes

IG(Y|X) = H(Y) - H(Y | X)

Example:

Copyright 2001, Andrew W. Moore

H(Y) = 1
H(Y|X) = 0.5
Thus IG(Y|X) = 1 0.5 = 0.5
Decision Trees: Slide 20

Learning Decision Trees

A Decision Tree is a tree-structured plan of
a set of attributes to test in order to predict
the output.
To decide which attribute should be tested
first, simply find the one with the highest
information gain.
Then recurse

[Link]/~awm
awm@[Link]
412-268-7599
Copyright 2001, Andrew W. Moore

July 30, 2001

A small dataset: Miles Per Gallon

mpg

40
Records

good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad

cylinders displacement horsepower

4
6
4
8
6
4
4
8
:
:
:
8
8
8
4
6
4
4
8
4
5

low
medium
medium
high
medium
low
low
high
:
:
:
high
high
high
low
medium
medium
low
high
low
medium

low
medium
medium
high
medium
medium
medium
high
:
:
:
high
medium
high
low
medium
low
low
high
medium
medium

weight

acceleration modelyear maker

low
medium
medium
high
medium
low
low
high
:
:
:
high
high
high
low
medium
low
medium
high
low
medium

high
medium
low
low
medium
medium
low
low
:
:
:
low
high
low
low
high
low
high
low
medium
medium

75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78

asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe

From the UCI repository (thanks to Ross Quinlan)

Decision Trees: Slide 23

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 22

Suppose we want to
predict MPG.

Look at all
the
information
gains
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 24

A Decision Stump

Recursion Step
Records
in which
cylinders
=4

Take the
Original
Dataset..

Records
in which
cylinders
=5

And partition it
according
to the value of
the attribute
we split on

Records
in which
cylinders
=6
Records
in which
cylinders
=8

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 25

Recursion Step

Build tree from

These records..

Records in
which
cylinders = 4

Build tree from

These records..

Records in
which
cylinders = 5

Copyright 2001, Andrew W. Moore

Build tree from

These records..

Records in
which
cylinders = 6

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 26

Second level of tree

Build tree from

These records..

Records in
which
cylinders = 8

Decision Trees: Slide 27

Recursively build a tree from the seven

records in which there are four cylinders and
the maker was based in Asia
Copyright 2001, Andrew W. Moore

(Similar recursion in the

other cases)

Decision Trees: Slide 28

Base Case
One

The final tree

Dont split a
node if all
matching
records have
the same
output value

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 29

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 30

Base Case
Two
Dont split a
node if none
of the
attributes
can create
multiple nonempty
children

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 31

Base Case Two:

No attributes
can distinguish

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 32

Base Cases

Base Cases: An idea

Base Case One: If all records in current data subset have

the same output then dont recurse
Base Case Two: If all records have exactly the same set of
input attributes then dont recurse

Base Case One: If all records in current data subset have

the same output then dont recurse
Base Case Two: If all records have exactly the same set of
input attributes then dont recurse
Proposed Base Case 3:
If all attributes have zero information
gain then dont recurse

Is this a good idea?

Decision Trees: Slide 33

Copyright 2001, Andrew W. Moore

The problem with Base Case 3

b
0
0
1
1

y
0
1
0
1

0
1
1
0

The information gains:

Copyright 2001, Andrew W. Moore

If we omit Base Case 3:

y = a XOR b

b
0
0
1
1

The resulting decision

tree:

Decision Trees: Slide 35

Decision Trees: Slide 34

y
0
1
0
1

0
1
1
0

y = a XOR b

The resulting decision tree:

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 36

Basic Decision Tree Building

Summarized

Training Set Error

BuildTree(DataSet,Output)
If all output values are the same in DataSet, return a leaf node that
says predict this unique output
If all input values are the same, return a leaf node that says predict
the majority output
Else find attribute X with highest Info Gain
Suppose X has nX distinct values (i.e. X has arity nX).

For each record, follow the decision tree to

see what it would predict

Copyright 2001, Andrew W. Moore

Create and return a non-leaf node with nX children.

The ith child should be built by calling
BuildTree(DSi,Output)
Where DSi built consists of all those records in DataSet for which X = ith
distinct value of X.

Decision Trees: Slide 37

For what number of records does the decision

trees prediction disagree with the true value in
the database?

This quantity is called the training set error.

The smaller the better.

MPG Training
error

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 39

MPG Training
error

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 41

Decision Trees: Slide 38

MPG Training
error

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 40

Stop and reflect: Why are we

doing this learning anyway?
It is not usually in order to predict the
training datas output on data we have
already seen.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 42

Stop and reflect: Why are we

doing this learning anyway?

Stop and reflect: Why are we

doing this learning anyway?

It is not usually in order to predict the

training datas output on data we have
already seen.
It is more commonly in order to predict the
output value for future data we have not yet
seen.

It is not usually in order to predict the

training datas output on data we have
already seen.
It is more commonly in order to predict the
output value for future data we have not yet
seen.
Warning: A common data mining misperception is that the
above two bullets are the only possible reasons for learning.
There are at least a dozen others.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 43

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 44

MPG Test set

error

Test Set Error

Suppose we are forward thinking.
We hide some data away when we learn the
decision tree.
But once learned, we see how well the tree
predicts that data.
This is a good simulation of what happens
when we try to predict future data.
And it is called Test Set Error.
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 45

MPG Test set

error

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 46

An artificial example
Well create a training dataset

The test set error is much worse than the

training set error

why?

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 47

32 records

Five inputs, all bits, are

generated in all 32 possible
combinations

Copyright 2001, Andrew W. Moore

Output y = copy of e,
Except a random 25%
of the records have y
set to the opposite of e

Decision Trees: Slide 48

In our artificial example

Building a tree with the artificial

training set

Suppose someone generates a test set

according to the same method.
The test set is identical, except that some of
the ys will be different.
Some ys that were corrupted in the training
set will be uncorrupted in the testing set.
Some ys that were uncorrupted in the
training set will be corrupted in the test set.

Suppose we build a full tree (we always split until base case 2)

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 49

Training set error for our artificial

tree

Root
e=0
a=0

a=1

a=0

a=1

25% of these leaf node labels will be corrupted

Decision Trees: Slide 50

Testing the tree with the test set

All the leaf nodes contain exactly one record and so

We would have a training set error

of zero

e=1

1/4 of the tree nodes

are corrupted

3/4 are fine

1/4 of the test set

records are
corrupted

1/16 of the test set will 3/16 of the test set will
be correctly predicted be wrongly predicted
for the wrong reasons because the test record is
corrupted

3/4 are fine

3/16 of the test

predictions will be
wrong because the
tree node is corrupted

9/16 of the test

predictions will be fine

In total, we expect to be wrong on 3/8 of the test set predictions

Decision Trees: Slide 51

Whats this example shown us?

This explains the discrepancy between
training and test set error
But more importantly it indicates theres
something we should do about it if we want
to predict well on future data.

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 52

Suppose we had less data

Lets not look at the irrelevant bits
Output y = copy of e, except a
random 25% of the records
have y set to the opposite of e

These bits are hidden

32 records

Copyright 2001, Andrew W. Moore

What decision tree would we learn now?

Decision Trees: Slide 53

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 54

Without access to the irrelevant bits

Root
e=0

Root
e=1

These nodes will be unexpandable

Copyright 2001, Andrew W. Moore

Decision Trees: Slide 55

e=0

e=1

In about 12 of
the 16 records
in this node the
output will be 0

In about 12 of
the 16 records
in this node the
output will be 1

So this will
almost certainly
predict 0

So this will
almost certainly
predict 1

Root
e=0

e=1

almost certainly all

are fine

1/4 of the test n/a

set records
are corrupted

1/4 of the test set

will be wrongly
predicted because
the test record is
corrupted

3/4 are fine

3/4 of the test

predictions will be
fine

n/a

Decision Trees: Slide 56

Overfitting

Without access to the irrelevant bits

almost certainly
none of the tree
nodes are
corrupted

These nodes will be unexpandable

Definition: If your machine learning

algorithm fits noise (i.e. pays attention to
parts of the data that are irrelevant) it is
overfitting.
Fact (theoretical and empirical): If your
machine learning algorithm is overfitting
then it may perform less well on test set
data.

In total, we expect to be wrong on only 1/4 of the test set predictions

Decision Trees: Slide 57

Decision Trees: Slide 58

Avoiding overfitting
Usually we do not know in advance which
are the irrelevant variables
and it may depend on the context

Consider this
split

For example, if y = a AND b then b is an irrelevant

variable only in the portion of the tree in which a=0

But we can use simple statistics to

Decision Trees: Slide 59

Decision Trees: Slide 60

A chi-squared test

Suppose that mpg was completely uncorrelated with

maker.
What is the chance wed have seen data of at least this
apparent level of association anyway?

Suppose that mpg was completely uncorrelated with

maker.
What is the chance wed have seen data of at least this
apparent level of association anyway?

By using a particular kind of chi-squared test, the

Decision Trees: Slide 61

CS
Non CS
15972 145643

Likes
Matrix
Hates 3
Matrix

Decision Trees: Slide 62

Using Chi-squared to avoid

overfitting

What is a Chi-Square test?

Google chi square for
excellent explanations
Takes into account surprise
that a feature generates:
((unsplit-number splitnumber)2/unsplit-number)
Gives probability that rate
you saw was generated by
luck of the draw
Does likes-Matrix predict
CS grad?

Non CS

Likes 21543 145643

Matrix
Hates 3
173
Decision
Trees:
Slide 63
Matrix

Build the full decision tree as before.

But when you can grow it no more, start to
prune:
Beginning at the bottom of the tree, delete
splits in which pchance > MaxPchance.
Continue working your way up until there are no
more prunable nodes.
MaxPchance is a magic parameter you must specify to the decision tree,
indicating your willingness to risk fitting noise.

Pruning example

Decision Trees: Slide 64

MaxPchance

With MaxPchance = 0.1, you will see the

following MPG decision tree:

Good news:

The decision tree can automatically adjust

its pruning decisions according to the amount of apparent
noise and data.

Bad news:
Note the improved
test set accuracy
compared with the
unpruned tree

Decision Trees: Slide 65

The user must come up with a good value of

MaxPchance. (Note, Andrew usually uses 0.05, which is his
favorite value for any magic parameter).

Good news:

But with extra work, the best MaxPchance

value can be estimated automatically by a technique called
cross-validation.

Decision Trees: Slide 66

MaxPchance

The simplest tree

Expected Test set

Error

Technical note (dealt with in other lectures):

MaxPchance is a regularization parameter.

The simplest tree structure for which all within-leafnode disagreements can be explained by chance
Decreasing

High Bias

MaxPchance

Increasing
High Variance

Decision Trees: Slide 67

Expressiveness of Decision Trees

Assume all inputs are Boolean and all outputs are

Boolean.
What is the class of Boolean functions that are
possible to represent by decision trees?
Answer: All Boolean functions.

Simple proof:
1.
2.
3.

Note that this pruning is heuristically trying

to find

Take any Boolean function

Convert it into a truth table
Construct a decision tree in which each row of the truth table
corresponds to one path through the decision tree.

Decision Trees: Slide 69

One branch for each numeric

value idea:

This is not the same as saying the simplest

classification scheme for which
Decision trees are biased to prefer classifiers
that can be expressed as trees.

Decision Trees: Slide 68

Real-Valued inputs
What should we do if some of the inputs are
real-valued?
mpg
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
good
bad
good
bad

cylinders displacemen horsepower weight acceleration modelyear maker

4
6
4
8
6
4
4
8
:
:
:

97
199
121
350
198
108
113
302
:
:
:

4
8
4
5

75
90
110
175
95
94
95
139
:
:
:

120
455
107
131

2265
2648
2600
4100
3102
2379
2228
3570
:
:
:

79
225
86
103

18.2
15
12.8
13
16.5
16.5
14
12.8
:
:
:

2625
4425
2464
2830

77
70
77
73
74
73
71
78
:
:
:

18.6
10
15.5
15.9

82
70
76
78

asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
europe
europe

Idea One: Branch on each possible real value

Decision Trees: Slide 70

A better idea: thresholded splits

Suppose X is real valued.
Define IG(Y|X:t) as H(Y) - H(Y|X:t)
Define H(Y|X:t) =

H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)

Hopeless: with such high branching factor will shatter

the dataset and over fit
Note pchance is 0.222 in the aboveif MaxPchance
was 0.05 that would end up pruning away to a single
root node.
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 71

IG(Y|X:t) is the information gain for predicting Y if all

you know is whether X is greater than or less than t

Then define IG*(Y|X) = maxt IG(Y|X:t)

For each real-valued attribute, use IG*(Y|X)
for assessing its suitability as a split
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 72

Computational Issues

Example with
MPG

You can compute IG*(Y|X) in time

R log R + 2 R ny
Where
R is the number of records in the node under consideration
ny is the arity (number of distinct values of) Y
How?
Sort records according to increasing values of X. Then create a 2xny
contingency table corresponding to computation of IG(Y|X:xmin). Then
iterate through the records, testing for each threshold between adjacent
values of X, incrementally updating the contingency table as you go. For a
minor additional speedup, only test between values of Y that differ.

Decision Trees: Slide 73

Decision Trees: Slide 74

Pruned tree using reals

Unpruned
tree using
reals

Decision Trees: Slide 75

Decision Trees: Slide 76

Binary categorical splits

One of Andrews
favorite tricks
Allow splits of the
following form

Predicting age
from census

Example:

Root
Attribute
equals
value

Attribute
doesnt
equal value

Decision Trees: Slide 77

Decision Trees: Slide 78

Predicting gender from census

Predicting
wealth from
census

Decision Trees: Slide 79

Conclusions

What you should know

Decision trees are the single most popular

data mining tool

Easy to understand
Easy to implement
Easy to use
Computationally cheap

Its possible to get in trouble with overfitting

They do classification: predict a categorical
output from categorical and/or real inputs
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 81

Whats information gain, and why we use it

The recursive algorithm for building an
unpruned decision tree
What are training and test set errors
Why test set errors can be bigger than
training set
Why pruning can reduce test set error
How to exploit real-valued inputs
Copyright 2001, Andrew W. Moore

What we havent discussed

Its easy to have real-valued outputs too---these are called
Regression Trees*
Bayesian Decision Trees can take a different approach to
preventing overfitting
Computational complexity (straightforward and cheap) *
Alternatives to Information Gain for splitting nodes
How to choose MaxPchance automatically *
The details of Chi-Squared testing *
Boosting---a simple way to improve accuracy *
* = discussed in other Andrew lectures

Decision Trees: Slide 80

Decision Trees: Slide 83

Decision Trees: Slide 82

For more information

Two nice books
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
Classification and Regression Trees. Wadsworth, Belmont,
CA, 1984.
C4.5 : Programs for Machine Learning (Morgan Kaufmann
Series in Machine Learning) by J. Ross Quinlan

Dozens of nice papers, including

Learning Classification Trees, Wray Buntine, Statistics and
Computation (1992), Vol 2, pages 63-73
Kearns and Mansour, On the Boosting Ability of Top-Down
Decision Tree Learning Algorithms, STOC: ACM Symposium
on Theory of Computing, 1996

Dozens of software implementations available on the web for free and

commercially for prices ranging between $50 - $300,000

Decision Trees: Slide 84

Discussion
Instead of using information gain, why not choose the
splitting attribute to be the one with the highest prediction
accuracy?
Instead of greedily, heuristically, building the tree, why not
do a combinatorial search for the optimal tree?
If you build a decision tree to predict wealth, and marital
status, age and gender are chosen as attributes near the
top of the tree, is it reasonable to conclude that those
three inputs are the major causes of wealth?
..would it be reasonable to assume that attributes not
mentioned in the tree are not causes of wealth?
..would it be reasonable to assume that attributes not
mentioned in the tree are not correlated with wealth?
What about multi-attribute splits?
Copyright 2001, Andrew W. Moore

Decision Trees: Slide 85

03 InformationGain
No ratings yet
03 InformationGain
20 pages
Decision Trees
No ratings yet
Decision Trees
42 pages
ML Lecture04x2
No ratings yet
ML Lecture04x2
16 pages
Decision Trees in AI: Building Classifiers
No ratings yet
Decision Trees in AI: Building Classifiers
41 pages
Lecture 03-04 2025 Information DT
No ratings yet
Lecture 03-04 2025 Information DT
46 pages
Introduction to Information Theory
No ratings yet
Introduction to Information Theory
44 pages
Recitation Decision Trees Adaboost 02-09-2006
No ratings yet
Recitation Decision Trees Adaboost 02-09-2006
30 pages
Understanding Information Theory Concepts
No ratings yet
Understanding Information Theory Concepts
52 pages
Entropy and Source Coding Basics
No ratings yet
Entropy and Source Coding Basics
98 pages
Decision Tree
No ratings yet
Decision Tree
52 pages
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
No ratings yet
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
44 pages
Mutual Information
No ratings yet
Mutual Information
48 pages
Lecture 11
No ratings yet
Lecture 11
44 pages
Lec02 Ann
No ratings yet
Lec02 Ann
71 pages
Info Theory & Entropy Basics
No ratings yet
Info Theory & Entropy Basics
44 pages
Data Compression
No ratings yet
Data Compression
113 pages
Machine Learning and Information Theory
No ratings yet
Machine Learning and Information Theory
22 pages
4.decision Tree
No ratings yet
4.decision Tree
39 pages
Module14 InformationTheoryandEntropy
No ratings yet
Module14 InformationTheoryandEntropy
24 pages
Decision Trees in Machine Learning
No ratings yet
Decision Trees in Machine Learning
49 pages
A08 Decision Trees 2up
No ratings yet
A08 Decision Trees 2up
20 pages
Unit 1
No ratings yet
Unit 1
94 pages
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
No ratings yet
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
5 pages
LVC 1 Post-Session Summary
No ratings yet
LVC 1 Post-Session Summary
9 pages
Data Compression Basics: Discrete Source
No ratings yet
Data Compression Basics: Discrete Source
34 pages
Information Theory 1
No ratings yet
Information Theory 1
37 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
Decision Tree Class 2
No ratings yet
Decision Tree Class 2
40 pages
Lossless Math
No ratings yet
Lossless Math
32 pages
Intro to Information Theory
No ratings yet
Intro to Information Theory
17 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
ML Lecture 13-14
No ratings yet
ML Lecture 13-14
33 pages
Information Theory and Machine Learning
No ratings yet
Information Theory and Machine Learning
5 pages
Module 1
No ratings yet
Module 1
40 pages
Information Theory
No ratings yet
Information Theory
29 pages
MSC Project Report of Priyamvada
No ratings yet
MSC Project Report of Priyamvada
44 pages
Understanding Information Theory Concepts
No ratings yet
Understanding Information Theory Concepts
76 pages
Decision Trees for Data Scientists
No ratings yet
Decision Trees for Data Scientists
75 pages
Decision Trees: Algorithms and Analysis
No ratings yet
Decision Trees: Algorithms and Analysis
128 pages
Information Theory and Decision Tree: Jianxin Wu
No ratings yet
Information Theory and Decision Tree: Jianxin Wu
21 pages
Decision Tree Class 1
No ratings yet
Decision Tree Class 1
34 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
Unit 5. Decision Trees
No ratings yet
Unit 5. Decision Trees
58 pages
Information Theory Course Overview
No ratings yet
Information Theory Course Overview
114 pages
Information Theory
No ratings yet
Information Theory
114 pages
Info Theory & Noise Analysis
No ratings yet
Info Theory & Noise Analysis
34 pages
Information Coding Techniques
No ratings yet
Information Coding Techniques
42 pages
Decision Trees Notes
No ratings yet
Decision Trees Notes
11 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
Rojas 10 Why The Normal Distribution
No ratings yet
Rojas 10 Why The Normal Distribution
10 pages
Decision Trees for CS Students
No ratings yet
Decision Trees for CS Students
54 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Lec7 - Nonparametric Methods - II
No ratings yet
Lec7 - Nonparametric Methods - II
38 pages
M04 Trees
No ratings yet
M04 Trees
43 pages
Stats Cheat Sheet
No ratings yet
Stats Cheat Sheet
28 pages
Krupa Final 13-01-15
No ratings yet
Krupa Final 13-01-15
17 pages
Python Peewee
No ratings yet
Python Peewee
142 pages
Programming Contest Techniques Guide
100% (2)
Programming Contest Techniques Guide
78 pages
According To Dulay
No ratings yet
According To Dulay
4 pages
AI Report Education
No ratings yet
AI Report Education
112 pages
Participant Information Sheet For Paper-Based Questionnaires
No ratings yet
Participant Information Sheet For Paper-Based Questionnaires
2 pages
Personal Life Map Project
No ratings yet
Personal Life Map Project
3 pages
MAPEH 7 Third Quarter Exam 2019-2020
No ratings yet
MAPEH 7 Third Quarter Exam 2019-2020
3 pages
Shstvet Schools
No ratings yet
Shstvet Schools
38 pages
JR April 2025 For Sabvgmc
No ratings yet
JR April 2025 For Sabvgmc
3 pages
SPED 411 Book Review
No ratings yet
SPED 411 Book Review
4 pages
Objects of National Consciousness8
No ratings yet
Objects of National Consciousness8
3 pages
Suggested Parent Interview Guide
No ratings yet
Suggested Parent Interview Guide
4 pages
Shortlisted Candidates List - CSIR-CDRI - Aug 2025 - Final To Upload
No ratings yet
Shortlisted Candidates List - CSIR-CDRI - Aug 2025 - Final To Upload
14 pages
Mount Kenya Niversity Pry Certificate Course For Nursing Apprenticeships
No ratings yet
Mount Kenya Niversity Pry Certificate Course For Nursing Apprenticeships
1 page
Lesson Plan
No ratings yet
Lesson Plan
9 pages
Writing Letter MATATAG COT Grade 7
No ratings yet
Writing Letter MATATAG COT Grade 7
5 pages
4 - HDP Ass Module2 Reactivity 2
80% (10)
4 - HDP Ass Module2 Reactivity 2
3 pages
Od Letter Workshop
100% (1)
Od Letter Workshop
2 pages
Fundamentals Islamic Finance
0% (1)
Fundamentals Islamic Finance
2 pages
The Bully, The Bullied, and The Bystander
No ratings yet
The Bully, The Bullied, and The Bystander
33 pages
Bgcse Mathematics 2021 Paper 2
No ratings yet
Bgcse Mathematics 2021 Paper 2
11 pages
Standard Operating Procedures On Assessment of Staff Training and Competency
50% (6)
Standard Operating Procedures On Assessment of Staff Training and Competency
4 pages
Generic Elective Assignment - 2
No ratings yet
Generic Elective Assignment - 2
4 pages
Checkpoint English 0861 Tough Paper
No ratings yet
Checkpoint English 0861 Tough Paper
2 pages
Shobha MAC-2 Indore Lesson Plan Month of May 2024
No ratings yet
Shobha MAC-2 Indore Lesson Plan Month of May 2024
2 pages
Creative Writing Q2 Module 3 Final
92% (26)
Creative Writing Q2 Module 3 Final
32 pages
University of Toronto Literature Review Example
100% (5)
University of Toronto Literature Review Example
20 pages
AI Set 2 MID II 2024
No ratings yet
AI Set 2 MID II 2024
2 pages
Pedagogical Leadership Assignment
No ratings yet
Pedagogical Leadership Assignment
18 pages
BNHS Action Plan Chess Club 2023 2024
No ratings yet
BNHS Action Plan Chess Club 2023 2024
3 pages
The Level of Difficulty and Discrimination Power of The Basic Knowledge and Skills Examination (EXHCOBA)
No ratings yet
The Level of Difficulty and Discrimination Power of The Basic Knowledge and Skills Examination (EXHCOBA)
16 pages
Harmony in Context 3rd Edition Miguel Roig-Francoli - Ebook PDF Download
100% (4)
Harmony in Context 3rd Edition Miguel Roig-Francoli - Ebook PDF Download
118 pages