0% found this document useful (0 votes)

75 views42 pages

Decision Trees

The document discusses function approximation and decision tree learning. It explains that the goal of function approximation is to find the hypothesis that best approximates an unknown target function, based on a set of training examples. It describes the basic process of decision tree learning, which involves choosing attributes to split the data at each node, recursively partitioning the instance space and assigning class labels to the leaves. The document provides examples of applying decision trees to classification problems and discusses concepts like information gain, entropy and choosing the best attribute to split on.

Uploaded by

Agatha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views42 pages

Decision Trees

Uploaded by

Agatha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

FuncIon

ApproximaIon
Problem Se*ng
•  Set of possible instances X
•  Set of possible labels Y
•  Unknown target funcIon f : X ! Y
•  Set of funcIon hypotheses H = {h | h : X ! Y}

Input: Training examples of unknown target funcIon f

n
{hxi , yi i}i=1 = {hx1 , y1 i , . . . , hxn , yn i}
Output: Hypothesis that best approximates f
h2H

Based on slide by Tom Mitchell

Sample Dataset
•  Columns denote features Xi
•  Rows denote labeled instances hxi , yi i
•  Class label denotes whether a tennis game was played

hxi , yi i
Decision Tree
•  A possible decision tree for the data:

•  Each internal node: test one aHribute Xi

•  Each branch from a node: selects one value for Xi
•  Each leaf node: predict Y (or )
p(Y | x 2 leaf)

Based on slide by Tom Mitchell

Decision Tree
•  A possible decision tree for the data:

•  What predicIon would we make for

<outlook=sunny, temperature=hot, humidity=high, wind=weak> ?

Based on slide by Tom Mitchell

Decision Tree
•  If features are conInuous, internal nodes can
test the value of a feature against a threshold

6
DecisionDecision Tree Learning
Tree Learning
Problem Setting:
•  Set of possible instances X
–  each instance x in X is a feature vector
–  e.g., <Humidity=low, Wind=weak, Outlook=rain, Temp=hot>
•  Unknown target function f : XY
–  Y is discrete valued
•  Set of function hypotheses H={ h | h : XY }
–  each hypothesis h is a decision tree
–  trees sorts x to leaf, which assigns y

Slide by Tom Mitchell

Stages of (Batch) Machine Learning
n
Given: labeled training data X, Y = {hxi , yi i}i=1

•  Assumes each with

xi ⇠ D(X ) yi = ftarget (xi )

X, Y
Train the model:
learner
model ß classiﬁer.train(X, Y )
x model ypredicIon

Apply the model to new data:
•  Given: new unlabeled instance x ⇠ D(X )
ypredicIon ß model.predict(x)
Example ApplicaIon: A Tree to
Predict Caesarean SecIon Risk

Based on Example by Tom Mitchell

Decision Tree Induced ParIIon

Color
green blue
red

Size + Shape
big small square round
- + Size +
big small

- +
Decision Tree – Decision Boundary
•  Decision trees divide the feature space into axis-
parallel (hyper-)rectangles
•  Each rectangular region is labeled with one label
–  or a probability distribuIon over labels

Decision
boundary

11
Expressiveness
•  Decision trees can represent any boolean funcIon of
the input aHributes

Truth table row à path to leaf

•  In the worst case, the tree will require exponenIally
many nodes
Expressiveness
Decision trees have a variable-sized hypothesis space
•  As the #nodes (or depth) increases, the hypothesis
space grows
–  Depth 1 (“decision stump”): can represent any boolean
funcIon of one feature
–  Depth 2: any boolean fn of two features; some involving
three features (e.g., )
(x1 ^ x2 ) _ (¬x1 ^ ¬x3 )
–  etc.

Based on slide by Pedro Domingos

Another Example:
Restaurant Domain (Russell & Norvig)
Model a patron’s decision of whether to wait for a table at a restaurant

~7,000 possible cases
A Decision Tree
from IntrospecIon

Is this the best decision tree?

Preference bias: Ockham’s Razor
•  Principle stated by William of Ockham (1285-1347)
–  “non sunt mul0plicanda en0a praeter necessitatem”
–  enIIes are not to be mulIplied beyond necessity
–  AKA Occam’s Razor, Law of Economy, or Law of Parsimony

Idea: The simplest consistent explanaIon is the best

•  Therefore, the smallest decision tree that correctly

classiﬁes all of the training examples is best
•  Finding the provably smallest decision tree is NP-hard
•  ...So instead of construcIng the absolute smallest tree
consistent with the training examples, construct one that
is preHy small
Basic Algorithm for Top-Down
InducIon of Decision Trees
[ID3, C4.5 by Quinlan]

node = root of decision tree

Main loop:
1.  A ß the “best” decision aHribute for the next node.
2.  Assign A as decision aHribute for node.
3.  For each value of A, create a new descendant of node.
4.  Sort training examples to leaf nodes.
5.  If training examples are perfectly classiﬁed, stop.
Else, recurse over new leaf nodes.

How do we choose which aHribute is best?

Choosing the Best AHribute
Key problem: choosing which aHribute to split a
given set of examples

•  Some possibiliIes are:

–  Random: Select any aHribute at random
–  Least-Values: Choose the aHribute with the smallest
number of possible values
–  Most-Values: Choose the aHribute with the largest
number of possible values
–  Max-Gain: Choose the aHribute that has the largest
expected informa0on gain
•  i.e., aHribute that results in smallest expected size of subtrees
rooted at its children

•  The ID3 algorithm uses the Max-Gain method of

selecIng the best aHribute
Choosing an AHribute
Idea: a good aHribute splits the examples into subsets
that are (ideally) “all posiIve” or “all negaIve”

Which split is more informaIve: Patrons? or Type?

Based on Slide from M. desJardins & T. Finin

ID3-induced
Decision Tree

Based on Slide from M. desJardins & T. Finin

Compare the Two Decision Trees

Based on Slide from M. desJardins & T. Finin

InformaIon Gain

Which test is more informaIve?
Split over whether Split over whether
Balance exceeds 50K applicant is employed

Less or equal 50K Over 50K Unemployed Employed

22
Based on slide by Pedro Domingos
InformaIon Gain
Impurity/Entropy (informal)
– Measures the level of impurity in a group of
examples

23
Based on slide by Pedro Domingos
Impurity

Very impure group Less impure Minimum

impurity

24
Based on slide by Pedro Domingos
Entropy: a common way to measure impurity
Entropy # of possible
values for X
Entropy H(X) of a random variable X

H(X) is the expected number of bits needed to encode a

randomly drawn value of X (under most efficient code)

Why? Information theory:

•  Most efficient code assigns -log2P(X=i) bits to encode
the message X=i
•  So, expected number of bits to code one random X is:

Slide by Tom Mitchell

Entropy: a common way to measure impurity
Entropy # of possible
values for X
Entropy H(X) of a random variable X

H(X) is the expected number of bits needed to encode a

randomly drawn value of X (under most efficient code)

Why? Information theory:

•  Most efficient code assigns -log2P(X=i) bits to encode
the message X=i
•  So, expected number of bits to code one random X is:

Slide by Tom Mitchell

Example: Huffman code
•  In 1952 MIT student David Huffman devised, in the course
of doing a homework assignment, an elegant coding
scheme which is opImal in the case where all symbols’
probabiliIes are integral powers of 1/2.
•  A Huffman code can be built in the following manner:
– Rank all symbols in order of probability of occurrence
– Successively combine the two symbols of the lowest
probability to form a new composite symbol; eventually
we will build a binary tree where each node is the
probability of all nodes beneath it
– Trace a path to each leaf, noIcing direcIon at each node

Based on Slide from M. desJardins & T. Finin

Huﬀman code example
M code length prob

M P A 000 3 0.125 0.375

B 001 3 0.125 0.375
A .125 1
C 01 2 0.250 0.500
B .125 0 1
D 1 1 0.500 0.500
C .25 .5 .5 average message length 1.750
D .5 0 1 D
.25 .25 If we use this code to many
messages (A,B,C or D) with this
0 1 C probability distribuIon, then, over
.125 .125
Ime, the average bits/message
should approach 1.75
A B
Based on Slide from M. desJardins & T. Finin
2-Class Cases:
n
X
Entropy H(x) = P (x = i) log2 P (x = i)
i=1

•  What is the entropy of a group in which all Minimum

examples belong to the same class? impurity
–  entropy = - 1 log21 = 0

not a good training set for learning

•  What is the entropy of a group with 50% Maximum

in either class? impurity
–  entropy = -0.5 log20.5 – 0.5 log20.5 =1

good training set for learning

30
Based on slide by Pedro Domingos
Sample Entropy
Sample Entropy

Slide by Tom Mitchell

InformaIon Gain
•  We want to determine which aHribute in a given set
of training feature vectors is most useful for
discriminaIng between the classes to be learned.

•  InformaIon gain tells us how important a given

aHribute of the feature vectors is.

•  We will use it to decide the ordering of aHributes in

the nodes of a decision tree.

32
Based on slide by Pedro Domingos
From Entropy to InformaIon Gain
Entropy
Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

Slide by Tom Mitchell

From Entropy to InformaIon Gain
Entropy
Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

Slide by Tom Mitchell

From Entropy to InformaIon Gain
Entropy
Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

Slide by Tom Mitchell

From Entropy to InformaIon Gain
Entropy
Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

Slide by Tom Mitchell

InformaIon Gain
Information Gain is the mutual information between
input attribute A and target variable Y

Information Gain is the expected reduction in entropy

of target variable Y for data sample S, due to sorting
on variable A

Slide by Tom Mitchell

CalculaIng InformaIon Gain
InformaOon Gain = entropy(parent) – [average entropy(children)]
child = −⎛⎜ 13 ⋅ log 13 ⎞⎟ − ⎛⎜ 4 ⋅ log 4 ⎞⎟ = 0.787
impurity 2 2
entropy ⎝ 17 17 ⎠ ⎝ 17 17 ⎠

Entire population (30 instances)

17 instances

child ⎛1 1 ⎞ ⎛ 12 12 ⎞
impurity = −⎜
entropy 13 ⋅ log 2 ⎟ − ⎜ ⋅ log 2 ⎟ = 0.391
⎝ 13 ⎠ ⎝ 13 13 ⎠

parent = −⎛⎜ 14 ⋅ log 2 14 ⎞⎟ − ⎛⎜ 16 ⋅ log 2 16 ⎞⎟ = 0.996

impurity
entropy ⎝ 30 30 ⎠ ⎝ 30 30 ⎠ 13 instances

⎛ 17 ⎞ ⎛ 13 ⎞
(Weighted) Average Entropy of Children = ⎜ ⋅ 0.787 ⎟ + ⎜ ⋅ 0.391⎟ = 0.615
⎝ 30 ⎠ ⎝ 30 ⎠
Information Gain= 0.996 - 0.615 = 0.38 38
Based on slide by Pedro Domingos
Entropy-Based AutomaIc Decision
Tree ConstrucIon

Training Set X Node 1

x1=(f11,f12,…f1m) What feature
x2=(f21,f22, f2m) should be used?
.
What values?
.
xn=(fn1,f22, f2m)

Quinlan suggested informaIon gain in his ID3 system

and later the gain raIo, both based on entropy.

39
Based on slide by Pedro Domingos
Using InformaIon Gain to Construct
a Decision Tree
Choose the aHribute A
Full Training Set X with highest informaIon
AHribute A
gain for the full training
v1 v2 vk set at the root of the tree.
Construct child nodes
for each value of A. Set X X ={x X | value(A)=v1}
Each has an associated
subset of vectors in repeat
which A has a parIcular recursively
Ill when?
value.

Disadvantage of informaIon gain:

•  It prefers aHributes with large number of values that split
the data into small, pure subsets
•  Quinlan’s gain raIo uses normalizaIon to improve this
40
Based on slide by Pedro Domingos
Slide by Tom Mitchell
Slide by Tom Mitchell
Slide by Tom Mitchell

Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
DTreesAndOverfitting 1 11 2011 - Final
No ratings yet
DTreesAndOverfitting 1 11 2011 - Final
20 pages
ML Lecture 13-14
No ratings yet
ML Lecture 13-14
33 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
Lec7 - Nonparametric Methods - II
No ratings yet
Lec7 - Nonparametric Methods - II
38 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
Decision Tree
No ratings yet
Decision Tree
98 pages
L8 1 Decisiontrees Random Forest
No ratings yet
L8 1 Decisiontrees Random Forest
118 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
03 InformationGain
No ratings yet
03 InformationGain
20 pages
Decision Trees in AI: Building Classifiers
No ratings yet
Decision Trees in AI: Building Classifiers
41 pages
Module - 2 Decision Tree Learning
No ratings yet
Module - 2 Decision Tree Learning
79 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
ML Lecture04x2
No ratings yet
ML Lecture04x2
16 pages
ML - Module 2
No ratings yet
ML - Module 2
41 pages
Decision Tree Learning Guide
No ratings yet
Decision Tree Learning Guide
12 pages
Unit 5. Decision Trees
No ratings yet
Unit 5. Decision Trees
58 pages
Decision Trees for CS Students
No ratings yet
Decision Trees for CS Students
54 pages
19 - Decision Tree - ID3
No ratings yet
19 - Decision Tree - ID3
87 pages
2024 Lecture11 MLAlgorithms
No ratings yet
2024 Lecture11 MLAlgorithms
84 pages
Module 3
No ratings yet
Module 3
102 pages
Decision Tree
No ratings yet
Decision Tree
52 pages
Learning From Observations: Section 1 - 3
No ratings yet
Learning From Observations: Section 1 - 3
26 pages
LVC 1 Post-Session Summary
No ratings yet
LVC 1 Post-Session Summary
9 pages
Lecture 11
No ratings yet
Lecture 11
44 pages
Module 3
No ratings yet
Module 3
101 pages
CS 343: Artificial Intelligence Machine Learning: Raymond J. Mooney
100% (1)
CS 343: Artificial Intelligence Machine Learning: Raymond J. Mooney
35 pages
Dtree
No ratings yet
Dtree
15 pages
Decision Trees: Classifier
No ratings yet
Decision Trees: Classifier
23 pages
Decision Tree
No ratings yet
Decision Tree
23 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
Class 16 Decision Tree
No ratings yet
Class 16 Decision Tree
45 pages
ID3 Decision Tree Algorithm Overview
No ratings yet
ID3 Decision Tree Algorithm Overview
41 pages
Artificial Intelligence: Machine Learning
No ratings yet
Artificial Intelligence: Machine Learning
110 pages
7 DecisionTree
No ratings yet
7 DecisionTree
58 pages
Decision Trees in Machine Learning
No ratings yet
Decision Trees in Machine Learning
49 pages
Random Forest Regression
No ratings yet
Random Forest Regression
57 pages
Machine Learning Learning
No ratings yet
Machine Learning Learning
35 pages
Unit 5
No ratings yet
Unit 5
21 pages
Learning
No ratings yet
Learning
51 pages
Decision Trees
No ratings yet
Decision Trees
34 pages
Decision Tree - 1
No ratings yet
Decision Tree - 1
31 pages
Decision Trees for Data Scientists
No ratings yet
Decision Trees for Data Scientists
61 pages
ML-3-Decision Tree
No ratings yet
ML-3-Decision Tree
17 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
70 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Chapter19 4e
No ratings yet
Chapter19 4e
67 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Decision Tree Learning Overview
No ratings yet
Decision Tree Learning Overview
49 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
117 pages
Decision Tree Example
No ratings yet
Decision Tree Example
21 pages
ML Unit 3 Notes-1
No ratings yet
ML Unit 3 Notes-1
118 pages
Decision Trees: Algorithms and Analysis
No ratings yet
Decision Trees: Algorithms and Analysis
128 pages
Decision Tree Algorithm: and Classification Problems Too
No ratings yet
Decision Tree Algorithm: and Classification Problems Too
12 pages
Elevator Control System Design
No ratings yet
Elevator Control System Design
43 pages
With Pdfmyurl Anyone Can Convert Entire Websites To PDF!
100% (1)
With Pdfmyurl Anyone Can Convert Entire Websites To PDF!
1 page
Mobilization / Demobilization Power Consumption Water Consumption As-Built Plan Shop Drawing Supervision Cost
No ratings yet
Mobilization / Demobilization Power Consumption Water Consumption As-Built Plan Shop Drawing Supervision Cost
4 pages
Fundamentals of Python
No ratings yet
Fundamentals of Python
17 pages
Risk Register Overview and Management
67% (3)
Risk Register Overview and Management
10 pages
Classic Job Search Scenarios
No ratings yet
Classic Job Search Scenarios
3 pages
R22B Tech CSE (AIML) IandIIYearSyllabus PDF
No ratings yet
R22B Tech CSE (AIML) IandIIYearSyllabus PDF
65 pages
MODELLING RUBRIC - Linear Programming (Math 3)
100% (2)
MODELLING RUBRIC - Linear Programming (Math 3)
1 page
Industrial Software Testing Insights
No ratings yet
Industrial Software Testing Insights
7 pages
SIM Database Initialization Logs
No ratings yet
SIM Database Initialization Logs
53 pages
Python
No ratings yet
Python
1 page
Tanner EDA: IC Design & Simulation Tools
No ratings yet
Tanner EDA: IC Design & Simulation Tools
15 pages
Fanuc LR Mate 200iC Overview and PDF
0% (2)
Fanuc LR Mate 200iC Overview and PDF
2 pages
Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
100% (2)
Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
582 pages
SAP System Transaction Codes List
No ratings yet
SAP System Transaction Codes List
2 pages
Review Text
No ratings yet
Review Text
2 pages
WRC 107 WRC 297 Pau
No ratings yet
WRC 107 WRC 297 Pau
4 pages
Module 3 Question Bank
No ratings yet
Module 3 Question Bank
2 pages
Reduced Order Observer Design
100% (1)
Reduced Order Observer Design
6 pages
SAP HANA Migration Scenarios Overview
No ratings yet
SAP HANA Migration Scenarios Overview
16 pages
Chapter 3 - Tutorial2
No ratings yet
Chapter 3 - Tutorial2
6 pages
Havok AI Brief Jun29
No ratings yet
Havok AI Brief Jun29
2 pages
Shannon's 1949 Secrecy Theory Overview
No ratings yet
Shannon's 1949 Secrecy Theory Overview
2 pages
Differential Equations for Students
No ratings yet
Differential Equations for Students
6 pages
CS Holiday Assignment
No ratings yet
CS Holiday Assignment
5 pages
Al Shirawi Electrical & Mechanical Co.
No ratings yet
Al Shirawi Electrical & Mechanical Co.
4 pages
Scheduled Waste Management Report
No ratings yet
Scheduled Waste Management Report
3 pages
10 Linked List
No ratings yet
10 Linked List
21 pages
C++ Pointers and Structures Guide
No ratings yet
C++ Pointers and Structures Guide
4 pages
Catalogo Tophammer PDF
No ratings yet
Catalogo Tophammer PDF
94 pages

Decision Trees

Uploaded by

Decision Trees

Uploaded by

FuncIon

Input: Training examples of unknown target funcIon f

Based on slide by Tom Mitchell

• Each internal node: test one aHribute Xi

Based on slide by Tom Mitchell

• What predicIon would we make for

Based on slide by Tom Mitchell

Slide by Tom Mitchell

• Assumes each with

Based on Example by Tom Mitchell

Truth table row à path to leaf

Based on slide by Pedro Domingos

Is this the best decision tree?

Idea: The simplest consistent explanaIon is the best

• Therefore, the smallest decision tree that correctly

node = root of decision tree

How do we choose which aHribute is best?

• Some possibiliIes are:

• The ID3 algorithm uses the Max-Gain method of

Based on Slide from M. desJardins & T. Finin

Based on Slide from M. desJardins & T. Finin

Based on Slide from M. desJardins & T. Finin

Less or equal 50K Over 50K Unemployed Employed

Very impure group Less impure Minimum

H(X) is the expected number of bits needed to encode a

Why? Information theory:

Slide by Tom Mitchell

H(X) is the expected number of bits needed to encode a

Why? Information theory:

Slide by Tom Mitchell

Based on Slide from M. desJardins & T. Finin

M P A 000 3 0.125 0.375

• What is the entropy of a group in which all Minimum

• What is the entropy of a group with 50% Maximum

good training set for learning

Slide by Tom Mitchell

• InformaIon gain tells us how important a given

• We will use it to decide the ordering of aHributes in

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

Slide by Tom Mitchell

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

Slide by Tom Mitchell

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

Slide by Tom Mitchell

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

Slide by Tom Mitchell

Information Gain is the expected reduction in entropy

Slide by Tom Mitchell

Entire population (30 instances)

parent = −⎛⎜ 14 ⋅ log 2 14 ⎞⎟ − ⎛⎜ 16 ⋅ log 2 16 ⎞⎟ = 0.996

Training Set X Node 1

Quinlan suggested informaIon gain in his ID3 system

Disadvantage of informaIon gain:

You might also like

•  Each internal node: test one aHribute Xi

•  What predicIon would we make for

•  Assumes each with

•  Therefore, the smallest decision tree that correctly

•  Some possibiliIes are:

•  The ID3 algorithm uses the Max-Gain method of

•  What is the entropy of a group in which all Minimum

•  What is the entropy of a group with 50% Maximum

•  InformaIon gain tells us how important a given

•  We will use it to decide the ordering of aHributes in