Aiml Module 04
Aiml Module 04
Chapter 8
Bayesian Learning
Bayesian Learning
“In science, progress is possible. In fact, if one bgligves in Bayes’
theorem, scientific progress is inevitable as predictions are made
and as beliefs are tested and refined.
— Nate Silver
Bayesian Learning is alearning method that describes and represents knowledge in an uncertain
domain and provides a way to reason about this knowledge using probability measure. It uses
Bayes theorem to infer the unknown parameters of a model. Bayesian inference is useful in
many applications which involve reasoning and diagnosis such as game theory, medicine, etc.
Bayesian inference is much more powerful in handling missing data and for estimating any
uncertainty in predictions.
Learning Objectives
* Understand the basics of probability-based learning and probability theory
¢ Learn the fundamentals of Bayes theorem
¢ Introduce Bayes Classification models such as Brute Force Bayes learning algorithm,
Bayes Optimal classifier, and Gibbs algorithm
¢ Introduce Naive Bayes Classification models that work on the principle of Bayes theorem
* Explore the Naive Bayes classification algorithm
. Study about Naive Bayes Algorithm for continuous attributes using Gaussian
distribution
* Introduce other popular types of Naive Bayes classifiers such as Bernoulli Naive Bayes
classifier, Multinomial Naive Bayes classifier, and Multi-class Naive Bayes classifier
hence it exhibits the same initial conditions every time the model is run and is likely to get a single
possible outcome as the solution.
Bayesian learning differs from probabilistic learning as it uses subjective probabilities
(i.e., probability that is based on an individual’s belief or interpretation about the f)utcome .of an
event and it can change over time) to infer parameters of a model. Two practical _leammg algornth-ms
called Naive Bayes learning and Bayesian Belief Network (BBN) form the major part' of Bayeélan
learning. These algorithms use prior probabilities and apply Bayes rule to infer useful information.
Bayesian Belief Networks (BBN) is explained in detail in Chapter 9.
Prior Probability
It is the general probability of an uncertain event before an observation is seen or some evidence is
collected. It is the initial probability that is believed before any new information is collected.
Likelihood Probability
Likelihood probability is the relative probability of the observation occurring for each class or the
sampling density for the evidence given the hypothesis. It is stated as P (Evidence | Hypothesis),
which denotes the likeliness of the occurrence of the evidence given the parameters.
Posterior Probability
It is the updated or revised probability of an event taking into account the observations from the
training data. P (Hypothesis | Evidence) is the posterior distribution representing the belief about
the hypothesis, given the evidence from the training data. Therefore,
Posterior probability = prior probability + new evidence
P (Hypothesis | Evidence E).is calculated from the prior probability P (Hypothesis h), the
likelihood probability P (Evidence E |Hypothesis k) and the marginal probability P (Evidence E),
It can be written as:
P (Hypothesis i | Evidence E) =
P(Evidence ElHypothesis h) P(Hypothesis h) 8.1)
P(Evidence E)
where, P (Hypothesis h) is the prior probability of the hypothesis i without observing the training
data or considering any evidence. It denotes the prior belief or the initial probability that the
hypothesis } is correct. P (Evidence E) is the prior probability of the evidence E from the training
dataset without any knowledge of which hypothesis holds. It is also called the marginal proba-
bility.
P (Evidence E | Hypothesis h) is the prior probability of Evidence E given Hypothesis h,
It is the likelihood probability of the Evidence E after observing the training data that the
hypothesis i is correct. P (Hypothesis h | Evidence E) is the posterior probability of Hypothesis i
given Evidence E. It is the probability of the hypothesis h after observing the training data that the
evidence E is correct. In other words, by the equation of Bayes Eq. (8.1), one can observe that:
Posterior Probability a Prior Probability x Likelihood Probability
Bayes theorem helps in calculating the posterior probability for a number of hypotheses, from
which the hypothesis with the highest probability can be selected.
This selection of the most probable hypothesis from a set of hypotheses is formally defined as
Maximum A Posteriori (MAP) Hypothesis.
P(A|B)=2/4
P(BIA=2/5
P(A1B)=P(B A P(A)/P(®B)==2/4
P@BIA)=P(AIB)P(B)/P(A)==2/5
Let us consider a numerical example to illustrate the use of Bayes theorem now:
o—
Consider a boy who has a volleyball tournament on the next day, but today he feels
sick. It is unusual that there is only a 40% chance he would fall sick since he is a healthy boy. Now,
Find the probability of the boy participating in the tournament. The boy is very much interested in
volley ball, so there is a 90% probability that he would participate in tournaments and 20% that he
will fall sick given that he participates in the tournament.
Solution: P (Boy participating in the tournament) = 90%
P (He is sick | Boy participating in the tournament) = 20%
P (He is Sick) = 40%
The probability of the boy participating in the tournament given that he is sick is:
P (Boy participating in the tournament | He is sick) = P (Boy participating in the tournament)
x P (Heis sick | Boy participating in the tournament)/P (He is Sick)
P (Boy participating in the tournament | He is sick) = (0.9 x 0.2)/0.4
=045
Hence, 45% is the probability that the boy will participate in the tournament given that he is sick.
®
One related concept of Bayes theorem is the principle of Minimum Description Length (MDL). The
minimum description length (MDL) principle is yet another powerful method like Occam’s razor
principle to perform inductive inference. It states that the best and most probable hypothesis
is chosen for a set of observed data or the one with the minimum description. Recall from
Eq. (8.2) Maximum A Posteriori (MAP) Hypothesis, h,,» which says that given a set of candidate
hypotheses, the hypothesis which has the maximum value is considered as the maximum probable
hypothesis or most probable hypothesis. Naive Bayes algorithm uses the Bayes theorem and applies
this MDL principle to find the best hypothesis for a given problem. Let us clearly understand how
e
. Compute Frequency matrix and likelihood Probability for each of the feature.
N
. Use Bayes theorem Eq. (8.1) to calculate the probability of all hypotheses.
w
. Use Maximum A Posteriori (MAP) Hypothesis, /,,, Eq. (8.2) to classify the test object
B
-
LAREAY Assess a student's performance using Naive Bayes algorithm with the dataset
provided in Table 8.1. Predict whether a student gets a job offer or not in his final year of the course,
Table 8.1: Training Dataset
Solution: The training dataset T consists of 10 data instances with attributes such as ‘CGPA’,
“Interactiveness’, ‘Practical Knowledge’ and ‘Communication Skills’ as shown in Table 8.1. The
target variable is Job Offer which is classified as Yes or No for a candidate student.
Step 1: Compute the prior probability for the target feature Job Offer’. The target feature ‘Job
Offer has two classes, ‘Yes’ and ‘No’. It is a binary classification problem. Given a student instance,
we need to classify whether ‘Job Offer = Yes’ or ‘Job Offer = No’.
From the training dataset, we observe that the frequency or the number of instances with ‘Job
Offer = Yes’ is 7 and ‘Job Offer = No' is 3.
The prior probability for the target feature is calculated by dividing the number of instances
belonging to a particular target class by the total number of instances.
Hence, the prior probability for ‘Job Offer = Yes’ is 7/10 and ‘Job Offer = No’ is 3/10 as shown
in Table 8.2.
Bayesian Learning « 239
step 2: Compute Frequency matrix and Likelihood Probability for each of the feature.
Step 2(a): Feature - CGPA
Table 8.3 shows the frequency matrix for the feature CGPA.
Table 8.3: Frequency Matrix of CGPA
<8 0 2
Total 7 3
Table 8.4 shows how the likelihood probability is calculated for CGPA using conditional
probability.
Table 8.4: Likelihood Probability of CGPA
29 P (CGPA 29 | Job Offer = Yes) = 3/7 P (CGPA 29 | Job Offer =No) =1/3
28 P (CGPA 28 | Job Offer = Yes) = 4/7 P (CGPA 28 | Job Offer = No) = 0/3
<8 P (CGPA <8 | Job Offer = Yes) =0/7 P (CGPA <8 | Job Offer = No) =2/3
As explained earlier the Likelihood probabiiity is stated as the sampling density for the
evidence given the hypothesis. It is denoted as P (Evidence | Hypothesis), which says how likely
is the occurrence of the evidence given the parameters.
It is calculated as the number of instances of each attribute value and for a given class value
divided by the number of instances with that class value.
For example P (CGPA 29 | Job Offer = Yes) denotes the number of instances with ‘CGPA >9’
and ‘Job Offer = Yes’ divided by the total number of instances with ‘Job Offer = Yes’.
From the Table 8.3 Frequency Matrix of CGPA, number of instances with ‘CGPA 29’ and ‘Job
Offer = Yes’ is 3. The total number of instances with ‘Job Offer = Yes’ is 7. Hence, P (CGPA 29 | Job
Offer = Yes ) = 3/7.
Similarly, the Likelihood probability is calculated for all attribute values of feature CGPA.
Step 2(b): Feature — Interactiveness
Table 8.5 shows the frequency matrix for the feature Interactiveness.
Table 8.5: Frequency Matrix of Interactiveness
Table 8.6 shows how the likelihood probability is calculated for Interactiveness using condi-
tional probability.
YES P (Interactiveness = Yes | Job Offer = Yes) | P (Interactiveness = Yes 1 Job Offer
=517 —Noy=1/3
NO P (Interactiveness = No | Job Offer = Yes) | P (Interactiveness =No | Job Offer
=27 =No)=2/3
Step 2(c): Feature - Practical Knowledge
Table 8.7 shows the frequency matrix for the feature Practical Knowledge.
Table 8.7: Frequency Matrix of Practical Know
ledge
Practical Knowledge Job Offer = Yes Job Offer = No
Very Good 2 0
Average 1 2
Good 4 1
Total 7 3
Table 8.8 shows how the likelihood probability is calculated for Practical Knowledge using
conditional probability. :
Table 8.8: Likelihood Probability of Practical Knowledge
Table 8.10 shows how the likelihood probability is calculated for Communication Skills using
conditional probability.
Bayesian Learning « 241
P (Job Offgr =Yes | Test data) = (P(CGPA 28 |]Job Offer = Yes) P(Interactiveness = Yes | Job Offer
= Yes) P (Practical knowledge = Average | Job Offer = Yes) P (Communication Skills = Good | Job
Offer = Yes) P (Job Offer = Yes)
=4/7 x 5/7 x 1/7 x 4/7 x 7/10
=0.0233
Similarly, for the other case ‘Job Offer = No,
When we compute the probability:
P (Job Offer =Nol Test data) = (P(CGPA 28 |Job Offer = No) P (Interactiveness = Yes | Job Offer
= No) P (Practical knowledge = Average | Job Offer = No) P (Communication Skills = Good | Job
Offer = No) P (Job Offer = No))/(P(Test Data))
=P (CGPA 28 |Job Offer =No) P (Interactiveness = Yes | Job Offer =No) P (Practical knowledge
= Average | Job Offer = No) P (Communication Skills = Good | Job Offer =No) P (Job Offer = No)
=0/3x1/3x2/3x1/3x3/10
=0
Since the probability value is zero, the model fails to predict, and this is called as Zero-
Probability error. This problem arises because there are no instances in the given Table 8.1 for
the attribute value CGPA 28 and Job Offer = No and hence the probability value of this case is
zero. This zero-probability error can be solved by applying a smoothing technique called Laplace
correction which means given 1000 data instances in the training dataset, if there are zero instances
for a particular value of a feature we can add 1 instance for each attribute value pair of that feature
which will not make much difference for 1000 data instances and the overall probability does not
become zero.
Now, let us scale the values given in Table 8.1 for 1000 data instances. The scaled values without
Laplace correction are shown in Table 8.11.
Table 8.11: Scaled Values to 1000 without Laplace Correction
Now, add 1 instance for each CGPA-value pair for ‘Job Offer = No’. Then,
P (CGPA 29 | Job Offer = No) = 101/303 = 0.333
P (CGPA 28 | Job Offer = No) = 1/303 = 0.0033
P (CGPA <8 | Job Offer = No) = 201/303 = 0.6634
With scaled values to 1003 data instances, we get
P (Job Offer = Yes | Test data) = (P(CGPA 28 |Job Offer = Yes) P (Interactiveness = Yes | Job
Offer = Yes) P (Practical knowledge = Average | Job Offer = Yes) P (Communication Skills = Good |
Job Offer = Yes) P (Job Offer = Yes)
=400/700 x 500/700 x 100/700 x 400/700 x 700/1003
=0.02325
Bayesian Learning « 243
P (Job Offt?r =No | Test data) = P(CGPA >8 |Job Offer = No) P (Interactiveness = Yes | Job Offer
=No) P (Practical knowledge = Average | Job Offer
Offer = No) P (Job Offer = No) = No) P (Communication Skills = Good | Job
=1/303 x 100/300 x 200/300 x 100/300
x 303/1003
= 0.00007385
Thus, using Laplace Correction, Zero Probability error can be solved with Naive Bayes classifier.
Given the hypothesis space with 4 hypothesis h,, h,, h, and h,. Determine if the
patient is diagnosed as COVID positive or COVID negative using Bayes Optimal classifier.
Solution: From the training dataset T, the posterior probabilities of the four different hypotheses
for a new instance are given in Table 8.12.
Table 8.12: Posterior Probability Values
h,,,, chooses h, which has the maximum probability value 0.3 as the solution and gives the result
that the patient is COVID negative. But Bayes Optimal classifier combines the predictions of h,, h,
and h, which is 0.4 and gives the result that the patient is COVID positive.
%, « P(COVID Negative |k)P(h 1T)=03x1=03
%, ., P(COVID Positive 1) P(h, 1T) =0.1x1+02x1+01x1=04
244 « Machine Learning
e
8.3.4 Gibbs Algorithm
The main drawback of Bayes optimal classifier is that it computes the posterior probabillity for a)
hypotheses in the hypothesis space and then combines the predictions to classify a new instance,
Gibbs algorithm is a sampling technique which randomly selects a hypothesis from the
hypothesis space according to the posterior probability distribution and classifies a new instance_
Itis found that the prediction error occurs twice with the Gibbs algorithm when compared to Bayeg
Optimal classifier.
IBELTNIR R Assess a student's performance using Naive Bayes algorithm for the continuous
attribute. Predict whether a student gets a job offer or not in his final year of the course. The train-
ing dataset T consists of 10 data instances with attributes such as ‘CGPA’ and ‘Interactiveness’ as
shown in Table 8.13. The target variable is Job Offer which is classified as Yes or No for a candidate
student.
Table 8.13: Training Dataset with Continuous Attribute
S.No. CGPA Interactiveness Job Offer
1. 9.5 Yes Yes
2. 8.2 No Yes
3. 9.3 No No
4. 7.6 No No
5. 8.4 Yes Yes
6. 9.1 Yes Yes
7 75 Yes No
8. 9.6 No Yes
9. 8.6 Yes Yes
10. 83 Yes Yes
Solution:
Step 1: Compute the prior probability for the target feature ‘Job Offer’.
Bayesian Learning « 245
Prior probabilities of both the classes are calculated using the same formula (refer to
Table 8.14).
Table 8.14: Prior Probability of Target Class
Table 8.16 shows how the likelihood probability is calculated for Interactiveness using condi-
tional probability.
Table 8.16: Likelihood Probability of Interactiveness
1 eyl
P(Xogpp "CGPA = 851Cppopuoye)
‘Job Offer = Ye¢ = ———=e
cccm_y“\/fl 20w’
. Bayes’ theorem uses conditional probability that can be defined via Joint probabilities where
P(A | B)=P (A, ByP (B).
. Bayes theorem is used to select the most probable hypothesis from data, considering both prior
knowledge and posterior distributions.
. Bayes theorem helps in calculating the posterior probability for several hypotheses and selects the
hypothesis with the highest probability.
- Naive Bayes Algorithm is a supervised binary class or multi-class classification algorithm that works
on the principle of Bayes theorem.
. Zero probability error with Naive Bayes Model can be solved by applying a smoothing technique
called Laplace correction.
. Naive Bayes Algorithm for Continuous Attributes can be solved using Gaussian distribution.
- Other popular types of Naive Bayes Classifier are Bernoulli Naive Bayes Classifier, Multinomial
Naive Bayes Classifier and Multi-class Naive Bayes Classifier, etc.
Probability-based Learning — It is one of the most important practical learning methods which
combines prior knowledge or prior probabilities with observed data.
Probabilistic Model - A model in which randomness plays a major role and which gives probability
distribution as a solution.
Deterministic Model — A model in which there is no randomness and hence it exhibits the same
initial condition every time it is run and is likely to get a single possible outcome as the solution.
Bayesian Learning — A learning method that describes and represents knowledge in an uncertain
domain and provides a way to reason about this knowledge using probability measure.
Conditional Probability — The probability of an event 4, given that event B occurs, is written as
P (AIB).
Joint Probability — The probability of the intersection of two or more events. .
Bayesian Probability — Otherwise called as Personal probability, it is a person’s degree of belief in
event A and does not require repeated trials.
Marginal Probability — The probability of an event occurring P (A) unconditionally and not cond-
tioned on another event.
Belief Measure - Means a person’s belief in a statement “S” depends on some knowledge “K".
Bayesian Learning « 249
Prior Probability - The general probability of an uncertain event before an observation or some
evidence is collected.
Likelihood Probability - The relative probability of the observation occurring for each class or the
sampling density for the evidence given the hypothesis.
Posterior Probability - The updated or revised probability of an event taking into account the obser-
vations from the training data.
Maximum A Posteriori (MAP) Hypothesis, f,,,,, - The hypothesis which has the maximum value
among a given a set of candidate hypotheses, and is considered as the maximum probable hypothesis
or most probable hypothesis.
Maximum Likelihood (ML) Hypothesis, h,,, - The hypothesis that gives the maximum likelihood
for P (E | h).
Review Questions
1. What is meant by Probabilistic-based learning?
2. Differentiate between probabilistic model and deterministic model.
3. What is meant by Bayesian learning?
4. Define the following:
* Conditional probability
* Joint probability
* Bayesian Probability
* Marginal probability
. What is belief measure?
W
What is marginalization?
&
What is the difference between prior and posterior and likelihood probabilities?
N
. Define Maximum A Posteriori (MAP) Hypothesis, h,y,, and Maximum Likelihood (ML)
©
Hypothesis, h,,,.
10. Check the correctness of Bayes theorem with an example.
11. Consider there are three baskets, Basket I, Basket IT and Basket IIl with each basket containing rings
of red color and green color. Basket I contains 6 red rings and 5 green rings. Basket Il contains 3 green
rings and 2 red rings while Basket III contains 6 rings which are all red. A person chooses a ring
randomly from a basket. If the ring picked is red, find the probability that it was taken from Basket II.
12. Assume the following probabilities, the probability of a person having Malaria to be 0.02%, the
probability of the test to be positive on detecting Malaria, given that the person has Malaria is 98%
and similarly the probability of the test to be negative on detecting Malaria, given that the person
doesn’t have malaria to be 95%. Find the probability of a person having Malaria; given that, the test
result is positive.
250 Machine Learning
13. Take a real-time example of predicting the result of a student using Naive Bayes algorithm. The
training dataset T consists of 8 data instances with attributes such as ‘Assessment’, ‘Assignmeny,
‘Project’ and “Seminar’ as shown in Table 8.17. The target variable is Result which is classified as Pagg
or Fail for a candidate student. Given a test data to be (Assessment = Average,
Assignment = Yes,
Project=No and Seminar = Good), predict the result of the student. Apply Laplace Correction if Zerqy
probability problem occurs,
Table 8.17: Training Dataset
Across
3. MAP hypothesis has . Bayesian learning uses
(maximum/minimum) value among the (frequency/subjective) based reasoning to
given set of candidate hypothesis. infer parameters.
. The Naive Bayes algorithm assumes that . Bayes theorem uses (marginal/
features are (dependent/ conditional) probability.
independent) of each other. probability is the general
. The degree of belief can be denoted by probability of an uncertain event before
probability. (True/False) observation data is collected.
. The updated probability of an even taking . Bayes theorem is noted for its usefulness in
into account observations from training computing (prior/posterior)
data is called (prior/posterior) probability.
probability. . Probabilistic learning uses the concept of
. Bayes theorem combines prior knowledge probability theory that describes how to
with distributions. model randomness, uncertainty and noise
. Deterministic model has no randomness. to predict the future events. (True/False)
(True/False) 11 Naive bayesian algorithm cannot be used
12. Bayesian learning can be used make predic- to solve the problem with continuous
tions based on historical data. (True/False) attributes. (True/False)
14. ML hypothesis gives the minimum
13. Zero probability error can be solved using
a smoothing technique called likelihood for P (E/h). (True/False)
correction.
Chapter 6 - 04
Module
Decision Tree
Decision Tree Learning
Learning
“Prediction n isis very very difficult,
diffi especially
pecially ifif it's about
the future.”
_ Niels Bohr
spans
Decision Tree Learning is a widely used predictive model for supervised learning that
over a number of practical applications in various areas. It is used for both classification and
regression tasks. The decision tree model basically represents logical rules that predict the
value of a target variable by inferring from data features. This chapter provides a keen insight
into how to construct a decision tree and infer knowledge from the tree.
Learning Objectives
This model can be used to classify both categorical target variables and continu
ous-vajye, q
target variables. Given a training dataset X, this model computes a hypothesis function AX) o
decision tree.
Inputs to the model are data instances or objects with a set of
features or attributes which can
be discrete or continuous and the output of the model is a decision tree which predicts or dassifies
the target class for the test data object.
In statistical terms, attributes or features are called as independent variables. The target featyry
or target class is called as response variable which indicates the category we need to predict on
a test object.
The decision tree learning model generates a complete hypothesis space in the
form of ,
tree structure with the given training dataset and allows us to search through the possible set
of
hypotheses which in fact would be a smaller decision tree as we walk through the tree. This
king
of search bias is called as preference bias,
O Root note
@ Decision node
- Leaf node
Goal Construct a decision tree with the given training dataset. The tree is constructed in a
top-down fashion. It starts from the root node. At every level of tree construction, we need to find
the best split attribute or best decision node among all attributes. This process is recursive and
continued until we end up in the last level of the tree or finding a leaf node which cannot be split
further. The tree construction is complete when all the test conditions lead to a leaf node. The leaf
node contains the target class or output of classification.
Goal Given a test instance, infer to the target class it belongs to.
Classification Inferring the target class for the test instance or object is based on inductive
inference on the constructed decision tree. In order to classify an object, we need to start traversing
the tree from the root. We traverse as we evaluate the test condition on every decision node with
the test object attribute value and walk to the branch corresponding to the test’s outcome. This
process is repeated until we end up in a leaf node which contains the target class of the test object.
Output Target label of the test instance.
[e~
e
Example 6. g How to draw a decision tree to predict a student’s academic performance based op,
the given information such as class attendance, class assignments, home-work assignments, tess,
participation in competitions or other events, group activities such as projects and presentations, etc,
Solution: The target feature is the student performance in the final examination whether he wij
pass or fail in the examination. The decision nodes are test nodes which check for conditions like
‘What's the student’s class attendance?’, ‘How did he perform in his class assignments?’, ‘Did he
do his home assignments properly?’ ‘What about his assessment results?’, ‘Did he participate in
competitions or other events?’, ‘What is the performance rating in group activities such as projects
and presentations?'. Table 6.1 shows the attributes and set of values for each attribute.
Table 6.1: Attributes and Associated Valu
es
Attributes Values
Class attendance Good, Average, Poor
Class assignments Good, Moderate, Poor
Home-work assignments Yes, No
Assessment Good, Moderate, Poor
Participation in competitions or other events
Yes, No
Group activities such as projects and presentations Yes, No
Exam Result Pass, Fail
The leaf nodes represent the outcomes, that is, either ‘pass’, or ‘fail’.
A decision tree would be constructed by following a set of if-else conditions which may or
may not include all the attributes, and decision nodes outcomes are two or more than two. Hence,
the tree is not a binary tree.
2 ]
ote: A decision tree is not always a binary tree. It is a tree which can have more than two branches.
[ 2
Predict a student’s academic performance of whether he will pass or fail based
on the given information such as ‘Assessment’ and ‘Assignment’. The following Table 6.2 shows
the independent variables, Assessment and Assignment, and the target variable Exam Result with
their values. Draw a binary decision tree.
Table 6.2: Attributes and Associated Values
Solution: Consider the root node is ‘Assessment’. If a student’s marks are >50, the root node is
branched to leaf node “Pass’ and if the assessment marks are <50, it is branched to another decision
node. If the decision node in next level of the tree is ‘Assignment’ and if a student has submitted
his assignment, the node branches to ‘Pass’ and if not submitted, the node branches to ‘Fail’.
Figure 6.2 depicts this rule.
Decision Tree Learning « 159
Example How to draw a decision tree to predict a student’s academic performance based on
the given information such as class attendance, class assignments, home-work assignments, tests,
participation in competitions or other events, group activities such as projects and presentations, etc.
Solution: The target feature is the student performance in the final examination whether he will
pass or fail in the examination. The decision nodes are test nodes which check for conditions like
‘What's the student’s class attendance?’, "How did he perform in his class assignments?’, ‘Did he
do his home assignments properly?’ ‘What about his assessment results?’, ‘Did he participate in
competitions or other events?’, ‘What is the performance rating in group activities such as projects
and presentations?’. Table 6.1 shows the attributes and set of values for each attribute.
Table 6.1: Attributes and Associated Values
Attributes Values
Class attendance Good, Average, Poor
Class assignments Good, Moderate, Poor
Home-work assignments Yes, No
Assessment Good, Moderate, Poor
Participation in competitions or other events Yes, No
Group activities such as projects and presentations Yes, No
Exam Result Pass, Fail
The leaf nodes represent the outcomes, that is, either ‘pass’, or ‘fail’.
A decision tree would be constructed by following a set of if-else conditions which may or
may not include all the attributes, and decision nodes outcomes are two or more than two. Hence,
the tree is not a binary tree.
o
60&: A decision tree is not always a binary tree. It is a tree which can have more than two branches. )
Example Predict a student’s academic performance of whether he will pass or fail based
on the given information such as ‘Assessment’ and ‘Assignment’. The following Table 6.2 shows
the independent variables, Assessment and Assignment, and the target variable Exam Result with
their values. Draw a binary decision tree.
Table 6.2: Attributes and Associated Values
Attributes Values
Assessment 250, <50
Assignment Yes, No
Exam Result Pass, Fail
Solution: Consider the root node is ‘Assessment’. If a student’s marks are 250, the root node is
branched to leaf node ‘Pass’ and if the assessment marks are <50, it is branched to another decision
node. If the decision node in next level of the tree is ‘Assignment’ and if a student has submitted
his assignment, the node branches to ‘Pass’ and if not submitted, the node branches to ‘Fail’.
Figure 6.2 depicts this rule.
Decision Tree Learning « 159
\
160 « Machine Learning
-
Entr opy_Info(6,
Info(6,
4) is calculated
c:
6,
ated as ~|[]0—log,
6 4
— + —log, 03110]
Og‘lo+10 —
4
6.1,
(6.1
1. Find the best attribute from the training dataset using an at!ribute selection measure and
place it at the root of the tree.
(Continued )
Decision Tree Learning « 161
2. Split the training dataset into subsets based on the outcomes of the test attribute and each
subset in a branch contains the data instances or tuples with the same value for the selected
test attribute.
of
3. Repeat step 1 and step 2 on each subset until we end up in leaf nodes in all the branches
the tree.
4. This splitting process is recursive until the stopping criterion is reached.
Stopping Criteria
The following are some of the common stopping conditions:
1. The data instances are homogenous which means all belong to the same class C,and
hence its entropy is 0.
2. A node with some defined minimum number of data instances becomes a leaf (Number
of data instances in a node is between 0.25 and 1.00% of the full training dataset).
3. The maximum tree depth is reached, so further splitting is not done and the node
becomes a leaf node.
\ J
6.2 DECISION TREE INDUCTION ALGORITHMS
There are many decision tree algorithms, such as ID3, C4.5, CART, CHAID, QUEST, GUIDE,
CRUISE, and CTREE, that are used for classification in real-time environment. The most
commonly used decision tree algorithms are ID3 (Iterative Dichotomizer 3), developed by
J.R Quinlan in 1986, and C4.5 is an advancement of ID3 presented by the same author in 1993.
CART, that stands for Classification and Regression Trees, is another algorithm which was
developed by Breiman et al. in 1984.
The accuracy of the tree constructed depends upon the selection of the best split attribute.
Different algorithms are used for building decision trees which use different measures to decide on
the splitting criterion. Algorithms such as ID3, C4.5 and CART are popular algorithms used in the
construction of decision trees. The algorithm ID3 uses ‘Information Gain’ as the splitting criterion
whereas the algorithm C4.5 uses ‘Gain Ratio” as the splitting criterion. The CART algorithm is
popularly used for classifying both categorical and continuous-valued target variables. CART uses
GINI Index to construct a decision tree.
Decision trees constructed using ID3 and C4.5 are also called as univariate decision trees which
consider only one feature/attribute to split at each decision node whereas decision trees constructed
using CART algorithm are multivariate decision trees which consider a conjunction of univariate
splits. The details about univariate and multivariate data has been discussed in Chapter 2.
ID3 works well if the attributes or features are considered as discrete/categorical valyeg,
1f some attributes are continuous, then we have to partition attributes or features to be decretizEd
or nominal attributes or features.
The algorithm builds the tree using a purity measure called ‘Information Gain’ with the given
training data instances and then uses the constructed tree to classify the test data. It is applieq
for training set with only nominal attributes or categorical attributes and with no missing valyes
for classification. ID3 works well for a large dataset. If the dataset is small, overfitting may occur,
Moreover, it is not accurate if the dataset has missing attribute values.
No pruning is done during or after construction of the tree and it is prone to outliers. C4.5 anq
CART can handle both categorical attributes and continuous attributes. Both C4.5 and CART can
also handle missing values, but C4.5 is prone to outliers whereas CART can handle outliers as we]],
1. Compute Entropy_Info Eq. (6.8) for the whole training dataset based on the target attribute,
2. Compute Entropy_Info Eq. (6.9) and Information_Gain Eq. (6.10) for each of the attribute in
the training dataset.
3. Choose the attribute for which entropy is minimum and therefore the gain is maximum as
the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split
into subsets.
6. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in
the subset.
\ > |
Definitions
Let T be the training dataset.
Let A be the set of attributes A={A, A, A, ....... A).
Let m be the number of classes in the training dataset.
Let P be the probability that a data instance or a tuple ‘d’ belongs to class C:
v
Itis calculated as,
training set T
P, =Total no of data instances that belongs to class C, in T/Total no of tuples in the
(66)
'
Mathematically, it is represented as shown in Eq. (6.7).
d ©7)
~H
P= _Cl
Decision Tree Learning « 163
where, the attribute A has got ‘v’ distinct values (a,, a, .... 4}, A,! is the number of instances for
distinct value ‘i" in attribute A, and Entropy_Info (A) is the entropy for that set of instances.
Information_Gain is a metric that measures how much information is gained by branching on
an attribute A. In other words, it measures the reduction in impurity in an arbitrary subset of data.
Itis calculated as given in Eq. (6.10):
Information_Gain(A) = Entropy_Info(T) — Entropy_Info(T, A) (6.10)
It can be noted that as entropy increases, information gain decreases. They are inversely
proportional to each other.
Assess a student’s performance during his course of study and predict whether
a student will get a job offer or not in his final year of the course. The training dataset T consists
of 10 data instances with attributes such as ‘CGPA’, ‘Interactiveness’, ‘Practical Knowledge’ and
‘Communication Skills’ as shown in Table 6.3. The target class attribute is the ‘Job Offer’.
Table 6.3: Training Dataset T
Solution:
Step 1:
Calculate the Entropy for the target class ‘Job Offer’.
Entropy_Info(Target Attribute = Job Offer) = Entropy_Info(7,
3) =
=—|:1105 l+ 3 lo; 3
l—o—:l =—(~0.3599 +-0.5208) = 0.8807
Iteration 1: 1075210 10 8
Step 2:
Calculate the Entropy_Info and Gain(].nformation_Gain) for each of the attribute in the training
dataset.
Table 6.4 shows the number of data instances classified with Job Offer as Yes or No for the attribute
CGPA.
Table 6.4: Entropy Information for CGPA
5 1
=1 +E4 72 2 2 2
Entropy_Info(T, Interactiveness) = li[
N T L
] [“‘%zrzl"gfij
= %(0.2191 +0.4306) + %(0.4997 +0.4997)
= 0.3898 + 0.3998 = 0.7896
Gain(Interactiveness) = 0.8807 — 0.7896
=0.0911
Table 6.6 shows the number of data instances classified with Job Offer as Yes or No for the
attribute Practical Knowledge.
Decision Tree Learning « 165
Very Good 2 0 2 0
Average 1 2 3
Good 4 1 5
- 2[ 2, .20~lo 0] =3[1 1 5[ 4,
—|-clog, 4=~ 1,1
log, =
10[28273 522] 10[ 3'%8:3° 31313] 10[ 3 B8 6 25
= Z(0)+—(05 :
m( )+ 10( 280 +03897) + T45 (0.2574 + 0.4641)
=0+0.2753 + 0.3608
=0.6361
Gain(Practical Knowledge) = 0.8807 — 0.6361
=0.2446
Table 6.7 shows the number of data instances classified with Job Offer as Yes or No for the
attribute Communication Skills.
Table 6.7: Entropy Information for Communication Skills
CGPA 0.5564
Interactiveness 0.0911
Practical Knowledge 0.2246
Communication Skills 0.5203
166 « Machine Learning
Step 3: From Table 6.8, choose the attribute for which entropy is minimum and therefore the gain
is maximum as the best split attribute.
The best split attribute is CGPA since it has the maximum gain. So, we choose CGPA as the
root node. There are three distinct values for CGPA with outcomes >9, >8 and <8. The entropy
value is 0 for 28 and <8 with all instances classified as Job Offer = Yes for >8 and Job Offer = No for
<8. Hence, both 28 and <8 end up in a leaf node. The tree grows with the subset of instances with
CGPA 29 as shown in Figure 6.3.
o (e
No Average Poor No
~~[310g, 3+ 1o 1]
4 724 4 24
= ~(-0.3111 + -0.4997)
=0.8108
=0+ 0.4997
Gain(Interactiveness) = 0.8108 — 0.4997
=0.3111
Entropy_Info(T, Practical Knowledge)
|
} Decision
Tree Learning « 167
Here, both the attributes ‘Practical Knowledge’ and “Communication Skills” have the same
Gain. So, we can either construct the decision tree using ‘Practical Knowledge’ or ‘Communication
Skills”. The final decision tree is shown in Figure 6.4.
28 @ <8
R 4
Average
The algorithm C4.5 is based on Occam’s Razor which says that given two correct solutions,
the simpler solution has to be chosen. Moreover, the algorithm requires a larger training set
for better accuracy. It uses Gain Ratio as a measure during the construction of decision trees,
ID3 is more biased towards attributes with larger values. For example, if there is an attribute called
‘Register No’ for students it would be unique for every student and will have distinct value for
every data instance resulting in more values for the attribute. Hence, every instance belongs to
a category and would have higher Information Gain than other attributes. To overcome this bias
issue, C4.5 uses a purity measure Gain ratio to identify the best split attribute. In C4.5 algorithm,
the Information Gain measure used in ID3 algorithm is normalized by computing another
factor called Split_Info. This normalized information gain of an attribute called as Gain_Ratio is
computed by the ratio of the calculated Split_Info and Information Gain of each attribute. Then,
the attribute with the highest normalized information gain, that is, highest gain ratio is used as
the splitting criteria.
As an example, we will choose the same training dataset shown in Table 6.3 to construct a
decision tree using the C4.5 algorithm.
Given a Training dataset T,
The Split_Info of an attribute 4 is computed as given in Eq. (6.11):
Gain_Ratio(A) =
Info_Gain(A)
Split_Info(T, A)
1. Compute Entropy_Info Eq. (6.8) for the whole training dataset based on the target attribute.
2. Compute Entropy_Info Eq. (6.9), Info_Gain Eq. (6.10), Split_Info Eq. (6.11) and Gain_Ratio
Eq. (6.12) for each of the attribute in the training dataset.
3. Choose the attribute for which Gain_Ratio is maximum as the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the
test condition of the root node attribute. Accordingly, the training dataset is also split
into subsets.
6. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in
the subset. .
Decision Tree Learning « 169
*—
Make use of Information Gain of the attributes which are calculated in ID3
algorithm in Example 6.3 to construct a decision tree using C4.5.
Solution:
Iteration 1:
Step 1: Calculate the Class_Entropy for the target class ‘Job Offer’.
Entropy_Info(Target Attribute = Job Offer) = Entropy_Info(7, 3) =
ORI
7.
(R
7.3,+—log,
— —
3
[101"gz 1010 & 10]
= (03599 + ~0.5208)
=0.8807
Step 2: Calculate the Entropy_Info, Gain(Info_Gain), Split_Info, Gain_Ratio for each of the attribute
in the training dataset.
CGPA:
1, 1|,1] 4(4[ 4004,7163 4.0 0}
Info( CGPA) ) =0[| 31003.3
Entropy py Info(T, log14 3 110,
4log14]+w[ 2 D)5
71987=
o6
=2(0.2191 + 0.4306) ) + —10( (0.4997 ++0.4997)
0.
=0.3898 + 0.3998 = 0.7896
Gain(Interactiveness) = 0.8807 — 0.7896 = 0.0911
5 . 6
Gain(Interactiveness) = — 1010870 1010g1M =0.9704
170 « Machine Learning
Gain_Ratio(Interactiveness) = Gain(Interactiveness)
Split _Info(T, Interactiveness)
0.0911
0.9704
=0.0939
Practical Knowledge:
) 2[ 2 og, 53
Entropy_Info(T, Practical Knowledge) = E[_El
0],31
= logz Z:I + 10[ 1103 og, 13732,0, og, 23
Table 6.10 shows the Gain_Ratio computed for all the attributes.
Table 6.10: Gain_Ratio
Attribute Gain_Ratio
CGPA 0.3658
INTERACTIVENESS 0.0939
PRACTICAL KNOWLEDGE | 0.1648
COMMUNICATION SKILLS | 0.3502
Step 3: Choose the attribute for which Gain_Ratio is maximum as the best split attribute.
From Table 6.10, we can see that CGPA has highest gain ratio and it is selected as the best split
attribute. We can construct the decision tree placing CGPA as the root node shown in Figure 6.5.
The training dataset is split into subsets with 4 data instances.
<8
L CGPA
=0+0.4997
Gain(Interactiveness) = 0.8108 — 0.4997 = 0.3111
172 « Machine Learning
2
Split _Info(T, Interactiveness)= -—logz il logz =05+05=1
Gain_Ratio(Interactiveness) = .Gam(lmeracnveness)
Split_Info(T, Interactiveness)
TR,0.
Practical Knowledge: |
. Knowledge)
Entropy_Info(T, Practical _= —|:——1ogz25 0, og, _]
_1 0].1fo 0
Z[_ 1 log, 7 1,1 J
Ilogz
1 1w
+ —[——logz 171
30 1D
08, 1]
=0
Gain(Practical Knowledge) = 0.8108
Split :
plit_Info(T, Practical Knowledge) =_.2Zlogz i
2 1
ZIOg’ 1 Ilog’
7~ 1 o =15
Communication Skills:
+ ——log L 0log 0
4 21 24
=0
Gain(Communication Skills) = 0.8108
. S 2. .2 T 1 1
ZIng i
1 B 1.5
Split_Info(T, Communication Skills) = —Zlog2 i zl-log2 T
Table 6.11 shows the Gain_Ratio computed for all the attributes.
Table 6.11: Gain-Ratio
Attributes Gain_Ratio
Interactiveness 0.3112
Practical Knowledge 0.5408
Communication Skills 0.5408
Both “Practical Knowledge’ and ‘Communication Skills’ have the highest gain ratio. So, the best
splitting attribute can either be ‘Practical Knowledge’ or ‘Communication Skills’, and therefore, the
split can be based on any one of these.
Here, we split based on ‘Practical Knowledge’. The final decision tree is shown in Figure 6.6.
_a
Decision Tree Learning 173
28 <8
Average
Yes 0] 7 o 1 6 |2| 5 4 3
No 1] 2 |2 2 1 [2] 1 2 1
Entropy 0 |0.7637 {0 |0.5433| 09177 | 0.5913|1 |0.6497| 0.9177 | 0.8108
For a sample, the calculations are shown below for a single distinct value say, CGPA « 6.8.
7 7 3 3
Entropy_Info(T, Job_Offer) = - _Elogz o + Elog2 Td]
= (- 0.3599 + —0.5209)
=0.8808
Entropy(7, 2) = _[%1032 % + %1052 %]
=—(-0.2818 +—0.4819)
=0.7637
Entropy_Info(T, CGPA € 6.8) = % x Entropy(0, 1) + % Entropy(7, 2)
M O O Ly 1] 9] 7ren 72565 2
10|71 %8217 187 10| 9 829 9 By
=0+-2(07637)
10
=0.6873
Gain(CGPA € 6.8) = 0.8808 - 0.6873
=0.1935
Similarly, the calculations are done for each of the distinct value for the attribute CGPA and
a table is created. Now, the value of CGPA with maximum gain is chosen as the threshold value
or the best split point. From Table 6.13, we can observe that CGPA with 7.9 has the maximum gain
25 0.4462. Hence, CGPA e 7.9 is chosen as the split point. Now, we can dicretize the continuous values
of CGPA as two categories with CGPA <7.9 and CGPA > 7.9. The resulting discretized instances are
shown in Table 6.14.
Decision Tree Learning « 175
The splitting subset with minimum Gini_Index is chosen as the best splitting subset for an
attribute. The best splitting attribute is chosen by the minimum Gini_Index which is otherwise
maximum AGini because it reduces the impurity.
AGini is computed as given in Eq. (6.15):
AGini(A) = Gini(T) — Gini(T, A) (6.15)
. Compute AGini Eq. (6.15) for the best splitting subset of that attribute.
oo
condition of the root node attribute. Accordingly, the training dataset is also split into
two subsets.
8. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in
the subset.
_ _/
[ o
I2EVIEERH Choose the same training dataset shown in Table 6.3 and construct a decision tree
using CART algorithm.
Solution:
Step 1: Calculate the Gini_Index for the dataset shown in Table 6.3, which consists of 10 data
instances. The target attribute ‘Job Offer’ has 7 instances as Yes and 3 instances as No.
2 2
Gini_Index(T) = 1 —(l] —( 3 )
10 (10
=1-049 - 0.09
=1-058
Gini_Index(T) = 0.42
Step 2: Compute Gini_Index for each of the attribute and each of the subset in the attribute.
(as shown
CGPA has 3 categories, so there are 6 subsets and hence 3 combinations of subsets
in Table 6.15).
Decision Tree Learning o 177
<8
=1-0.7806
=0.2194
Gini_Index(T, CGPA e {<8]) =1 - (0/2)* - (2/2)?
=1-1
=0
Gini_Index(T, CGPA « ((29, 28), <8} = (8/10) x 0.2194 + (2/10) X 0
=0.17552
Gini_Index(T, CGPA € {29, <8}) = 1 - (3/6)* -
(3/6)
=1-05=05
Gini_Index(T, CGPA € {>8}) = 1 — (4/4)* — (0/4)?
=1-1=0
Gini_Index(T, CGPA € {(29, <8), >8}) = (6/10) x 0.5 + (4/10) X 0
=03
Gini_Index(T, CGPA e {28, <8}) = 1 - (4/6) — (2/6)
=1-0.555
=0.445
Gini_Index(T, CGPA e {29})) =1 - (3/4)* - (1/4)
=1-0.625
=0.375
Gini_Index(T, CGPA e {(8, <8), 29}) = (6/10) x 0.445 + (4/10) x 0.375
=0.417
Table 6.16 shows the Gini_Index for 3 subsets of CGPA.
Table 6.16: Gini_Index of CGPA
Subsets Gini_Index
(29, 28) <8 0.1755
(29, <8) >8 03
(28, <8) 29 0.417
Step 3: Choose the best splitting subset which has minimum Gini_Index for an attribute.
The subset CGPA e {(29 28), <8} has the lowest Gini_Index value as 0.1755 is chosen as the best
splitting subset.
[
178 « Machine Learning
=1-072
=028
2 2
=1-05
=05
Gini_Index(T, Interactiveness € {Yes, No}) = %(0.28) + %(0.5)
=0.168 +0.2
=0.368
AGini(Interactiveness) = Gini(T ) — Gini(T, Interactiveness)
=0.42-0.368
=0.052
Table 6.18: Categories for Practical Knowledge
Practical Knowledge Job Offer = Yes Job Offer = No
Very Good 2 0
Good 4 1
Average 1 2
2 \2
=1-0.7544
=0.2456
2 2
=1-0.555=0.445
Decision Tree Learning « 179
=1-052
=048
=1-0.68
=032
- 5
Gini_Index(T, Practical Knowledge € {Very Good, Average}, Good) = [%) %048 + (-1—0) % 0.32
=040
2 2
=1-0.5312 = 0.4688
=1-1=0
Gini_Index(T, Practical Knowledge e {Good, Average}, Very Good) = (%) X% 0.4688 + (%J x0
=0.3750
Table 6.19 shows the Gini_Index for various subsets of Practical Knowledge.
Table 6.19: Gini_Index for Practical Knowledge
Subsets Gini_Index
(Very Good, Good) Average 0.3054
(Very Good, Average) | Good 0.40
(Good, Average) Very Good 0.3750
2 2
Gini_Index(T, Communication Skills € {Good, Moderate}) = 1 - [%J - (%)
=1-0.7806
=0.2194
2 2
=1-1=0
=1-0.52
=048
4 2 1 2
=1-0.68
=0.32 2 2
Gini_Index(T, Communication Skills € {Moderate, Poor}, Good) = [%] x 048 + [%) x 0.32
=040
Table 6.21 shows the Gini_Index for various subsets of Communication Skills.
Table 6.21: Gini-Index for Subsets of Communication Skills
Subsets Gini_Index
(Good, Moderate) Poor 0.1755
(Good, Poor) Moderate 0.3429
(Moderate, Poor) Good 0.40
Decision Tree Learning . 181
Step 5: Choose the best splitting attribute that has maximum AGini.
CGPA and Communication Skills have the highest AGini value. We can choose CGPA as the
root node and split the datasets into two subsets shown in Figure 6.7 since the tree constructed by
CART is a binary tree.
Very good
Gini_Index(T) =1 - (Z) - (l ]1
8|3
=1-0.766 - 0.0156
=1-058
Gini_Index(T) = 0.2184
Tables 6.24, 6.25, and 6.27 show the categories for attributes Interactiveness, Practical Knowledge,
and Communication Skills, respectively.
Table 6.24: Categories for Interactiveness
Interactiveness Job Offer = Yes Job Offer = No
Yes 5 0
No 2 1
=1- 0
j
Gini_Index(T, Interactiveness € {No}) =1
ET ¢ —0.44 -0.111= 0.449
=0.056
AGini(Interactiveness) = Gini(T ) — Gini(T, Interacti veness)
=0.2184 - 0.056 = 0.1624
2 2
=1-025-025
=05
2
Gini_Index(T, Practical Knowledge e (Very Good, Good}, Average) = (g) %0+ (%) %05
=0.125
2 2
" i 1
Gini_Index(T, Practical Knowledge e {Very Good, Average})=1- [—] = (—)
=1-0.5625 — 0.0625
=0.375
2 2
=1-1=0
Gini_Index(T, Practical Knowledge e {Very Good, Average}, Good) = [%) % 0.375 + (%) x0
=0.1875
2 2
=1-0.694 - 0.028
=0.278
2 2
=1-1=0
6) 2} x0
Gini_Index(T, Practical Knowledge € {Good, Average}, Very Good) = (E] % 0.278 + (g]
=0.2085
Table 6.26 shows the Gini_Index values for various subsets of Practical Knowledge.
Table 6.26: Gini_Index for Subsets of Practical Knowledge
Subsets Gini_Index
(Very Good, Good) Average 0.125
(Very Good, Average) Good 0.1875
(Good, Average) Very Good 0.2085
184 « Machine Learning
=1-1=0
=0
=1-0.64-0.04
=032
2 2
=1-1=0
=02
3 2 1 2
=1-0.5625 - 0.0625
=0.375
=1-1=0
1=
2 2
=0.1875
Decision Tree Learning « 185
Attribute Gini_Index A
Interactiveness 0.056 0.1624
Practical knowledge 0.125 0.0934
Communication Skills 0 0.2184
Communication Skills has the highest AGini value. The tree is further branched based on the
attribute ‘Communication Skills'. Here, we see all branches end up in a leaf node and the process
of construction is completed. The final tree is shown in Figure 6.8.
1. Compute standard deviation for each attribute with respect to target attribute.
2. Compute standard deviation for the number of data instances of each distinct value of an
attribute.
ute.
3. Compute weighted standard deviation for each attrib
(Continued)
186 « Machine Learning
4. Compute standard deviation reduction by subtracting weighted standard deviation for each
attribute from standard deviation of each attribute.
5. Choose the attribute with a higher standard deviation reduction as the best split attribute.
6. The best split attribute is placed as the root node.
7. The root node is branched into subtrees with each subtree as an outcome of the test condition
of the root node attribute. Accordingly, the training dataset is also split into different subsets.
8. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in the subset.
o
Construct a regression tree using the following Table 6.30 which consists of
10 data instances and 3 attributes ‘Assessment’, ‘Assignment’ and ‘Project’. The target attribute is
the ‘Result’ which is a continuous attribute.
Table 6.30: Training Dataset
Solution:
Step 1: Compute standard deviation for each attribute with respect to the target attribute:
Average = (95 + 70 + 75 + 45 + 98 + 80 + 75 + 65 + 58 + 89) = 75
(95— 75)% + (70 — 75)% + (75 — 75)2 + (45 — 75)% + (98 — 75) + (80 — 75)*
+ (75— 75)2 + (65 — 75)? + (58 — 75)2 + (89 — 75)*
Standard Deviation =
10
=16.55
Assessment = Good (Table 6.31)
Table 6.31: Attribute Assessment = Good
S.No. Assessment Assignment Project Result (%)
1. | Good Yes Yes 95:
3. | Good No Yes 75
5. | Good Yes Yes 98
7. | Good No No 75
10. | Good Yes Yes 89
Decision Tree Learning « 187
Standard Deviation _ J<95 Z86.4) + (75 — 86.4)? + (98 - 86.4)" + (75 — 86.4)" + (89 — 86.4)?
5
=109
Assessment = Average (Table 6.32)
Table 6.32: Attribute Assessment = Average
S.No. Assessment Assignment Project Result (%)
2. | Average Yes No 70
6. | Average No Yes 80
9. Average No No 58
Average = (70 + 80 + 58) = 69.3
Weighted standard deviation for Assessment= [—1%] x10.9 + (1%-] x11.01 + [%) x14.14
=11.58
Standard deviation reduction for Assessment = 16.55 — 11.58 = 4.97
Assignment = Yes (Table 6.35)
Table 6.35: Assignment = Yes
S.No. Assessment Assignment Project Result (%)
1. | Good Yes Yes 95
2. | Average Yes No 70
5. | Good Yes Yes 98
8. | Poor Yes Yes 65
10. | Good Yes Yes 89
188 « Machine Leamig ——o—M
H8™ H1— ————x
Average =(95+70+98 ——
+65+89) =834
Standard Deviation = | C2=834)* + (70 —83.4)7 + (98 — 83.4)" + (65 — 83.4)" + (89 — 834y
5
=14.98
Assignment = No (Table 6.36)
Table 6.36: Assignment = No
6. Average No Yes 80
7. | Good No No 75
9. | Average No No 58
Average = (75 +45 + 80 + 75 + 58) = 66.6
_ (75 - 66.6)* + (45 — 66.6) + (80 — 66.6)* + (75 — 66.6)* + (58 — 66.6)
Standard Deviation
5
=147
Table 6.37 shows the Standard Deviation and Data Instances for attribute, Assignment.
Table 6.37: Standard Deviation for Assignment
Weighted standard deviation for Assignment = [%) x14.98 + (%) x14.7 = 14.84
=126
Standard Deviation = J
(70 — 75)2 + (45 — 75)* + (75 — 75)" + (58 — 75)*
4
=13.39
Table 6.40 shows the Standard Deviation and Data Instances for attribute, Project.
Table 6.40: Standard Deviation for Project
The attribute Assessment’ has the maximum Standard Deviation Reduction and hence it is
chosen as the best splitting attribute.
The training dataset is split into subsets based on the attribute ‘Assessment’ and this process is
continued until the entire tree is constructed. Figure 6.9 shows the regression tree with ‘Assessment’
as the root node and the subsets in each branch.
190 « Machine Learning
SNo. | Assessment | Assignment | Project "m’l §.No.| Assessment [Assignment | Project R(e;‘i;k |
1 Good Yes Yes 95 4. Poor No No 45
3 Good No Yes 75 8. Poor. Yes Yes 65
5. Good Yes Yes | 98
1 1 Good No No L 75 1 [no.| Assessment | Assignment | Project | Result
o0y
10 Good Yes ves | 89
2. | Average Yes No 70
6. | Average No Yes 80
9. Average No. No 58
»
Decision Tree Learning « 191
instances correctly classified and number of instances wrongly classified, Average Squared Error
(ASE) is computed. The tree nodes are pruned based on these computations and the resulting tree
is validated until we get a tree that performs better. Cross validation is another way to construct
an optimal decision tree. Here, the dataset is split into k-folds, among which k-1 folds are used
for training the decision tree and the k™ fold is used for validation and errors are computed. The
process is repeated for randomly k-1 folds and the mean of the errors is computed for different
trees. The tree with the lowest error is chosen with which the performance of the tree is improved.
This tree can now be tested with the test dataset and predictions are made.
Another approach is that after the tree is constructed using the training set, statistical tests like
error estimation and Chi-square test are used to estimate whether pruning or splitting is required
for a particular node to find a better accurate tree.
The third approach is using a principle called Minimum Description Length which uses
a complexity measure for encoding the training set and the growth of the decision tree is stopped
when the encoding size (i.e., (size(tree)) + size(misclassifications(tree)) is minimized. CART and
C4.5 perform post-pruning, that is, pruning the tree to a smaller size after construction in order
to minimize the misclassification error. CART makes use of 10-fold cross validation method to
validate and prune the trees, whereas C4.5 uses heuristic formula to estimate misclassification
error rates.
Some of the tree pruning methods are listed below:
. Reduced Error Pruning
N
Optimal Pruning
U
Summary
1. The decision tree learning model performs an Inductive inference that reaches a general conclusion
from observed examples.
2. The decision tree learning model generates a complete hypothesis space in the form of a tree structure.
3. A decision tree has a structure that consists of a root node, internal nodes/decision nodes, branches,
and terminal nodes/leaf nodes.
4. Every path from root to a leaf node represents a logical rule that corresponds to a conjunction of test
attributes and the whole tree represents a disjunction of these conjunctions.
5. A decision tree consists of two major procedures, namely building the tree and knowledge inference
or classification.
6. A decision tree is constructed by finding the attribute or feature that best describes the target class
for the given test instances.
192 « Machine Learning
7. Decision tree algorithms such as ID3, C4.5, CART, CHAID, QUEST, GUIDE, CRUISE,
and CTREE,
are used for classification,
8. The univariate decision tree algorithm ID3 uses ‘Information Gain’ as the splitting
criterion whereas
algorithm C4.5 uses ‘Gain Ratio’ as the splitting criterion.
9. Multivariate decision tree algorithm called Classification and Regression Trees (CART) algorith
m ic
popularly used for classifying both categorical and continuous-valued target variables,
10. CART uses GINI Index to construct a decision tree.
11. ID3 works well if the attributes or features are considered as discrete/categorical values,
12. C4.5 and CART can handle both categorical attributes and continuous attributes.
13. Both C4.5 and CART can also handle missing values. C4.5 is prone to outliers but CART can hana
outliers too.
14. The C45 algorithm is further improved by considering attributes which are continuous, and a
continuous attribute is discretized by finding a split point or threshold.
15. The algorithm C4.5 is based on Occam’s Razor which says that given two correct solutions, the
simpler solution must be chosen.
16. Regression trees are a variant of decision trees where the target feature is a continuous-valued
variable.
Information Gain - It is a metric that measures how much information is gained by branching on
an attribute.
Gain Ratio - It is the normalized information gain computed by the ratio of Split_Info and Gain of
an attribute.
GINI Index - It is defined as the number of data instances for a class or it is the proportion of
instances.
Inductive Bias - It refers to a set of assumptions added to the training data in order to perform
induction.
Pre-Pruning - It is a process of pruning the tree nodes during construction or if the construction is
stopped earlier without exploring the nodes branches.
Post-Pruning - It is a process of pruning the tree nodes after the construction is over.
Review Questions
1. How does the structure of a decision tree help in classifying a data instance?
2. What are the different metrics used in deciding the splitting attribute?
3. Define Entropy.
4. Relate Entropy and Information Gain.
5. How does a C4.5 algorithm perform better than ID3? What metric is used in the algorithm?
Decision Tree Learning « 193
6. . What is CART?
7. How does CART solve regression problems?
8. . What is meant by pre-pruning and post-pruning? Compare both the methods.
9. . How are continuous attributes discretized?
10. Consider the training dataset shown in Table 6.42. Discretize the continuous attribute ‘Percentage’.
Table 6.42: Training Dataset
S.No. Percentage Award
1. 95 Yes
2. 80 Yes
3. 72 No
4. 65 Yes
5. 95 Yes
6. 32 No
7 66 No
8. 54 No
9. 89 Yes
10. 72 Yes
11. Consider the training dataset in Table 6.43. Construct decision trees using ID3, C45, and CART.
Table 6.43: Training Dataset
S.No. Assessment Assignme Project Seminar Result
1. | Good Yes Yes Good Pass
2. | Average Yes No Poor Fail
3. | Good No Yes Good Pass
4. | Poor No No Poor Fail
5. | Good Yes Yes Good Pass
6. | Average No Yes Good Pass
7. | Good No No Fair Pass
8. | Poor Yes Yes Good Fail
9. | Average No No Poor Fail
10. | Good Yes Yes Fair Pass
194 o Machine Learning
Information gain
UDIDZ<
= >mona>mmnI 03
Decision Tree Learning
Decision
EE e OO0 OD -~ XunmNXP
False
D <UNOXMO
B >0
EZBO0ZAa<
T ZOXEmO
<w>ZESponmESmE=ZS X
20020 >m>mZmn VA~
FDO L M¥Xaw<ZAPDUK~0Z
False
True
Yes
—mZO M
VO0OUEFHMOM 00X
S ZORMPEXO
>DZDLE
U
HASEH ~0Zmm X
Xo> AR
>SS <EIMMUT-3
After
False
SR MEE>N®
20O KA
Tree
Yes
MmEEXMYANSIX>O0O>mmn~0
¥NRZe>000X ~MpaO0mon X
R ~O0O>XSANANMKZZmx
Find and mark the words listed below.
Orderliness
O P
DumXQAEaO>m X
Inference
ArxwneI3iIgdawZugdxm
True
Rule
X>PUNZILom<mUDPUEEO
A0
WM< | XE|
UIXDO<oUZmOILIRZA<n
SN~ MEI
>DZUOZS0Zumeado
Induction
OIS -2>X3RWNOK<
Worse
True
~NBEN>OSEBxD>EDEBZ
XK