0% found this document useful (0 votes)
526 views

Aiml Module 04

Uploaded by

bashantr29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
526 views

Aiml Module 04

Uploaded by

bashantr29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Module - 04

Chapter 8
Bayesian Learning
Bayesian Learning
“In science, progress is possible. In fact, if one bgligves in Bayes’
theorem, scientific progress is inevitable as predictions are made
and as beliefs are tested and refined.
— Nate Silver

Bayesian Learning is alearning method that describes and represents knowledge in an uncertain
domain and provides a way to reason about this knowledge using probability measure. It uses
Bayes theorem to infer the unknown parameters of a model. Bayesian inference is useful in
many applications which involve reasoning and diagnosis such as game theory, medicine, etc.
Bayesian inference is much more powerful in handling missing data and for estimating any
uncertainty in predictions.

Learning Objectives
* Understand the basics of probability-based learning and probability theory
¢ Learn the fundamentals of Bayes theorem
¢ Introduce Bayes Classification models such as Brute Force Bayes learning algorithm,
Bayes Optimal classifier, and Gibbs algorithm
¢ Introduce Naive Bayes Classification models that work on the principle of Bayes theorem
* Explore the Naive Bayes classification algorithm
. Study about Naive Bayes Algorithm for continuous attributes using Gaussian
distribution
* Introduce other popular types of Naive Bayes classifiers such as Bernoulli Naive Bayes
classifier, Multinomial Naive Bayes classifier, and Multi-class Naive Bayes classifier

8.1 INTRODUCTION TO PROBABILITY-BASED LEARNING


Probability-based learning is one of the most important practical learning methods which combines
prior knowledge or prior probabilities with observed data. Probabilistic learning uses the concept
of probability theory that describes how to model randomness, uncertainty, and noise to predict
future events. It is a tool for modelling large datasets and uses Bayes rule to infer unknown
quantities, predict and learn from data. In a probabilistic model, randomness plays a major role which
gives probability distribution a solution, while in a deterministic model there is no randomness and
Bayesian Learning « 235

hence it exhibits the same initial conditions every time the model is run and is likely to get a single
possible outcome as the solution.
Bayesian learning differs from probabilistic learning as it uses subjective probabilities
(i.e., probability that is based on an individual’s belief or interpretation about the f)utcome .of an
event and it can change over time) to infer parameters of a model. Two practical _leammg algornth-ms
called Naive Bayes learning and Bayesian Belief Network (BBN) form the major part' of Bayeélan
learning. These algorithms use prior probabilities and apply Bayes rule to infer useful information.
Bayesian Belief Networks (BBN) is explained in detail in Chapter 9.

Scan for information on ‘Probability Theory’ and for ‘Additional Examples’

8.2 FUNDAMENTALS OF BAYES THEOREM


Naive Bayes Model relies on Bayes theorem that works on the principle of three kinds of probabil-
ities called prior probability, likelihood probability, and posterior probability.

Prior Probability
It is the general probability of an uncertain event before an observation is seen or some evidence is
collected. It is the initial probability that is believed before any new information is collected.

Likelihood Probability
Likelihood probability is the relative probability of the observation occurring for each class or the
sampling density for the evidence given the hypothesis. It is stated as P (Evidence | Hypothesis),
which denotes the likeliness of the occurrence of the evidence given the parameters.

Posterior Probability
It is the updated or revised probability of an event taking into account the observations from the
training data. P (Hypothesis | Evidence) is the posterior distribution representing the belief about
the hypothesis, given the evidence from the training data. Therefore,
Posterior probability = prior probability + new evidence

8.3 CLASSIFICATION USING BAYES MODEL


Naive Bayes Classification models work on the principle of Bayes theorem. Bayes’ rule is a mathe-
matical formula used to determine the posterior probability, given prior probabilities of events.
Generally, Bayes theorem is used to select the most probable hypothesis from data, considering
both prior knowledge and posterior distributions. It is based on the calculation of the posterior
probability and is stated as:
P (Hypothesis h | Evidence E)
where, Hypothesis / is the target class to be classified and Evidence E is the given test instance.
-
236 « Machine Learning

P (Hypothesis | Evidence E).is calculated from the prior probability P (Hypothesis h), the
likelihood probability P (Evidence E |Hypothesis k) and the marginal probability P (Evidence E),
It can be written as:
P (Hypothesis i | Evidence E) =
P(Evidence ElHypothesis h) P(Hypothesis h) 8.1)
P(Evidence E)
where, P (Hypothesis h) is the prior probability of the hypothesis i without observing the training
data or considering any evidence. It denotes the prior belief or the initial probability that the
hypothesis } is correct. P (Evidence E) is the prior probability of the evidence E from the training
dataset without any knowledge of which hypothesis holds. It is also called the marginal proba-
bility.
P (Evidence E | Hypothesis h) is the prior probability of Evidence E given Hypothesis h,
It is the likelihood probability of the Evidence E after observing the training data that the
hypothesis i is correct. P (Hypothesis h | Evidence E) is the posterior probability of Hypothesis i
given Evidence E. It is the probability of the hypothesis h after observing the training data that the
evidence E is correct. In other words, by the equation of Bayes Eq. (8.1), one can observe that:
Posterior Probability a Prior Probability x Likelihood Probability
Bayes theorem helps in calculating the posterior probability for a number of hypotheses, from
which the hypothesis with the highest probability can be selected.
This selection of the most probable hypothesis from a set of hypotheses is formally defined as
Maximum A Posteriori (MAP) Hypothesis.

Maximum A Posteriori (MAP) Hypothesis, h,,,,


Given a set of candidate hypotheses, the hypothesis which has the maximum value is considerea
the maximum probable hypothesis or most probable hypothesis. This most probable hypothesis is calle«
the Maximum A Posteriori Hypothesis h,, . Bayes theorem Eq. (8.1) can be used to find the h,,,.
h,,,, = max,,, P(Hypothesish| Evidence E)
P(EvidenceE | Hypothesis h)P(Hypothesis h)
= P(Evidence E)
=max,,, P(EvidenceE | Hypothesis h)P(Hypothesis h) 82

Maximum Likelihood (ML) Hypothesis, h,,,


Given a set of candidate hypotheses, if every hypothesis is equally probable, only P (E | h) is used
to find the most probable hypothesis. The hypothesis that gives the maximum likelihood for P (E | k)
is called the Maximum Likelihood (ML) Hypothesis, h, .
h,, =max,,, P(Evidence E|Hypothesis h) (8.3)

Correctness of Bayes Theorem


Consider two events A and B in a sample space S.
ATFTTFTTF
BFTTFTFTF
P(A)=5/8
P®)=4/8
236 e Machine Learning

P (Hypothesis k| Evidence E).is calculated from the prior probability P (Hypot


hesis k), the
likelihood probability P (Evidence E IHypothesis k) and the marginal probability P (Evidence E)
It can be written as:
P (Hypothesis 1 | Evidence Fy ~ "(Evidence ElHypothesis k) P(Hypothesis 1) @®1)
P(Evidence E)
where, P (Hypothesis h) is the prior probability of the hypothesis & without observing the training
data or considering any evidence. It denotes the prior belief or the initial probability that the
hypothesis h is correct. P (Evidence E) is the prior probability of the evidence E from the training
dataset without any knowledge of which hypothesis holds. It is also called the marginal proba.
bility.
P (Evidence E | Hypothesis h) is the prior probability of Evidence E given Hypothesis
It is the likelihood probability of the Evidence E after observing the training data that the
hypothesis & is correct. P (Hypothesis h | Evidence E) is the posterior probability of Hypothesis 4
given Evidence E. It is the probability of the hypothesis h after observing the training data that the
evidence E is correct. In other words, by the equation of Bayes Eq. (8.1), one can observe that:
Posterior Probability a Prior Probability x Likelihood Probability
Bayes theorem helps in calculating the posterior probability for a number of hypotheses, from
which the hypothesis with the highest probability can be selected.
This selection of the most probable hypothesis from a set of hypotheses is formally defined as
Maximum A Posteriori (MAP) Hypothesis.

Maximum A Posteriori (MAP) Hypothesis, hi


Given a set of candidate hypotheses, the hypothesis which has the maximum value is considered as
the maximum probable hypothesis or most probable hypothesis. This most probable hypothesis is called
the Maximum A Posteriori Hypothesis h,,,,. Bayes theorem Eq. (8.1) can be used to find the ;.
h,,,, = max,,,, P(Hypothesish| Evidence E)
P(Evidence E | Hypothesis h)P(Hypothesis h)
et P(Evidence E)
=max,,, P(EvidenceE | Hypothesis h)P(Hypothesis h) (82)

Maximum Likelihood (ML) Hypothesis, h,,,


Given a set of candidate hypotheses, if every hypothesis is equally probable, only P (E | k) is used
to find the most probable hypothesis. The hypothesis that gives the maximum likelihood for P (E | h)
is called the Maximum Likelihood (ML) Hypothesis, h, ;.
h,, = max,,,, heH P(EvidenceE | Hypothesis h) 83)

Correctness of Bayes Theorem


Consider two events A and B in a sample space S.
ATFTTFTTF
BFTTFTFTF
P (A)=5/8
P (B)=4/8
Bayesian Learning « 237

P(A|B)=2/4
P(BIA=2/5
P(A1B)=P(B A P(A)/P(®B)==2/4
P@BIA)=P(AIB)P(B)/P(A)==2/5
Let us consider a numerical example to illustrate the use of Bayes theorem now:
o—
Consider a boy who has a volleyball tournament on the next day, but today he feels
sick. It is unusual that there is only a 40% chance he would fall sick since he is a healthy boy. Now,
Find the probability of the boy participating in the tournament. The boy is very much interested in
volley ball, so there is a 90% probability that he would participate in tournaments and 20% that he
will fall sick given that he participates in the tournament.
Solution: P (Boy participating in the tournament) = 90%
P (He is sick | Boy participating in the tournament) = 20%
P (He is Sick) = 40%
The probability of the boy participating in the tournament given that he is sick is:
P (Boy participating in the tournament | He is sick) = P (Boy participating in the tournament)
x P (Heis sick | Boy participating in the tournament)/P (He is Sick)
P (Boy participating in the tournament | He is sick) = (0.9 x 0.2)/0.4
=045
Hence, 45% is the probability that the boy will participate in the tournament given that he is sick.
®
One related concept of Bayes theorem is the principle of Minimum Description Length (MDL). The
minimum description length (MDL) principle is yet another powerful method like Occam’s razor
principle to perform inductive inference. It states that the best and most probable hypothesis
is chosen for a set of observed data or the one with the minimum description. Recall from
Eq. (8.2) Maximum A Posteriori (MAP) Hypothesis, h,,» which says that given a set of candidate
hypotheses, the hypothesis which has the maximum value is considered as the maximum probable
hypothesis or most probable hypothesis. Naive Bayes algorithm uses the Bayes theorem and applies
this MDL principle to find the best hypothesis for a given problem. Let us clearly understand how
e

this algorithm works in the following Section 8.3.1.

8.3.1 NAIVE BAYES ALGORITHM


Itis a supervised binary class or multi class classification algorithm that works on the principle of
Bayes theorem. There is a family of Naive Bayes classifiers based on a common principle. These
algorithms classify for datasets whose features are independent and each feature is assumed to
be given equal weightage. It particularly works for a large dataset and is very fast. It is one of
the most effective and simple classification algorithms. This algorithm considers all features to be
independent of each other even though they are individually dependent on the classified object.
Each of the features contributes a probability value independently during classification and hence
this algorithm is called as Naive algorithm.
Some important applications of these algorithms are text classification, recommendation
system and face recognition.
238 « Machine Learning

Algorithm 8.1: Naive Bayes

. Compute the prior probability for the target class.


=

. Compute Frequency matrix and likelihood Probability for each of the feature.
N

. Use Bayes theorem Eq. (8.1) to calculate the probability of all hypotheses.
w

. Use Maximum A Posteriori (MAP) Hypothesis, /,,, Eq. (8.2) to classify the test object
B

to the hypothesis with the highest probability.

-
LAREAY Assess a student's performance using Naive Bayes algorithm with the dataset
provided in Table 8.1. Predict whether a student gets a job offer or not in his final year of the course,
Table 8.1: Training Dataset

S.No. CGPA Interactiveness Practical Knowledge Communication Skills Job Offer

29 Yes Very good Good Yes


2. 28 | No Good Moderate Yes
3. 29 No Average Poor No
4. <8 |No Average Good No
5. 28 | Yes Good Moderate Yes
6. 29 Yes Good Moderate Yes
7. <8 Yes Good Poor No
8. 29 No Very good Good Yes
9. 28 | Yes Good Good Yes
10. 28 Yes Average Good Yes

Solution: The training dataset T consists of 10 data instances with attributes such as ‘CGPA’,
“Interactiveness’, ‘Practical Knowledge’ and ‘Communication Skills’ as shown in Table 8.1. The
target variable is Job Offer which is classified as Yes or No for a candidate student.
Step 1: Compute the prior probability for the target feature Job Offer’. The target feature ‘Job
Offer has two classes, ‘Yes’ and ‘No’. It is a binary classification problem. Given a student instance,
we need to classify whether ‘Job Offer = Yes’ or ‘Job Offer = No’.
From the training dataset, we observe that the frequency or the number of instances with ‘Job
Offer = Yes’ is 7 and ‘Job Offer = No' is 3.
The prior probability for the target feature is calculated by dividing the number of instances
belonging to a particular target class by the total number of instances.
Hence, the prior probability for ‘Job Offer = Yes’ is 7/10 and ‘Job Offer = No’ is 3/10 as shown
in Table 8.2.
Bayesian Learning « 239

Table 8.2: Frequency Matrix and Prior Probability of Job Offer

Job Offer Classes No. of Instances Probability Value


P (Job Offer = Yes) =7/10

P (Job Offer = No) = 3/10

step 2: Compute Frequency matrix and Likelihood Probability for each of the feature.
Step 2(a): Feature - CGPA

Table 8.3 shows the frequency matrix for the feature CGPA.
Table 8.3: Frequency Matrix of CGPA

CGPA Job Offer = Yes Job Offer = No


29 3 1
>8 4 0

<8 0 2
Total 7 3

Table 8.4 shows how the likelihood probability is calculated for CGPA using conditional
probability.
Table 8.4: Likelihood Probability of CGPA

P (Job Offer = Yes) P (Job Offer = No)

29 P (CGPA 29 | Job Offer = Yes) = 3/7 P (CGPA 29 | Job Offer =No) =1/3
28 P (CGPA 28 | Job Offer = Yes) = 4/7 P (CGPA 28 | Job Offer = No) = 0/3
<8 P (CGPA <8 | Job Offer = Yes) =0/7 P (CGPA <8 | Job Offer = No) =2/3

As explained earlier the Likelihood probabiiity is stated as the sampling density for the
evidence given the hypothesis. It is denoted as P (Evidence | Hypothesis), which says how likely
is the occurrence of the evidence given the parameters.
It is calculated as the number of instances of each attribute value and for a given class value
divided by the number of instances with that class value.
For example P (CGPA 29 | Job Offer = Yes) denotes the number of instances with ‘CGPA >9’
and ‘Job Offer = Yes’ divided by the total number of instances with ‘Job Offer = Yes’.
From the Table 8.3 Frequency Matrix of CGPA, number of instances with ‘CGPA 29’ and ‘Job
Offer = Yes’ is 3. The total number of instances with ‘Job Offer = Yes’ is 7. Hence, P (CGPA 29 | Job
Offer = Yes ) = 3/7.
Similarly, the Likelihood probability is calculated for all attribute values of feature CGPA.
Step 2(b): Feature — Interactiveness
Table 8.5 shows the frequency matrix for the feature Interactiveness.
Table 8.5: Frequency Matrix of Interactiveness

: ene ob Offe e ob Offe o


YES 5
NO 2
Total 7
240 MachineLearning

Table 8.6 shows how the likelihood probability is calculated for Interactiveness using condi-
tional probability.

Table 8.6: Likelihood Probability of Interactiveness


Interactiveness P (Job Offer = Yes) P (Job Offer = No)

YES P (Interactiveness = Yes | Job Offer = Yes) | P (Interactiveness = Yes 1 Job Offer
=517 —Noy=1/3
NO P (Interactiveness = No | Job Offer = Yes) | P (Interactiveness =No | Job Offer
=27 =No)=2/3
Step 2(c): Feature - Practical Knowledge
Table 8.7 shows the frequency matrix for the feature Practical Knowledge.
Table 8.7: Frequency Matrix of Practical Know
ledge
Practical Knowledge Job Offer = Yes Job Offer = No
Very Good 2 0
Average 1 2
Good 4 1
Total 7 3
Table 8.8 shows how the likelihood probability is calculated for Practical Knowledge using
conditional probability. :
Table 8.8: Likelihood Probability of Practical Knowledge

Practical Knowledge P (Job Offer = Yes) P (Job Offer = No)


Very Good P (Practical Knowledge = Very | P (Practical Knowledge = Very
Good | Job Offer = Yes) =2/7 Good | Job Offer = No) =0/3
Average P (Practical Knowledge = Average | P (Practical Knowledge = Average
| Job Offer = Yes) =1/7 | Job Offer = No) = 2/3
Good P (Practical Knowledge = Good | P (Practical Knowledge = Good
| Job Offer = Yes) =4/7 | Job Offer = No) =1/3
Step 2(d): Feature — Communication Skills
Table 8.9 shows the frequency matrix for the feature Communication Skills.
Table 8.9: Frequency Matrix of Communication Skills

o atio ob Offe e ob Offe 0


Good 4 1
Moderate 3 0
Poor 0 2
Total 7 3

Table 8.10 shows how the likelihood probability is calculated for Communication Skills using
conditional probability.
Bayesian Learning « 241

Table 8.10: Likelihood Probability of Communication Skills


Communication Skills
P (Job Offer = Yes) P (Job Offer = No)

Good P (Communication Skills = Good | P (Communication Skills = Good


| Job Offer = Yes) = 4/7 | Job Offer = No) =1/3
Moderste P (Communication Skills = P (Communication Skills =
Moderate | Job Offer = Yes) = 3/7 | Moderate | Job Offer = No) = 0/3

Foor P (Communication Skills = Poor | P (Communication Skills = Poor |


| Job Offer = Yes) = 0/7 Job Offer = No) =2/3
Step 3: Use Bayes theorem Eq. (8.1) to calculate the probability of all hypotheses.
Given the test data = (CGPA 29, Interactiveness = Yes, Practical knowledge = Average, Commu-
nication Skills = Good), apply the Bayes theorem to classify whether the given student gets a Job
offer or not.
P (Job Offer = Yes | Test data) = (P(CGPA 29 | Job Offer = Yes) P (Interactiveness = Yes | Job
Offer = Yes) P (Practical knowledge = Average | Job Offer = Yes) P (Communication Skills = Good |
Job Offer = Yes) P (Job Offer = Yes)))/(P (Test Data))
We canignore P (Test Data) in the denominator since it is common for all cases to be considered.
Hence, P (Job Offer = Yes | Test data) = (P(CGPA 29 |Job Offer = Yes) P (Interactiveness = Yes
| Job Offer = Yes) P (Practical knowledge = Average | Job Offer = Yes) P (Communication Skills =
Good | Job Offer = Yes) P (Job Offer = Yes)
=3/7 x5[7 x 1/7
x 4/7 x 7/10
=0.0175
Similarly, for the other case ‘Job Offer = No’,
We compute the probability,
P (Job Offer =Nol| Test data) = (°(CGPA 29 |Job Offer =No) P (Interactiveness = Yes | Job Offer
=No) P (Practical knowledge = Average | Job Offer = No) P (Communication Skills = Good | Job
Offer =No) P (Job Offer = No))/(P(Test Data)).
P (CGPA 29 |Job Offer = No) P (Interactiveness = Yes | Job Offer = No) P (Practical knowledge
= Average | Job Offer = No) P (Communication Skills = Good | Job Offer = No) P (Job Offer = No)
=1/3x1/3x2/3 x1/3 x3/10
=0.0074
Step 4: Use Maximum A Posteriori (MAP) Hypothesis, h,,,, Eq. (8.2) to classify the test object to
the hypothesis with the highest probability.
Since P (Job Offer = Yes | Test data) has the highest probability value, the test data is classified
as ‘Job Offer = Yes'.

Zero Probability Error


In Example 8.1, consider the test data to be (CGPA 28, Interactiveness = Yes, Practical knowledge =
Average, Communication Skills = Good)
When computing the posterior probability,
P (Job Offer = Yes | Test data) = (P(CGPA 28 |Job Offer = Yes) P (Interactiveness = Yes | Job
Offer = Yes) P (Practical knowledge = Average | Job Offer = Yes) P (Communication Skills = Good |
Job Offer = Yes) P (Job Offer = Yes)))/((P(Test Data))
242 « Machine Learning

P (Job Offgr =Yes | Test data) = (P(CGPA 28 |]Job Offer = Yes) P(Interactiveness = Yes | Job Offer
= Yes) P (Practical knowledge = Average | Job Offer = Yes) P (Communication Skills = Good | Job
Offer = Yes) P (Job Offer = Yes)
=4/7 x 5/7 x 1/7 x 4/7 x 7/10
=0.0233
Similarly, for the other case ‘Job Offer = No,
When we compute the probability:
P (Job Offer =Nol Test data) = (P(CGPA 28 |Job Offer = No) P (Interactiveness = Yes | Job Offer
= No) P (Practical knowledge = Average | Job Offer = No) P (Communication Skills = Good | Job
Offer = No) P (Job Offer = No))/(P(Test Data))
=P (CGPA 28 |Job Offer =No) P (Interactiveness = Yes | Job Offer =No) P (Practical knowledge
= Average | Job Offer = No) P (Communication Skills = Good | Job Offer =No) P (Job Offer = No)
=0/3x1/3x2/3x1/3x3/10
=0
Since the probability value is zero, the model fails to predict, and this is called as Zero-
Probability error. This problem arises because there are no instances in the given Table 8.1 for
the attribute value CGPA 28 and Job Offer = No and hence the probability value of this case is
zero. This zero-probability error can be solved by applying a smoothing technique called Laplace
correction which means given 1000 data instances in the training dataset, if there are zero instances
for a particular value of a feature we can add 1 instance for each attribute value pair of that feature
which will not make much difference for 1000 data instances and the overall probability does not
become zero.
Now, let us scale the values given in Table 8.1 for 1000 data instances. The scaled values without
Laplace correction are shown in Table 8.11.
Table 8.11: Scaled Values to 1000 without Laplace Correction

CGPA P (Job Offer = Yes) P (Job Offer = No)


29 P (CGPA >9 | Job Offer = Yes) = 300/700 P (CGPA 29 | Job Offer = No) = 100/300
28 P (CGPA 28 | Job Offer = Yes) = 400/700 P (CGPA 28 | Job Offer = No) = 0/300
<8 P (CGPA <8 | Job Offer = Yes) = 0/700 P (CGPA <8 | Job Offer = No) = 200/300

Now, add 1 instance for each CGPA-value pair for ‘Job Offer = No’. Then,
P (CGPA 29 | Job Offer = No) = 101/303 = 0.333
P (CGPA 28 | Job Offer = No) = 1/303 = 0.0033
P (CGPA <8 | Job Offer = No) = 201/303 = 0.6634
With scaled values to 1003 data instances, we get
P (Job Offer = Yes | Test data) = (P(CGPA 28 |Job Offer = Yes) P (Interactiveness = Yes | Job
Offer = Yes) P (Practical knowledge = Average | Job Offer = Yes) P (Communication Skills = Good |
Job Offer = Yes) P (Job Offer = Yes)
=400/700 x 500/700 x 100/700 x 400/700 x 700/1003
=0.02325
Bayesian Learning « 243

P (Job Offt?r =No | Test data) = P(CGPA >8 |Job Offer = No) P (Interactiveness = Yes | Job Offer
=No) P (Practical knowledge = Average | Job Offer
Offer = No) P (Job Offer = No) = No) P (Communication Skills = Good | Job
=1/303 x 100/300 x 200/300 x 100/300
x 303/1003
= 0.00007385
Thus, using Laplace Correction, Zero Probability error can be solved with Naive Bayes classifier.

8.3.2 Brute Force Bayes Algorithm


Applying Bayes theorem, Brute Force Bayes algorithm relies on the idea of concept learning wherein
given a hypothesis space H for the training dataset T, the algorithm computes the posterior proba-
bilities for all the hypothesis h,eH. Then, Maximum A Posteriori (MAP) Hypothesis, h,,,, is used
to output the hypothesis with maximum posterior probability. The algorithm is quite expensive
since it requires computations for all the hypotheses. Although computing posterior probabilities
is inefficient, this idea is applied in various other algorithms which is also quite interesting.

8.3.3 Bayes Optimal Classifier


Bayes optimal classifier is a probabilistic model, which in fact, uses the Bayes theorem to find the
most probable classification for a new instance given the training data by combining the predic-
tions of all posterior hypotheses, This is different from Maximum A Posteriori (MAP) Hypothesis,
Nuiv which chooses the maximum probable hypothesis or the most probable hypothesis.
Here, a new instance can be classified to a possible classification value C;by the following
Eq. (8.4). ;
=max. %, ,P(C, | k) P(h, | T) (84)
L 2

Given the hypothesis space with 4 hypothesis h,, h,, h, and h,. Determine if the
patient is diagnosed as COVID positive or COVID negative using Bayes Optimal classifier.
Solution: From the training dataset T, the posterior probabilities of the four different hypotheses
for a new instance are given in Table 8.12.
Table 8.12: Posterior Probability Values

P(h,|D P (COVID Positive | h) P (COVID Negative | h)


0.3 0 1
0.1 1 0
0.2 1 0
0.1 1 0

h,,,, chooses h, which has the maximum probability value 0.3 as the solution and gives the result
that the patient is COVID negative. But Bayes Optimal classifier combines the predictions of h,, h,
and h, which is 0.4 and gives the result that the patient is COVID positive.
%, « P(COVID Negative |k)P(h 1T)=03x1=03
%, ., P(COVID Positive 1) P(h, 1T) =0.1x1+02x1+01x1=04
244 « Machine Learning

Therefore, max_, 460D rotve covo ey Z, o P(CI1) P, 1T) = COVID Positive.


Thus, this algorithm, diagnoses the new instance to be COVID positive.

e
8.3.4 Gibbs Algorithm
The main drawback of Bayes optimal classifier is that it computes the posterior probabillity for a)
hypotheses in the hypothesis space and then combines the predictions to classify a new instance,
Gibbs algorithm is a sampling technique which randomly selects a hypothesis from the
hypothesis space according to the posterior probability distribution and classifies a new instance_
Itis found that the prediction error occurs twice with the Gibbs algorithm when compared to Bayeg
Optimal classifier.

8.4 NAIVE BAYES ALGORITHM FOR CONTINUOUS ATTRIBUTES


There are two ways to predict with Naive Bayes algorithm for continuous attributes:
1. Discretize continuous feature to discrete feature.
2. Apply Normal or Gaussian distribution for continuous feature.

Gaussian Naive Bayes Algorithm


In Gaussian Naive Bayes, the values of continuous features are assumed to be sampled from a
Gaussian distribution.

IBELTNIR R Assess a student's performance using Naive Bayes algorithm for the continuous
attribute. Predict whether a student gets a job offer or not in his final year of the course. The train-
ing dataset T consists of 10 data instances with attributes such as ‘CGPA’ and ‘Interactiveness’ as
shown in Table 8.13. The target variable is Job Offer which is classified as Yes or No for a candidate
student.
Table 8.13: Training Dataset with Continuous Attribute
S.No. CGPA Interactiveness Job Offer
1. 9.5 Yes Yes
2. 8.2 No Yes
3. 9.3 No No
4. 7.6 No No
5. 8.4 Yes Yes
6. 9.1 Yes Yes
7 75 Yes No
8. 9.6 No Yes
9. 8.6 Yes Yes
10. 83 Yes Yes
Solution:
Step 1: Compute the prior probability for the target feature ‘Job Offer’.
Bayesian Learning « 245

Prior probabilities of both the classes are calculated using the same formula (refer to
Table 8.14).
Table 8.14: Prior Probability of Target Class

Job Offer Classes No. of Instances Probability Value


Yes 7 P (Job Offer = Yes) = 7/10
No 3 P (Job Offer = No) = 3/10
Step 2: Compute Frequency matrix and Likelihood Probability for each of the feature.
Likelihood probabilities for a continuous attribute is obtained from Gaussian (Normal) Distri-
bution. In the above data set, CGPA is a continuous attribute for which we need to apply Gaussian
distribution to calculate the likelihood probability.
Gaussian distribution for continuous feature is calculated using the given formula,
P(X,=x,1C)=g (x, 1,5, 85)
where,
X, is the i™ continuous attribute in the given dataset and x, is a value of the attribute.
C, denotes the j* class of the target feature.
1, denotes the mean of the values of that continuous attribute X, with respect to the class j of
the target feature.
o, denotes the standard deviation of the values of that continuous attribute X; with respect to
the class j of the target feature.
Hence, the normal distribution formula is given as:
oy
P(X,=x,1C)=——e ™' 86)
0, V21
Step 2(a): Consider the feature CGPA
In this example CGPA is a continuous attribute,
To calculate the likelihood probability for this continuous attribute, first compute the mean
and standard deviation for CGPA with respect to the target class ‘Job Offer’.
Here, X, = CGPA
C, =‘Job Offer = Yes’
Mean and Standard Deviation for class ‘Job Offer = Yes’ are given as:
M= Hegpa-ves = 8.814286
6= Ocopn- yes = 058146
Mean and Standard Deviation for class ‘Job Offer = No’ are given as:
C, = "Job Offer = No’
Ky = Hegpa-no =8.133333
G, = Ggepp-no = 1011599
Once Mean and Standard Deviation are computed, the likelihood probability for any test value
using Gaussian distribution formula can be calculated.
246 « Machine Learning

Step 2(b): Consider the feature Interactiveness


Interactiveness is a discrete feature whose probability is calculated as earlier.
Table 8.15 shows the frequency matrix for the feature Interactiveness.
Table 8.15: Frequency Matrix of Interactiveness

Interactiveness Job Offer = Yes Job Offer = No


YES 5 1
NO 2 2
Total 7 3

Table 8.16 shows how the likelihood probability is calculated for Interactiveness using condi-
tional probability.
Table 8.16: Likelihood Probability of Interactiveness

Interactiveness P (Job Offer = Yes) P (Job Offer = No)


YES P (Interactiveness = Yes | Job Offer = | P (Interactiveness = Yes | Job Offer =
Yes) =5/7 No)=1/3
NO P (Interactiveness = No | Job Offer = | P (Interactiveness = No | Job Offer =
Yes) =2/7 No)=2/3

Step 3: Use Bayes theorem to célculate the probability of all hypotheses.


Consider the test data to be (CGPA = 8.5, Interactiveness = Yes).
For the hypothesis ‘Job Offer = Yes':
P (Job Offer = Yes | Test data) = (P(CGPA = 8.5 | Job Offer = Yes) x P (Interactiveness = Yes | Job
Offer = Yes) x P (Job Offer = Yes)
To compute P (CGPA =8.5 | Job Offer = Yes) use Gaussian distribution formula:
P(X,=x,| C,.) =g, ;)
1 [
P(Xocpn 'CGPA =851C, “Job Offer cueva)
= Ye
= "\/fi
e

1 eyl
P(Xogpp "CGPA = 851Cppopuoye)
‘Job Offer = Ye¢ = ———=e
cccm_y“\/fl 20w’

P(CGPA =8.51Job Offer = Yes) = g(x, =8.5,1, =8.814c, = 0.581)


1 (85-8814)"
e > =(.594
" 0581V2n
P (Interactiveness = Yes|Job Offer = Yes ) = 5/7
P (Job Offer = Yes) =7/10
Hence:
P (Job Offer = Yes | Test data) = (P(CGPA = 8.5 | Job Offer = Yes) x P (Interactiveness = Yes|Job
Offer = Yes) x P (Job Offer = Yes)
Bayesian Learning o 247

=0.594 x 5/7 x 7/10


=0.297
Similarly, for the hypothesis ‘Job Offer = No':
P (Job Offer =No | Test data) = P (CGPA = 8.5 | Job Offer = No) x P (Interactiveness = Yes | Job
Offer = No) x P (Job Offer = No)
P (CGPA =85 | Job Offer = No) = (x,=85, = 81330, = 1.0116)
(43-sccranol’
P(Xeopn =851C,
0 yn) = ——
b g 2occruno
SecrnnoV2R
1 Lz
=— ¢ mlong _
1.0116v2n 032
P (Interactiveness = Yes | Job Offer = No)=1/3
= P (Job Offer = No) = 0.369
Hence,
P (Job Offer =No | Test data) = P (CGPA = 8.5 | Job Offer = No) P (Interactiveness = Yes | Job
Offer = No) x P (Job Offer = No)
=0.369 x 1/3 x 3/10
=0.0369
Step 4: Use Maximum A Posteriori (MAP) Hypothesis, h,y,, to classify the test object to the
hypothesis with the highest probability.
Since P (Job Offer = Yes | Test data) has the highest probability value of 0.297,the test data is
classified as ‘Job Offer = Yes’.
. ]

8.5 OTHER POPULAR TYPES OF NAIVE BAYES CLASSIFIERS


Some of the popular variants of Bayesian classifier are listed below:

Bernoulli Naive Bayes Classifier


Bernoulli Naive Bayes works with discrete features. In this algorithm, the features used for making
predictions are Boolean variables that take only two values either ‘yes’ or ‘no’. This is particularly
useful for text classification where all features are binary with each feature containing two values
whether the word occurs or not.

Multinomial Naive Bayes Classifier


This algorithm is a generalization of the Bernoulli Naive Bayes model that works for categorical
data or parficularly integer features. This classifier is useful for text classification where each
feature will have an integer value that represents the frequency of occurrence of words.

Multi-class Naive Bayes Classifier


This algorithm is useful for classification problems with more than two classes where the target
feature contains multiple classes and test instance has to be predicted with the class it belongs to.
1
I T —

248 Machine Learning

1. Probabilistic learning uses the concept of probability theory that descri


bes how to model randomnes;,
uncertainty and noise to predict the future
events.
. Itis a tool for modelling large datasets and uses Bayes rule to infer unknown quantities, predict ang
learn from data.
. Bayesian learning differs from probabilistic learning in a way that it uses subjective probabilities t,
w

infer parameters of a model.


. Probabilities are used to denote the degree of belief in the occurrence of an event.

. Bayes’ theorem uses conditional probability that can be defined via Joint probabilities where
P(A | B)=P (A, ByP (B).
. Bayes theorem is used to select the most probable hypothesis from data, considering both prior
knowledge and posterior distributions.
. Bayes theorem helps in calculating the posterior probability for several hypotheses and selects the
hypothesis with the highest probability.
- Naive Bayes Algorithm is a supervised binary class or multi-class classification algorithm that works
on the principle of Bayes theorem.
. Zero probability error with Naive Bayes Model can be solved by applying a smoothing technique
called Laplace correction.
. Naive Bayes Algorithm for Continuous Attributes can be solved using Gaussian distribution.

- Other popular types of Naive Bayes Classifier are Bernoulli Naive Bayes Classifier, Multinomial
Naive Bayes Classifier and Multi-class Naive Bayes Classifier, etc.

Probability-based Learning — It is one of the most important practical learning methods which
combines prior knowledge or prior probabilities with observed data.
Probabilistic Model - A model in which randomness plays a major role and which gives probability
distribution as a solution.
Deterministic Model — A model in which there is no randomness and hence it exhibits the same
initial condition every time it is run and is likely to get a single possible outcome as the solution.
Bayesian Learning — A learning method that describes and represents knowledge in an uncertain
domain and provides a way to reason about this knowledge using probability measure.
Conditional Probability — The probability of an event 4, given that event B occurs, is written as
P (AIB).
Joint Probability — The probability of the intersection of two or more events. .
Bayesian Probability — Otherwise called as Personal probability, it is a person’s degree of belief in
event A and does not require repeated trials.
Marginal Probability — The probability of an event occurring P (A) unconditionally and not cond-
tioned on another event.
Belief Measure - Means a person’s belief in a statement “S” depends on some knowledge “K".
Bayesian Learning « 249

Prior Probability - The general probability of an uncertain event before an observation or some
evidence is collected.
Likelihood Probability - The relative probability of the observation occurring for each class or the
sampling density for the evidence given the hypothesis.
Posterior Probability - The updated or revised probability of an event taking into account the obser-
vations from the training data.
Maximum A Posteriori (MAP) Hypothesis, f,,,,, - The hypothesis which has the maximum value
among a given a set of candidate hypotheses, and is considered as the maximum probable hypothesis
or most probable hypothesis.
Maximum Likelihood (ML) Hypothesis, h,,, - The hypothesis that gives the maximum likelihood
for P (E | h).

Review Questions
1. What is meant by Probabilistic-based learning?
2. Differentiate between probabilistic model and deterministic model.
3. What is meant by Bayesian learning?
4. Define the following:
* Conditional probability
* Joint probability
* Bayesian Probability
* Marginal probability
. What is belief measure?
W

What is marginalization?
&

What is the difference between prior and posterior and likelihood probabilities?
N

. State Bayes theorem.


®

. Define Maximum A Posteriori (MAP) Hypothesis, h,y,, and Maximum Likelihood (ML)
©

Hypothesis, h,,,.
10. Check the correctness of Bayes theorem with an example.
11. Consider there are three baskets, Basket I, Basket IT and Basket IIl with each basket containing rings
of red color and green color. Basket I contains 6 red rings and 5 green rings. Basket Il contains 3 green
rings and 2 red rings while Basket III contains 6 rings which are all red. A person chooses a ring
randomly from a basket. If the ring picked is red, find the probability that it was taken from Basket II.
12. Assume the following probabilities, the probability of a person having Malaria to be 0.02%, the
probability of the test to be positive on detecting Malaria, given that the person has Malaria is 98%
and similarly the probability of the test to be negative on detecting Malaria, given that the person
doesn’t have malaria to be 95%. Find the probability of a person having Malaria; given that, the test
result is positive.
250 Machine Learning

13. Take a real-time example of predicting the result of a student using Naive Bayes algorithm. The
training dataset T consists of 8 data instances with attributes such as ‘Assessment’, ‘Assignmeny,
‘Project’ and “Seminar’ as shown in Table 8.17. The target variable is Result which is classified as Pagg
or Fail for a candidate student. Given a test data to be (Assessment = Average,
Assignment = Yes,
Project=No and Seminar = Good), predict the result of the student. Apply Laplace Correction if Zerqy
probability problem occurs,
Table 8.17: Training Dataset

1. Good Yes Yes Good Pass


2 Average Yes No Poor Fail
3. Good No Yes Good Pass
4. Average No No Poor Fail
5. Average No Yes Good Pass
6. Good No No Poor Pass
7. Average Yes Yes Good Fail
8. Good Yes Yes Poor Pass
14. Consider an example of predicting a student’s result using Gaussian Naive Bayes algorithm for
continuous attribute. The training dataset T consists of 10 data instances with attributes
such as
‘Assessment Marks °, “Assignment Marks’ and ‘Seminar Done’ as shown in Table 8.18. The target
variable is Result which is classified as Pass or Fail for a candidate student.
Given a test data to be (Assessment Marks = 75, Assignment Marks= 6, Seminar
Done = Poor),
predict the result of the student.
Table 8.18: Training Dataset

S.No. Assessment Marks Assignment Marks Seminar Done Result


1. 95 8 Good Pass
2 Z1 5 Poor Fail
3. 93 9 Good Pass
4. 62 4 Poor Fail
5. 81 9 Good Pass
6. 93 8.5 Poor Pass
7. 65 9 Good Pass
8. 45 3 Poor Fail
9. 78 85 Good Pass
10. 56 4 Poor Fail
Bayesian Learning « 251

Across
3. MAP hypothesis has . Bayesian learning uses
(maximum/minimum) value among the (frequency/subjective) based reasoning to
given set of candidate hypothesis. infer parameters.
. The Naive Bayes algorithm assumes that . Bayes theorem uses (marginal/
features are (dependent/ conditional) probability.
independent) of each other. probability is the general
. The degree of belief can be denoted by probability of an uncertain event before
probability. (True/False) observation data is collected.
. The updated probability of an even taking . Bayes theorem is noted for its usefulness in
into account observations from training computing (prior/posterior)
data is called (prior/posterior) probability.
probability. . Probabilistic learning uses the concept of
. Bayes theorem combines prior knowledge probability theory that describes how to
with distributions. model randomness, uncertainty and noise
. Deterministic model has no randomness. to predict the future events. (True/False)
(True/False) 11 Naive bayesian algorithm cannot be used
12. Bayesian learning can be used make predic- to solve the problem with continuous
tions based on historical data. (True/False) attributes. (True/False)
14. ML hypothesis gives the minimum
13. Zero probability error can be solved using
a smoothing technique called likelihood for P (E/h). (True/False)

correction.
Chapter 6 - 04
Module

Decision Tree
Decision Tree Learning

Learning
“Prediction n isis very very difficult,
diffi especially
pecially ifif it's about
the future.”
_ Niels Bohr

spans
Decision Tree Learning is a widely used predictive model for supervised learning that
over a number of practical applications in various areas. It is used for both classification and
regression tasks. The decision tree model basically represents logical rules that predict the
value of a target variable by inferring from data features. This chapter provides a keen insight
into how to construct a decision tree and infer knowledge from the tree.

Learning Objectives

* Understand the structure of the decision tree


* Know about the fundamentals of Entropy
* Learn and understand popular univariate Decision Tree Induction algorithms such as
D3, C4.5, and multivariate decision tree algorithm such as CART
* Deal with continuous attributes using improved C4.5 algorithm
Construct Classification and Regression Tree (CART) for classifying both categorical
and continuous-valued target variables
* Construct regression trees where the target feature is a continuous-valued variable
Understand the basics of validating and pruning of decision trees

6.1 INTRODUCTION TO DECISION TREE LEARNING MODEL


Decision tree learning model, one of the most popular supervised predictive learning models,
dlassifies data instances with high accuracy and consistency. The model performs an inductive
inference that reaches a general conclusion from observed examples. This model is variably used
for solving complex classification applications.
Decision tree is a concept tree which summarizes the information contained in the training
dataset in the form of a tree structure. Once the concept model is built, test data can be easily
classified.
156 e« Machine Learning

This model can be used to classify both categorical target variables and continu
ous-vajye, q
target variables. Given a training dataset X, this model computes a hypothesis function AX) o
decision tree.
Inputs to the model are data instances or objects with a set of
features or attributes which can
be discrete or continuous and the output of the model is a decision tree which predicts or dassifies
the target class for the test data object.
In statistical terms, attributes or features are called as independent variables. The target featyry
or target class is called as response variable which indicates the category we need to predict on
a test object.
The decision tree learning model generates a complete hypothesis space in the
form of ,
tree structure with the given training dataset and allows us to search through the possible set
of
hypotheses which in fact would be a smaller decision tree as we walk through the tree. This
king
of search bias is called as preference bias,

6.1.1 Structure of a Decision Tree


A decision tree has a structure that consists of a root node, internal nodes/decision nodes, branches,
and terminal nodes/leaf nodes. The topmost node in the tree is the root node. Internal nodes
are the
test nodes and are also called as decision nodes. These nodes represent a choice
or test of an input
attribute and the outcome or outputs of the test condition are the branches emanating from thjs
decision node. The branches are labelled as per the outcomes or output values of the test condition,
Each branch represents a sub-tree or subsection of the entire tree. Every decision
node is part of 3
path to a leaf node. The leaf nodes represent the labels or the outcome of a decision path. The labels
of the leaf nodes are the different target classes a data instance can belong to.
Every path from root to a leaf node represents a logical rule that corresponds to
a conjunction
of test attributes and the whole tree represents a disjunction of these conjunctions. The decision
tree model, in general, represents a collection of logical rules of classification in the form of a tree
structure.
Decision networks, otherwise called as influence diagrams, have a directed graph structure
with nodes and links. It is an extension of Bayesian belief networks that represents information
about each node’s current state, its possible actions, the possible outcome of those actions, and their
utility. The concept of Bayesian Belief Network (BBN) is discussed in Chapter 9.
Figure 6.1 shows symbols that are used in this book to represent different nodes in the construction
of a decision tree. A circle is used to represent a root node, a diamond symbol is used to represent a
decision node or the internal nodes, and all leaf nodes are represented with a rectangle.

O Root note

@ Decision node

- Leaf node

Figure 6.1: Nodes in a Decision Tree


A decision tree consists of two major procedures discussed below.
Decision Tree Learning « 157

Building the Tree

Goal Construct a decision tree with the given training dataset. The tree is constructed in a
top-down fashion. It starts from the root node. At every level of tree construction, we need to find
the best split attribute or best decision node among all attributes. This process is recursive and
continued until we end up in the last level of the tree or finding a leaf node which cannot be split
further. The tree construction is complete when all the test conditions lead to a leaf node. The leaf
node contains the target class or output of classification.

Output Decision tree representing the complete hypothesis space.

Knowledge Inference or Classification

Goal Given a test instance, infer to the target class it belongs to.

Classification Inferring the target class for the test instance or object is based on inductive
inference on the constructed decision tree. In order to classify an object, we need to start traversing
the tree from the root. We traverse as we evaluate the test condition on every decision node with
the test object attribute value and walk to the branch corresponding to the test’s outcome. This
process is repeated until we end up in a leaf node which contains the target class of the test object.
Output Target label of the test instance.

Advantages of Decision Trees


1. Easy to model and interpret
2. Simple to understand
3. The input and output attributes can be discrete or continuous predictor variables.
4. Can model a high degree of nonlinearity in the relationship between the target variables and the
predictor variables
5. Quick
to train

Disadvantages of Decision Trees


Some of the issues that generally arise with a decision tree learning are that:
1. It is difficult to determine how deeply a decision tree can be grown or when to stop
growing it.
2. If training data has errors or missing attribute values, then the decision tree constructed
may become unstable or biased.
3. If the training data has continuous valued attributes, handling it is computationally
complex and has to be discretized.
4. A complex decision tree may also be over-fitting with the training data.
5. Decision tree learning is not well suited for classifying multiple output classes.
TS

6. Learning an optimal decision tree is also known to be NP-complete.


158 « Machine Learning

[e~
e
Example 6. g How to draw a decision tree to predict a student’s academic performance based op,
the given information such as class attendance, class assignments, home-work assignments, tess,
participation in competitions or other events, group activities such as projects and presentations, etc,
Solution: The target feature is the student performance in the final examination whether he wij
pass or fail in the examination. The decision nodes are test nodes which check for conditions like
‘What's the student’s class attendance?’, ‘How did he perform in his class assignments?’, ‘Did he
do his home assignments properly?’ ‘What about his assessment results?’, ‘Did he participate in
competitions or other events?’, ‘What is the performance rating in group activities such as projects
and presentations?'. Table 6.1 shows the attributes and set of values for each attribute.
Table 6.1: Attributes and Associated Valu
es
Attributes Values
Class attendance Good, Average, Poor
Class assignments Good, Moderate, Poor
Home-work assignments Yes, No
Assessment Good, Moderate, Poor
Participation in competitions or other events
Yes, No
Group activities such as projects and presentations Yes, No
Exam Result Pass, Fail

The leaf nodes represent the outcomes, that is, either ‘pass’, or ‘fail’.
A decision tree would be constructed by following a set of if-else conditions which may or
may not include all the attributes, and decision nodes outcomes are two or more than two. Hence,
the tree is not a binary tree.
2 ]

ote: A decision tree is not always a binary tree. It is a tree which can have more than two branches.

[ 2
Predict a student’s academic performance of whether he will pass or fail based
on the given information such as ‘Assessment’ and ‘Assignment’. The following Table 6.2 shows
the independent variables, Assessment and Assignment, and the target variable Exam Result with
their values. Draw a binary decision tree.
Table 6.2: Attributes and Associated Values

Assessment 250, <50


Assignment Yes, No
Exam Result Pass, Fail

Solution: Consider the root node is ‘Assessment’. If a student’s marks are >50, the root node is
branched to leaf node “Pass’ and if the assessment marks are <50, it is branched to another decision
node. If the decision node in next level of the tree is ‘Assignment’ and if a student has submitted
his assignment, the node branches to ‘Pass’ and if not submitted, the node branches to ‘Fail’.
Figure 6.2 depicts this rule.
Decision Tree Learning « 159

Figure 6.2: Illustration of a Decision Tree


This tree can be interpreted as a sequence of logical rules as follows:
if (Assessment > 50) then ‘Pass’
else if (Assessment < 50) then
if (Assignment = Yes) then ‘Pass’
else if (Assignment = No) then ‘Fail’
Now, if a test instance is given, such as a student has scored 42 marks in his assessment and has
not submitted his assignment, then it is predicted with the decision tree that his exam result is ‘Fail’.
L
Many algorithms exist which will be studied for constructing decision trees in the sections below.

6.1.2 Fundamentals of Entropy


Given the training dataset with a set of attributes or features, the decision tree is constructed by
finding the attribute or feature that best describes the target class for the given test instances.
The best split feature is the one which contains more information about how to split the dataset
among all features so that the target class is accurately identified for the test instances. In other
words, the best split attribute is more informative to split the dataset into sub datasets and this
process is continued until the stopping criterion is reached. This splitting should be pure at every
stage of selecting the best feature.
The best feature is selected based on the amount of information among the features which
are basically calculated on probabilities. Quantifying information is closely related to information
theory. In the field of information theory, the features are quantified by a measure called Shannon
Entropy which is calculated based on the probability distribution of the events.
Entropy is the amount of uncertainty or randomness in the outcome of a random variable or an
event. Moreover, entropy describes about the homogeneity of the data instances. The best feature
is selected based on the entropy value. For example, when a coin is flipped, head or tail are the two
outcomes, hence its entropy is lower when compared to rolling a dice which has got six outcomes.
Hence, the interpretation is,
158 « Machine Learning
-

Example How to draw a decision tree to predict a student’s academic performance based on
the given information such as class attendance, class assignments, home-work assignments, tests,
participation in competitions or other events, group activities such as projects and presentations, etc.
Solution: The target feature is the student performance in the final examination whether he will
pass or fail in the examination. The decision nodes are test nodes which check for conditions like
‘What's the student’s class attendance?’, "How did he perform in his class assignments?’, ‘Did he
do his home assignments properly?’ ‘What about his assessment results?’, ‘Did he participate in
competitions or other events?’, ‘What is the performance rating in group activities such as projects
and presentations?’. Table 6.1 shows the attributes and set of values for each attribute.
Table 6.1: Attributes and Associated Values
Attributes Values
Class attendance Good, Average, Poor
Class assignments Good, Moderate, Poor
Home-work assignments Yes, No
Assessment Good, Moderate, Poor
Participation in competitions or other events Yes, No
Group activities such as projects and presentations Yes, No
Exam Result Pass, Fail
The leaf nodes represent the outcomes, that is, either ‘pass’, or ‘fail’.
A decision tree would be constructed by following a set of if-else conditions which may or
may not include all the attributes, and decision nodes outcomes are two or more than two. Hence,
the tree is not a binary tree.
o

60&: A decision tree is not always a binary tree. It is a tree which can have more than two branches. )

Example Predict a student’s academic performance of whether he will pass or fail based
on the given information such as ‘Assessment’ and ‘Assignment’. The following Table 6.2 shows
the independent variables, Assessment and Assignment, and the target variable Exam Result with
their values. Draw a binary decision tree.
Table 6.2: Attributes and Associated Values
Attributes Values
Assessment 250, <50
Assignment Yes, No
Exam Result Pass, Fail

Solution: Consider the root node is ‘Assessment’. If a student’s marks are 250, the root node is
branched to leaf node ‘Pass’ and if the assessment marks are <50, it is branched to another decision
node. If the decision node in next level of the tree is ‘Assignment’ and if a student has submitted
his assignment, the node branches to ‘Pass’ and if not submitted, the node branches to ‘Fail’.
Figure 6.2 depicts this rule.
Decision Tree Learning « 159

Figure 6.2: lllustration of a Decision Tree


This tree can be interpreted as a sequence of logical rules as follows:
if (Assessment > 50) then ‘Pass’
else if (Assessment < 50) then
if (Assignment = Yes) then ‘Pass’
else if (Assignment = No) then ‘Fail’
Now, if a test instance is given, such as a student has scored 42 marks in his assessment and has
not submitted his assignment, then it is predicted with the decision tree that his exam result is ‘Fail’.
®
Many algorithms exist which will be studied for constructing decision trees in the sections below.

6.1.2 Fundamentals of Entropy '


Given the training dataset with a set of attributes or features, the decision tree is constructed by
finding the attribute or feature that best describes the target class for the given test instances.
The best split feature is the one which contains more information about how to split the dataset
among all features so that the target class is accurately identified for the test instances. In other
words, the best split attribute is more informative to split the dataset into sub datasets and this
process is continued until the stopping criterion is reached. This splitting should be pure at every
stage of selecting the best feature.
The best feature is selected based on the amount of information among the features which
are basically calculated on probabilities. Quantifying information is closely related to information
theory. In the field of information theory, the features are quantified by a measure called Shannon
Entropy which is calculated based on the probability distribution of the events.
Entropy is the amount of uncertainty or randomness in the outcome of a random variable or an
event. Moreover, entropy describes about the homogeneity of the data instances. The best feature
is selected based on the entropy value. For example, when a coin is flipped, head or tail are the two
outcomes, hence its entropy is lower when compared to rolling a dice which has got six outcomes.
Hence, the interpretation is,

\
160 « Machine Learning

Higher the entropy — Higher the uncertainty


Lower the entropy — Lower the uncertainty
Similarly, if all instances are homogenous, say (1, 0), which means all instances belong to the
same class (here it is positive) or (0, 1) where all instances are negative, then the entropy is 0.
On the other hand, if the instances are equally distributed, say (0.5, 0.5), which means 50% positive
and 50% negative, then the entropy is 1. If there are 10 data instances, out of which 6 belong to
positive class and 4 belong to negative class, then the entropy is calculated as shown in Eq. (6.1),
6 6 4 4
Entropy = —[Elogz ot —lalng2 E:I (6.1)
It is concluded that if the dataset has instances that are completely homogeneous, then the
entropy is 0 and if the dataset has samples that are equally divided (i.e., 50% — 50%), it has an
entropy of 1. Thus, the entropy value ranges between 0 and 1 based on the randomness of the
samples in the dataset. If the entropy is 0, then the split is pure which means that all samples in
the set will partition into one class or category. But if the entropy is 1, the split is impure and the
distribution of the samples is more random. The stopping criterion is based on the entropy value.
Let P be the probability distribution of data instances from 1 to n as shown in Eq. (6.2).
P, 6.2)
Entropy of P is the information measure of this probability distribution given in Eq. (6.3),
Entropy_Info(P) = Entropy_Info(P, .... P,)
=~(P, log,(P,) + P,10g,(P,) + ........ + P,_log,(P,)) (6.3)
where, P, is the probability of data instances classified as class 1 and P, is the probability of data
instances classified as class 2 and so on.
P, = INo of data instances belonging to class 1 / | Total no of data instances in the
training dataset|
Entropy_Info(P)
can be computed as shown in Eq. (6.4).

-
Entr opy_Info(6,
Info(6,
4) is calculated
c:
6,
ated as ~|[]0—log,
6 4
— + —log, 03110]
Og‘lo+10 —
4
6.1,
(6.1

Mathematically, entropy is defined in Eq. (6.5) as:


1
Entropy_Info(X)= X __ I Pr{X = x] - log, m (6.5)

Pr[X = x] is the probability of a random variable X with a possible outcome x.

log, P—r[X1=_xi =-log,(Pr[X = x]) J


[Nole:

Algorithm 6.1: General Algorithm for Decision Trees

1. Find the best attribute from the training dataset using an at!ribute selection measure and
place it at the root of the tree.

(Continued )
Decision Tree Learning « 161

2. Split the training dataset into subsets based on the outcomes of the test attribute and each
subset in a branch contains the data instances or tuples with the same value for the selected
test attribute.
of
3. Repeat step 1 and step 2 on each subset until we end up in leaf nodes in all the branches
the tree.
4. This splitting process is recursive until the stopping criterion is reached.

Stopping Criteria
The following are some of the common stopping conditions:
1. The data instances are homogenous which means all belong to the same class C,and
hence its entropy is 0.
2. A node with some defined minimum number of data instances becomes a leaf (Number
of data instances in a node is between 0.25 and 1.00% of the full training dataset).
3. The maximum tree depth is reached, so further splitting is not done and the node
becomes a leaf node.
\ J
6.2 DECISION TREE INDUCTION ALGORITHMS
There are many decision tree algorithms, such as ID3, C4.5, CART, CHAID, QUEST, GUIDE,
CRUISE, and CTREE, that are used for classification in real-time environment. The most
commonly used decision tree algorithms are ID3 (Iterative Dichotomizer 3), developed by
J.R Quinlan in 1986, and C4.5 is an advancement of ID3 presented by the same author in 1993.
CART, that stands for Classification and Regression Trees, is another algorithm which was
developed by Breiman et al. in 1984.
The accuracy of the tree constructed depends upon the selection of the best split attribute.
Different algorithms are used for building decision trees which use different measures to decide on
the splitting criterion. Algorithms such as ID3, C4.5 and CART are popular algorithms used in the
construction of decision trees. The algorithm ID3 uses ‘Information Gain’ as the splitting criterion
whereas the algorithm C4.5 uses ‘Gain Ratio” as the splitting criterion. The CART algorithm is
popularly used for classifying both categorical and continuous-valued target variables. CART uses
GINI Index to construct a decision tree.
Decision trees constructed using ID3 and C4.5 are also called as univariate decision trees which
consider only one feature/attribute to split at each decision node whereas decision trees constructed
using CART algorithm are multivariate decision trees which consider a conjunction of univariate
splits. The details about univariate and multivariate data has been discussed in Chapter 2.

6.2.1 ID3 Tree Construction


ID3 is a supervised learning algorithm which uses a training dataset with labels and constructs
a decision tree. ID3 is an example of univariate decision trees as it considers only one
feature at
each decision node. This leads to axis-aligned splits. The tree is then used to classify the future test
instances. It constructs the tree using a greedy approach in a top-down fashion by identifying the
best attribute at each level of the tree. Y
Yy
162 « Machine Learning

ID3 works well if the attributes or features are considered as discrete/categorical valyeg,
1f some attributes are continuous, then we have to partition attributes or features to be decretizEd
or nominal attributes or features.
The algorithm builds the tree using a purity measure called ‘Information Gain’ with the given
training data instances and then uses the constructed tree to classify the test data. It is applieq
for training set with only nominal attributes or categorical attributes and with no missing valyes
for classification. ID3 works well for a large dataset. If the dataset is small, overfitting may occur,
Moreover, it is not accurate if the dataset has missing attribute values.
No pruning is done during or after construction of the tree and it is prone to outliers. C4.5 anq
CART can handle both categorical attributes and continuous attributes. Both C4.5 and CART can
also handle missing values, but C4.5 is prone to outliers whereas CART can handle outliers as we]],

Algorithm 6.2: Procedure to Construct a Decision Tree using ID3

1. Compute Entropy_Info Eq. (6.8) for the whole training dataset based on the target attribute,
2. Compute Entropy_Info Eq. (6.9) and Information_Gain Eq. (6.10) for each of the attribute in
the training dataset.
3. Choose the attribute for which entropy is minimum and therefore the gain is maximum as
the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test
condition of the root node attribute. Accordingly, the training dataset is also split
into subsets.
6. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in
the subset.
\ > |

/Note: We stop branching a node if entropy is 0. N


L The best split attribute at every iteration is the attribute with the highest information gain. J

Definitions
Let T be the training dataset.
Let A be the set of attributes A={A, A, A, ....... A).
Let m be the number of classes in the training dataset.
Let P be the probability that a data instance or a tuple ‘d’ belongs to class C:
v
Itis calculated as,
training set T
P, =Total no of data instances that belongs to class C, in T/Total no of tuples in the
(66)
'
Mathematically, it is represented as shown in Eq. (6.7).
d ©7)
~H
P= _Cl
Decision Tree Learning « 163

Expected information or Entropy needed to classify a data instance d’ in T is denoted as


Entropy_Info(T) given in Eq. (6.8).
Entropy_Info(T)= - 3P, log,P, (6.8)
=1

shown in Eq. (6.9) as:


Entropy of every attribute denoted as Entropy_Info(T, A) is
A (6.9)
Entropy_Info(T, A) = T [I?"l x Entropy_Info (A))

where, the attribute A has got ‘v’ distinct values (a,, a, .... 4}, A,! is the number of instances for
distinct value ‘i" in attribute A, and Entropy_Info (A) is the entropy for that set of instances.
Information_Gain is a metric that measures how much information is gained by branching on
an attribute A. In other words, it measures the reduction in impurity in an arbitrary subset of data.
Itis calculated as given in Eq. (6.10):
Information_Gain(A) = Entropy_Info(T) — Entropy_Info(T, A) (6.10)
It can be noted that as entropy increases, information gain decreases. They are inversely
proportional to each other.

Scan for ‘Additional Examples”

Assess a student’s performance during his course of study and predict whether
a student will get a job offer or not in his final year of the course. The training dataset T consists
of 10 data instances with attributes such as ‘CGPA’, ‘Interactiveness’, ‘Practical Knowledge’ and
‘Communication Skills’ as shown in Table 6.3. The target class attribute is the ‘Job Offer’.
Table 6.3: Training Dataset T

S.No. CGPA Interactiveness Practical Knowledge Communication Skills Job Offer


L >9 Yes Very good Good Yes
2. 28 No Good Moderate Yes
3. 29 No Average' Poor No
4. <8 No Average Good No
5. 28 Yes Good Moderate Yes
6. 29 Yes Good Moderate Yes
7. <8 Yes Good Poor No
8. 29 No Very good Good Yes
9. 28 Yes Good Good Yes
28 Yes Average Good Yes
10.
164 « Machine Learning

Solution:
Step 1:
Calculate the Entropy for the target class ‘Job Offer’.
Entropy_Info(Target Attribute = Job Offer) = Entropy_Info(7,
3) =
=—|:1105 l+ 3 lo; 3
l—o—:l =—(~0.3599 +-0.5208) = 0.8807
Iteration 1: 1075210 10 8
Step 2:
Calculate the Entropy_Info and Gain(].nformation_Gain) for each of the attribute in the training
dataset.
Table 6.4 shows the number of data instances classified with Job Offer as Yes or No for the attribute
CGPA.
Table 6.4: Entropy Information for CGPA

CGPA Job Offer = Yes Job Offer=No Total Entropy


29 3 1 4
28 4 0 4 0
<8 0 2 2 0
Entropy_Info(T, CGPA)
4[3==
=—|
3 1
— == =
1] ey
4[4O 4 0
oo _o0].2 [0 0 2 2
10[ 3 By 40524}10[ 1°8:3
=
41‘"‘7'24}+
ol P
[ o83
e K.
21°g22]
= % (03111 +04997) +0+0
=03243
Gain (CGPA) = 0.8807 — 0.3243
=0.5564
Table 6.5 shows the number of data instances classified with Job Offer as Yes or No for the
attribute Interactiveness.
Table 6.5: Entropy Information for Interactiveness
Interactiveness Job Offer = Yes Job Offer=No Total Entropy
YES 5 1 6
NO 2 2 4

5 1
=1 +E4 72 2 2 2
Entropy_Info(T, Interactiveness) = li[
N T L
] [“‘%zrzl"gfij
= %(0.2191 +0.4306) + %(0.4997 +0.4997)
= 0.3898 + 0.3998 = 0.7896
Gain(Interactiveness) = 0.8807 — 0.7896
=0.0911
Table 6.6 shows the number of data instances classified with Job Offer as Yes or No for the
attribute Practical Knowledge.
Decision Tree Learning « 165

Table 6.6: Entropy Information for Practical Knowledge


Practical Knowledge Job Offer = Yes Job Offer= No Total Entropy

Very Good 2 0 2 0

Average 1 2 3
Good 4 1 5

Entropy_Info(T, Practical Knowledge)

- 2[ 2, .20~lo 0] =3[1 1 5[ 4,
—|-clog, 4=~ 1,1
log, =
10[28273 522] 10[ 3'%8:3° 31313] 10[ 3 B8 6 25
= Z(0)+—(05 :
m( )+ 10( 280 +03897) + T45 (0.2574 + 0.4641)
=0+0.2753 + 0.3608
=0.6361
Gain(Practical Knowledge) = 0.8807 — 0.6361
=0.2446
Table 6.7 shows the number of data instances classified with Job Offer as Yes or No for the
attribute Communication Skills.
Table 6.7: Entropy Information for Communication Skills

Communication Skills Job Offer = Yes Job Offer =No Total


Good 4 1 5
Moderate 3 0 3
Poor 0 2 2

Entropy_Info(T, Communication Skills)

wr | Mg & Ty 11,818 3_91 21000 20 2


0| 5985 51985 | T 19| 33 gz3 10| 282272 823
5 3
=—(0.5280
10" + 0.3897) 1++ —(0)
5@ + —(0
10()
=0.3609

Gain(Communication Skills) = 0.8813 — 0.36096


=0.5203
The Gain calculated for all the attributes is shown in Table 6.8:
Table 6.8: Gain

CGPA 0.5564
Interactiveness 0.0911
Practical Knowledge 0.2246
Communication Skills 0.5203
166 « Machine Learning

Step 3: From Table 6.8, choose the attribute for which entropy is minimum and therefore the gain
is maximum as the best split attribute.
The best split attribute is CGPA since it has the maximum gain. So, we choose CGPA as the
root node. There are three distinct values for CGPA with outcomes >9, >8 and <8. The entropy
value is 0 for 28 and <8 with all instances classified as Job Offer = Yes for >8 and Job Offer = No for
<8. Hence, both 28 and <8 end up in a leaf node. The tree grows with the subset of instances with
CGPA 29 as shown in Figure 6.3.

o (e

Yes Very good Good Yes

No Average Poor No

Yes | Good Moderate Yes™ ®


No Verygood | Good Yes ;

Figfire 6.3: Decision Tree After Iteration 1


Now, continue the same process for the subset of data instances branched with CGPA > 9.
Iteration 2:
In this iteration, the same process of computing the Entropy_Info and Gain are repeated with the
subset of training set. The subset consists of 4 data instances as shown in the above Figure 6.3.
Entropy_Info(T) = Entropy_Info(3, 1) =

~~[310g, 3+ 1o 1]
4 724 4 24
= ~(-0.3111 + -0.4997)
=0.8108

Entropy_Info(T, Interactiveness) = i—[—zlo 2_ glog 9:| + 3|:—110g2% - }-logl %J

=0+ 0.4997
Gain(Interactiveness) = 0.8108 — 0.4997
=0.3111
Entropy_Info(T, Practical Knowledge)
|
} Decision
Tree Learning « 167

Gain(Practical Knowledge) = 0.8108


Entropy_Info(T, Communication Skills)
2[ 2,2 =~ ~log,
=%|-3log, 0 =0] [+—|-—log,
1[0 o0~ - 1,~log, =1]| + 11 1= - Tlog,
0, 0=
4[ 283 2°g’2]+4[ |
11087718, 7|+ 3| 718171 BT-~ log,

Gain(Communication Skills) = 0.8108


The gain calculated for all the attributes is shown in Table 6.9.
Table 6.9: Total Gain
Attributes Gain
Interactiveness 0.3111
Practical Knowledge 0.8108
Communication Skills 0.8108

Here, both the attributes ‘Practical Knowledge’ and “Communication Skills” have the same
Gain. So, we can either construct the decision tree using ‘Practical Knowledge’ or ‘Communication
Skills”. The final decision tree is shown in Figure 6.4.

28 @ <8

R 4

Average

Figure 6.4: Final Decision Tree


.
6.2.2 C4.5 Construction
C4.5 is an improvement over ID3. C4.5 works with continuous and discrete attributes and missing
values, and it also supports post-pruning. C5.0 is the successor of C4.5 and is more efficient and
used for building smaller decision trees. C4.5 works with missing values by marking as ‘?’, but
these missing attribute values are not considered in the calculations.
168 « Machine Learning

The algorithm C4.5 is based on Occam’s Razor which says that given two correct solutions,
the simpler solution has to be chosen. Moreover, the algorithm requires a larger training set
for better accuracy. It uses Gain Ratio as a measure during the construction of decision trees,
ID3 is more biased towards attributes with larger values. For example, if there is an attribute called
‘Register No’ for students it would be unique for every student and will have distinct value for
every data instance resulting in more values for the attribute. Hence, every instance belongs to
a category and would have higher Information Gain than other attributes. To overcome this bias
issue, C4.5 uses a purity measure Gain ratio to identify the best split attribute. In C4.5 algorithm,
the Information Gain measure used in ID3 algorithm is normalized by computing another
factor called Split_Info. This normalized information gain of an attribute called as Gain_Ratio is
computed by the ratio of the calculated Split_Info and Information Gain of each attribute. Then,
the attribute with the highest normalized information gain, that is, highest gain ratio is used as
the splitting criteria.
As an example, we will choose the same training dataset shown in Table 6.3 to construct a
decision tree using the C4.5 algorithm.
Given a Training dataset T,
The Split_Info of an attribute 4 is computed as given in Eq. (6.11):

A o(T, A)= -7 iAl xlogl4]e


Split_Inf (6.11)
where, the attribute A has got ‘v’ distinct values {a, a,,....a}, and | ] is the number of instanc
for distinct value ‘i’ in attribute A.
The Gain_Ratio of an attribute A is computed as given in Eq. (6.12):

Gain_Ratio(A) =
Info_Gain(A)
Split_Info(T, A)

Algorithm 6.3: Procedure to Construct a Decision Tree using C4.5

1. Compute Entropy_Info Eq. (6.8) for the whole training dataset based on the target attribute.
2. Compute Entropy_Info Eq. (6.9), Info_Gain Eq. (6.10), Split_Info Eq. (6.11) and Gain_Ratio
Eq. (6.12) for each of the attribute in the training dataset.
3. Choose the attribute for which Gain_Ratio is maximum as the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the
test condition of the root node attribute. Accordingly, the training dataset is also split
into subsets.
6. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in
the subset. .
Decision Tree Learning « 169

*—
Make use of Information Gain of the attributes which are calculated in ID3
algorithm in Example 6.3 to construct a decision tree using C4.5.
Solution:
Iteration 1:
Step 1: Calculate the Class_Entropy for the target class ‘Job Offer’.
Entropy_Info(Target Attribute = Job Offer) = Entropy_Info(7, 3) =

ORI
7.
(R
7.3,+—log,
— —
3
[101"gz 1010 & 10]
= (03599 + ~0.5208)
=0.8807
Step 2: Calculate the Entropy_Info, Gain(Info_Gain), Split_Info, Gain_Ratio for each of the attribute
in the training dataset.
CGPA:
1, 1|,1] 4(4[ 4004,7163 4.0 0}
Info( CGPA) ) =0[| 31003.3
Entropy py Info(T, log14 3 110,
4log14]+w[ 2 D)5
71987=

2|2 Liop0, L0 Z1op =2


+10[ 298373 '%: 2}
= %(o.am +04997)+0+0
=03243
Gain(CGPA) = 0.8807 - 0.3243
=0.5564
Split_Info(T, CGPA) = -t10 1og,£ _ L10g 4 _ 2105 2
z10 10 %70 10 5210
= 05285 + 0.5285 + 04641
=15211
Gain Ratio(CGPA) = (Gain(CGPA))/(Split_Info(T, CGPA))
= 05564
1.5211
_ 3658
Interactiveness:
: 6| 5 5_1 1 4 2 2 2
Entropy Info(T, Interactiveness) = fi[_ElOg2 —==lof —] [ 2 Jog, n= 4log1 4]

o6
=2(0.2191 + 0.4306) ) + —10( (0.4997 ++0.4997)
0.
=0.3898 + 0.3998 = 0.7896
Gain(Interactiveness) = 0.8807 — 0.7896 = 0.0911
5 . 6
Gain(Interactiveness) = — 1010870 1010g1M =0.9704
170 « Machine Learning

Gain_Ratio(Interactiveness) = Gain(Interactiveness)
Split _Info(T, Interactiveness)
0.0911
0.9704
=0.0939
Practical Knowledge:
) 2[ 2 og, 53
Entropy_Info(T, Practical Knowledge) = E[_El
0],31
= logz Z:I + 10[ 1103 og, 13732,0, og, 23

2. (0) + 3.3 (05280 + 0.3897) + —5 (0.2574 + 0.4641)


=2109 * 1o¢ * 10
= 0+0.2753 + 03608 = 0.6361
Gain(Practical Knowledge) = 0.8807 — 0.6361
=0.2448
. . 2 2 5 5 3 3
plit_Info(T, P Practical Knowledge) =-=T log, =T
Split_Info(T, log, i
— 01 8270
=1.4853

Gain_Ratio(Practical Knowledge) = Gain(Practical I.(nowledge)


Split_Info(T, Practical Knowledge)
_ 02448
1.4853
=0.1648
Communication Skills: .
i i .
Skills) = 5| 4
E[_Ek’g? 4
= 1
gl"gz 1
§:| 4 3] 3
E[_§1°g2 5 0
glogz 5“
Entropy_Info(T, Communication

5 (05280 +0.3897) + >3 (0) + (0


=510280+ 2
ST
- 03609
Gain(Communication Skills) = 0.8813 - 0.36096
=0.5202
i o s 7 5 3 3 2 2
Split _Info(T, Communication Skills) = ST
log2010—log, T Elog 270
=1.4853
Gain(Communication Skills)
in_Ratio(C ication Skills) =
Gatn Ratib(Commuy )= Split_Info(T, Communication Skills)
Decision Tree Learning « 171

Table 6.10 shows the Gain_Ratio computed for all the attributes.
Table 6.10: Gain_Ratio
Attribute Gain_Ratio
CGPA 0.3658
INTERACTIVENESS 0.0939
PRACTICAL KNOWLEDGE | 0.1648
COMMUNICATION SKILLS | 0.3502

Step 3: Choose the attribute for which Gain_Ratio is maximum as the best split attribute.
From Table 6.10, we can see that CGPA has highest gain ratio and it is selected as the best split
attribute. We can construct the decision tree placing CGPA as the root node shown in Figure 6.5.
The training dataset is split into subsets with 4 data instances.

<8
L CGPA

Yes Very good Good Yes


No Average Poor No
Yes Good Moderate Yes
No Very good | Good Yes

Figure 6.5: Decision Tree after Iteration 1


Iteration 2:
Total Samples: 4
Repeat the same process for this resultant dataset with 4 data instances.
Job Offer has 3 instances as Yes and 1 instance as No.

Entropy_Info(Target Class = Job Offer) = —%logz% = ilogzi


=0.3112+0.5
=0.8112
Interactiveness: . 2 _210 2 E]o 0 L2 1 log L_1 1
Entropy_Info(T, Interactiveness) = 73 8, s 8, 2733 og, o= El‘)gzi

=0+0.4997
Gain(Interactiveness) = 0.8108 — 0.4997 = 0.3111
172 « Machine Learning

2
Split _Info(T, Interactiveness)= -—logz il logz =05+05=1

Gain_Ratio(Interactiveness) = .Gam(lmeracnveness)
Split_Info(T, Interactiveness)

TR,0.

Practical Knowledge: |
. Knowledge)
Entropy_Info(T, Practical _= —|:——1ogz25 0, og, _]
_1 0].1fo 0
Z[_ 1 log, 7 1,1 J
Ilogz

1 1w
+ —[——logz 171
30 1D
08, 1]

=0
Gain(Practical Knowledge) = 0.8108
Split :
plit_Info(T, Practical Knowledge) =_.2Zlogz i
2 1
ZIOg’ 1 Ilog’
7~ 1 o =15

Gain_Ratio(Practical Knowledge) = Gain(Practical Knowledge) - 08108 _ (5408


Split_Info(T, Practical Knowledge) 15

Communication Skills:

Entropy_Info(T, Communication Skills) = %I:— Zlog 2_ log2 2:| I: 1:|


log,
2 o232

+ ——log L 0log 0
4 21 24
=0
Gain(Communication Skills) = 0.8108
. S 2. .2 T 1 1
ZIng i
1 B 1.5
Split_Info(T, Communication Skills) = —Zlog2 i zl-log2 T

Gain(Practical Knowledge) _ 0.8108


Gain_Ratio(Communication Skills) = = =0.5408
Split_Info(T, Practical Knowledge) 15

Table 6.11 shows the Gain_Ratio computed for all the attributes.
Table 6.11: Gain-Ratio
Attributes Gain_Ratio
Interactiveness 0.3112
Practical Knowledge 0.5408
Communication Skills 0.5408

Both “Practical Knowledge’ and ‘Communication Skills’ have the highest gain ratio. So, the best
splitting attribute can either be ‘Practical Knowledge’ or ‘Communication Skills’, and therefore, the
split can be based on any one of these.
Here, we split based on ‘Practical Knowledge’. The final decision tree is shown in Figure 6.6.

_a
Decision Tree Learning 173

28 <8

Average

Figure 6.6: Final Decision Tree


2 ]
Dealing with Continuous Attributes in C4.5
The C4.5 algorithm is further improved by considering attributes which are continuous, and a
continuous attribute is discretized by finding a split point or threshold. When an attribute ‘A’ has
numerical values which are continuous, a threshold or best split point ’s” is found such that the set
of values is categorized into two sets such as A < s and A 2 s. The best split point is the attribute
value which has maximum information gain for that attribute.
Now, let us consider the set of continuous values for the attribute CGPA in the sample dataset
as shown in Table 6.12.
Table 6.12: Sample Dataset
S.No. CGPA Job Offer
1 9.5 Yes
2 8.2 Yes
3. 9.1 No
4. 6.8 No
5. 8.5 Yes
6. 9.5 Yes
7 7.9 No
8. 9.1 Yes
9. 8.8 Yes
10. 8.8 Yes
174 « Machine Learning

First, sort the values in an ascending order.


68 {79 |82|85 88|88 |91 (9.1(95/[95
Remove the duplicates and consider only the unique values of the attribute.

68 [79 |82|85 (88|88 (9.1 |95


Now, compute the Gain for the distinct values of this continuous attribute. Table 6.13 shows
the computed values.
Table 6.13: Gain Values for CGPA
6.8 79 8.2 8.5 8.8
Range < > < > & > < > < >

Yes 0] 7 o 1 6 |2| 5 4 3
No 1] 2 |2 2 1 [2] 1 2 1
Entropy 0 |0.7637 {0 |0.5433| 09177 | 0.5913|1 |0.6497| 0.9177 | 0.8108

Entropy_Info | 0.6873 0.4346 0.6892 0.7898 0.8749


s, 1)
Gain 0.1935 0.4462 0.1916 0.091 0.0059 0.1178 0

For a sample, the calculations are shown below for a single distinct value say, CGPA « 6.8.

7 7 3 3
Entropy_Info(T, Job_Offer) = - _Elogz o + Elog2 Td]

= (- 0.3599 + —0.5209)
=0.8808
Entropy(7, 2) = _[%1032 % + %1052 %]

=—(-0.2818 +—0.4819)
=0.7637
Entropy_Info(T, CGPA € 6.8) = % x Entropy(0, 1) + % Entropy(7, 2)

M O O Ly 1] 9] 7ren 72565 2
10|71 %8217 187 10| 9 829 9 By
=0+-2(07637)
10
=0.6873
Gain(CGPA € 6.8) = 0.8808 - 0.6873
=0.1935
Similarly, the calculations are done for each of the distinct value for the attribute CGPA and
a table is created. Now, the value of CGPA with maximum gain is chosen as the threshold value
or the best split point. From Table 6.13, we can observe that CGPA with 7.9 has the maximum gain
25 0.4462. Hence, CGPA e 7.9 is chosen as the split point. Now, we can dicretize the continuous values
of CGPA as two categories with CGPA <7.9 and CGPA > 7.9. The resulting discretized instances are
shown in Table 6.14.
Decision Tree Learning « 175

Table 6.14: Discretized Instances


S.No. CGPA Continuous CGPA Discretized Job Offer
1. 9.5 >7.9 Yes
2. 8.2 >7.9 Yes
a 9.1 >7.9 No
4. 6.8 <79 No
5. 85 >7.9 Yes
6. 9.5 >7.9 Yes
7 79 <79 No
8. 9.1 >7.9 Yes
9. 8.8 >7.9 Yes
10. 8.8 >7.9 Yes

6.2.3 Classification and Regression Trees Construction


The Classification and Regression Trees (CART) algorithm is a multivariate decision tree learning
used for classifying both categorical and continuous-valued target variables. CART algorithm is
an example of multivariate decision trees that gives oblique splits. It solves both classification and
regression problems. If the target feature is categorical, it constructs a classification tree and if the
target feature is continuous, it constructs a regression tree. CART uses GINI Index to construct a
decision tree. GINI Index is defined as the number of data instances for a class or it is the proportion
of instances. It constructs the tree as a binary tree by recursively splitting a node into two nodes.
Therefore, even if an attribute has more than two possible values, GINI Index is calculated for
all subsets of the attributes and the subset which has maximum value is selected as the best split
subset. For example, if an attribute A has three distinct values say {a,, a, “3" the possible subsets
are {}, {a)}, {a,}, {aj), {a, )}, (a, a), (a, a,}, and {a,, a,, a,}. So, if an attribute has 3 distinct values, the
number of possible subsets is 2°, which means 8. Excluding the empty set { } and the full set {a,, a,,
a,}, we have 6 subsets. With 6 subsets, we can form three possible combinations such as:
{a,} with {a,, a,}
{a,} with {a,, a,}
{a,} with {a,, a,}
Hence, in this CART algorithm, we need to compute the best splitting attribute and the best
split subset i in the chosen attribute.
Higher the GINI value, higher is the homogeneity of the data instances.
Gini_Index(T) is computed as given in Eq. (6.13).
Gini_Index(T) = 1- X7 P? (6.13)
where,
P,be the probability that a data instance or a tuple ‘d" belongs to class C,. It is computed as:
P,= INo. of data instances belonging to class il/Total no of data instances in the training
dataset T
GINI Index assumes a binary split on each attribute, therefore, every attribute is considered as
a binary attribute which splits the data instances into two subsets S, and S,
176 « Machine Learning

Gini_Index(T, A) is computed as given in Eq. (6.14).

Gini_Index(T, A) = %cmi(s,) + %Gm«sz) (6.14)

The splitting subset with minimum Gini_Index is chosen as the best splitting subset for an
attribute. The best splitting attribute is chosen by the minimum Gini_Index which is otherwise
maximum AGini because it reduces the impurity.
AGini is computed as given in Eq. (6.15):
AGini(A) = Gini(T) — Gini(T, A) (6.15)

Algorithm 6.4: P: ure to Constru


1. Compute Gini_Index Eq. (6.13) for the whole training dataset based on the target attribute.
2. Compute Gini_Index for each of the attribute Eq. (6.14) and for the subsets of each attribute
in the training dataset.
. Choose the best splitting subset which has minimum Gini_Index for an attribute.
w

. Compute AGini Eq. (6.15) for the best splitting subset of that attribute.
oo

. Choose the best splitting attribute that has maximum AGini.


. The best split attribute with the best split subset is placed as the root node.
. The root node is branched into two subtrees with each subtree an outcome of the test
N

condition of the root node attribute. Accordingly, the training dataset is also split into
two subsets.
8. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in
the subset.
_ _/
[ o
I2EVIEERH Choose the same training dataset shown in Table 6.3 and construct a decision tree
using CART algorithm.
Solution:
Step 1: Calculate the Gini_Index for the dataset shown in Table 6.3, which consists of 10 data
instances. The target attribute ‘Job Offer’ has 7 instances as Yes and 3 instances as No.
2 2

Gini_Index(T) = 1 —(l] —( 3 )
10 (10
=1-049 - 0.09
=1-058
Gini_Index(T) = 0.42
Step 2: Compute Gini_Index for each of the attribute and each of the subset in the attribute.
(as shown
CGPA has 3 categories, so there are 6 subsets and hence 3 combinations of subsets
in Table 6.15).
Decision Tree Learning o 177

Table 6.15: Categories of CGPA

<8

Gini_Index(T, CGPA € (29, >8)) =1 - (7/8)* - (1/8)?

=1-0.7806
=0.2194
Gini_Index(T, CGPA e {<8]) =1 - (0/2)* - (2/2)?
=1-1
=0
Gini_Index(T, CGPA « ((29, 28), <8} = (8/10) x 0.2194 + (2/10) X 0
=0.17552
Gini_Index(T, CGPA € {29, <8}) = 1 - (3/6)* -
(3/6)
=1-05=05
Gini_Index(T, CGPA € {>8}) = 1 — (4/4)* — (0/4)?
=1-1=0
Gini_Index(T, CGPA € {(29, <8), >8}) = (6/10) x 0.5 + (4/10) X 0
=03
Gini_Index(T, CGPA e {28, <8}) = 1 - (4/6) — (2/6)
=1-0.555
=0.445
Gini_Index(T, CGPA e {29})) =1 - (3/4)* - (1/4)
=1-0.625
=0.375
Gini_Index(T, CGPA e {(8, <8), 29}) = (6/10) x 0.445 + (4/10) x 0.375
=0.417
Table 6.16 shows the Gini_Index for 3 subsets of CGPA.
Table 6.16: Gini_Index of CGPA
Subsets Gini_Index
(29, 28) <8 0.1755
(29, <8) >8 03
(28, <8) 29 0.417
Step 3: Choose the best splitting subset which has minimum Gini_Index for an attribute.
The subset CGPA e {(29 28), <8} has the lowest Gini_Index value as 0.1755 is chosen as the best
splitting subset.
[
178 « Machine Learning

Step 4: Compute AGini or the best splitting subset of that attribute.


AGini(CGPA) = Gini(T) - Gini(T, CGPA)
=0.42 - 0.1755
=0.2445
Repeat the same process for the remaining attributes in the dataset such as for Interactivenesg
shown in Table 6.17, Practical Knowledge in Table 6.18, and Communication Skills in Table 6.20,
Table 6.17: Categories for Interactiveness
Interactiveness Job Offer = Yes Job Offer = No
Yes 5 1
No 2 2
2 2
Gini_Index(T, Interactiveness € {Yes)) =1— (g) = [%]

=1-072
=028
2 2

Gini_Index(T, Interactiveness € {No}) =1— [%] - (%)

=1-05
=05
Gini_Index(T, Interactiveness € {Yes, No}) = %(0.28) + %(0.5)

=0.168 +0.2
=0.368
AGini(Interactiveness) = Gini(T ) — Gini(T, Interactiveness)
=0.42-0.368
=0.052
Table 6.18: Categories for Practical Knowledge
Practical Knowledge Job Offer = Yes Job Offer = No
Very Good 2 0
Good 4 1
Average 1 2
2 \2

Gini_Index(T, Practical Knowledge € {Very Good, Good} = [;] — [l)

=1-0.7544
=0.2456
2 2

Gini_Index(T, Practical Knowledge e {Average})=1- (l] s (2)

=1-0.555=0.445
Decision Tree Learning « 179

Gini_Index(T, Practical Knowledge e {Very Good, Good}, Average)


2

—(Z) x02456 + (2| x 0.4a5


10 10
=0.3054
2 2
G 3 2
Gini_Index(T, Practical Knowledge € [Very Good, Average))=1- (%] - [g)

=1-052
=048

Gini_Index(T, Practical Knowledge e {Good))= 1 - [.;‘.] - [é)


2 2

=1-0.68
=032
- 5
Gini_Index(T, Practical Knowledge € {Very Good, Average}, Good) = [%) %048 + (-1—0) % 0.32

=040
2 2

Gini_Index(T, Practical Knowledge e {Very Good, Average})=1- [E) - [E)

=1-0.5312 = 0.4688

Gini_Index(T, Practical Knowledge & {Very Good})=1- [%]2 = [9)1

=1-1=0
Gini_Index(T, Practical Knowledge e {Good, Average}, Very Good) = (%) X% 0.4688 + (%J x0

=0.3750
Table 6.19 shows the Gini_Index for various subsets of Practical Knowledge.
Table 6.19: Gini_Index for Practical Knowledge
Subsets Gini_Index
(Very Good, Good) Average 0.3054
(Very Good, Average) | Good 0.40
(Good, Average) Very Good 0.3750

AGini(Practical Knowledge) = Gini(T ) — Gini(T, Practical Knowledge)


=0.42 - 0.3054 = 0.1146
Table 6.20: Categories for Communication Skills

Communication Skills Job Offer = Yes Job Offer = No


Good 4 1
Moderate 3 0
Poor 0 2
180 « Machine Learning

2 2
Gini_Index(T, Communication Skills € {Good, Moderate}) = 1 - [%J - (%)
=1-0.7806
=0.2194
2 2

Gini_Index(T, Communication Skills e {Poor))= 1~ —;—} - [EJ

Gini_Index(T, Communication Skills € {Good, Moderate}, Poor) = (%) x 0.2194 + (%J X0


=0.1755

Gini_Index(T, Communication Skills € {Good, Poor})=1- [—7—)


4Y - [7J
(3Y
=1-0.5101
=0.4899
3¢ (oY
Gini_Index(T, Communication Skills € (Moderate})=1— [EJ . [3)

=1-1=0

Gini_Index(T,Communication Skills € {Good, Poor}, Moderate) = (%) x 0.4899 + (%] x0


=0.3429
3y - (E)
(2}
Gini_Index(T, Communication Skills € {Moderate, Poor})=1- [g)

=1-0.52
=048
4 2 1 2

Gini_Index(T, Communication Skills € {Good})=1- [g] - (g]

=1-0.68
=0.32 2 2

Gini_Index(T, Communication Skills € {Moderate, Poor}, Good) = [%] x 048 + [%) x 0.32

=040
Table 6.21 shows the Gini_Index for various subsets of Communication Skills.
Table 6.21: Gini-Index for Subsets of Communication Skills
Subsets Gini_Index
(Good, Moderate) Poor 0.1755
(Good, Poor) Moderate 0.3429
(Moderate, Poor) Good 0.40
Decision Tree Learning . 181

AGini(Communication Skills) = Gini(T) = Gini(T, Communication Skills)


=042 -0.1755 '
=0.2445
Table 6.22 shows the Gini_Index and AGini values calculated for all the attributes.
Table 6.22: Gini_Index and AGini for all Attributes

CGPA 0.1755 0.2445


Interactiveness 0.368 0.052
Practical knowledge 0.3054 0.1146
‘Communication Skills 0.1755 0.2445

Step 5: Choose the best splitting attribute that has maximum AGini.
CGPA and Communication Skills have the highest AGini value. We can choose CGPA as the
root node and split the datasets into two subsets shown in Figure 6.7 since the tree constructed by
CART is a binary tree.

Very good

2 28 No Good Moderate Yes


) 29 No Average Poor No

S. 28 Yes Good Moderate Yes


6. 29 Yes Good Moderate Yes
8. 29 No Very good Good Yes

9, 28 Yes Good Good Yes

10. 28 Yes Average Good Yes

Figure 6.7: Decision Tree after Iteration 1


Iteration 2:
In the second iteration, the dataset has 8 data instances as shown in Table 6.23. Repeat the same
process to find the best splitting attribute and the splitting subset for that attribute.
182 « Machine Learning

Table 6.23: Subset of the Training Dataset after Iteration 1


S.No. CGPA Interactiveness Practical Knowledge ~Communication Skills Job Offer
1. 29 Yes Very good Good Yes
2. 28 No Good Moderate Yes
3. 29 No Average Poor No
S 28 Yes Good Moderate Yes
6. 29 | Yes Good Moderate Yes
8. 9 No Very good Good Yes
9. 28 Yes Good Good Yes
10. >8 Yes Average Good Yes

Gini_Index(T) =1 - (Z) - (l ]1
8|3
=1-0.766 - 0.0156
=1-058
Gini_Index(T) = 0.2184
Tables 6.24, 6.25, and 6.27 show the categories for attributes Interactiveness, Practical Knowledge,
and Communication Skills, respectively.
Table 6.24: Categories for Interactiveness
Interactiveness Job Offer = Yes Job Offer = No
Yes 5 0
No 2 1

Gini_Index(T, Interactiveness € {Yes})=1— [%) [%

=1- 0
j
Gini_Index(T, Interactiveness € {No}) =1
ET ¢ —0.44 -0.111= 0.449

Gini_Index(T, Interactiveness € {Yes, No}))= [ )x 0 +[ ) x 0.449

=0.056
AGini(Interactiveness) = Gini(T ) — Gini(T, Interacti veness)
=0.2184 - 0.056 = 0.1624

Table 6.25: Categories for Practical Knowledge


Practical Knowledge Job Offer = Yes Job Offer = No
Very Good 2 0
Good 4 0
Average 1 1
Decision Tree Learning « 183

2 2

Gini_Index(T, Practical Knowledge e {Very Good, Good}) =1~ (%) - [%J

Gini_Index(T, Practical Knowledge e {Average))=1- (%) - (%]

=1-025-025
=05
2

Gini_Index(T, Practical Knowledge e (Very Good, Good}, Average) = (g) %0+ (%) %05
=0.125
2 2
" i 1
Gini_Index(T, Practical Knowledge e {Very Good, Average})=1- [—] = (—)

=1-0.5625 — 0.0625
=0.375
2 2

Gini_Index(T, Practical Knowledge € {Good})=1- [%] - (%)

=1-1=0

Gini_Index(T, Practical Knowledge e {Very Good, Average}, Good) = [%) % 0.375 + (%) x0

=0.1875
2 2

Gini_Index(T, Practical Knowledge € {Good, Average})=1- (—J - [—J

=1-0.694 - 0.028
=0.278
2 2

Gini_Index(T, Practical Knowledge e {Very Good})=1- [—] = [—]

=1-1=0
6) 2} x0
Gini_Index(T, Practical Knowledge € {Good, Average}, Very Good) = (E] % 0.278 + (g]

=0.2085
Table 6.26 shows the Gini_Index values for various subsets of Practical Knowledge.
Table 6.26: Gini_Index for Subsets of Practical Knowledge
Subsets Gini_Index
(Very Good, Good) Average 0.125
(Very Good, Average) Good 0.1875
(Good, Average) Very Good 0.2085
184 « Machine Learning

AGini(Practical Knowledge) = Gini(T ) - Gini(T, Practical Knowledge)


=0.2184 -0.125
=0.0934
Table 6.27: Categories for Communication Skills
Communication Skills Job Offer = Yes Job Offer = No
Good 4 0
Moderate 3 0
Poor 0 1
2 2
s 7 0
Gini_Index(T, Communication Skills e {Good, Moderate})=1— [—J - (—J

=1-1=0

Gini_Index(T, Communication Skills € {Poor})=[


1~ J ( ]

Gini_Index(T, Communication Skills € {Good, Moderate}, Poor) = (%J x0 +( ] 0

=0

Gini_Index(T, Communication Skills € {Good, Poor})=1 - [ ] (J

=1-0.64-0.04
=032
2 2

Gini_Index(T, Communication Skills € {Moderate})=1— (EJ = (2]

=1-1=0

Gini_Index(T,Communication Skills € {Good, Poor}, Moderate) = [8) %032 + (8] x0

=02
3 2 1 2

Gini_Index(T, Communication Skills € {Moderate, Poor})=1— [Z] - [—.]

=1-0.5625 - 0.0625
=0.375

Skillsn € {Good})=1- (%)


Gini_Index(T, Communicatio ( ]

=1-1=0
1=
2 2

Gini_Index(T, Communication Skills € {Moderate, Poor}, Good) = [s] x 0.375 + (%J x0

=0.1875
Decision Tree Learning « 185

Table 6.28 shows the Gini_Index for subsets of Communication Skills.


Table 6.28: Gini_Index for Subsets of Communication Skills

(Good, Moderate) Poor 0


(Good, Poor) Moderate 0.2
(Moderate, Poor) Good 0.1875

AGini(Communication Skills) = Gini(T) - Gini(T, Communication Skills)


=0.2184-0=0.2184
Table 6.29 shows the Gini_Index and AGini values for all attributes.
Table 6.29: Gini_Index and AGini Values for All Attributes

Attribute Gini_Index A
Interactiveness 0.056 0.1624
Practical knowledge 0.125 0.0934
Communication Skills 0 0.2184

Communication Skills has the highest AGini value. The tree is further branched based on the
attribute ‘Communication Skills'. Here, we see all branches end up in a leaf node and the process
of construction is completed. The final tree is shown in Figure 6.8.

Figure 6.8: Final Tree


—e
6.2.4 Regression Trees
us valued
Regression trees are a variant of decision trees where the target feature is a continuo
which uses
variable. These trees can be constructed using an algorithm called reduction in variance
standard deviation to choose the best splitting attribute.

Algorithm 6.5: Procedure for Constructing Regression Trees

1. Compute standard deviation for each attribute with respect to target attribute.
2. Compute standard deviation for the number of data instances of each distinct value of an
attribute.
ute.
3. Compute weighted standard deviation for each attrib
(Continued)
186 « Machine Learning

4. Compute standard deviation reduction by subtracting weighted standard deviation for each
attribute from standard deviation of each attribute.
5. Choose the attribute with a higher standard deviation reduction as the best split attribute.
6. The best split attribute is placed as the root node.
7. The root node is branched into subtrees with each subtree as an outcome of the test condition
of the root node attribute. Accordingly, the training dataset is also split into different subsets.
8. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in the subset.
o

Construct a regression tree using the following Table 6.30 which consists of
10 data instances and 3 attributes ‘Assessment’, ‘Assignment’ and ‘Project’. The target attribute is
the ‘Result’ which is a continuous attribute.
Table 6.30: Training Dataset

S.No. Assessment Assignment = Project Result (


1. | Good Yes Yes 95
2. | Average Yes No 70
3. | Good No Yes 75
4. | Poor No No 45
5. | Good Yes Yes 98
6. | Average No Yes 80
7. | Good No No 75
8. | Poor Yes Yes 65
9. | Average No No 58
10. | Good Yes Yes 89

Solution:
Step 1: Compute standard deviation for each attribute with respect to the target attribute:
Average = (95 + 70 + 75 + 45 + 98 + 80 + 75 + 65 + 58 + 89) = 75
(95— 75)% + (70 — 75)% + (75 — 75)2 + (45 — 75)% + (98 — 75) + (80 — 75)*
+ (75— 75)2 + (65 — 75)? + (58 — 75)2 + (89 — 75)*
Standard Deviation =
10
=16.55
Assessment = Good (Table 6.31)
Table 6.31: Attribute Assessment = Good
S.No. Assessment Assignment Project Result (%)
1. | Good Yes Yes 95:
3. | Good No Yes 75
5. | Good Yes Yes 98
7. | Good No No 75
10. | Good Yes Yes 89
Decision Tree Learning « 187

Average = (95 + 75 + 98 + 75 + 89) = 86.4

Standard Deviation _ J<95 Z86.4) + (75 — 86.4)? + (98 - 86.4)" + (75 — 86.4)" + (89 — 86.4)?
5
=109
Assessment = Average (Table 6.32)
Table 6.32: Attribute Assessment = Average
S.No. Assessment Assignment Project Result (%)
2. | Average Yes No 70
6. | Average No Yes 80
9. Average No No 58
Average = (70 + 80 + 58) = 69.3

Standard Deviation _ [70=69.3y + (80— 369.3)’ +(58-6937 _11m

Assessment = Poor (Table 6.33)


Table 6.33: Attribute Assessment = Poor
S.No. Assessment Assignment Project Result(%)
4. | Poor No No 45
8. | Poor Yes Yes 65
Average = (45 + 65) =55

Standard Deviation = /w —1414


Table 6.34 shows the standard deviation and data instances for the attribute-Assessment.
Table 6.34: Standard Deviation for Assessment
Assessment Standard Deviation Data Instances
Good 10.9 5
Average 11.01 3
Poor 14.14 2

Weighted standard deviation for Assessment= [—1%] x10.9 + (1%-] x11.01 + [%) x14.14
=11.58
Standard deviation reduction for Assessment = 16.55 — 11.58 = 4.97
Assignment = Yes (Table 6.35)
Table 6.35: Assignment = Yes
S.No. Assessment Assignment Project Result (%)
1. | Good Yes Yes 95
2. | Average Yes No 70
5. | Good Yes Yes 98
8. | Poor Yes Yes 65
10. | Good Yes Yes 89
188 « Machine Leamig ——o—M
H8™ H1— ————x
Average =(95+70+98 ——
+65+89) =834

Standard Deviation = | C2=834)* + (70 —83.4)7 + (98 — 83.4)" + (65 — 83.4)" + (89 — 834y
5
=14.98
Assignment = No (Table 6.36)
Table 6.36: Assignment = No

S.No. Assessment Assignment Project Result (%)


3. | Good No Yes 75
4 Poor No No 45

6. Average No Yes 80
7. | Good No No 75
9. | Average No No 58
Average = (75 +45 + 80 + 75 + 58) = 66.6

_ (75 - 66.6)* + (45 — 66.6) + (80 — 66.6)* + (75 — 66.6)* + (58 — 66.6)
Standard Deviation
5
=147
Table 6.37 shows the Standard Deviation and Data Instances for attribute, Assignment.
Table 6.37: Standard Deviation for Assignment

Assessme Standard Deviation Data Instances


Yes 14.98 5
No 14.7 5

Weighted standard deviation for Assignment = [%) x14.98 + (%) x14.7 = 14.84

Standard deviation reduction for Assignment = 16.55 - 14.84 = 1.71


Project = Yes (Table 6.38)
Table 6.38: Project = Yes
S.No. Assessment Assignment Project Result (%)
1. | Good Yes Yes 95
3. | Good No Yes 75
5. | Good Yes Yes 98
6. | Average No Yes 80
8. | Poor Yes Yes 65
10. | Good Yes Yes 89
Decision Tree Learning « 189

Average = (95 + 75 + 98 + 80 + 65 + 89) = 83.7


(95-83.7)" + (75 — 83.7)" + (98 — 83.7)" + (80 — 83.7)" + (65 - 83.7)7
+(89 - 83.7)?
Standard Deviation =

=126

Project = No (Table 6.39)


Table 6.39: Project = No
S.No. Assessment Assignment Project Result (%)
2. | Average Yes No 70
4. | Poor No No 45
7. | Good No No 75
9. | Average No No 58

Average = (70 + 45 + 75 + 58) = 62

Standard Deviation = J
(70 — 75)2 + (45 — 75)* + (75 — 75)" + (58 — 75)*
4
=13.39
Table 6.40 shows the Standard Deviation and Data Instances for attribute, Project.
Table 6.40: Standard Deviation for Project

Project Standard Deviation Data Instances


Yes 12.6 6
No 13.39 4

Weighted standard deviation for Assessment=‘[%) x2.6+ (%] x13.39 =12.92

Standard deviation reduction for Assessment = 16.55 - 12.92 = 3.63


Table 6.41 shows the standard deviation reduction for each attribute in the training dataset.
Table 6.41: Standard Deviation Reduction for Each Attribute
Attributes Standard Deviation Reduction
Assessment 4.97
Assignment 171
Project 3.63

The attribute Assessment’ has the maximum Standard Deviation Reduction and hence it is
chosen as the best splitting attribute.
The training dataset is split into subsets based on the attribute ‘Assessment’ and this process is
continued until the entire tree is constructed. Figure 6.9 shows the regression tree with ‘Assessment’
as the root node and the subsets in each branch.
190 « Machine Learning

SNo. | Assessment | Assignment | Project "m’l §.No.| Assessment [Assignment | Project R(e;‘i;k |
1 Good Yes Yes 95 4. Poor No No 45
3 Good No Yes 75 8. Poor. Yes Yes 65
5. Good Yes Yes | 98
1 1 Good No No L 75 1 [no.| Assessment | Assignment | Project | Result
o0y
10 Good Yes ves | 89
2. | Average Yes No 70
6. | Average No Yes 80
9. Average No. No 58

Figure 6.9: Regression Tree with Assessment as Root Node


The rest of regression tree construction can be done as an exercise.
. ]

6.3 VALIDATING AND PRUNING OF DECISION TREES


Inductive bias refers to a set of assumptions about the domain knowledge added to the training 1
data to perform induction that is to construct a general model out of the training data. A bias is
generally required as without it induction is not possible, since the training data can normally be
generalized to a larger hypothesis space. Inductive bias in ID3 algorithm is the one that prefers the
first acceptable shorter trees over larger trees, and when selecting the best split attribute during
construction, attributes with high information gain are chosen. Thus, even though ID3 searches
a large space of decision trees, it constructs only a single decision tree when there may exist
many alternate decision trees for the same training data. It applies a hill-climbing search that does
not backtrack and may finally converge to a locally optimal solution that is not globally optimal.
The shorter tree is preferred using Occam’s razor principle which states that the simplest solution
is the best solution.
Overfitting is also a general problem with decision trees. Once the decision tree is constructed,
it must be validated for better accuracy and to avoid over-fitting and under-fitting. There is always
a tradeoff between accuracy and complexity of the tree. The tree must be simple and accurate. If the
tree is more complex, it can classify the data instances accurately for the training set but when test
data is given, the tree constructed may perform poorly which means misclassifications are higher
and accuracy is reduced. This problem is called as over-fitting.
To avoid overfitting of the tree, we need to prune the trees and construct an optimal decision
tree. Trees can be pre-pruned or post-pruned. If tree nodes are pruned during construction or
the construction is stopped earlier without exploring the nodes' branches, then it is called
as pre-pruning whereas if tree nodes are pruned after the construction is over then it is called
as post-pruning. Basically, the dataset is split into three sets called training dataset, validation
dataset and test dataset. Generally, 40% of the dataset is used for training the decision tree and
the remaining 60% is used for validation and testing. Once the decision tree is constructed, it is
validated with the validation dataset and the misclassifications are identified. Using the number of

»
Decision Tree Learning « 191

instances correctly classified and number of instances wrongly classified, Average Squared Error
(ASE) is computed. The tree nodes are pruned based on these computations and the resulting tree
is validated until we get a tree that performs better. Cross validation is another way to construct
an optimal decision tree. Here, the dataset is split into k-folds, among which k-1 folds are used
for training the decision tree and the k™ fold is used for validation and errors are computed. The
process is repeated for randomly k-1 folds and the mean of the errors is computed for different
trees. The tree with the lowest error is chosen with which the performance of the tree is improved.
This tree can now be tested with the test dataset and predictions are made.
Another approach is that after the tree is constructed using the training set, statistical tests like
error estimation and Chi-square test are used to estimate whether pruning or splitting is required
for a particular node to find a better accurate tree.
The third approach is using a principle called Minimum Description Length which uses
a complexity measure for encoding the training set and the growth of the decision tree is stopped
when the encoding size (i.e., (size(tree)) + size(misclassifications(tree)) is minimized. CART and
C4.5 perform post-pruning, that is, pruning the tree to a smaller size after construction in order
to minimize the misclassification error. CART makes use of 10-fold cross validation method to
validate and prune the trees, whereas C4.5 uses heuristic formula to estimate misclassification
error rates.
Some of the tree pruning methods are listed below:
. Reduced Error Pruning
N

Minimum Error Pruning (MEP)


. Pessimistic Pruning
e

. Error-based Pruning (EBP)


e

Optimal Pruning
U

Minimum Description Length (MDL) Pruning


Minimum Message Length Pruning
© N

. Critical Value Pruning

Summary
1. The decision tree learning model performs an Inductive inference that reaches a general conclusion
from observed examples.
2. The decision tree learning model generates a complete hypothesis space in the form of a tree structure.
3. A decision tree has a structure that consists of a root node, internal nodes/decision nodes, branches,
and terminal nodes/leaf nodes.
4. Every path from root to a leaf node represents a logical rule that corresponds to a conjunction of test
attributes and the whole tree represents a disjunction of these conjunctions.
5. A decision tree consists of two major procedures, namely building the tree and knowledge inference
or classification.
6. A decision tree is constructed by finding the attribute or feature that best describes the target class
for the given test instances.
192 « Machine Learning

7. Decision tree algorithms such as ID3, C4.5, CART, CHAID, QUEST, GUIDE, CRUISE,
and CTREE,
are used for classification,
8. The univariate decision tree algorithm ID3 uses ‘Information Gain’ as the splitting
criterion whereas
algorithm C4.5 uses ‘Gain Ratio’ as the splitting criterion.

9. Multivariate decision tree algorithm called Classification and Regression Trees (CART) algorith
m ic
popularly used for classifying both categorical and continuous-valued target variables,
10. CART uses GINI Index to construct a decision tree.
11. ID3 works well if the attributes or features are considered as discrete/categorical values,

12. C4.5 and CART can handle both categorical attributes and continuous attributes.
13. Both C4.5 and CART can also handle missing values. C4.5 is prone to outliers but CART can hana
outliers too.
14. The C45 algorithm is further improved by considering attributes which are continuous, and a
continuous attribute is discretized by finding a split point or threshold.
15. The algorithm C4.5 is based on Occam’s Razor which says that given two correct solutions, the
simpler solution must be chosen.
16. Regression trees are a variant of decision trees where the target feature is a continuous-valued
variable.

Entropy - It is the amount of uncertainty or randomness in the outcome of a random variable or an


event.

Information Gain - It is a metric that measures how much information is gained by branching on
an attribute.
Gain Ratio - It is the normalized information gain computed by the ratio of Split_Info and Gain of
an attribute.
GINI Index - It is defined as the number of data instances for a class or it is the proportion of
instances.
Inductive Bias - It refers to a set of assumptions added to the training data in order to perform
induction.
Pre-Pruning - It is a process of pruning the tree nodes during construction or if the construction is
stopped earlier without exploring the nodes branches.
Post-Pruning - It is a process of pruning the tree nodes after the construction is over.

Review Questions
1. How does the structure of a decision tree help in classifying a data instance?
2. What are the different metrics used in deciding the splitting attribute?
3. Define Entropy.
4. Relate Entropy and Information Gain.
5. How does a C4.5 algorithm perform better than ID3? What metric is used in the algorithm?
Decision Tree Learning « 193

6. . What is CART?
7. How does CART solve regression problems?
8. . What is meant by pre-pruning and post-pruning? Compare both the methods.
9. . How are continuous attributes discretized?
10. Consider the training dataset shown in Table 6.42. Discretize the continuous attribute ‘Percentage’.
Table 6.42: Training Dataset
S.No. Percentage Award
1. 95 Yes
2. 80 Yes
3. 72 No
4. 65 Yes
5. 95 Yes
6. 32 No
7 66 No
8. 54 No
9. 89 Yes
10. 72 Yes

11. Consider the training dataset in Table 6.43. Construct decision trees using ID3, C45, and CART.
Table 6.43: Training Dataset
S.No. Assessment Assignme Project Seminar Result
1. | Good Yes Yes Good Pass
2. | Average Yes No Poor Fail
3. | Good No Yes Good Pass
4. | Poor No No Poor Fail
5. | Good Yes Yes Good Pass
6. | Average No Yes Good Pass
7. | Good No No Fair Pass
8. | Poor Yes Yes Good Fail
9. | Average No No Poor Fail
10. | Good Yes Yes Fair Pass
194 o Machine Learning

. Decision tree is based on . ID3 uses as splitting criteria.


(Induction/Deduction). . Entropy is a measure of in the
. Internal node is also known as system.
node. . C4.5 is not prone for outlier problems.
. CART uses GINI index for splitting. (True/ (True/False)
False) . Decision tree generates the hypothesis
10. Decision tree performs inductive space in terms of decision
to reach general conclusion from observed . Two phases of a decision tree are
examples. construction of a decision tree and inference
12, Occam'’s razor states that complex solution or classification. (Yes/No)
is (better/worser) than simpler . Terminal nodes are also known as
solution. nodes.
13. CART is not a decision tree. (True/False) . CART uses gain ratio as a criteria for
14. A decision tree consists of a root node, splitting. (True/False)
internal nodes and terminal nodes. (Yes/ . Categorical variables as target variables
No) are not allowed in CART algorithm. (True/
15. Missing value problem cannot affect C4.5 False)
algorithm. (True/False) 17. Every path from a root node to a leaf node ,’
16. Post-pruning is removal of nodes constitutes a decision
(prior/after) the construction of a tree.
MOm<mbRmMEEOXMNI>NXZ
195

Information gain

—~EOA A< AR YOI NAQ >


o

UDIDZ<
= >mona>mmnI 03
Decision Tree Learning

Decision

EE e OO0 OD -~ XunmNXP
False
D <UNOXMO
B >0
EZBO0ZAa<
T ZOXEmO
<w>ZESponmESmE=ZS X
20020 >m>mZmn VA~
FDO L M¥Xaw<ZAPDUK~0Z

False
True
Yes
—mZO M
VO0OUEFHMOM 00X
S ZORMPEXO
>DZDLE
U
HASEH ~0Zmm X
Xo> AR
>SS <EIMMUT-3

After
False
SR MEE>N®
20O KA

Tree
Yes
MmEEXMYANSIX>O0O>mmn~0
¥NRZe>000X ~MpaO0mon X
R ~O0O>XSANANMKZZmx
Find and mark the words listed below.

Orderliness
O P
DumXQAEaO>m X

Inference
ArxwneI3iIgdawZugdxm

True
Rule
X>PUNZILom<mUDPUEEO
A0
WM< | XE|
UIXDO<oUZmOILIRZA<n
SN~ MEI
>DZUOZS0Zumeado

Induction
OIS -2>X3RWNOK<

Worse
True
~NBEN>OSEBxD>EDEBZ
XK

You might also like