0% found this document useful (0 votes)
4 views

Machine Learning - Classification

The document outlines various classification techniques in machine learning, including k-NN classifier, Naïve Bayes classifier, and Decision Tree induction. It explains the classification process, accuracy evaluation, and the significance of training data and distance measures. Additionally, it discusses the advantages and disadvantages of the k-NN and Naïve Bayes classifiers, emphasizing their applications and computational considerations.

Uploaded by

quinnharley6942
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Machine Learning - Classification

The document outlines various classification techniques in machine learning, including k-NN classifier, Naïve Bayes classifier, and Decision Tree induction. It explains the classification process, accuracy evaluation, and the significance of training data and distance measures. Additionally, it discusses the advantages and disadvantages of the k-NN and Naïve Bayes classifiers, emphasizing their applications and computational considerations.

Uploaded by

quinnharley6942
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers

ating Classifiers Performance

Machine Learning - Classification Task

Prof. Dr. Dewan Md. Farid

Dept. of Computer Science & Engineering,


United International University

June 29, 2024

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

k-NN classifier

Naı̈ve Bayes Classifier

Decision Tree Induction

Tree Pruning & Scalability

Evaluating Classifiers Performance

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Classification

Classification is a machine learning task that describes and distinguishes


data classes or concepts. The goal of classification is to accurately
predict class labels of instances whose attribute values are known, but
class values are unknown. It is a form of data analysis that extracts
models (called classifier) describing important data classes. It is a
two-step process:
Learning step (or training phase) where a classification model, classifier
is constructed. A classification algorithm builds the
classifier by analysing a training dataset made up of
instances and their associated class labels.
Classification step where the classification model is used to predict class
labels for given data.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Classification Accuracy

I The classification accuracy of a classifier on a given test set is the


percentage of test set instances that are correctly classified by the
classifier.

I If the accuracy of the classifier is considered acceptable, the classifier


can be used to classify future data instances for which the class label
is not known.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Data Instance

I In the context of classification in machine learning, instances can be


referred to as data points, examples, tuples, samples, or objects,
which making up the training set are referred to as training instances
and are randomly sampled from the database under analysis.
I Given a dataset, D = {x1 , x2 , · · · , xn }, each instance, xi , is
represented by an n-dimensional attribute vector,
xi = {xi1 , xi2 , · · · , xin }.
I The dataset, D, contains the following attributes {A1 , A2 , · · · , An }.
I Each attribute, Ai , contains the following attribute values
{Ai1 , Ai2 , · · · , Aih }, which represents a feature of xi .
I The dataset, D, also belong to a set of classes C = {c1 , c2 , · · · , cm }.
I Each instance, xi , is belong to a predefined class label, ci .

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Table: Commonly used symbols and terms.


Symbol Term
D Training Data
xi A data instance
X A subset of instances
Aj A feature
aji A feature’s value
cl A class label
kNN k-Nearest-Neighbor
NB Naı̈ve Bayes
DT Decision Tree

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

k-nearest-neighbor (kNN) classifier

I kNN is a simple classifier, which uses the distance measurement


techniques that widely used in pattern recognition.
I kNN finds k instances, X = {x1 , x2 , · · · , xk } ∈ Dtraining that are
closest to the test instance, xtest and assigns the most frequent class
label, cl → xtest among the X .
I When a classification is to be made for a new instance, xnew , its
distance to each Aj ∈ Dtraining , must be determined.
I Only the k closest instances, X ∈ Dtraining are considered further.
I The closest is defined in terms of a distance metric, such as
Euclidean distance.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

kNN classifier (con.)

Figure: Nearest neighbour using the 11-NN rule, the point denoted by a “star”
is classified to the class.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Euclidean Distance

The Euclidean distance between two points, x1 = (x11 , x12 , · · · , x1n ) and
x2 = (x21 , x22 , · · · , x2n ), is shown in Eq. 1
v
u n
uX
dist(x1 , x2 ) = t (x1i − x2i )2 (1)
i=1

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Distance Measures
The distance between the two points in the plane with coordinate (x,y)
and (a,b) is given by:
q
2 2
EuclideanDistance, (x, y )(a, b) = (x − a) + (y − b) (2)

ManhattanDistance, (x, y )(a, b) = |x − a| + |y − b| (3)

Figure: Euclidean and Manhattan distance.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

kNN classifier (con.)

I For, kNN classifier, the unknown instance, xunknown is assigned the


most common class, cl among its k nearest neighbours.
I The k is chosen to be odd for a two class classification and in
general not to be a multiple of the number of classes M.
I Usually, kNN achieves good results when the data set is large.
I The value of k should be large for classifying the noisy data.
I Also we can consider the majority voting over the k nearest
neighbours to deal with noisy instances.
I Algorithm 1 outlines the k-nearest-neighbor algorithm.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Algorithm 1 k-nearest-neighbor classifier


Input: D = {x1 , · · · , xi , · · · , xn }
Output: kNN classifier, kNN.
Method:
1: find X ∈ D that identify the k nearest neighbours, regardless of class
label, cl .
2: out of these instances, X = {x1 , x2 , · · · , xk }, identify the number of
instances, ki , that belong to class cl , l = 1, 2, · · · , M. Obviously,
P
i ki = k.
3: assign xtest to the class cl with the maximum number of ki of instances.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Disadvantage of kNN classifier

I The main disadvantage of the kNN classifier is that it is a lazy


learner, i.e. it does not learn anything from the training data and
simply uses the training data itself for classification.
I A serious drawback associated with (k)NN technique is the
complexity, (O(kN))2 , in search of the nearest neighbour(s) among
the N available training samples. Although, due to its asymptotic
error performance, the kNN rule achieves good results when the
data set is large, the performance of the classifier may degrade
dramatically when the value of N training instances is relatively
small.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

kNN - An Illustrative Example


Table: Data for Height Classification.
Name Gender Height Output
Kristina F 1.6 m Short
Jim M 2m Tall
Maggie F 1.9 m Medium
Martha F 1.88 m Medium
Stephanie F 1.7 m Short
Bob M 1.85 m Medium
Kathy F 1.6 m Short
Dave M 1.7 m Short
Worth M 2.2 m Tall
Steven M 2.1 m Tall
Debbie F 1.8 m Medium
Todd M 1.95 m Medium
Kim F 1.9 m Medium
Amy F 1.8 m Medium
Wynette F 1.75 m Medium

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

An Illustrative Example (con.)

I Using the sample data from Table 2 and the Output classification as
the training set output value, we classify the instance (Pat, F, 1.6).
I Only the height is used for distance calculation so that both the
Euclidean and Manhattan distance measures yield the same results;
that is, the distance is simply the absolute value of the difference
between the values.
I Suppose that K = 5 is given. We then have that the K nearest
neighbours to the input instance are (Kristina, F, 1.6), (Kathy, F,
1.6), (Stephanie, F, 1.7), (Dave, M, 1.7), and (Wynette, F,
1.75).
I Of these the five item, four are classified as short and one as
medium. Thus, the kNN will classify Pat as short.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Naı̈ve Bayes Classifier

I A naı̈ve Bayes (NB) classifier is a simple probabilistic classifier based


on: (a) Bayes theorem, (b) strong (naı̈ve) independence
assumptions, and (c) independent feature models.
I It is also an important mining classifier for pattern classification and
applied in many real world classification problems because of its high
classification performance.
I A NB classifier can easily handle missing attribute values by simply
omitting the corresponding probabilities for those attributes when
calculating the likelihood of membership for each class.
I The NB classifier also requires the class conditional independence,
i.e. the effect of an attribute on a given class is independent of
those of other attributes.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Advantages of NB classifier

The naı̈ve Bayes (NB classifier has several advantages such as:
1. Easy to use.
2. Only one scan of the training data required.
3. Handling missing attribute values.
4. Continuous data.
5. High classification performance.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

NB classifier
The NB classifier predicts that the instance X belongs to the class Ci , if
and only if P(Ci |X ) > P(Cj |X ) for 1 ≤ j ≤ m, j 6= i. The class Ci for
which P(Ci |X ) is maximized is called the Maximum Posteriori
Hypothesis.

P(X |Ci )P(Ci )


P(Ci |X ) = (4)
P(X )
In Bayes theorem shown in Eq. 4, as P(X ) is a constant for all classes,
only P(X |Ci )P(Ci ) needs to be maximized. If the class prior probabilities
are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1 ) = P(C2 ) = · · · = P(Cm ), and therefore maximize
P(X |Ci ). Otherwise, maximize P(X |Ci )P(Ci ). The class prior
probabilities are calculated by P(Ci ) = |Ci,D |/|D|, where |Ci,D | is the
number of training instances belonging to the class Ci in D.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

NB classifier (con.)
To compute P(X |Ci ) in a dataset with many attributes is extremely
computationally expensive. Thus, the naı̈ve assumption of class
conditional independence is made in order to reduce computation in
evaluating P(X |Ci ). The attributes are conditionally independent of one
another, given the class label of the instance. Thus, Eq. 5 and Eq. 6 are
used to produce P(X |Ci ).
n
Y
P(X |Ci ) = P(xk |Ci ) (5)
k=1

P(X |Ci ) = P(x1 |Ci ) × P(x2 |Ci ) × · · · × P(xn |Ci ) (6)


In Eq. 5, xk refers to the value of attribute Ak for instance X . Therefore,
these probabilities P(x1 |Ci ), P(x2 |Ci ), · · · , P(xn |Ci ) can be easily
estimated from the training instances.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

NB classifier (con.)
Moreover, the attributes in training datasets can be categorical or
continuous-valued. If the attribute value, Ak , is categorical, then
P(xk |Ci ) is the number of instances in the class Ci ∈ D with the value xk
for Ak , divided by |Ci,D |, i.e. the number of instances belonging to the
class Ci ∈ D. If Ak is a continuous-valued attribute, then Ak is typically
assumed to have a Gaussian distribution with a mean µ and standard
deviation σ, defined respectively by the following two equations:

P(xk |Ci ) = g (xk , µCi , σCi ) (7)

1 (x−p)2
g (x, µ, σ) = √ e − 2σ2 (8)
2πσ
In Eq. 7, µCi is the mean and σCi is the standard deviation of the values
of the attribute Ak for all training instances in the class Ci . Now we can
bring these two quantities to Eq. 8, together with xk , in order to
estimate P(xk |Ci ).
Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

NB classifier (con.)

To predict the class label of instance X , P(X |Ci )P(Ci ) is evaluated for
each class Ci ∈ D. The NB classifier predicts that the class label of
instance X is the class Ci , if and only if

P(X |Ci )P(Ci ) > P(X |Cj )P(Cj ) (9)


In Eq. 9, 1 ≤ j ≤ m and j 6= i. That is the predicted class label is the
class Ci for which P(X |Ci )P(Ci ) is the maximum probability. Algorithm
2 outlines the naı̈ve Bayes classifier algorithm.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Algorithm 2 Naı̈ve Bayes classifier


Input: D = {x1 , x2 , · · · , xn } // Training data.
Output: A naı̈ve Bayes Model.
Method:
1: for each class, Ci ∈ D, do
2: Find the prior probabilities, P(Ci ).
3: end for
4: for each attribute, Ai ∈ D, do
5: for each attribute value, Aij ∈ Ai , do
6: Find the class conditional probabilities, P(Aij |Ci ).
7: end for
8: end for
9: for each instance, xi ∈ D, do
10: Find the posterior probability, P(Ci |xi );
11: end for

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Laplace Correction

I A zero probability cancels the effects of all the other (posteriori)


probabilities (on Ci ) involved in the product.

I We can assume that our training database, D, is so large that


adding one to each count that we need would only make a negligible
difference in the estimated probability value, yet would conveniently
avoid the case of probability values of zero.

I This technique estimation is known as the Laplacian correction or


Laplace estimator, named after Pierre Laplace, a French
mathematician who lived from 1749 to 1827.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

NB classifier - An Illustrative Example

To illustrate the operation of naı̈ve Bayes, NB classifier, we consider a


small dataset in Table 3 described by four attributes namely Outlook,
Temperature, Humidity, and Wind, which represent the weather condition
of a particular day. Each attribute has several unique attribute values.
The Play column in Table 3 represents the class category of each
instance. It indicates whether a particular weather condition is suitable or
not for playing tennis.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Table: The playing tennis dataset


Day Outlook Temperature Humidity Wind Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Table: Prior probabilities for each class generated using the playing tennis
dataset
Probability Value
P(Play = Yes) 9/14 = 0.642
P(Play = No) 5/14 = 0.375

Table: Conditional probabilities for Outlook calculated using the playing tennis
dataset
Probability Value
P(Outlook = Sunny |Play = Yes) 2/9 = 0.222
P(Outlook = Sunny |Play = No) 3/5 = 0.6
P(Outlook = Overcast|Play = Yes) 4/9 = 0.444
P(Outlook = Overcast|Play = No) 0/5 = 0.0
P(Outlook = Rain|Play = Yes) 3/9 = 0.3
P(Outlook = Rain|Play = No) 2/5 = 0.4

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Table: Conditional probabilities for Temperature, Humidity, and Wind


calculated using the playing tennis dataset
Probability Value
P(Temperature = Hot|Play = Yes) 2/9 = 0.222
P(Temperature = Hot|Play = No) 2/5 = 0.4
P(Temperature = Mild|Play = Yes) 4/9 = 0.444
P(Temperature = Mild|Play = No) 2/5 = 0.4
P(Temperature = Cool|Play = Yes) 3/9 = 0.333
P(Temperature = Cool|Play = No) 1/5 = 0.2
P(Humidity = High|Play = Yes) 3/9 = 0.333
P(Humidity = High|Play = No) 4/5 = 0.8
P(Humidity = Normal|Play = Yes) 6/9 = 0.666
P(Humidity = Normal|Play = No) 1/5 = 0.2
P(Wind = Weak|Play = Yes) 6/9 = 0.666
P(Wind = Weak|Play = No) 2/5 = 0.4
P(Wind = Strong |Play = Yes) 3/9 = 0.333
P(Wind = Strong |Play = No) 3/5 = 0.6

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Using these probabilities, we obtain


P(D1 |Play = Yes) = P(Outlook = Sunny |Play =
Yes) ∗ P(Temperature = Hot|Play = Yes) ∗ P(Humidity = High|Play =
Yes) ∗ P(Wind = Weak|Play = Yes) = 0.222 ∗ 0.222 ∗ 0.333 ∗ 0.666 =
0.0109.
Similarly
P(D1 |Play = No) = P(Outlook = Sunny |Play =
No) ∗ P(Temperature = Hot|Play = No) ∗ P(Humidity = High|Play =
No) ∗ P(Wind = Weak|Play = No) = 0.6 ∗ 0.4 ∗ 0.8 ∗ 0.4 = 0.0768.
To find the class, Ci , that maximises P(X |Ci )P(Ci ), we compute:
P(D1 |Play = Yes)P(Play = Yes) = 0.0109 ∗ 0.642 = 0.00699
P(D1 |Play = No)P(Play = No) = 0.0768 ∗ 0.375 = 0.0288
Therefore, the naı̈ve Bayes classifier predicts Play = No for instance D1 .
Note: As P(Outlook = Overcast|Play = No)= 50 =0. So, all instances
with Outlook = Overcast will be Yes (D3 , D7 , D12 , and D13 will be Yes).

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Tree

Figure: A picture of tree.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Decision Tree
Decision tree (DT) induction is the learning of decision trees from
class-labeled training instances, which is a top-down recursive divide and
conquer algorithm. The goal of DT is to create a model (classifier) that
predicts the value of a target class for an unseen test instance based on
several input instances. DTs have various advantages:
1. Simple to understand.
2. Easy to implement.
3. Requiring little prior knowledge.
4. Able to handle both numerical and categorical data.
5. Robust.
6. Dealing with large and noisy datasets.
7. Nonlinear relationships between features do not affect the tree
performance.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Iterative Dichotomiser 3 (ID3)


The goal of DT is to iteratively partition the data into smaller subsets
until all the subsets belong to a single class or the stopping criteria of DT
building process are met.
I ID3 (Iterative Dichotomiser 3) algorithm that used information
theory to select the best feature, Aj . The Aj with the maximum
Information Gain is chosen as root node of the tree.
I To classify an instance, xi ∈ D the average amount of information
needed to identify a class, cl is shown in Eq. 10. Where pi is the
probability that xi belongs to the class, cl and is estimated by
|cl , D|/|D|.
N
X
Info(D) = − pi log2 (pi ) (10)
i=1

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

ID3 (con.)

InfoA (D) is the expected information required to correctly classify xi ∈ D


based on the partitioning by Aj . Eq. 11 shows InfoA (D) calculation,
|D |
where |D|j acts as the weight of the jth partition.
n
X |Dj |
InfoA (D) = × Info(Dj ) (11)
|D|
j=1

Information gain is defined as the difference between Info(D) and


InfoA (D) that is shown in Eq. 12.

Gain(A) = Info(D) − InfoA (D) (12)

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

C4.5
Quinlan later presented C4.5 (a successor of ID3 algorithm) that became
a benchmark in supervised learning algorithms. C4.5 uses an extension of
Information Gain, which is known as Gain Ratio. It applies a kind of
normalisation of Information Gain using Split Information defined
analogously to Info(D) as shown in Eq. 13.
n  
X |Dj | |Dj |
SplitInfoA (D) = − × log2 (13)
|D| |D|
j=1

The Aj with the maximum Gain Ratio, which is defined in Eq. 14 is


selected as the splitting feature.

Gain(A)
GainRatio(A) = (14)
SplitInfo(A)

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Gini Index
The Gini Index is used in Classification and Regression Trees (CART)
algorithm, which generates the binary classification tree for decision
making. It measures the impurity of D, a data partition or X , as shown
in Eq. 15, where, pi is the probability that xi ∈ D belongs to class, cl and
is estimated by |cl , D|/|D|. The sum is computed over M classes.
N
X
Gini(D) = 1 − pi2 (15)
i=1

It considers a binary split, a weighted sum of the impurity of each


resulting partition. For example, if a binary split on A partitions D into
D1 and D2 the Gini Index of D given that partitioning is shown in Eq. 16.

|D1 | |D2 |
GiniA (D) = Gini(D1 ) + Gini(D2 ) (16)
|D| |D|

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Gini Index (con.)

For each Aj , each of the possible binary splits is considered. The Aj that
maximises the reduction in impurity is selected as the splitting feature,
shown in Eq. 17.

∆Gini(A) = Gini(D) − GiniA (D) (17)


The time and space complexity of a tree depend on the size of the data
set, number of features and the size of the generated tree. The key
disadvantage of DTs is that without proper pruning (or limiting tree
growth), trees tend to overfit the training data.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Algorithm 3 Decision Tree Induction


Input: D = {x1 , · · · , xi , · · · , xN }
Output: A decision tree, DT .
Method:
1: DT = ∅;
2: find the root node with best splitting, Aj ∈ D;
3: DT = create the root node;
4: DT = add arc to root node for each split predicate and label;
5: for each arc do
6: Dj created by applying splitting predicate to D;
7: if stopping point reached for this path then
8: DT 0 = create a leaf node and label it with cl ;
9: else 0
10: DT = DTBuild(Dj );
11: end if
12: DT = add DT 0 to arc;
13: end for

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Tree using ID3 - An Illustrative Example

I To illustrate the operation of DT, we consider a small dataset in


Table 7 described by four attributes namely Outlook, Temperature,
Humidity, and Wind, which represent the weather condition of a
particular day.

I Each attribute has several unique attribute values.

I The Play column in Table 7 represents the class category of each


instance. It indicates whether a particular weather condition is
suitable or not for playing tennis.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Table: The playing tennis dataset


Day Outlook Temperature Humidity Wind Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Gain Calculation

   
9 9 5 5
Info(D) = − log2 − log2 = 0.940
14 14 14 14

   
5 2 2 3 3 4 4 4
InfoOutlook (D) = ∗ − log2 − log2 + ∗ − log2
14 5 5 5 5 14 4 4
 
5 3 3 2 2
+ ∗ − log2 − log2 = 0.694
14 5 5 5 5
Therefore, the gain in information from such a partitioning would be:

Gain(Outlook) = Info(D) − InfoOutlook (D) = 0.940 − 0.694 = 0.246

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Gain Calculation (con.)


 
4 2 2 2 2
InfoTemperature (D) = ∗ − log2 − log2
14 4 4 4 4
 
6 4 4 2 2
+ ∗ − log2 − log2
14 6 6 6 6
 
4 3 3 1 1
+ ∗ − log2 − log2 = 0.854
14 4 4 4 4
Therefore, the gain in information from such a partitioning would be:

Gain(Temperature) = Info(D)−InfoTemperature (D) = 0.940−0.854 = 0.086

So, Information Gain of Outlook = 0.246, Temperature = 0.086,


Humidity = 0.154, and Wind = 0.197.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Table: The playing tennis sub-dataset, Outlook=Sunny


Day Temperature Humidity Wind Play
D1 Hot High Weak No
D2 Hot High Strong No
D8 Mild High Weak No
D9 Cool Normal Weak Yes
D11 Mild Normal Strong Yes

Table: The playing tennis sub-dataset, Outlook=Overcast


Day Temperature Humidity Wind Play
D3 Hot High Weak Yes
D7 Cool Normal Strong Yes
D12 Mild High Strong Yes
D13 Hot Normal Weak Yes

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Table: The playing tennis dataset, Outlook=Rain


Day Temperature Humidity Wind Play
D4 Mild High Weak Yes
D5 Cool Normal Weak Yes
D6 Cool Normal Strong No
D10 Mild Normal Weak Yes
D14 Mild High Strong No

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Decision Tree

Figure: A DT generated by the playing tennis dataset.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Tree Pruning

I Tree pruning methods address the problem of overfitting the data.

I When a decision tree is built, many of the branches will reflect


anomalies in the training data due to noise.

I There are two common approaches to tree pruning:


1. Pre-pruning
2. Post-pruning

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Pre-pruning

I Pre-pruning a tree is pruned by halting its construction early (e.g. by


deciding not to further split or partition the subset of training
instances at a given node).

I Upon halting, the node becomes a leaf.

I The leaf may hold the most frequent class among the subset
instances or the probability distribution of those instances.

I In choosing an appropriate threshold, high thresholds could result in


oversimplified trees, whereas low thresholds could result in very little
simplification.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Post-pruning

I Post-pruning is commonly used tree pruning approach, which


removes subtrees from a full grown tree.

I A subtree at a given node is pruned by removing its branches and


replacing it with a leaf.

I The leaf is labeled with the most frequent class among the subtree
being replaced.

I Post-pruning requires more computation than pre-pruning, yet


generally leads to a more reliable tree.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Repetition &. Replication


I No single pruning method has been found to be superior over all
others.

I Although some pruning methods do depend on the availability of


additional data for pruning, this is usually not a concern when
dealing with large databases.

I Decision trees can suffer from repetition and replication.


Repetition occurs when an attribute is repeatedly tested along a
given branch of the tree.
Replication duplicates subtrees exist within the tree. These
situations can impede the accuracy and
comprehensibility of a decision tree.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Tree Pruning by Cost Complexity


I The cost complexity pruning algorithm used in CART (Classification
and Regression Trees) is an example of the post-pruning approach.
I This approach considers the cost complexity of a tree to be a
function of the number of leaves in the tree and the error rate of
the tree (where the error rate is the percentage of instances
misclassified by the tree).
I It starts from the bottom of the tree.
I For each internal node, N, it computes the cost complexity of the
subtree at N, and the cost complexity of the subtree at N if it were
to be pruned (i.e., replaced by a leaf node).
I The two values are compared. If pruning the subtree at node N
would result in a smaller cost complexity, then the subtree is pruned.
Otherwise, it is kept.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Pruning Set

I A pruning set of class-labeled instances is used to estimate cost


complexity.

I This set is independent of the training set used to build the


unpruned tree and of any test set used for accuracy estimation.

I The algorithm generates a set of progressively pruned trees.

I In general, the smallest decision tree that minimises the cost


complexity is preferred.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Pessimistic Tree Pruning

I Pessimistic tree pruning is used by C4.5 decision tree induction


algorithm, which is similar to the cost complexity method in that it
also uses error rate estimates to make decisions regarding subtree
pruning.

I Pessimistic pruning does not require the use of a prune set. Instead,
it uses the training set to estimate error rates. Recall that an
estimate of accuracy or error based on the training set is overly
optimistic and, therefore, strongly biased.

I The pessimistic pruning method therefore adjusts the error rates


obtained from the training set by adding a penalty, so as to counter
the bias incurred.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Scalability & Decision Tree Induction

I In data mining applications, very large training sets of millions of


instances are common.
I Most often, the training data will not fit in memory.
I The efficiency of existing decision tree algorithms, such as ID3, C4.5,
and CART, has been well established for relatively small data sets.
I Therefore, decision tree construction becomes inefficient due to
swapping of the training instances in and out of main and cache
memories.
I More scalable approaches, capable of handling training data that are
too large to fit in memory, are required.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

RainForest Tree

RainForest adapts to the amount of main memory available and applies


to any decision tree induction algorithm. The method maintains an
AVC-set (Attribute-Value, Class-label) for each attribute, at each tree
node, describing the training instances at the node. The AVC-set of an
attribute A at node N gives the class label counts for each value of A for
the instances at N. Table 11 shows the AVC-set of outlook attribute of
Table 7.

Table: AVC-set of Outlook attribute of Table 7


Outlook Yes No
Sunny 2 3
Overcast 4 0
Rain 3 2

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Bootstrapped Optimistic Algorithm for Tree construction


(BOAT)

I It creates several smaller subsets of the given training data, each of


which fits in memory. Each subset is used to construct a tree,
resulting in several trees.

I The trees are examined and used to construct a new tree, T 0 , that
turns out to be very close to the tree that would have been
generated if all the original training data had fit in memory.

I BOAT can use any attribute selection measure that selects binary
splits and that is based on the notion of purity of partitions such as
the Gini index.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

BOAT (con.)

I BOAT was found to be two to three times faster than RainForest,


while constructing exactly the same tree.

I An additional advantage of BOAT is that it can be used for


incremental updates. That is, BOAT can take new insertions and
deletions for the training data and update the decision tree to reflect
these changes, without having to reconstruct the tree from scratch.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Evaluating Classifier Performance

The performance measures of a classifier that how accurate the classifier


predicting the class label of instances (both training and testing
instances). There are four terms we need to know that are used in
computing many evaluation measures.

Table: Class Prediction - Comparing classification performance to information


retrieval.
True Positive, TP False Negative, FN
False Positive, FP True Negative,TN

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Evaluating Classifier Performance (con.)


True positives, TP: These refer to the positive instances (e.g.,
Play = Yes) that were correctly classified/ labeled by the
classifier. Let TP be the number of true positives.
True negatives, TN: These are the negative instances (e.g., Play = No)
that were correctly classified/ labeled by the classifier. Let
TN be the number of true negatives.
False positives, FP: These are the negative instances that were
misclassified/ incorrectly labeled as positive by the
classifier (e.g., instances of Play = No for which the
classifier predicted Play = Yes). Let FP be the number of
false positives.
False negatives, FN: These are the positive instances that were
misclassified/ incorrectly labeled as negative by the
classifier (e.g., instances of Play = Yes for which the
classifier predicted Play = No). Let FN be the number of
false negatives.
Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Classification accuracy
The classification accuracy of a classifier on a given test set is the
percentage of test set instances that are correctly classified by the
classifier. In the pattern recognition literature, this is also referred to as
the overall recognition rate of the classifier, that is, it reflects how well
the classifier recognises instances of the various classes. The classification
accuracy is measured either by Equation 18 to Equation 19.
TP + TN
accuracy = (18)
P +N
TP + TN
accuracy = (19)
TP + TN + FP + FN
P|X |
i=1 assess(xi )
accuracy = , xi ∈ X (20)
|X |

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Error rate

The error rate or misclassification rate of a classifier on a given test


set is the percentage of test set instances that are misclassified by the
classifier, which is also known as the resubstitution error. The error
rate is shown in Equation 21.
FP + FN
errorRate = (21)
P +N

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Sensitivity

The sensitivity is referred to as the true positive (recognition) rate (i.e.,


the proportion of positive instances that are correctly identified), shown
in Equation 22.
TP
sensitivity = (22)
P

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Specificity

The specificity is referred to as the true negative rate (i.e., the


proportion of negative instances that are correctly identified), shown in
Equation 23. A perfect classifier would be described as 100% sensitivity
and 100% specificity.
TN
specificity = (23)
N

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Precision & Recall

The precision and recall measures are also widely used in classification.
Precision can be thought of as a measure of exactness (i.e., what
percentage of instances labeled as positive are actually such), whereas
recall is a measure of completeness (what percentage of positive
instances are labeled as such). If recall seems familiar, that?s because it
is the same as sensitivity (or the true positive rate).
TP
precision = (24)
TP + FP
TP TP
recall = = (25)
TP + FN P

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Table: Evaluation Measures - Classifier performance


Measure Formula
TP+TN
Accuracy/ Recognition rate P+N
FP+FN
Error rate/ Misclassification rate P+N
TP
Sensitivity/ True positive rate/ Recall P
TN
Specificity/ True negative rate N
TP
Precision TP+FP
2∗precision∗recall
F / F1 / F − score precision+recall
(1+β)2 ∗precision∗recall
Fβ , where β is a non-negative real number β 2 ∗precision+recall

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

k-fold Cross-Validation
I In k-fold cross-validation, the initial data are randomly partitioned
into k mutually exclusive subsets or “folds”, D1 , D2 , · · · , Dk , each of
which has an approximately equal size.
I Training and testing are performed k times.
I In iteration i, the partition Di is reserved as the test set, and the
remaining partitions are collectively used to train the classifier.
I 10-fold cross validation breaks data into 10 sets of size N/10.
I It trains the classifier on 9 datasets and tests it using the remaining
one dataset. This repeats 10 times and we take a mean accuracy
rate.
I For classification, the accuracy estimate is the overall number of
correct classifications from the k iterations, divided by the total
number of instances in the initial dataset.

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

*** THANK YOU ***

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University

You might also like