0% found this document useful (0 votes)

4 views

Machine Learning - Classification

The document outlines various classification techniques in machine learning, including k-NN classifier, Naïve Bayes classifier, and Decision Tree induction. It explains the classification process, accuracy evaluation, and the significance of training data and distance measures. Additionally, it discusses the advantages and disadvantages of the k-NN and Naïve Bayes classifiers, emphasizing their applications and computational considerations.

Uploaded by

quinnharley6942

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Machine Learning - Classification

Uploaded by

quinnharley6942

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers

ating Classifiers Performance

Machine Learning - Classification Task

Prof. Dr. Dewan Md. Farid

Dept. of Computer Science & Engineering,

United International University

June 29, 2024

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

k-NN classifier

Naı̈ve Bayes Classifier

Decision Tree Induction

Tree Pruning & Scalability

Evaluating Classifiers Performance

Classification

Classification is a machine learning task that describes and distinguishes

data classes or concepts. The goal of classification is to accurately
predict class labels of instances whose attribute values are known, but
class values are unknown. It is a form of data analysis that extracts
models (called classifier) describing important data classes. It is a
two-step process:
Learning step (or training phase) where a classification model, classifier
is constructed. A classification algorithm builds the
classifier by analysing a training dataset made up of
instances and their associated class labels.
Classification step where the classification model is used to predict class
labels for given data.

Classification Accuracy

I The classification accuracy of a classifier on a given test set is the

percentage of test set instances that are correctly classified by the
classifier.

I If the accuracy of the classifier is considered acceptable, the classifier

can be used to classify future data instances for which the class label
is not known.

Data Instance

I In the context of classification in machine learning, instances can be

referred to as data points, examples, tuples, samples, or objects,
which making up the training set are referred to as training instances
and are randomly sampled from the database under analysis.
I Given a dataset, D = {x1 , x2 , · · · , xn }, each instance, xi , is
represented by an n-dimensional attribute vector,
xi = {xi1 , xi2 , · · · , xin }.
I The dataset, D, contains the following attributes {A1 , A2 , · · · , An }.
I Each attribute, Ai , contains the following attribute values
{Ai1 , Ai2 , · · · , Aih }, which represents a feature of xi .
I The dataset, D, also belong to a set of classes C = {c1 , c2 , · · · , cm }.
I Each instance, xi , is belong to a predefined class label, ci .

Table: Commonly used symbols and terms.

Symbol Term
D Training Data
xi A data instance
X A subset of instances
Aj A feature
aji A feature’s value
cl A class label
kNN k-Nearest-Neighbor
NB Naı̈ve Bayes
DT Decision Tree

k-nearest-neighbor (kNN) classifier

I kNN is a simple classifier, which uses the distance measurement

techniques that widely used in pattern recognition.
I kNN finds k instances, X = {x1 , x2 , · · · , xk } ∈ Dtraining that are
closest to the test instance, xtest and assigns the most frequent class
label, cl → xtest among the X .
I When a classification is to be made for a new instance, xnew , its
distance to each Aj ∈ Dtraining , must be determined.
I Only the k closest instances, X ∈ Dtraining are considered further.
I The closest is defined in terms of a distance metric, such as
Euclidean distance.

kNN classifier (con.)

Figure: Nearest neighbour using the 11-NN rule, the point denoted by a “star”
is classified to the class.

Euclidean Distance

The Euclidean distance between two points, x1 = (x11 , x12 , · · · , x1n ) and
x2 = (x21 , x22 , · · · , x2n ), is shown in Eq. 1
v
u n
uX
dist(x1 , x2 ) = t (x1i − x2i )2 (1)
i=1

Distance Measures
The distance between the two points in the plane with coordinate (x,y)
and (a,b) is given by:
q
2 2
EuclideanDistance, (x, y )(a, b) = (x − a) + (y − b) (2)

ManhattanDistance, (x, y )(a, b) = |x − a| + |y − b| (3)

Figure: Euclidean and Manhattan distance.

kNN classifier (con.)

I For, kNN classifier, the unknown instance, xunknown is assigned the

most common class, cl among its k nearest neighbours.
I The k is chosen to be odd for a two class classification and in
general not to be a multiple of the number of classes M.
I Usually, kNN achieves good results when the data set is large.
I The value of k should be large for classifying the noisy data.
I Also we can consider the majority voting over the k nearest
neighbours to deal with noisy instances.
I Algorithm 1 outlines the k-nearest-neighbor algorithm.

Algorithm 1 k-nearest-neighbor classifier

Input: D = {x1 , · · · , xi , · · · , xn }
Output: kNN classifier, kNN.
Method:
1: find X ∈ D that identify the k nearest neighbours, regardless of class
label, cl .
2: out of these instances, X = {x1 , x2 , · · · , xk }, identify the number of
instances, ki , that belong to class cl , l = 1, 2, · · · , M. Obviously,
P
i ki = k.
3: assign xtest to the class cl with the maximum number of ki of instances.

Disadvantage of kNN classifier

I The main disadvantage of the kNN classifier is that it is a lazy

learner, i.e. it does not learn anything from the training data and
simply uses the training data itself for classification.
I A serious drawback associated with (k)NN technique is the
complexity, (O(kN))2 , in search of the nearest neighbour(s) among
the N available training samples. Although, due to its asymptotic
error performance, the kNN rule achieves good results when the
data set is large, the performance of the classifier may degrade
dramatically when the value of N training instances is relatively
small.

kNN - An Illustrative Example

Table: Data for Height Classification.
Name Gender Height Output
Kristina F 1.6 m Short
Jim M 2m Tall
Maggie F 1.9 m Medium
Martha F 1.88 m Medium
Stephanie F 1.7 m Short
Bob M 1.85 m Medium
Kathy F 1.6 m Short
Dave M 1.7 m Short
Worth M 2.2 m Tall
Steven M 2.1 m Tall
Debbie F 1.8 m Medium
Todd M 1.95 m Medium
Kim F 1.9 m Medium
Amy F 1.8 m Medium
Wynette F 1.75 m Medium

An Illustrative Example (con.)

I Using the sample data from Table 2 and the Output classification as
the training set output value, we classify the instance (Pat, F, 1.6).
I Only the height is used for distance calculation so that both the
Euclidean and Manhattan distance measures yield the same results;
that is, the distance is simply the absolute value of the difference
between the values.
I Suppose that K = 5 is given. We then have that the K nearest
neighbours to the input instance are (Kristina, F, 1.6), (Kathy, F,
1.6), (Stephanie, F, 1.7), (Dave, M, 1.7), and (Wynette, F,
1.75).
I Of these the five item, four are classified as short and one as
medium. Thus, the kNN will classify Pat as short.

Naı̈ve Bayes Classifier

I A naı̈ve Bayes (NB) classifier is a simple probabilistic classifier based

on: (a) Bayes theorem, (b) strong (naı̈ve) independence
assumptions, and (c) independent feature models.
I It is also an important mining classifier for pattern classification and
applied in many real world classification problems because of its high
classification performance.
I A NB classifier can easily handle missing attribute values by simply
omitting the corresponding probabilities for those attributes when
calculating the likelihood of membership for each class.
I The NB classifier also requires the class conditional independence,
i.e. the effect of an attribute on a given class is independent of
those of other attributes.

Advantages of NB classifier

The naı̈ve Bayes (NB classifier has several advantages such as:
1. Easy to use.
2. Only one scan of the training data required.
3. Handling missing attribute values.
4. Continuous data.
5. High classification performance.

NB classifier
The NB classifier predicts that the instance X belongs to the class Ci , if
and only if P(Ci |X ) > P(Cj |X ) for 1 ≤ j ≤ m, j 6= i. The class Ci for
which P(Ci |X ) is maximized is called the Maximum Posteriori
Hypothesis.

P(X |Ci )P(Ci )

P(Ci |X ) = (4)
P(X )
In Bayes theorem shown in Eq. 4, as P(X ) is a constant for all classes,
only P(X |Ci )P(Ci ) needs to be maximized. If the class prior probabilities
are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1 ) = P(C2 ) = · · · = P(Cm ), and therefore maximize
P(X |Ci ). Otherwise, maximize P(X |Ci )P(Ci ). The class prior
probabilities are calculated by P(Ci ) = |Ci,D |/|D|, where |Ci,D | is the
number of training instances belonging to the class Ci in D.

NB classifier (con.)
To compute P(X |Ci ) in a dataset with many attributes is extremely
computationally expensive. Thus, the naı̈ve assumption of class
conditional independence is made in order to reduce computation in
evaluating P(X |Ci ). The attributes are conditionally independent of one
another, given the class label of the instance. Thus, Eq. 5 and Eq. 6 are
used to produce P(X |Ci ).
n
Y
P(X |Ci ) = P(xk |Ci ) (5)
k=1

P(X |Ci ) = P(x1 |Ci ) × P(x2 |Ci ) × · · · × P(xn |Ci ) (6)

In Eq. 5, xk refers to the value of attribute Ak for instance X . Therefore,
these probabilities P(x1 |Ci ), P(x2 |Ci ), · · · , P(xn |Ci ) can be easily
estimated from the training instances.

NB classifier (con.)
Moreover, the attributes in training datasets can be categorical or
continuous-valued. If the attribute value, Ak , is categorical, then
P(xk |Ci ) is the number of instances in the class Ci ∈ D with the value xk
for Ak , divided by |Ci,D |, i.e. the number of instances belonging to the
class Ci ∈ D. If Ak is a continuous-valued attribute, then Ak is typically
assumed to have a Gaussian distribution with a mean µ and standard
deviation σ, defined respectively by the following two equations:

P(xk |Ci ) = g (xk , µCi , σCi ) (7)

1 (x−p)2
g (x, µ, σ) = √ e − 2σ2 (8)
2πσ
In Eq. 7, µCi is the mean and σCi is the standard deviation of the values
of the attribute Ak for all training instances in the class Ci . Now we can
bring these two quantities to Eq. 8, together with xk , in order to
estimate P(xk |Ci ).
Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

NB classifier (con.)

To predict the class label of instance X , P(X |Ci )P(Ci ) is evaluated for
each class Ci ∈ D. The NB classifier predicts that the class label of
instance X is the class Ci , if and only if

P(X |Ci )P(Ci ) > P(X |Cj )P(Cj ) (9)

In Eq. 9, 1 ≤ j ≤ m and j 6= i. That is the predicted class label is the
class Ci for which P(X |Ci )P(Ci ) is the maximum probability. Algorithm
2 outlines the naı̈ve Bayes classifier algorithm.

Algorithm 2 Naı̈ve Bayes classifier

Input: D = {x1 , x2 , · · · , xn } // Training data.
Output: A naı̈ve Bayes Model.
Method:
1: for each class, Ci ∈ D, do
2: Find the prior probabilities, P(Ci ).
3: end for
4: for each attribute, Ai ∈ D, do
5: for each attribute value, Aij ∈ Ai , do
6: Find the class conditional probabilities, P(Aij |Ci ).
7: end for
8: end for
9: for each instance, xi ∈ D, do
10: Find the posterior probability, P(Ci |xi );
11: end for

Laplace Correction

I A zero probability cancels the effects of all the other (posteriori)

probabilities (on Ci ) involved in the product.

I We can assume that our training database, D, is so large that

adding one to each count that we need would only make a negligible
difference in the estimated probability value, yet would conveniently
avoid the case of probability values of zero.

I This technique estimation is known as the Laplacian correction or

Laplace estimator, named after Pierre Laplace, a French
mathematician who lived from 1749 to 1827.

NB classifier - An Illustrative Example

To illustrate the operation of naı̈ve Bayes, NB classifier, we consider a

small dataset in Table 3 described by four attributes namely Outlook,
Temperature, Humidity, and Wind, which represent the weather condition
of a particular day. Each attribute has several unique attribute values.
The Play column in Table 3 represents the class category of each
instance. It indicates whether a particular weather condition is suitable or
not for playing tennis.

Table: The playing tennis dataset

Day Outlook Temperature Humidity Wind Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Table: Prior probabilities for each class generated using the playing tennis
dataset
Probability Value
P(Play = Yes) 9/14 = 0.642
P(Play = No) 5/14 = 0.375

Table: Conditional probabilities for Outlook calculated using the playing tennis
dataset
Probability Value
P(Outlook = Sunny |Play = Yes) 2/9 = 0.222
P(Outlook = Sunny |Play = No) 3/5 = 0.6
P(Outlook = Overcast|Play = Yes) 4/9 = 0.444
P(Outlook = Overcast|Play = No) 0/5 = 0.0
P(Outlook = Rain|Play = Yes) 3/9 = 0.3
P(Outlook = Rain|Play = No) 2/5 = 0.4

Table: Conditional probabilities for Temperature, Humidity, and Wind

calculated using the playing tennis dataset
Probability Value
P(Temperature = Hot|Play = Yes) 2/9 = 0.222
P(Temperature = Hot|Play = No) 2/5 = 0.4
P(Temperature = Mild|Play = Yes) 4/9 = 0.444
P(Temperature = Mild|Play = No) 2/5 = 0.4
P(Temperature = Cool|Play = Yes) 3/9 = 0.333
P(Temperature = Cool|Play = No) 1/5 = 0.2
P(Humidity = High|Play = Yes) 3/9 = 0.333
P(Humidity = High|Play = No) 4/5 = 0.8
P(Humidity = Normal|Play = Yes) 6/9 = 0.666
P(Humidity = Normal|Play = No) 1/5 = 0.2
P(Wind = Weak|Play = Yes) 6/9 = 0.666
P(Wind = Weak|Play = No) 2/5 = 0.4
P(Wind = Strong |Play = Yes) 3/9 = 0.333
P(Wind = Strong |Play = No) 3/5 = 0.6

Using these probabilities, we obtain

P(D1 |Play = Yes) = P(Outlook = Sunny |Play =
Yes) ∗ P(Temperature = Hot|Play = Yes) ∗ P(Humidity = High|Play =
Yes) ∗ P(Wind = Weak|Play = Yes) = 0.222 ∗ 0.222 ∗ 0.333 ∗ 0.666 =
0.0109.
Similarly
P(D1 |Play = No) = P(Outlook = Sunny |Play =
No) ∗ P(Temperature = Hot|Play = No) ∗ P(Humidity = High|Play =
No) ∗ P(Wind = Weak|Play = No) = 0.6 ∗ 0.4 ∗ 0.8 ∗ 0.4 = 0.0768.
To find the class, Ci , that maximises P(X |Ci )P(Ci ), we compute:
P(D1 |Play = Yes)P(Play = Yes) = 0.0109 ∗ 0.642 = 0.00699
P(D1 |Play = No)P(Play = No) = 0.0768 ∗ 0.375 = 0.0288
Therefore, the naı̈ve Bayes classifier predicts Play = No for instance D1 .
Note: As P(Outlook = Overcast|Play = No)= 50 =0. So, all instances
with Outlook = Overcast will be Yes (D3 , D7 , D12 , and D13 will be Yes).

Tree

Figure: A picture of tree.

Decision Tree
Decision tree (DT) induction is the learning of decision trees from
class-labeled training instances, which is a top-down recursive divide and
conquer algorithm. The goal of DT is to create a model (classifier) that
predicts the value of a target class for an unseen test instance based on
several input instances. DTs have various advantages:
1. Simple to understand.
2. Easy to implement.
3. Requiring little prior knowledge.
4. Able to handle both numerical and categorical data.
5. Robust.
6. Dealing with large and noisy datasets.
7. Nonlinear relationships between features do not affect the tree
performance.

Iterative Dichotomiser 3 (ID3)

The goal of DT is to iteratively partition the data into smaller subsets
until all the subsets belong to a single class or the stopping criteria of DT
building process are met.
I ID3 (Iterative Dichotomiser 3) algorithm that used information
theory to select the best feature, Aj . The Aj with the maximum
Information Gain is chosen as root node of the tree.
I To classify an instance, xi ∈ D the average amount of information
needed to identify a class, cl is shown in Eq. 10. Where pi is the
probability that xi belongs to the class, cl and is estimated by
|cl , D|/|D|.
N
X
Info(D) = − pi log2 (pi ) (10)
i=1

ID3 (con.)

InfoA (D) is the expected information required to correctly classify xi ∈ D

based on the partitioning by Aj . Eq. 11 shows InfoA (D) calculation,
|D |
where |D|j acts as the weight of the jth partition.
n
X |Dj |
InfoA (D) = × Info(Dj ) (11)
|D|
j=1

Information gain is defined as the difference between Info(D) and

InfoA (D) that is shown in Eq. 12.

Gain(A) = Info(D) − InfoA (D) (12)

C4.5
Quinlan later presented C4.5 (a successor of ID3 algorithm) that became
a benchmark in supervised learning algorithms. C4.5 uses an extension of
Information Gain, which is known as Gain Ratio. It applies a kind of
normalisation of Information Gain using Split Information defined
analogously to Info(D) as shown in Eq. 13.
n
X |Dj | |Dj |
SplitInfoA (D) = − × log2 (13)
|D| |D|
j=1

The Aj with the maximum Gain Ratio, which is defined in Eq. 14 is

selected as the splitting feature.

Gain(A)
GainRatio(A) = (14)
SplitInfo(A)

Gini Index
The Gini Index is used in Classification and Regression Trees (CART)
algorithm, which generates the binary classification tree for decision
making. It measures the impurity of D, a data partition or X , as shown
in Eq. 15, where, pi is the probability that xi ∈ D belongs to class, cl and
is estimated by |cl , D|/|D|. The sum is computed over M classes.
N
X
Gini(D) = 1 − pi2 (15)
i=1

It considers a binary split, a weighted sum of the impurity of each

resulting partition. For example, if a binary split on A partitions D into
D1 and D2 the Gini Index of D given that partitioning is shown in Eq. 16.

|D1 | |D2 |
GiniA (D) = Gini(D1 ) + Gini(D2 ) (16)
|D| |D|

Gini Index (con.)

For each Aj , each of the possible binary splits is considered. The Aj that
maximises the reduction in impurity is selected as the splitting feature,
shown in Eq. 17.

∆Gini(A) = Gini(D) − GiniA (D) (17)

The time and space complexity of a tree depend on the size of the data
set, number of features and the size of the generated tree. The key
disadvantage of DTs is that without proper pruning (or limiting tree
growth), trees tend to overfit the training data.

Algorithm 3 Decision Tree Induction

Input: D = {x1 , · · · , xi , · · · , xN }
Output: A decision tree, DT .
Method:
1: DT = ∅;
2: find the root node with best splitting, Aj ∈ D;
3: DT = create the root node;
4: DT = add arc to root node for each split predicate and label;
5: for each arc do
6: Dj created by applying splitting predicate to D;
7: if stopping point reached for this path then
8: DT 0 = create a leaf node and label it with cl ;
9: else 0
10: DT = DTBuild(Dj );
11: end if
12: DT = add DT 0 to arc;
13: end for

Tree using ID3 - An Illustrative Example

I To illustrate the operation of DT, we consider a small dataset in

Table 7 described by four attributes namely Outlook, Temperature,
Humidity, and Wind, which represent the weather condition of a
particular day.

I Each attribute has several unique attribute values.

I The Play column in Table 7 represents the class category of each

instance. It indicates whether a particular weather condition is
suitable or not for playing tennis.

Table: The playing tennis dataset

Gain Calculation

9 9 5 5
Info(D) = − log2 − log2 = 0.940
14 14 14 14

5 2 2 3 3 4 4 4
InfoOutlook (D) = ∗ − log2 − log2 + ∗ − log2
14 5 5 5 5 14 4 4

5 3 3 2 2
+ ∗ − log2 − log2 = 0.694
14 5 5 5 5
Therefore, the gain in information from such a partitioning would be:

Gain(Outlook) = Info(D) − InfoOutlook (D) = 0.940 − 0.694 = 0.246

Gain Calculation (con.)

4 2 2 2 2
InfoTemperature (D) = ∗ − log2 − log2
14 4 4 4 4

6 4 4 2 2
+ ∗ − log2 − log2
14 6 6 6 6

4 3 3 1 1
+ ∗ − log2 − log2 = 0.854
14 4 4 4 4
Therefore, the gain in information from such a partitioning would be:

Gain(Temperature) = Info(D)−InfoTemperature (D) = 0.940−0.854 = 0.086

So, Information Gain of Outlook = 0.246, Temperature = 0.086,

Humidity = 0.154, and Wind = 0.197.

Table: The playing tennis sub-dataset, Outlook=Sunny

Day Temperature Humidity Wind Play
D1 Hot High Weak No
D2 Hot High Strong No
D8 Mild High Weak No
D9 Cool Normal Weak Yes
D11 Mild Normal Strong Yes

Table: The playing tennis sub-dataset, Outlook=Overcast

Day Temperature Humidity Wind Play
D3 Hot High Weak Yes
D7 Cool Normal Strong Yes
D12 Mild High Strong Yes
D13 Hot Normal Weak Yes

Table: The playing tennis dataset, Outlook=Rain

Day Temperature Humidity Wind Play
D4 Mild High Weak Yes
D5 Cool Normal Weak Yes
D6 Cool Normal Strong No
D10 Mild Normal Weak Yes
D14 Mild High Strong No

Decision Tree

Figure: A DT generated by the playing tennis dataset.

Tree Pruning

I Tree pruning methods address the problem of overfitting the data.

I When a decision tree is built, many of the branches will reflect

anomalies in the training data due to noise.

I There are two common approaches to tree pruning:

1. Pre-pruning
2. Post-pruning

Pre-pruning

I Pre-pruning a tree is pruned by halting its construction early (e.g. by

deciding not to further split or partition the subset of training
instances at a given node).

I Upon halting, the node becomes a leaf.

I The leaf may hold the most frequent class among the subset
instances or the probability distribution of those instances.

I In choosing an appropriate threshold, high thresholds could result in

oversimplified trees, whereas low thresholds could result in very little
simplification.

Post-pruning

I Post-pruning is commonly used tree pruning approach, which

removes subtrees from a full grown tree.

I A subtree at a given node is pruned by removing its branches and

replacing it with a leaf.

I The leaf is labeled with the most frequent class among the subtree
being replaced.

I Post-pruning requires more computation than pre-pruning, yet

generally leads to a more reliable tree.

Repetition &. Replication

I No single pruning method has been found to be superior over all
others.

I Although some pruning methods do depend on the availability of

additional data for pruning, this is usually not a concern when
dealing with large databases.

I Decision trees can suffer from repetition and replication.

Repetition occurs when an attribute is repeatedly tested along a
given branch of the tree.
Replication duplicates subtrees exist within the tree. These
situations can impede the accuracy and
comprehensibility of a decision tree.

Tree Pruning by Cost Complexity

I The cost complexity pruning algorithm used in CART (Classification
and Regression Trees) is an example of the post-pruning approach.
I This approach considers the cost complexity of a tree to be a
function of the number of leaves in the tree and the error rate of
the tree (where the error rate is the percentage of instances
misclassified by the tree).
I It starts from the bottom of the tree.
I For each internal node, N, it computes the cost complexity of the
subtree at N, and the cost complexity of the subtree at N if it were
to be pruned (i.e., replaced by a leaf node).
I The two values are compared. If pruning the subtree at node N
would result in a smaller cost complexity, then the subtree is pruned.
Otherwise, it is kept.

Pruning Set

I A pruning set of class-labeled instances is used to estimate cost

complexity.

I This set is independent of the training set used to build the

unpruned tree and of any test set used for accuracy estimation.

I The algorithm generates a set of progressively pruned trees.

I In general, the smallest decision tree that minimises the cost

complexity is preferred.

Pessimistic Tree Pruning

I Pessimistic tree pruning is used by C4.5 decision tree induction

algorithm, which is similar to the cost complexity method in that it
also uses error rate estimates to make decisions regarding subtree
pruning.

I Pessimistic pruning does not require the use of a prune set. Instead,
it uses the training set to estimate error rates. Recall that an
estimate of accuracy or error based on the training set is overly
optimistic and, therefore, strongly biased.

I The pessimistic pruning method therefore adjusts the error rates

obtained from the training set by adding a penalty, so as to counter
the bias incurred.

Scalability & Decision Tree Induction

I In data mining applications, very large training sets of millions of

instances are common.
I Most often, the training data will not fit in memory.
I The efficiency of existing decision tree algorithms, such as ID3, C4.5,
and CART, has been well established for relatively small data sets.
I Therefore, decision tree construction becomes inefficient due to
swapping of the training instances in and out of main and cache
memories.
I More scalable approaches, capable of handling training data that are
too large to fit in memory, are required.

RainForest Tree

RainForest adapts to the amount of main memory available and applies

to any decision tree induction algorithm. The method maintains an
AVC-set (Attribute-Value, Class-label) for each attribute, at each tree
node, describing the training instances at the node. The AVC-set of an
attribute A at node N gives the class label counts for each value of A for
the instances at N. Table 11 shows the AVC-set of outlook attribute of
Table 7.

Table: AVC-set of Outlook attribute of Table 7

Outlook Yes No
Sunny 2 3
Overcast 4 0
Rain 3 2

Bootstrapped Optimistic Algorithm for Tree construction

(BOAT)

I It creates several smaller subsets of the given training data, each of

which fits in memory. Each subset is used to construct a tree,
resulting in several trees.

I The trees are examined and used to construct a new tree, T 0 , that
turns out to be very close to the tree that would have been
generated if all the original training data had fit in memory.

I BOAT can use any attribute selection measure that selects binary
splits and that is based on the notion of purity of partitions such as
the Gini index.

BOAT (con.)

I BOAT was found to be two to three times faster than RainForest,

while constructing exactly the same tree.

I An additional advantage of BOAT is that it can be used for

incremental updates. That is, BOAT can take new insertions and
deletions for the training data and update the decision tree to reflect
these changes, without having to reconstruct the tree from scratch.

Evaluating Classifier Performance

The performance measures of a classifier that how accurate the classifier

predicting the class label of instances (both training and testing
instances). There are four terms we need to know that are used in
computing many evaluation measures.

Table: Class Prediction - Comparing classification performance to information

retrieval.
True Positive, TP False Negative, FN
False Positive, FP True Negative,TN

Evaluating Classifier Performance (con.)

True positives, TP: These refer to the positive instances (e.g.,
Play = Yes) that were correctly classified/ labeled by the
classifier. Let TP be the number of true positives.
True negatives, TN: These are the negative instances (e.g., Play = No)
that were correctly classified/ labeled by the classifier. Let
TN be the number of true negatives.
False positives, FP: These are the negative instances that were
misclassified/ incorrectly labeled as positive by the
classifier (e.g., instances of Play = No for which the
classifier predicted Play = Yes). Let FP be the number of
false positives.
False negatives, FN: These are the positive instances that were
misclassified/ incorrectly labeled as negative by the
classifier (e.g., instances of Play = Yes for which the
classifier predicted Play = No). Let FN be the number of
false negatives.
Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University
Outline k-NN classifier Naı̈ve Bayes Classifier Decision Tree Induction Tree Pruning & Scalability Evaluating Classifiers Performance

Classification accuracy
The classification accuracy of a classifier on a given test set is the
percentage of test set instances that are correctly classified by the
classifier. In the pattern recognition literature, this is also referred to as
the overall recognition rate of the classifier, that is, it reflects how well
the classifier recognises instances of the various classes. The classification
accuracy is measured either by Equation 18 to Equation 19.
TP + TN
accuracy = (18)
P +N
TP + TN
accuracy = (19)
TP + TN + FP + FN
P|X |
i=1 assess(xi )
accuracy = , xi ∈ X (20)
|X |

Error rate

The error rate or misclassification rate of a classifier on a given test

set is the percentage of test set instances that are misclassified by the
classifier, which is also known as the resubstitution error. The error
rate is shown in Equation 21.
FP + FN
errorRate = (21)
P +N

Sensitivity

The sensitivity is referred to as the true positive (recognition) rate (i.e.,

the proportion of positive instances that are correctly identified), shown
in Equation 22.
TP
sensitivity = (22)
P

Specificity

The specificity is referred to as the true negative rate (i.e., the

proportion of negative instances that are correctly identified), shown in
Equation 23. A perfect classifier would be described as 100% sensitivity
and 100% specificity.
TN
specificity = (23)
N

Precision & Recall

The precision and recall measures are also widely used in classification.
Precision can be thought of as a measure of exactness (i.e., what
percentage of instances labeled as positive are actually such), whereas
recall is a measure of completeness (what percentage of positive
instances are labeled as such). If recall seems familiar, that?s because it
is the same as sensitivity (or the true positive rate).
TP
precision = (24)
TP + FP
TP TP
recall = = (25)
TP + FN P

Table: Evaluation Measures - Classifier performance

Measure Formula
TP+TN
Accuracy/ Recognition rate P+N
FP+FN
Error rate/ Misclassification rate P+N
TP
Sensitivity/ True positive rate/ Recall P
TN
Specificity/ True negative rate N
TP
Precision TP+FP
2∗precision∗recall
F / F1 / F − score precision+recall
(1+β)2 ∗precision∗recall
Fβ , where β is a non-negative real number β 2 ∗precision+recall

k-fold Cross-Validation
I In k-fold cross-validation, the initial data are randomly partitioned
into k mutually exclusive subsets or “folds”, D1 , D2 , · · · , Dk , each of
which has an approximately equal size.
I Training and testing are performed k times.
I In iteration i, the partition Di is reserved as the test set, and the
remaining partitions are collectively used to train the classifier.
I 10-fold cross validation breaks data into 10 sets of size N/10.
I It trains the classifier on 9 datasets and tests it using the remaining
one dataset. This repeats 10 times and we take a mean accuracy
rate.
I For classification, the accuracy estimate is the overall number of
correct classifications from the k iterations, divided by the total
number of instances in the initial dataset.

* THANK YOU *

Prof. Dr. Dewan Md. Farid: Machine Learning - Classification Task Dept. of Computer Science & Engineering, United International University

6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
50 Deep Learning Technical Interview Questions With Answers
100% (1)
50 Deep Learning Technical Interview Questions With Answers
20 pages
KNN Algorithm
No ratings yet
KNN Algorithm
16 pages
12_23ECE216_Nearest Neighbors
No ratings yet
12_23ECE216_Nearest Neighbors
29 pages
Naive Bayes Classifier: K M M I I M
No ratings yet
Naive Bayes Classifier: K M M I I M
16 pages
A Complete Guide To KNN
No ratings yet
A Complete Guide To KNN
16 pages
L05-Predictive Analytics I
No ratings yet
L05-Predictive Analytics I
49 pages
Classification FoundationalMathofAI S24
No ratings yet
Classification FoundationalMathofAI S24
6 pages
KNN PDF
No ratings yet
KNN PDF
30 pages
Lecture7 KNN
No ratings yet
Lecture7 KNN
40 pages
T6- KNN - Features, Distances &amp; Non-Parametric Models
No ratings yet
T6- KNN - Features, Distances &amp; Non-Parametric Models
23 pages
Practical 10 K-Nearest Neighbors Algorithm
No ratings yet
Practical 10 K-Nearest Neighbors Algorithm
16 pages
sensitivity unit 4
No ratings yet
sensitivity unit 4
4 pages
Lecture 3 Basics of Clssification
No ratings yet
Lecture 3 Basics of Clssification
53 pages
DADM S15 K-NN Classification
No ratings yet
DADM S15 K-NN Classification
13 pages
Lec 5 b Analytics Classification
No ratings yet
Lec 5 b Analytics Classification
56 pages
Data Analytics Classification
No ratings yet
Data Analytics Classification
56 pages
Classification and Regression: Arturo Calder On Mora
No ratings yet
Classification and Regression: Arturo Calder On Mora
8 pages
Classification KNN
No ratings yet
Classification KNN
11 pages
k-NN
No ratings yet
k-NN
17 pages
Introduction To Classification - PPT Slides 1
No ratings yet
Introduction To Classification - PPT Slides 1
62 pages
The Nearest Neighbour Algorithm
No ratings yet
The Nearest Neighbour Algorithm
3 pages
Ml 7th Sem Aiml Ite Notes Complete Long[1]-63-155
No ratings yet
Ml 7th Sem Aiml Ite Notes Complete Long[1]-63-155
93 pages
FPA unit 2
No ratings yet
FPA unit 2
20 pages
ML DSBA Lab4
No ratings yet
ML DSBA Lab4
5 pages
Lectures 7 and 8 - Data Anaysis in Management - MBM
No ratings yet
Lectures 7 and 8 - Data Anaysis in Management - MBM
78 pages
Unit 5 - DA - Classification & Clustering
No ratings yet
Unit 5 - DA - Classification & Clustering
105 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
19 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
Research and Implementation of Machine
No ratings yet
Research and Implementation of Machine
6 pages
Experiment No 7 ML
No ratings yet
Experiment No 7 ML
4 pages
AIML Lec-10
No ratings yet
AIML Lec-10
19 pages
KNN
No ratings yet
KNN
29 pages
3.1 K Nearest Neighbour Classifier (1)
No ratings yet
3.1 K Nearest Neighbour Classifier (1)
24 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
22 pages
5. K-Nearest Neighbors
No ratings yet
5. K-Nearest Neighbors
35 pages
K-Nearest Neighbor Classifier: This Slide Is Modified From Dr. Tan's Slides. Thanks To Dr. Tan
No ratings yet
K-Nearest Neighbor Classifier: This Slide Is Modified From Dr. Tan's Slides. Thanks To Dr. Tan
11 pages
Supervised Learning KNN
No ratings yet
Supervised Learning KNN
23 pages
Lec 11
No ratings yet
Lec 11
31 pages
Lecture8 KNN1
No ratings yet
Lecture8 KNN1
16 pages
ML 4 (1)
No ratings yet
ML 4 (1)
33 pages
K_Nearest_Neighbour_Classifier
No ratings yet
K_Nearest_Neighbour_Classifier
24 pages
ML Supervised Learning Unit 3
No ratings yet
ML Supervised Learning Unit 3
51 pages
5. K-Nearest Neighbors Classifiers 2025
No ratings yet
5. K-Nearest Neighbors Classifiers 2025
33 pages
Unit II - 2 - Supervised Learning
No ratings yet
Unit II - 2 - Supervised Learning
23 pages
20 KNN Presentation
No ratings yet
20 KNN Presentation
16 pages
03 - Classification PDF
No ratings yet
03 - Classification PDF
92 pages
ML-KN
No ratings yet
ML-KN
12 pages
05-classification
No ratings yet
05-classification
33 pages
Dr. S. Vairachilai Department of CSE CVR College of Engineering Mangalpalli Telangana
No ratings yet
Dr. S. Vairachilai Department of CSE CVR College of Engineering Mangalpalli Telangana
18 pages
Lecture Note #3_PEC-CS701E
No ratings yet
Lecture Note #3_PEC-CS701E
27 pages
T07 IDS - Classification
No ratings yet
T07 IDS - Classification
20 pages
19-K-Nearest Neighbor Learning.-22-08-2024
No ratings yet
19-K-Nearest Neighbor Learning.-22-08-2024
25 pages
Supervised Learning - SVM - DT
No ratings yet
Supervised Learning - SVM - DT
43 pages
UNIT 2 - Notes
No ratings yet
UNIT 2 - Notes
31 pages
K-NN
No ratings yet
K-NN
25 pages
Classifiers
No ratings yet
Classifiers
26 pages
K-Nearest-Neighbors-KNN-A-Fundamental-Machine-Learning-Algorithm (1).pptx
No ratings yet
K-Nearest-Neighbors-KNN-A-Fundamental-Machine-Learning-Algorithm (1).pptx
11 pages
ML1 - Classification - KNN & NB
No ratings yet
ML1 - Classification - KNN & NB
23 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
On KNN Algorithm
No ratings yet
On KNN Algorithm
8 pages
Telegram Channel Telegram Group
No ratings yet
Telegram Channel Telegram Group
36 pages
08 Neural Networks
No ratings yet
08 Neural Networks
47 pages
Convolutional Neural Network CNN For Image Detection and Recognition
No ratings yet
Convolutional Neural Network CNN For Image Detection and Recognition
5 pages
Lecture 1-Unit 3.3
No ratings yet
Lecture 1-Unit 3.3
3 pages
MACHINE LEARNING LAB MANUAL
No ratings yet
MACHINE LEARNING LAB MANUAL
105 pages
Artificial Neural Network
100% (1)
Artificial Neural Network
35 pages
CMPE597 Syllabus
No ratings yet
CMPE597 Syllabus
3 pages
Human Activity Recognition For Elderly People Using Machine and Deep Learning Approaches
No ratings yet
Human Activity Recognition For Elderly People Using Machine and Deep Learning Approaches
14 pages
Ensemble Methods in Machine Learning
No ratings yet
Ensemble Methods in Machine Learning
24 pages
Deep Learning CNN
100% (1)
Deep Learning CNN
22 pages
Ch. 10: Introduction To Convolution Neural Networks CNN and Systems
No ratings yet
Ch. 10: Introduction To Convolution Neural Networks CNN and Systems
69 pages
AI31
No ratings yet
AI31
13 pages
Lecture 03 Perceptron PDF
No ratings yet
Lecture 03 Perceptron PDF
12 pages
NNFL CBCGS Syllabus
No ratings yet
NNFL CBCGS Syllabus
8 pages
6.10-Tutorial For Week6
No ratings yet
6.10-Tutorial For Week6
17 pages
Backpropagation Algorithm
No ratings yet
Backpropagation Algorithm
3 pages
ccs355 model-B
No ratings yet
ccs355 model-B
4 pages
Syllabus - CS 231N PDF
No ratings yet
Syllabus - CS 231N PDF
1 page
Module 2
No ratings yet
Module 2
13 pages
Machine Learning Unit 2 MCQ
No ratings yet
Machine Learning Unit 2 MCQ
17 pages
Assignment 1-ML
No ratings yet
Assignment 1-ML
4 pages
Unit 3
100% (1)
Unit 3
11 pages
6-Neural NT
No ratings yet
6-Neural NT
44 pages
Ebook Deep Learning Objective Type Questions
No ratings yet
Ebook Deep Learning Objective Type Questions
102 pages
Decision_Trees_Concepts_Algorithms
No ratings yet
Decision_Trees_Concepts_Algorithms
15 pages
ANN Backpropagation Algorithm
No ratings yet
ANN Backpropagation Algorithm
4 pages
Applsci 13 04550
No ratings yet
Applsci 13 04550
21 pages
DL syllabus 3164601
No ratings yet
DL syllabus 3164601
4 pages