Unit II Classification
Unit II Classification
Bayes’ Theorem is used to determine the conditional probability of an event. It was named after an
English statistician, Thomas Bayes who discovered this formula in 1763. Bayes Theorem is a very
important theorem in mathematics, that laid the foundation of a unique statistical inference approach
called the Bayes’ inference. It is used to find the probability of an event, based on prior knowledge of
conditions that might be related to that event.
Bayes theorem (also known as the Bayes Rule or Bayes Law) is used to determine the conditional
probability of event A when event B has already occurred.
The general statement of Bayes’ theorem is “The conditional probability of an event A, given the
occurrence of another event B, is equal to the product of the event of B, given A and the probability of
A divided by the probability of event B.” i.e.
where,
Let E1, E2,…, En be a set of events associated with the sample space S, in which all the events E1, E2,…,
En have a non-zero probability of occurrence. All the events E1, E2,…, E form a partition of S. Let A be an
event from space S for which we have to find probability, then according to Bayes’ theorem,
for k = 1, 2, 3, …., n
For any two events A and B, then the formula for the Bayes theorem is given by: (the image given below
gives the Bayes’ theorem formula)
Bayes’ Theorem Formula
where,
• P(A) and P(B) are the probabilities of events A and B also P(B) is never equal to zero.
The proof of Bayes’ Theorem is given as, according to the conditional probability formula,
P(Ei∩A) = P(Ei)P(A|Ei)……(ii)
P(A) = ∑ P(Ek)P(A|Ek)…..(iii)
Substituting the value of P(Ei∩A) and P(A) from eq (ii) and eq(iii) in eq(i) we get,
P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)
Bayes’ theorem is also known as the formula for the probability of “causes”. As we know, the Ei‘s are a
partition of the sample space S, and at any given time only one of the events Ei occurs. Thus we
conclude that the Bayes’ theorem formula gives the probability of a particular Ei, given the event A has
occurred.
After learning about Bayes theorem in detail, let us understand some important terms related to the
concepts we covered in formula and derivation.
• Hypotheses: Events happening in the sample space E1, E2,… En is called the hypotheses
• Priori Probability: Priori Probability is the initial probability of an event occurring before any new
data is taken into account. P(Ei) is the priori probability of hypothesis Ei.
Conditional Probability
• The probability of an event A based on the occurrence of another event B is termed conditional
Probability.
• It is denoted as P(A|B) and represents the probability of A when event B has already happened.
Joint Probability
When the probability of two more events occurring together and at the same time is measured it is
marked as Joint Probability. For two events A and B, it is denoted by joint probability is denoted
as, P(A∩B).
Random Variables
Real-valued variables whose possible values are determined by random experiments are called random
variables. The probability of finding such variables is the experimental probability.
Bayesian inference is very important and has found application in various activities, including medicine,
science, philosophy, engineering, sports, law, etc., and Bayesian inference is directly derived from Bayes’
theorem.
Example: Bayes’ theorem defines the accuracy of the medical test by taking into account how likely a
person is to have a disease and what is the overall accuracy of the test.
The difference between Conditional Probability and Bayes Theorem can be understood with the help of
the table given below,
Bayes’ Theorem Conditional Probability
Let E1, E2, . . ., En is mutually exclusive and exhaustive events associated with a random experiment and
lets E be an event that occurs with some Ei. Then, prove that
Proof:
S = E1 ∪ E2 ∪ E3 ∪ . . . ∪ En and Ei ∩ Ej = ∅ for i ≠ j.
E=E∩S
⇒ E = E ∩ (E1 ∪ E2 ∪ E3 ∪ . . . ∪ En)
⇒ E = (E ∩ E1) ∪ (E ∩ E2) ∪ . . . ∪ (E ∩ En)
P(E) = P{(E ∩ E1) ∪ (E ∩ E2)∪ . . . ∪(E ∩ En)}
⇒ P(E) = P(E/E1) . P(E1) + P(E/E2) . P(E2) + . . . + P(E/En) . P(En) [by multiplication theorem]
⇒ P(E) = n∑i=1P(E/Ei) . P(Ei)
The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method employed to tackle
classification and regression problems. Evelyn Fix and Joseph Hodges developed this algorithm in 1951,
which was subsequently expanded by Thomas Cover. The article explores the fundamentals, workings,
and implementation of the KNN algorithm.
KNN is one of the most basic yet essential classification algorithms in machine learning. It belongs to
the supervised learning domain and finds intense application in pattern recognition, data mining, and
intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning it does not make any
underlying assumptions about the distribution of data (as opposed to other algorithms such as GMM,
which assume a Gaussian distribution of the given data). We are given some prior data (also called
training data), which classifies coordinates into groups identified by an attribute.
As an example, consider the following table of data points containing two features:
Now, given another set of data points (also called testing data), allocate these points to a group by
analyzing the training set. Note that the unclassified points are marked as ‘White’.
If we plot these points on a graph, we may be able to locate some clusters or groups. Now, given an
unclassified point, we can assign it to a group by observing what group its nearest neighbors belong to.
This means a point close to a cluster of points classified as ‘Red’ has a higher probability of getting
classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’, and the second point
(5.5, 4.5) should be classified as ‘Red’.
The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a distance
metric, such as Euclidean distance. The class or value of the data point is then determined by the
majority vote or average of the K neighbors. This approach allows the algorithm to adapt to different
patterns and make predictions based on the local structure of the data.
As we know that the KNN algorithm helps us identify the nearest points or the groups for a query point.
But to determine the closest groups or the nearest points for a query point we need some metric. For
this purpose, we use below distance metrics:
Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight line that joins
the two points which are into consideration. This metric helps us calculate the net displacement done
between the two states of an object.
distance(x,Xi)=∑j=1d(xj–Xij)2]distance(x,Xi)=∑j=1d(xj–Xij)2]
Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total distance traveled by the
object instead of the displacement. This metric is calculated by summing the absolute difference
between the coordinates of the points in n-dimensions.
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
The value of k is very crucial in the KNN algorithm to define the number of neighbors in the algorithm.
The value of k in the k-nearest neighbors (k-NN) algorithm should be chosen based on the input data. If
the input data has more outliers or noise, a higher value of k would be better. It is recommended to
choose an odd value for k to avoid ties in classification. Cross-validation methods can help in selecting
the best k value for the given dataset.
Fit using K-NN is more reasonable than 1-NN, K-NN affects very less from noise if dataset is large.
In K-NN algorithm, We can see jump in prediction values due to unit change in input. The reason for this
due to change in neighbors. To handles this situation, We can use weighting of neighbors in algorithm. If
the distance from neighbor is high, we want less effect from that neighbor. If distance is low, that
neighbor should be more effective than others.
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it predicts the
label or value of a new data point by considering the labels or values of its K nearest neighbors in the
training dataset.
• K represents the number of nearest neighbors that needs to be considered while making
prediction.
• To measure the similarity between target and training data points, Euclidean distance is used.
Distance is calculated between each of the data points in the dataset and target point.
• In the classification problem, the class labels of K-nearest neighbors are determined by
performing majority voting. The class with the most occurrences among the neighbors becomes
the predicted class for the target data point.
• In the regression problem, the class label is calculated by taking average of the target values of K
nearest neighbors. The calculated average value becomes the predicted output for the target
data point.
Let X be the training dataset with n data points, where each data point is represented by a d-dimensional
feature vector XiXi and Y be the corresponding labels or values for each data point in X. Given a new data
point x, the algorithm calculates the distance between x and each data point XiXi in X using a distance
metric, such as Euclidean distance:distance(x,Xi)=∑j=1d(xj–Xij)2]distance(x,Xi)=∑j=1d(xj–Xij)2]
The algorithm selects the K data points from X that have the shortest distances to x. For classification
tasks, the algorithm assigns the label y that is most frequent among the K nearest neighbors to x. For
regression tasks, the algorithm calculates the average or weighted average of the values y of the K
nearest neighbors and assigns it as the predicted value for x.
• Adapts Easily – As per the working of the KNN algorithm it stores all the data in memory storage
and hence whenever a new example or data point is added then the algorithm adjusts itself as
per that new example and has its contribution to the future predictions as well.
• Few Hyperparameters – The only parameters which are required in the training of a KNN
algorithm are the value of k and the choice of the distance metric which we would like to choose
from our evaluation metric.
• Does not scale – As we have heard about this that the KNN algorithm is also considered a Lazy
Algorithm. The main significance of this term is that this takes lots of computing power as well as
data storage. This makes this algorithm both time-consuming and resource exhausting.
• Curse of Dimensionality – There is a term known as the peaking phenomenon according to this
the KNN algorithm is affected by the curse of dimensionality which implies the algorithm faces a
hard time classifying the data points properly when the dimensionality is too high.
• Prone to Overfitting – As the algorithm is affected due to the curse of dimensionality it is prone
to the problem of overfitting as well. Hence generally feature selection as well as dimensionality
reduction techniques are applied to deal with this problem.
What is Naive Bayes Classifiers?
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is
not a single algorithm but a family of algorithms where all of them share a common principle,
i.e. every pair of features being classified is independent of each other. To start with, let us
consider a dataset.
One of the most simple and effective classification algorithms, the Naïve Bayes classifier aids in
the rapid development of machine learning models with rapid prediction capabilities.
Naïve Bayes algorithm is used for classification problems. It is highly used in text classification. In
text classification tasks, data contains high dimension (as each word represent one feature in the
data). It is used in spam filtering, sentiment detection, rating classification etc. The advantage of
using naïve Bayes is its speed. It is fast and making prediction is easy with high dimension of
data.
This model predicts the probability of an instance belongs to a class with a given set of feature
value. It is a probabilistic classifier. It is because it assumes that one feature in the model is
independent of existence of another feature. In other words, each feature contributes to the
predictions with no relation between each other. In real world, this condition satisfies rarely. It
uses Bayes theorem in the algorithm for training and prediction
• Assumes that features are independent, which may not always hold in real-world data.
• Text Classification: Used in sentiment analysis, document categorization, and topic classification.
The probability of a 'Yes' class is higher. So you can determine here if the weather is overcast than players
will play the sport.
Advantages
• It is not only a simple approach but also a fast and accurate method for prediction.
• It performs well in case of discrete response variable compared to the continuous variable.
• When the assumption of independence holds, a Naive Bayes classifier performs better compared
to other models like logistic regression.
Disadvantages
• The assumption of independent features. In practice, it is almost impossible that model will get a
set of predictors which are entirely independent.
• If there is no training tuple of a particular class, this causes zero posterior probability. In this
case, the model is unable to make predictions. This problem is known as Zero
Probability/Frequency Problem.
he Apriori algorithm is a classic algorithm used in association rule mining and market basket analysis. It
helps in identifying frequent item sets (groups of items) in large datasets, and from these frequent item
sets, it derives association rules. These rules indicate how the presence of one item (or set of items) in a
transaction affects the likelihood of the presence of other items.
1. Support: The support of an itemset is the fraction of transactions in which the itemset appears.
It is a measure of how frequently the itemset appears in the dataset.
For example, if a dataset contains 100 transactions and "bread" appears in 40 of them, then the support
for "bread" is 40100=0.4\frac{40}{100} = 0.410040=0.4.
2. Confidence: Confidence is the conditional probability that a transaction contains an item Y, given
that it contains an item X.
Confidence(X→Y)=Support(X∪Y)Support(X)\text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X
\cup Y)}{\text{Support}(X)}Confidence(X→Y)=Support(X)Support(X∪Y)
If confidence is high, it means that the presence of X strongly suggests the presence of Y.
3. Lift: Lift measures how much more likely two items are to appear together than if they were
independent.
o Start by identifying all single items that meet a minimum support threshold.
o Then, iteratively combine these frequent itemsets to form larger itemsets (pairs, triples,
etc.) that also meet the support threshold.
2. Prune:
o In each iteration, itemsets that do not meet the minimum support threshold are
discarded (pruned).
o Once frequent itemsets are found, association rules are generated by calculating
confidence for each rule. Only rules with confidence above a threshold are considered.
It is one of the oldest and first introduced neural networks. It was proposed by Frank
Rosenblatt in 1958. Perceptron is also known as an artificial neural network. Perceptron is
mainly used to compute the logical gate like AND, OR, and NOR which has binary input and
binary output.
Here activation functions can be anything like sigmoid, tanh, relu Based on the requirement we
will be choosing the most appropriate nonlinear activation function to produce the better result.
Multi-layer Perceptron
Multi-layer perception is also known as MLP. It is fully connected dense layers, which
transform any input dimension to the desired dimension. A multi-layer perception is a neural
network that has multiple layers.
A multi-layer perceptron has one input layer and for each input, there is one neuron(or node), it
has one output layer with a single node for each output and it can have any number of hidden
layers and each hidden layer can have any number of nodes. A schematic diagram of a Multi-
Layer Perceptron (MLP) is depicted below.
In the multi-layer perceptron diagram above, we can see that there are three inputs and thus
three input nodes and the hidden layer has three nodes. The output layer gives two outputs,
therefore there are two output nodes. The nodes in the input layer take input and forward it for
further process, in the diagram above the nodes in the input layer forwards their output to each
of the three nodes in the hidden layer, and in the same way, the hidden layer processes the
information and passes it to the output layer.
Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid
activation function takes real values as input and converts them to numbers between 0 and 1
using the sigmoid formula.
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently a itemset occurs in a transaction. A typical example is a
Market Based Analysis. Market Based Analysis is one of the key techniques used by large
relations to show associations between items.It allows retailers to identify relationships between
the items that people buy together frequently. Given a set of transactions, we can find rules that
will predict the occurrence of an item based on the occurrences of other items in the
transaction.
TID Items
1 Bread, Milk
Before we start defining the rule, let us first see the basic definitions. Support Count( )
– Frequency of occurrence of a itemset.
• Support(s) – The number of transactions that include items in the {X} and {Y} parts of the rule as
a percentage of the total number of transaction.It is a measure of how frequently the collection
of items occur together as a percentage of all transactions.
• Support = (X+Y) total – It is interpreted as fraction of transactions that contain both X and Y.
• Confidence(c) – It is the ratio of the no of transactions that includes all items in {B} as well as the
no of transactions that includes all items in {A} to the no of transactions that includes all items in
{A}.
• Lift(l) – The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence,
assuming that the itemsets X and Y are independent of each other.The expected confidence is
the confidence divided by the frequency of {Y}.
• Lift(X=>Y) = Conf(X=>Y) Supp(Y) – Lift value near 1 indicates X and Y almost often appear
together as expected, greater than 1 means they appear together more than expected and less
than 1 means they appear less than expected.Greater lift values indicate stronger association.
= 2/5
= 0.4
= 2/3
= 0.67
= 0.4/(0.6*0.6)
= 1.11
The Association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consists of a large number of transaction records
which list all items bought by a customer on a single purchase. So the manager could know if
certain groups of items are consistently purchased together and use this data for adjusting store
layouts, cross-selling, promotions based on statistics.
A Decision Tree is a machine learning algorithm used for classification and regression tasks. It's a
predictive model that maps observations about data (represented in the branches) to conclusions
(represented in the leaves). Decision trees are easy to understand, interpret, and visualize, which makes
them popular for both technical and non-technical audiences.
1. Root Node: The topmost node in the tree, representing the entire dataset. From here, the
dataset is split based on certain conditions.
2. Branches: Each branch represents the result of a decision (or test) on an attribute. It leads to
another node or a leaf.
3. Internal Nodes: These nodes contain decisions based on input features (e.g., whether an
individual’s age is above or below 30). They represent intermediate stages in the decision-
making process.
4. Leaf Nodes (Terminal Nodes): These nodes represent the final output or decision. In a
classification tree, they represent class labels; in a regression tree, they represent a numeric
output.
5. Splitting: The process of dividing a node into two or more sub-nodes. The goal is to improve the
homogeneity (for classification) or minimize error (for regression) within the resulting subsets.
6. Impurity Measures:
o Gini Index: Measures the impurity in a classification decision tree. Lower values indicate
purer splits.
o Entropy: Used in information gain, another measure for deciding the best split in
classification.
o Variance: In regression trees, it measures the variability of the target variable in each
node.
• Classification Trees: Used when the output is a category or class (e.g., "spam" or "not spam").
• Regression Trees: Used when the output is a continuous value (e.g., predicting house prices).
Advantages:
Disadvantages:
• Overfitting: Trees can become too complex and capture noise in the data.
• Bias towards dominant classes: If one class dominates the dataset, the tree may overly focus on
that class.
• Unstable: Small changes in the data can lead to very different trees.
• CART (Classification and Regression Trees): One of the most common implementations.
• ID3 (Iterative Dichotomiser 3): A decision tree algorithm that uses entropy and information gain
for classification.
Decision trees are often used in ensemble methods like Random Forests and Gradient Boosting to
improve accuracy and address some of their limitations.
Neural networks are computational models that mimic the way biological neural networks in the human
brain process information. They consist of layers of neurons that transform the input data into
meaningful outputs through a series of mathematical operations.
• Definition: Feedforward neural networks are a form of artificial neural network where without
forming any cycles between layers or nodes means inputs can pass data through those nodes
within the hidden level to the output nodes.
• Architecture: Made up of layers with unidirectional flow of data (from input through hidden and
the output layer).
• Training: Backpropagation is often used during training for the main aim of reducing the
prediction errors.
• Applications: In visual and voice recognition, NLP, financial forecasting, and recommending
systems.
Convolutional Neural Networks (CNN)
• Definition: The convolutional neural networks structure is focused on processing the grid type
data like images and videos by using convolutional layers filtering driving the patterns and spatial
hierarchies.
• Key Components: Utilizing convolutional layers, pooling layers and fully connected layers.
• Applications: Used for classification of images, object detection, medical imaging analyzes,
autonomous driving and visualization in augmented reality.
• Definition: Recurrent neural network handles sequential data in which the current output is a
result of previous inputs by looping over themselves to hold internal state (memory).
• Architecture: Contains recurrent connections that enable feedback loops for processing
sequences.
• Challenges: Problems such as vanishing gradients become apparent since they limit the mode
detectors’ ability to comprehensively capture interdependence on a long scale.
• Applications: Language translation, open-ended text classification, ones to ones interaction, and
time series prediction are its applications.
• Definition: LSTM networks, as a recurrent neural network variant, exhibit memory cells to solve
the disappearing gradient issue and keep large ranges of information in their memory.
• Key Features: Capture memory cells in pass information flowing and graduate greediness issue.
• Applications: Value of RNNs is in terms of importing long-term memory into the model e.g.,
language translation, and time-series forecasting.
• Definition: GRU is the second usual variant of RNNs which is working on gating mechanism just
like LSTM but with little parameter.
• Definition: The radial basis function (RBF) networks can be regarded as models which define
radial basis functions that are very useful in the function approximation and classification
approaches, being useful in complex input-output data modelling.
• Applications: It includes regression, pattern recognition, and system control methods for fast-
tracking.
• Definition: Self-Organizing Maps are unsupervised neural networks; these networks are used for
unsupervised cluster generation based on the retaining of topological features of the high
dimensional data from an upper dimensional source, transformed into low dimensional form of
output data.
• Features: Design methods that reduces the dimension of data from the high dimension into a
low dimension without loss of the underlying geometry of the data.
• Definition: The architecture of the Deep Belief Networks is built on many stochastic, latent
variables that are used for both deep supervised and unsupervised tasks such as nonlinear
feature learning and mid dimensional representation.
• Function: If you are looking for the most effective architecture of data that can be learned via
classification, this algorithm clearly emerges as the winner.
• Applications: Image and voice recognition, natural language understanding, and smart devices
as recommendations systems.
• Definition: Generative Adversarial Networks has made up of of two neural networks, the
generator and discriminator, which compete against each other. The generator creates a fake
generated data, and the discriminator learns to differentiate the real from and fake data.
• Working Principle: Generator evolves after each iteration while the fake data being generated.
This simultaneously makes the discriminator more discriminating as it determines whether the
components are real or generated.
• Applications: They have proved useful not only for pattern generation but also data
augmentation, style transfer, and learning without any supervision.
Autoencoders (AE)
• Definition: Autoencoders are feedforward networks (ANNs) that are trained to acquire the most
helpful presentations of the information through the process of re-coding the input data. The
encoder is pinpointed to precisely map the input into the legal latent space representation, while
the decoder does the opposite, decoding the space from this representation.
• Definition: Siamese Neural Network work with networks of the same structure and an identical
architecture. Comparison is being made via a similarity metric that can tell the degree of
resemblance the two networks have.
• Definition: The layers of Capsule Networks do not only incorporate localization relations of data
but allows multilevel structure by passing the information from lower convolutional layers to
higher. They use cyclicals to the items and their bodies too, of course, they do not do that at the
same time.
• Applications: Image classification, object detection and scene understanding via the immense
visual data exposure.
Transformer Networks
• Definition: The Transformer Networks do this by way of self-attention mechanism which results
into a parallel process used for making the tokenization inputs faster and thus improved
capturing of long range dependencies.
• Key Features: Provides better performance than any of the other models due to their capability
to process natural language sufficiently, and handle tasks related to machine translation,
generating text, and document summarization.
• Applications: The application of this technology had got more popular, specially in language
understanding tasks and image and audio data processing applications of this time and more
similar tasks.
• Definition: Main thing related with Spiking Neural Networks is the brain functionality which is
processed by action potentials (spikes) in biological neurons in the same way. These are the key
factors of "neuromorphic" technology which perform the deep learning and avoid another type
of processing as well.
The uses of neural networks are diverse and cut across many distinct industries and domains; processes
and innovations are being transformed and even revolutionized by this advancement in technology.
• Healthcare: Neural networks play a critical role in medical image analysis, disease diagnosis,
personalized treatment plans, drug discovery and healthcare management systems.
• Finance: They have a very strong influence on algorithmic trading, fraud detection, credit
scoring, risk management and portfolio optimization.
• Manufacturing: They innovate in supply chain management especially in optimizing it, predictive
maintenance, quality control processes and industrial automation.
• Transportation: The human brain is incorporated into the auto-piloted cars for the purpose of
perception, making decisions, and navigation.
• Environmental Sciences: They help construct climate models, satellite monitoring, and
ecological observation.
A neural network is a basic backbone of modern artificial intelligence that changes the way machines
learn from data and carry out sophisticated tasks that were once considered to be human. Research
technology is developing everyday, also computational resources are getting more readily available on
daily basis. Consequently, neural networks are constantly evolving with innovation in mind, thus,
transforming industries. The upcoming age will present an integration of smart systems into our daily
living that will unveil a technology-oriented world, leading to many possibilities among healthcare,
finance, entertainment, manufacturing, and transportation, and others.
CART( Classification And Regression Trees) is a variation of the decision tree algorithm. It can handle
both classification and regression tasks. Scikit-Learn uses the Classification And Regression Tree (CART)
algorithm to train Decision Trees (also called “growing” trees). CART was first produced by Leo Breiman,
Jerome Friedman, Richard Olshen, and Charles Stone in 1984.
CART is a predictive algorithm used in Machine learning and it explains how the target variable’s values
can be predicted based on other matters. It is a decision tree where each fork is split into a predictor
variable and each node has a prediction for the target variable at the end.
The term CART serves as a generic term for the following categories of decision trees:
• Classification Trees: The tree is used to determine which “class” the target variable is most likely
to fall into when it is continuous.
In the decision tree, nodes are split into sub-nodes based on a threshold value of an attribute. The root
node is taken as the training set and is split into two by considering the best attribute and threshold
value. Further, the subsets are also split using the same logic. This continues till the last pure sub-set is
found in the tree or the maximum number of leaves possible in that growing tree.
CART Algorithm
Classification and Regression Trees (CART) is a decision tree algorithm that is used for both classification
and regression tasks. It is a supervised learning algorithm that learns from labelled data to predict
unseen data.
• Tree structure: CART builds a tree-like structure consisting of nodes and branches. The nodes
represent different decision points, and the branches represent the possible outcomes of those
decisions. The leaf nodes in the tree contain a predicted class label or value for the target
variable.
• Splitting criteria: CART uses a greedy approach to split the data at each node. It evaluates all
possible splits and selects the one that best reduces the impurity of the resulting subsets. For
classification tasks, CART uses Gini impurity as the splitting criterion. The lower the Gini impurity,
the more pure the subset is. For regression tasks, CART uses residual reduction as the splitting
criterion. The lower the residual reduction, the better the fit of the model to the data.
• Pruning: To prevent overfitting of the data, pruning is a technique used to remove the nodes
that contribute little to the model accuracy. Cost complexity pruning and information gain
pruning are two popular pruning techniques. Cost complexity pruning involves calculating the
cost of each node and removing nodes that have a negative cost. Information gain pruning
involves calculating the information gain of each node and removing nodes that have a low
information gain.
• Based on the best-split points of each input in Step 1, the new “best” split point is identified.
• Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by
searching for the best homogeneity for the sub nodes, with the help of the Gini index criterion.
The Gini index is a metric for the classification tasks in CART. It stores the sum of squared
probabilities of each class. It computes the degree of probability of a specific variable that is
wrongly being classified when chosen randomly and a variation of the Gini coefficient. It works
on categorical variables, provides outcomes either “successful” or “failure” and hence conducts
binary splitting only.
• Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.
• Gini index close to 1 means a high level of impurity, where each class contains a very small
fraction of elements, and
• A value of 1-1/n occurs when the elements are uniformly distributed into n classes and each
class has an equal probability of 1/n. For example, with two classes, the Gini impurity is 1 – 1/2 =
0.5.
Gini=1−∑i=1n(pi)2Gini=1−∑i=1n(pi)2
• CART (Classification and Regression Trees): The original algorithm that uses binary splits to build
decision trees.
• C4.5 and C5.0: Extensions of CART that allow for multiway splits and handle categorical variables
more effectively.
• Random Forests: Ensemble methods that use multiple decision trees (often CART) to improve
predictive performance and reduce overfitting.
• Gradient Boosting Machines (GBM): Boosting algorithms that also use decision trees (often
CART) as base learners, sequentially improving model performance.
Advantages of CART
Limitations of CART
• Overfitting.
• High Variance.
• low bias.
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the
following from −
Points to remember −
• The antecedent part the condition consist of one or more attribute tests and these tests are
logically ANDed.
If the condition holds true for a given tuple, then the antecedent is satisfied.