0% found this document useful (0 votes)
18 views31 pages

Unit II Classification

Data Mining Classification

Uploaded by

ihtisham26hfs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views31 pages

Unit II Classification

Data Mining Classification

Uploaded by

ihtisham26hfs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Unit II: Classification

What is Bayes’ Theorem?

Bayes’ Theorem is used to determine the conditional probability of an event. It was named after an
English statistician, Thomas Bayes who discovered this formula in 1763. Bayes Theorem is a very
important theorem in mathematics, that laid the foundation of a unique statistical inference approach
called the Bayes’ inference. It is used to find the probability of an event, based on prior knowledge of
conditions that might be related to that event.

Bayes theorem (also known as the Bayes Rule or Bayes Law) is used to determine the conditional
probability of event A when event B has already occurred.

The general statement of Bayes’ theorem is “The conditional probability of an event A, given the
occurrence of another event B, is equal to the product of the event of B, given A and the probability of
A divided by the probability of event B.” i.e.

P(A|B) = P(B|A)P(A) / P(B)

where,

• P(A) and P(B) are the probabilities of events A and B

• P(A|B) is the probability of event A when event B happens

• P(B|A) is the probability of event B when A happens

Bayes Theorem Statement

Bayes’ Theorem for n set of events is defined as,

Let E1, E2,…, En be a set of events associated with the sample space S, in which all the events E1, E2,…,
En have a non-zero probability of occurrence. All the events E1, E2,…, E form a partition of S. Let A be an
event from space S for which we have to find probability, then according to Bayes’ theorem,

P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)

for k = 1, 2, 3, …., n

Bayes Theorem Formula

For any two events A and B, then the formula for the Bayes theorem is given by: (the image given below
gives the Bayes’ theorem formula)
Bayes’ Theorem Formula

where,

• P(A) and P(B) are the probabilities of events A and B also P(B) is never equal to zero.

• P(A|B) is the probability of event A when event B happens

• P(B|A) is the probability of event B when A happens

Bayes Theorem Derivation

The proof of Bayes’ Theorem is given as, according to the conditional probability formula,

P(Ei|A) = P(Ei∩A) / P(A)…..(i)

Then, by using the multiplication rule of probability, we get

P(Ei∩A) = P(Ei)P(A|Ei)……(ii)

Now, by the total probability theorem,

P(A) = ∑ P(Ek)P(A|Ek)…..(iii)

Substituting the value of P(Ei∩A) and P(A) from eq (ii) and eq(iii) in eq(i) we get,
P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)

Bayes’ theorem is also known as the formula for the probability of “causes”. As we know, the Ei‘s are a
partition of the sample space S, and at any given time only one of the events Ei occurs. Thus we
conclude that the Bayes’ theorem formula gives the probability of a particular Ei, given the event A has
occurred.

Terms Related to Bayes Theorem

After learning about Bayes theorem in detail, let us understand some important terms related to the
concepts we covered in formula and derivation.

• Hypotheses: Events happening in the sample space E1, E2,… En is called the hypotheses

• Priori Probability: Priori Probability is the initial probability of an event occurring before any new
data is taken into account. P(Ei) is the priori probability of hypothesis Ei.

• Posterior Probability: Posterior Probability is the updated probability of an event after


considering new information. Probability P(Ei|A) is considered as the posterior probability of
hypothesis Ei.

Conditional Probability

• The probability of an event A based on the occurrence of another event B is termed conditional
Probability.

• It is denoted as P(A|B) and represents the probability of A when event B has already happened.

Joint Probability

When the probability of two more events occurring together and at the same time is measured it is
marked as Joint Probability. For two events A and B, it is denoted by joint probability is denoted
as, P(A∩B).

Random Variables

Real-valued variables whose possible values are determined by random experiments are called random
variables. The probability of finding such variables is the experimental probability.

Bayes’ Theorem Applications

Bayesian inference is very important and has found application in various activities, including medicine,
science, philosophy, engineering, sports, law, etc., and Bayesian inference is directly derived from Bayes’
theorem.

Example: Bayes’ theorem defines the accuracy of the medical test by taking into account how likely a
person is to have a disease and what is the overall accuracy of the test.

Difference Between Conditional Probability and Bayes Theorem

The difference between Conditional Probability and Bayes Theorem can be understood with the help of
the table given below,
Bayes’ Theorem Conditional Probability

Bayes’ Theorem is derived using the definition of


Conditional Probability is the probability of
conditional probability. It is used to find the reverse
event A when event B has already occurred.
probability.

Formula: P(A|B) = [P(B|A)P(A)] / P(B) Formula: P(A|B) = P(A∩B) / P(B)

Theorem of Total Probability

Let E1, E2, . . ., En is mutually exclusive and exhaustive events associated with a random experiment and
lets E be an event that occurs with some Ei. Then, prove that

P(E) = n∑i=1P(E/Ei) . P(Ej)

Proof:

Let S be the sample space. Then,

S = E1 ∪ E2 ∪ E3 ∪ . . . ∪ En and Ei ∩ Ej = ∅ for i ≠ j.

E=E∩S

⇒ E = E ∩ (E1 ∪ E2 ∪ E3 ∪ . . . ∪ En)
⇒ E = (E ∩ E1) ∪ (E ∩ E2) ∪ . . . ∪ (E ∩ En)
P(E) = P{(E ∩ E1) ∪ (E ∩ E2)∪ . . . ∪(E ∩ En)}

⇒ P(E) = P(E ∩ E1) + P(E ∩ E2) + . . . + P(E ∩ En)


{Therefore, (E ∩ E1), (E ∩ E2), . . . ,(E ∩ En)} are pairwise disjoint}

⇒ P(E) = P(E/E1) . P(E1) + P(E/E2) . P(E2) + . . . + P(E/En) . P(En) [by multiplication theorem]
⇒ P(E) = n∑i=1P(E/Ei) . P(Ei)

The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method employed to tackle
classification and regression problems. Evelyn Fix and Joseph Hodges developed this algorithm in 1951,
which was subsequently expanded by Thomas Cover. The article explores the fundamentals, workings,
and implementation of the KNN algorithm.

What is the K-Nearest Neighbors Algorithm?

KNN is one of the most basic yet essential classification algorithms in machine learning. It belongs to
the supervised learning domain and finds intense application in pattern recognition, data mining, and
intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning it does not make any
underlying assumptions about the distribution of data (as opposed to other algorithms such as GMM,
which assume a Gaussian distribution of the given data). We are given some prior data (also called
training data), which classifies coordinates into groups identified by an attribute.

As an example, consider the following table of data points containing two features:

KNN Algorithm working visualization

Now, given another set of data points (also called testing data), allocate these points to a group by
analyzing the training set. Note that the unclassified points are marked as ‘White’.

Intuition Behind KNN Algorithm

If we plot these points on a graph, we may be able to locate some clusters or groups. Now, given an
unclassified point, we can assign it to a group by observing what group its nearest neighbors belong to.
This means a point close to a cluster of points classified as ‘Red’ has a higher probability of getting
classified as ‘Red’.

Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’, and the second point
(5.5, 4.5) should be classified as ‘Red’.

Why do we need a KNN algorithm?


(K-NN) algorithm is a versatile and widely used machine learning algorithm that is primarily used for its
simplicity and ease of implementation. It does not require any assumptions about the underlying data
distribution. It can also handle both numerical and categorical data, making it a flexible choice for
various types of datasets in classification and regression tasks. It is a non-parametric method that makes
predictions based on the similarity of data points in a given dataset. K-NN is less sensitive to outliers
compared to other algorithms.

The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a distance
metric, such as Euclidean distance. The class or value of the data point is then determined by the
majority vote or average of the K neighbors. This approach allows the algorithm to adapt to different
patterns and make predictions based on the local structure of the data.

Distance Metrics Used in KNN Algorithm

As we know that the KNN algorithm helps us identify the nearest points or the groups for a query point.
But to determine the closest groups or the nearest points for a query point we need some metric. For
this purpose, we use below distance metrics:

Euclidean Distance

This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight line that joins
the two points which are into consideration. This metric helps us calculate the net displacement done
between the two states of an object.

distance(x,Xi)=∑j=1d(xj–Xij)2]distance(x,Xi)=∑j=1d(xj–Xij)2]

Manhattan Distance

Manhattan Distance metric is generally used when we are interested in the total distance traveled by the
object instead of the displacement. This metric is calculated by summing the absolute difference
between the coordinates of the points in n-dimensions.

d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣

How to choose the value of k for KNN Algorithm?

The value of k is very crucial in the KNN algorithm to define the number of neighbors in the algorithm.
The value of k in the k-nearest neighbors (k-NN) algorithm should be chosen based on the input data. If
the input data has more outliers or noise, a higher value of k would be better. It is recommended to
choose an odd value for k to avoid ties in classification. Cross-validation methods can help in selecting
the best k value for the given dataset.

Algorithm for K-NN

DistanceToNN=sort(distance from 1st example, distance from kth example)

value i=1 to number of training records:

Dist=distance(test example, ith example)


if (Dist<any example in DistanceToNN):

Remove the example from DistanceToNN and value.

Put new example in DistanceToNN and value in sorted order.

Return average of value

Fit using K-NN is more reasonable than 1-NN, K-NN affects very less from noise if dataset is large.

In K-NN algorithm, We can see jump in prediction values due to unit change in input. The reason for this
due to change in neighbors. To handles this situation, We can use weighting of neighbors in algorithm. If
the distance from neighbor is high, we want less effect from that neighbor. If distance is low, that
neighbor should be more effective than others.

Workings of KNN algorithm

Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it predicts the
label or value of a new data point by considering the labels or values of its K nearest neighbors in the
training dataset.

Step-by-Step explanation of how KNN works is discussed below:

Step 1: Selecting the optimal value of K

• K represents the number of nearest neighbors that needs to be considered while making
prediction.

Step 2: Calculating distance

• To measure the similarity between target and training data points, Euclidean distance is used.
Distance is calculated between each of the data points in the dataset and target point.

Step 3: Finding Nearest Neighbors


• The k data points with the smallest distances to the target point are the nearest neighbors.

Step 4: Voting for Classification or Taking Average for Regression

• In the classification problem, the class labels of K-nearest neighbors are determined by
performing majority voting. The class with the most occurrences among the neighbors becomes
the predicted class for the target data point.

• In the regression problem, the class label is calculated by taking average of the target values of K
nearest neighbors. The calculated average value becomes the predicted output for the target
data point.

Let X be the training dataset with n data points, where each data point is represented by a d-dimensional
feature vector XiXi and Y be the corresponding labels or values for each data point in X. Given a new data
point x, the algorithm calculates the distance between x and each data point XiXi in X using a distance
metric, such as Euclidean distance:distance(x,Xi)=∑j=1d(xj–Xij)2]distance(x,Xi)=∑j=1d(xj–Xij)2]

The algorithm selects the K data points from X that have the shortest distances to x. For classification
tasks, the algorithm assigns the label y that is most frequent among the K nearest neighbors to x. For
regression tasks, the algorithm calculates the average or weighted average of the values y of the K
nearest neighbors and assigns it as the predicted value for x.

Advantages of the KNN Algorithm

• Easy to implement as the complexity of the algorithm is not that high.

• Adapts Easily – As per the working of the KNN algorithm it stores all the data in memory storage
and hence whenever a new example or data point is added then the algorithm adjusts itself as
per that new example and has its contribution to the future predictions as well.

• Few Hyperparameters – The only parameters which are required in the training of a KNN
algorithm are the value of k and the choice of the distance metric which we would like to choose
from our evaluation metric.

Disadvantages of the KNN Algorithm

• Does not scale – As we have heard about this that the KNN algorithm is also considered a Lazy
Algorithm. The main significance of this term is that this takes lots of computing power as well as
data storage. This makes this algorithm both time-consuming and resource exhausting.

• Curse of Dimensionality – There is a term known as the peaking phenomenon according to this
the KNN algorithm is affected by the curse of dimensionality which implies the algorithm faces a
hard time classifying the data points properly when the dimensionality is too high.

• Prone to Overfitting – As the algorithm is affected due to the curse of dimensionality it is prone
to the problem of overfitting as well. Hence generally feature selection as well as dimensionality
reduction techniques are applied to deal with this problem.
What is Naive Bayes Classifiers?

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is
not a single algorithm but a family of algorithms where all of them share a common principle,
i.e. every pair of features being classified is independent of each other. To start with, let us
consider a dataset.

One of the most simple and effective classification algorithms, the Naïve Bayes classifier aids in
the rapid development of machine learning models with rapid prediction capabilities.

Naïve Bayes algorithm is used for classification problems. It is highly used in text classification. In
text classification tasks, data contains high dimension (as each word represent one feature in the
data). It is used in spam filtering, sentiment detection, rating classification etc. The advantage of
using naïve Bayes is its speed. It is fast and making prediction is easy with high dimension of
data.

This model predicts the probability of an instance belongs to a class with a given set of feature
value. It is a probabilistic classifier. It is because it assumes that one feature in the model is
independent of existence of another feature. In other words, each feature contributes to the
predictions with no relation between each other. In real world, this condition satisfies rarely. It
uses Bayes theorem in the algorithm for training and prediction

Advantages of Naive Bayes Classifier

• Easy to implement and computationally efficient.

• Effective in cases with a large number of features.

• Performs well even with limited training data.

• It performs well in the presence of categorical features.

• For numerical features data is assumed to come from normal distributions

Disadvantages of Naive Bayes Classifier

• Assumes that features are independent, which may not always hold in real-world data.

• Can be influenced by irrelevant attributes.

• May assign zero probability to unseen events, leading to poor generalization.

Applications of Naive Bayes Classifier

• Spam Email Filtering: Classifies emails as spam or non-spam based on features.

• Text Classification: Used in sentiment analysis, document categorization, and topic classification.

• Medical Diagnosis: Helps in predicting the likelihood of a disease based on symptoms.

• Credit Scoring: Evaluates creditworthiness of individuals for loan approval.

• Weather Prediction: Classifies weather conditions based on various factors.


Probability of playing:

P(Yes | Overcast) = P(Overcast | Yes) P(Yes) / P (Overcast) .....................(1)

1. Calculate Prior Probabilities:

P(Overcast) = 4/14 = 0.29

P(Yes)= 9/14 = 0.64

1. Calculate Posterior Probabilities:

P(Overcast |Yes) = 4/9 = 0.44

1. Put Prior and Posterior probabilities in equation (1)

P (Yes | Overcast) = 0.44 * 0.64 / 0.29 = 0.98(Higher)

Similarly, you can calculate the probability of not playing:

Probability of not playing:

P(No | Overcast) = P(Overcast | No) P(No) / P (Overcast) .....................(2)

1. Calculate Prior Probabilities:

P(Overcast) = 4/14 = 0.29

P(No)= 5/14 = 0.36

1. Calculate Posterior Probabilities:

P(Overcast |No) = 0/9 = 0

1. Put Prior and Posterior probabilities in equation (2)

P (No | Overcast) = 0 * 0.36 / 0.29 = 0

The probability of a 'Yes' class is higher. So you can determine here if the weather is overcast than players
will play the sport.
Advantages

• It is not only a simple approach but also a fast and accurate method for prediction.

• Naive Bayes has a very low computation cost.

• It can efficiently work on a large dataset.

• It performs well in case of discrete response variable compared to the continuous variable.

• It can be used with multiple class prediction problems.

• It also performs well in the case of text analytics problems.

• When the assumption of independence holds, a Naive Bayes classifier performs better compared
to other models like logistic regression.

Disadvantages

• The assumption of independent features. In practice, it is almost impossible that model will get a
set of predictors which are entirely independent.

• If there is no training tuple of a particular class, this causes zero posterior probability. In this
case, the model is unable to make predictions. This problem is known as Zero
Probability/Frequency Problem.

he Apriori algorithm is a classic algorithm used in association rule mining and market basket analysis. It
helps in identifying frequent item sets (groups of items) in large datasets, and from these frequent item
sets, it derives association rules. These rules indicate how the presence of one item (or set of items) in a
transaction affects the likelihood of the presence of other items.

Key Concepts in Apriori Algorithm

1. Support: The support of an itemset is the fraction of transactions in which the itemset appears.
It is a measure of how frequently the itemset appears in the dataset.

Support(A)=Number of transactions containing ATotal number of transactions\text{Support}(A) =


\frac{\text{Number of transactions containing A}}{\text{Total number of
transactions}}Support(A)=Total number of transactionsNumber of transactions containing A

For example, if a dataset contains 100 transactions and "bread" appears in 40 of them, then the support
for "bread" is 40100=0.4\frac{40}{100} = 0.410040=0.4.

2. Confidence: Confidence is the conditional probability that a transaction contains an item Y, given
that it contains an item X.
Confidence(X→Y)=Support(X∪Y)Support(X)\text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X
\cup Y)}{\text{Support}(X)}Confidence(X→Y)=Support(X)Support(X∪Y)

If confidence is high, it means that the presence of X strongly suggests the presence of Y.

3. Lift: Lift measures how much more likely two items are to appear together than if they were
independent.

Lift(X→Y)=Confidence(X→Y)Support(Y)\text{Lift}(X \rightarrow Y) = \frac{\text{Confidence}(X \rightarrow


Y)}{\text{Support}(Y)}Lift(X→Y)=Support(Y)Confidence(X→Y)

A lift value greater than 1 indicates a strong association.

Working of Apriori Algorithm

The Apriori algorithm works in the following steps:

1. Generate frequent itemsets:

o Start by identifying all single items that meet a minimum support threshold.

o Then, iteratively combine these frequent itemsets to form larger itemsets (pairs, triples,
etc.) that also meet the support threshold.

o This process is repeated until no more frequent itemsets can be formed.

2. Prune:

o In each iteration, itemsets that do not meet the minimum support threshold are
discarded (pruned).

o This reduces the search space and improves efficiency.

3. Generate association rules:

o Once frequent itemsets are found, association rules are generated by calculating
confidence for each rule. Only rules with confidence above a threshold are considered.

What is Single Layer Perceptron?

It is one of the oldest and first introduced neural networks. It was proposed by Frank
Rosenblatt in 1958. Perceptron is also known as an artificial neural network. Perceptron is
mainly used to compute the logical gate like AND, OR, and NOR which has binary input and
binary output.

The main functionality of the perceptron is:-

• Takes input from the input layer

• Weight them up and sum it up.

• Pass the sum to the nonlinear function to produce the output.


Single-layer neural network

Here activation functions can be anything like sigmoid, tanh, relu Based on the requirement we
will be choosing the most appropriate nonlinear activation function to produce the better result.

Multi-layer Perceptron

Multi-layer perception is also known as MLP. It is fully connected dense layers, which
transform any input dimension to the desired dimension. A multi-layer perception is a neural
network that has multiple layers.

A multi-layer perceptron has one input layer and for each input, there is one neuron(or node), it
has one output layer with a single node for each output and it can have any number of hidden
layers and each hidden layer can have any number of nodes. A schematic diagram of a Multi-
Layer Perceptron (MLP) is depicted below.
In the multi-layer perceptron diagram above, we can see that there are three inputs and thus
three input nodes and the hidden layer has three nodes. The output layer gives two outputs,
therefore there are two output nodes. The nodes in the input layer take input and forward it for
further process, in the diagram above the nodes in the input layer forwards their output to each
of the three nodes in the hidden layer, and in the same way, the hidden layer processes the
information and passes it to the output layer.

Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid
activation function takes real values as input and converts them to numbers between 0 and 1
using the sigmoid formula.

Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently a itemset occurs in a transaction. A typical example is a
Market Based Analysis. Market Based Analysis is one of the key techniques used by large
relations to show associations between items.It allows retailers to identify relationships between
the items that people buy together frequently. Given a set of transactions, we can find rules that
will predict the occurrence of an item based on the occurrences of other items in the
transaction.
TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Before we start defining the rule, let us first see the basic definitions. Support Count( )
– Frequency of occurrence of a itemset.

Here ({Milk, Bread, Diaper})=2

Frequent Itemset – An itemset whose support is greater than or equal to minsup


threshold. Association Rule – An implication expression of the form X -> Y, where X and Y are any
2 itemsets.

Example: {Milk, Diaper}->{Beer}

Rule Evaluation Metrics –

• Support(s) – The number of transactions that include items in the {X} and {Y} parts of the rule as
a percentage of the total number of transaction.It is a measure of how frequently the collection
of items occur together as a percentage of all transactions.

• Support = (X+Y) total – It is interpreted as fraction of transactions that contain both X and Y.

• Confidence(c) – It is the ratio of the no of transactions that includes all items in {B} as well as the
no of transactions that includes all items in {A} to the no of transactions that includes all items in
{A}.

• Conf(X=>Y) = Supp(X Y) Supp(X) – It measures how often each item in Y appears in


transactions that contains items in X also.

• Lift(l) – The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence,
assuming that the itemsets X and Y are independent of each other.The expected confidence is
the confidence divided by the frequency of {Y}.
• Lift(X=>Y) = Conf(X=>Y) Supp(Y) – Lift value near 1 indicates X and Y almost often appear
together as expected, greater than 1 means they appear together more than expected and less
than 1 means they appear less than expected.Greater lift values indicate stronger association.

Example – From the above table, {Milk, Diaper}=>{Beer}

s= ({Milk, Diaper, Beer}) |T|

= 2/5

= 0.4

c= (Milk, Diaper, Beer) (Milk, Diaper)

= 2/3

= 0.67

l= Supp({Milk, Diaper, Beer}) Supp({Milk, Diaper})*Supp({Beer})

= 0.4/(0.6*0.6)

= 1.11

The Association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consists of a large number of transaction records
which list all items bought by a customer on a single purchase. So the manager could know if
certain groups of items are consistently purchased together and use this data for adjusting store
layouts, cross-selling, promotions based on statistics.

A Decision Tree is a machine learning algorithm used for classification and regression tasks. It's a
predictive model that maps observations about data (represented in the branches) to conclusions
(represented in the leaves). Decision trees are easy to understand, interpret, and visualize, which makes
them popular for both technical and non-technical audiences.

Key Concepts of Decision Trees:

1. Root Node: The topmost node in the tree, representing the entire dataset. From here, the
dataset is split based on certain conditions.

2. Branches: Each branch represents the result of a decision (or test) on an attribute. It leads to
another node or a leaf.

3. Internal Nodes: These nodes contain decisions based on input features (e.g., whether an
individual’s age is above or below 30). They represent intermediate stages in the decision-
making process.
4. Leaf Nodes (Terminal Nodes): These nodes represent the final output or decision. In a
classification tree, they represent class labels; in a regression tree, they represent a numeric
output.

5. Splitting: The process of dividing a node into two or more sub-nodes. The goal is to improve the
homogeneity (for classification) or minimize error (for regression) within the resulting subsets.

6. Impurity Measures:

o Gini Index: Measures the impurity in a classification decision tree. Lower values indicate
purer splits.

o Entropy: Used in information gain, another measure for deciding the best split in
classification.

o Variance: In regression trees, it measures the variability of the target variable in each
node.

Types of Decision Trees:

• Classification Trees: Used when the output is a category or class (e.g., "spam" or "not spam").

• Regression Trees: Used when the output is a continuous value (e.g., predicting house prices).

Example of a Simple Decision Tree:

Imagine you're deciding whether to play tennis based on weather conditions:

• Root Node: "Is it sunny?"

o If yes, branch to "Is humidity high?"

▪ If yes, outcome = "Don't play."

▪ If no, outcome = "Play."

o If no, outcome = "Play."

Advantages:

• Easy to understand and interpret.

• Can handle both numerical and categorical data.

• Does not require scaling of data.

• Can capture complex, non-linear relationships.

Disadvantages:

• Overfitting: Trees can become too complex and capture noise in the data.

• Bias towards dominant classes: If one class dominates the dataset, the tree may overly focus on
that class.
• Unstable: Small changes in the data can lead to very different trees.

Popular Decision Tree Algorithms:

• CART (Classification and Regression Trees): One of the most common implementations.

• ID3 (Iterative Dichotomiser 3): A decision tree algorithm that uses entropy and information gain
for classification.

Decision trees are often used in ensemble methods like Random Forests and Gradient Boosting to
improve accuracy and address some of their limitations.

What are Neural Networks?

Neural networks are computational models that mimic the way biological neural networks in the human
brain process information. They consist of layers of neurons that transform the input data into
meaningful outputs through a series of mathematical operations.

Feedforward Neural Networks

• Definition: Feedforward neural networks are a form of artificial neural network where without
forming any cycles between layers or nodes means inputs can pass data through those nodes
within the hidden level to the output nodes.

• Architecture: Made up of layers with unidirectional flow of data (from input through hidden and
the output layer).

• Training: Backpropagation is often used during training for the main aim of reducing the
prediction errors.

• Applications: In visual and voice recognition, NLP, financial forecasting, and recommending
systems.
Convolutional Neural Networks (CNN)

• Definition: The convolutional neural networks structure is focused on processing the grid type
data like images and videos by using convolutional layers filtering driving the patterns and spatial
hierarchies.

• Key Components: Utilizing convolutional layers, pooling layers and fully connected layers.

• Applications: Used for classification of images, object detection, medical imaging analyzes,
autonomous driving and visualization in augmented reality.

Recurrent Neural Networks (RNN)

• Definition: Recurrent neural network handles sequential data in which the current output is a
result of previous inputs by looping over themselves to hold internal state (memory).
• Architecture: Contains recurrent connections that enable feedback loops for processing
sequences.

• Challenges: Problems such as vanishing gradients become apparent since they limit the mode
detectors’ ability to comprehensively capture interdependence on a long scale.

• Applications: Language translation, open-ended text classification, ones to ones interaction, and
time series prediction are its applications.

Long Short-Term Memory Networks (LSTM)

• Definition: LSTM networks, as a recurrent neural network variant, exhibit memory cells to solve
the disappearing gradient issue and keep large ranges of information in their memory.

• Key Features: Capture memory cells in pass information flowing and graduate greediness issue.

• Applications: Value of RNNs is in terms of importing long-term memory into the model e.g.,
language translation, and time-series forecasting.

Gated Recurrent Units (GRU)

• Definition: GRU is the second usual variant of RNNs which is working on gating mechanism just
like LSTM but with little parameter.

• Advantages: Vanishing gradient issue is addressed and it is compute-efficient than LSTM.


• Applications: LSTM is also involved in tasks that can be categorized as similar to speech
recognition and text monitoring.

Radial Basis Function Networks (RBFN)

• Definition: The radial basis function (RBF) networks can be regarded as models which define
radial basis functions that are very useful in the function approximation and classification
approaches, being useful in complex input-output data modelling.

• Applications: It includes regression, pattern recognition, and system control methods for fast-
tracking.

Self-Organizing Maps (SOM)

• Definition: Self-Organizing Maps are unsupervised neural networks; these networks are used for
unsupervised cluster generation based on the retaining of topological features of the high
dimensional data from an upper dimensional source, transformed into low dimensional form of
output data.
• Features: Design methods that reduces the dimension of data from the high dimension into a
low dimension without loss of the underlying geometry of the data.

• Applications: Visualizing data, discovering customers segments, locating anomalies; and


selecting needed features.

Deep Belief Networks (DBN)

• Definition: The architecture of the Deep Belief Networks is built on many stochastic, latent
variables that are used for both deep supervised and unsupervised tasks such as nonlinear
feature learning and mid dimensional representation.

• Function: If you are looking for the most effective architecture of data that can be learned via
classification, this algorithm clearly emerges as the winner.

• Applications: Image and voice recognition, natural language understanding, and smart devices
as recommendations systems.

Generative Adversarial Networks (GAN)

• Definition: Generative Adversarial Networks has made up of of two neural networks, the
generator and discriminator, which compete against each other. The generator creates a fake
generated data, and the discriminator learns to differentiate the real from and fake data.

• Working Principle: Generator evolves after each iteration while the fake data being generated.
This simultaneously makes the discriminator more discriminating as it determines whether the
components are real or generated.
• Applications: They have proved useful not only for pattern generation but also data
augmentation, style transfer, and learning without any supervision.

Autoencoders (AE)

• Definition: Autoencoders are feedforward networks (ANNs) that are trained to acquire the most
helpful presentations of the information through the process of re-coding the input data. The
encoder is pinpointed to precisely map the input into the legal latent space representation, while
the decoder does the opposite, decoding the space from this representation.

• Functionality: Help in techniques like dimensionality reduction, information extraction, noise


removal, and generative modelling the images become comprehensible.

• Types: Variants include undercomplete, overcomplete, and variational autoencoders.


Siamese Neural Networks

• Definition: Siamese Neural Network work with networks of the same structure and an identical
architecture. Comparison is being made via a similarity metric that can tell the degree of
resemblance the two networks have.

• Applications: Face recognition as the signature, retrieval of information, image similarity


comparison and category tasks.

Capsule Networks (CapsNet)

• Definition: The layers of Capsule Networks do not only incorporate localization relations of data
but allows multilevel structure by passing the information from lower convolutional layers to
higher. They use cyclicals to the items and their bodies too, of course, they do not do that at the
same time.

• Applications: Image classification, object detection and scene understanding via the immense
visual data exposure.
Transformer Networks

• Definition: The Transformer Networks do this by way of self-attention mechanism which results
into a parallel process used for making the tokenization inputs faster and thus improved
capturing of long range dependencies.

• Key Features: Provides better performance than any of the other models due to their capability
to process natural language sufficiently, and handle tasks related to machine translation,
generating text, and document summarization.
• Applications: The application of this technology had got more popular, specially in language
understanding tasks and image and audio data processing applications of this time and more
similar tasks.

Spiking Neural Networks (SNN)

• Definition: Main thing related with Spiking Neural Networks is the brain functionality which is
processed by action potentials (spikes) in biological neurons in the same way. These are the key
factors of "neuromorphic" technology which perform the deep learning and avoid another type
of processing as well.

• Applications: Neuromorphic processes, learning and computation in spiro-neural computing,


cognitive processes modeling, and mind-related computing are also carried out with this.

Applications of Neural Networks

The uses of neural networks are diverse and cut across many distinct industries and domains; processes
and innovations are being transformed and even revolutionized by this advancement in technology.

• Healthcare: Neural networks play a critical role in medical image analysis, disease diagnosis,
personalized treatment plans, drug discovery and healthcare management systems.

• Finance: They have a very strong influence on algorithmic trading, fraud detection, credit
scoring, risk management and portfolio optimization.

• Entertainment: Neural networks are allowing development of recommendation systems for


movies, music, books and character animation as well as virtual reality experiences.

• Manufacturing: They innovate in supply chain management especially in optimizing it, predictive
maintenance, quality control processes and industrial automation.

• Transportation: The human brain is incorporated into the auto-piloted cars for the purpose of
perception, making decisions, and navigation.

• Environmental Sciences: They help construct climate models, satellite monitoring, and
ecological observation.

A neural network is a basic backbone of modern artificial intelligence that changes the way machines
learn from data and carry out sophisticated tasks that were once considered to be human. Research
technology is developing everyday, also computational resources are getting more readily available on
daily basis. Consequently, neural networks are constantly evolving with innovation in mind, thus,
transforming industries. The upcoming age will present an integration of smart systems into our daily
living that will unveil a technology-oriented world, leading to many possibilities among healthcare,
finance, entertainment, manufacturing, and transportation, and others.

CART( Classification And Regression Trees) is a variation of the decision tree algorithm. It can handle
both classification and regression tasks. Scikit-Learn uses the Classification And Regression Tree (CART)
algorithm to train Decision Trees (also called “growing” trees). CART was first produced by Leo Breiman,
Jerome Friedman, Richard Olshen, and Charles Stone in 1984.

CART(Classification And Regression Tree) for Decision Tree

CART is a predictive algorithm used in Machine learning and it explains how the target variable’s values
can be predicted based on other matters. It is a decision tree where each fork is split into a predictor
variable and each node has a prediction for the target variable at the end.

The term CART serves as a generic term for the following categories of decision trees:

• Classification Trees: The tree is used to determine which “class” the target variable is most likely
to fall into when it is continuous.

• Regression trees: These are used to predict a continuous variable’s value.

In the decision tree, nodes are split into sub-nodes based on a threshold value of an attribute. The root
node is taken as the training set and is split into two by considering the best attribute and threshold
value. Further, the subsets are also split using the same logic. This continues till the last pure sub-set is
found in the tree or the maximum number of leaves possible in that growing tree.

CART Algorithm

Classification and Regression Trees (CART) is a decision tree algorithm that is used for both classification
and regression tasks. It is a supervised learning algorithm that learns from labelled data to predict
unseen data.

• Tree structure: CART builds a tree-like structure consisting of nodes and branches. The nodes
represent different decision points, and the branches represent the possible outcomes of those
decisions. The leaf nodes in the tree contain a predicted class label or value for the target
variable.

• Splitting criteria: CART uses a greedy approach to split the data at each node. It evaluates all
possible splits and selects the one that best reduces the impurity of the resulting subsets. For
classification tasks, CART uses Gini impurity as the splitting criterion. The lower the Gini impurity,
the more pure the subset is. For regression tasks, CART uses residual reduction as the splitting
criterion. The lower the residual reduction, the better the fit of the model to the data.
• Pruning: To prevent overfitting of the data, pruning is a technique used to remove the nodes
that contribute little to the model accuracy. Cost complexity pruning and information gain
pruning are two popular pruning techniques. Cost complexity pruning involves calculating the
cost of each node and removing nodes that have a negative cost. Information gain pruning
involves calculating the information gain of each node and removing nodes that have a low
information gain.

The CART algorithm works via the following process:

• The best-split point of each input is obtained.

• Based on the best-split points of each input in Step 1, the new “best” split point is identified.

• Split the chosen input according to the “best” split point.

• Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.

CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by
searching for the best homogeneity for the sub nodes, with the help of the Gini index criterion.

Gini index/Gini impurity

The Gini index is a metric for the classification tasks in CART. It stores the sum of squared
probabilities of each class. It computes the degree of probability of a specific variable that is
wrongly being classified when chosen randomly and a variation of the Gini coefficient. It works
on categorical variables, provides outcomes either “successful” or “failure” and hence conducts
binary splitting only.

The degree of the Gini index varies from 0 to 1,

• Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.
• Gini index close to 1 means a high level of impurity, where each class contains a very small
fraction of elements, and

• A value of 1-1/n occurs when the elements are uniformly distributed into n classes and each
class has an equal probability of 1/n. For example, with two classes, the Gini impurity is 1 – 1/2 =
0.5.

Mathematically, we can write Gini Impurity as follows:

Gini=1−∑i=1n(pi)2Gini=1−∑i=1n(pi)2

where pipi is the probability of an object being classified to a particular class.

POPULAR CART-BASED ALGORITHMS:

• CART (Classification and Regression Trees): The original algorithm that uses binary splits to build
decision trees.

• C4.5 and C5.0: Extensions of CART that allow for multiway splits and handle categorical variables
more effectively.

• Random Forests: Ensemble methods that use multiple decision trees (often CART) to improve
predictive performance and reduce overfitting.

• Gradient Boosting Machines (GBM): Boosting algorithms that also use decision trees (often
CART) as base learners, sequentially improving model performance.

Advantages of CART

• Results are simplistic.

• Classification and regression trees are Nonparametric and Nonlinear.

• Classification and regression trees implicitly perform feature selection.

• Outliers have no meaningful effect on CART.

• It requires minimal supervision and produces easy-to-understand models.

Limitations of CART

• Overfitting.

• High Variance.

• low bias.

• the tree structure may be unstable.

Applications of the CART algorithm


• For quick Data insights.

• In Blood Donors Classification.

• For environmental and ecological data.

• In the financial sectors.

IF-THEN Rules

Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the
following from −

IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes

THEN buy_computer = yes

Points to remember −

• The IF part of the rule is called rule antecedent or precondition.

• The THEN part of the rule is called rule consequent.

• The antecedent part the condition consist of one or more attribute tests and these tests are
logically ANDed.

• The consequent part consists of class prediction.

Note − We can also write rule R1 as follows −

R1: (age = youth) ^ (student = yes))(buys computer = yes)

If the condition holds true for a given tuple, then the antecedent is satisfied.

You might also like