Unit II NOTES
Unit II NOTES
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we
should play or not on a particular day according to the weather conditions. So
to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
ADVERTISEMENT
ADVERTISEMENT
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
ADVERTISEMENT
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a
class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated,
so it cannot learn the relationship between features.
Applications of Naïve Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier
is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment
analysis.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a
normal distribution. This means if predictors take continuous values
instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the
data is multinomial distributed. It is primarily used for document
classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent Booleans
variables. Such as if a particular word is present or not in a document.
This model is also famous for document classification tasks.
Bayesian Belief Network in artificial intelligence
Bayesian belief network is key computer technology for dealing with probabilistic events and to solve a problem w
has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables and their conditional
dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability distribution, and also use
probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between multiple events, we n
Bayesian network. It can also be used in various tasks including prediction, anomaly detection, diagnostics, autom
insight, reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it consists of two parts:
ADVERTISEMENT
o Directed Acyclic Graph
o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under uncertain knowledg
known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between random varia
These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed link that me
that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of the network gra
o If we are considering node B, which is connected with node A by a directed arrow, then node A is calle
parent of Node B.
o Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a directed acyclic gra
DAG.
The Bayesian network has mainly two components:
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which determines the e
the parent on that node.
ADVERTISEMENT
Bayesian network is based on Joint probability distribution and conditional probability. So let's first understand th
probability distribution:
Joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1, x2, x3.. xn, are known
Joint probability distribution.
P[x1 , x2 , x3 ,....., xn], it can be written as the following way in terms of the joint probability distribution.
= P[x1 | x2 , x3 ,....., xn]P[x2 , x3 ,....., xn]
= P[x1 | x2 , x3 ,....., xn]P[x2 |x3 ,....., xn]....P[xn-1 |xn]P[xn].
In general for each variable Xi, we can write the equation as:
P(Xi|Xi-1 ,........., X1 ) = P(Xi |Parents(Xi ))
Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:
B
E
P(A= True)
P(A= False)
True
True
0.94
0.06
True
False
0.95
0.04
False
True
0.31
0.69
False
False
0.001
0.999
Conditional probability table for David Calls:
The Conditional probability of David that he will call depends on the probability of Alarm.
A
P(D= True)
P(D= False)
True
0.91
0.09
False
0.05
0.95
Conditional probability table for Sophia Calls:
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
A
P(S= True)
P(S= False)
True
0.75
0.25
False
0.02
0.98
From the formula of joint distribution, we can write the problem statement in the form of probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint distribution.
The semantics of Bayesian Network:
There are two ways to understand the semantics of the Bayesian network, which is given below:
1. To understand the network as the representation of the Joint probability distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional independence statements.
Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or nonlinear classification, regression, and even outlier detection
tasks. SVMs can be used for a variety of tasks, such as text classification, image classification, spam detection, handwriting identification, gene expression
analysis, face detection, and anomaly detection. SVMs are adaptable and efficient in a variety of applications because they can manage high-dimensional
data and nonlinear relationships.
SVM algorithms are very effective as we try to find the maximum separating hyperplane between the different classes available in the target feature.
Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and regression. Though we say regression
problems as well it’s best suited for classification. The main objective of the SVM algorithm is to find the optimal hyperplane in an N-dimensional space
that can separate the data points in different classes in the feature space. The hyperplane tries that the margin between the closest points of different
classes should be as maximum as possible. The dimension of the hyperplane depends upon the number of features. If the number of input features is two,
then the hyperplane is just a line. If the number of input features is three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine when the
number of features exceeds three.
Let’s consider two independent variables x1, x2, and one dependent variable which is either a blue circle or a red circle.
From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line because we are considering only two input features x1, x2)
that segregate our data points or do a classification between red and blue circles. So how do we choose the best line or in general the best hyperplane that
segregates our data points?
One reasonable choice as the best hyperplane is the one that represents the largest separation or margin between the two classes.
So we choose the hyperplane whose distance from it to the nearest data point on each side is maximized. If such a hyperplane exists it is known as
the maximum-margin hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a scenario like shown below
Selecting hyperplane for data with outlier
Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data? It’s simple! The blue ball in the boundary of red ones is an
outlier of blue balls. The SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane that maximizes the margin. SVM is
robust to outliers.
So in this type of data point what SVM does is, finds the maximum margin as done with previous data sets along with that it adds a penalty each time a
point crosses the margin. So the margins in these types of cases are called soft margins. When there is a soft margin to the data set, the SVM tries to
minimize (1/margin+∀(∑penalty)). Hinge loss is a commonly used penalty. If no violations no hinge loss.If violations hinge loss proportional to the
distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red balls are separable by a straight line/linear line). What to do if data
are not linearly separable?
Original 1D dataset for classification
Say, our data is shown in the figure above. SVM solves this by creating a new variable using a kernel. We call a point xi on the line and we create a new
variable yi as a function of distance from origin o.so if we plot this we get something like as shown below
In this case, the new variable y is created as a function of distance from the origin. A non-linear function that creates a new variable is referred to as a kernel.
1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data points of different classes in a feature space. In the case
of linear classifications, it will be a linear equation i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which makes a critical role in deciding the hyperplane and
margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main objective of the support vector machine algorithm is
to maximize the margin. The wider margin indicates better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original input data points into high-dimensional feature spaces,
so, that the hyperplane can be easily found out even if the data points are not linearly separable in the original input space. Some of the
common kernel functions are linear, polynomial, radial basis function(RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane that properly separates the data points of
different categories without any misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a soft margin technique. Each data point has a
slack variable introduced by the soft-margin SVM formulation, which softens the strict margin requirement and permits certain
misclassifications or violations. It discovers a compromise between increasing the margin and reducing violations.
7. C: Margin maximisation and misclassification fines are balanced by the regularisation parameter C in SVM. The penalty for going over the
margin or misclassifying data items is decided by it. A stricter penalty is imposed with a greater value of C, which results in a smaller margin
and perhaps fewer misclassifications.
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect classifications or margin violations. The objective function in
SVM is frequently formed by combining it with the regularisation term.
9. Dual Problem: A dual Problem of the optimisation problem that requires locating the Lagrange multipliers related to the support vectors
can be used to solve SVM. The dual formulation enables the use of kernel tricks and more effective computing.
Advantages of SVM
1.K(x, y) = x .y
Where x and y are the input feature vectors. The dot product of the input
vectors is a measure of their similarity or distance in the original feature
space.
When using a linear kernel in an SVM, the decision boundary is a linear
hyperplane that separates the different classes in the feature space. This
linear boundary can be useful when the data is already separable by a linear
decision boundary or when dealing with high-dimensional data, where the use
of more complex kernel functions may lead to overfitting.
Polynomial Kernel
A particular kind of kernel function utilised in machine learning, such as in
SVMs, is a polynomial kernel (Support Vector Machines). It is a nonlinear
kernel function that employs polynomial functions to transfer the input data
into a higher-dimensional feature space.
One definition of the polynomial kernel is:
Where x and y are the input feature vectors, c is a constant term, and d is the
degree of the polynomial, K(x, y) = (x. y + c)d. The constant term is added to,
and the dot product of the input vectors elevated to the degree of the
polynomial.
The decision boundary of an SVM with a polynomial kernel might capture
more intricate correlations between the input characteristics because it is a
nonlinear hyperplane.
The degree of nonlinearity in the decision boundary is determined by the
degree of the polynomial.
The polynomial kernel has the benefit of being able to detect both linear and
nonlinear correlations in the data. It can be difficult to select the proper degree
of the polynomial, though, as a larger degree can result in overfitting while a
lower degree cannot adequately represent the underlying relationships in the
data.
In general, the polynomial kernel is an effective tool for converting the input
data into a higher-dimensional feature space in order to capture nonlinear
correlations between the input characteristics.
Gaussian (RBF) Kernel
The Gaussian kernel, also known as the radial basis function (RBF) kernel, is
a popular kernel function used in machine learning, particularly in SVMs
(Support Vector Machines). It is a nonlinear kernel function that maps the
input data into a higher-dimensional feature space using a Gaussian function.
The Gaussian kernel can be defined as:
Properties of SVM
Feature Selection