0% found this document useful (0 votes)
12 views

ML Unit-4.a

Uploaded by

abhinavmodem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

ML Unit-4.a

Uploaded by

abhinavmodem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Machine Learning

Subject Code: 18PC1CS09

By

Dr.G.MADHU
M.Tech., Ph.D., MIEEE., MCSI., MISTE., MISRS., MIRSS., MIAENG

Professor,
Department of Information Technology,
VNR Vignana Jyothi Institute of Engineering & Technology,
Bachupally, Nizampet (S.O.)
Hyderabad- 500 090,RangaReddy Dt. TELANGANA, INDIA.
Cell: +919849085728
E-mail: [email protected]
Unit-4: Probability and Bayes Learning

7/26/2021 Machine Learning Course- Dr G Madhu 2


Introduction to Probability and Bayes Learning
• Bayesian Reasoning provides a probabilistic
approach to inference.
• It is based on the assumption that the quantities
of interest are governed by probability
distributions and that optimal decisions can be
made by reasoning about these probabilities
together with observed data.
• It is important to machine learning because it
provides a quantitative approach to weighing the
evidence supporting alternative hypotheses.

7/26/2021 Machine Learning Course- Dr G Madhu 3


• Bayesian Reasoning provides the basis for
learning algorithms that directly manipulate
probabilities, as well as a framework for
analyzing the operation of other algorithms
that do not explicitly manipulate probabilities.
• Bayesian learning methods are relevant to our
study of machine learning for two different
reasons.
– First, Bayesian learning algorithms that calculate
explicit probabilities for hypotheses, such as the
naive Bayes classifier, are among the most practical
approaches to certain types of learning problems.

7/26/2021 Machine Learning Course- Dr G Madhu 4


• The second reason that Bayesian methods are
important to our study of machine learning is
that they provide a useful perspective for
understanding many learning algorithms that
do not explicitly manipulate probabilities.
• For example,
– In this chapter we analyze algorithms such as the
FIND-S and CANDIDATE ELIMINATION Algorithms
of Chapter 2 to determine conditions under which
they output the most probable hypothesis given
the training data.
7/26/2021 Machine Learning Course- Dr G Madhu 5
• A basic familiarity with Bayesian Methods is
important to understanding and characterizing
the operation of many algorithms in machine
learning.

7/26/2021 Machine Learning Course- Dr G Madhu 6


Features of Bayesian Learning Methods
• Each observed training example can
incrementally decrease or increase the
estimated probability that a hypothesis is
correct.
– This provides a more flexible approach to
learning than algorithms that completely
eliminate a hypothesis if it is found to be
inconsistent with any single example.

7/26/2021 Machine Learning Course- Dr G Madhu 7


• Prior knowledge can be combined with
observed data to determine the final probability
of a hypothesis.
• In Bayesian learning, prior knowledge is
provided by asserting
– (1) a prior probability for each candidate hypothesis,
and
– (2) a probability distribution over observed data for
each possible hypothesis.

7/26/2021 Machine Learning Course- Dr G Madhu 8


• Bayesian Methods can accommodate hypotheses
that make probabilistic predictions(e.g.,
hypotheses such as "this pneumonia patient has
a 93% chance of complete recovery").
• New instances can be classified by combining the
predictions of multiple hypotheses, weighted by
their probabilities.
• Even in cases where Bayesian methods prove
computationally intractable, they can provide a
standard of optimal decision making against
which other practical methods can be measured.
7/26/2021 Machine Learning Course- Dr G Madhu 9
BAYES THEOREM
• In machine learning we are often interested in
determining the best hypothesis from some
space H, given the observed training data E.
• One way to specify what we mean by the best
hypothesis is to say that we demand the most
probable hypothesis, given the data E plus any
initial knowledge about the prior probabilities of
the various hypotheses in H.
• Bayes theorem provides a direct method for
calculating such probabilities.
7/26/2021 Machine Learning Course- Dr G Madhu 10
• More precisely, Bayes theorem provides a way
to calculate the probability of a hypothesis
based on its prior probability, the probabilities
of observing various data given the hypothesis,
and the observed data itself.

7/26/2021 Machine Learning Course- Dr G Madhu 11


BAYES THEOREM
• To define Bayes theorem precisely, let us first
introduce a little notation.
• We shall write P(H) to denote the initial
probability that hypothesis H holds, before we
have observed the training data.
• P(H) is often called the prior-probability of H
and may reflect any background knowledge we
have about the chance that H is a correct
hypothesis.
7/26/2021 Machine Learning Course- Dr G Madhu 12
• If we have no such prior knowledge, then we
might simply assign the same prior probability
to each candidate hypothesis.
• Similarly, we will write P(E) to denote the prior
probability that training data E will be
observed (i.e., the probability of E given no
knowledge about which hypothesis holds).
• Next, we will write P(E/H) to denote the
probability of observing data E given some
world in which hypothesis H holds.

7/26/2021 Machine Learning Course- Dr G Madhu 13


• More generally, we write 𝑃(𝑥|𝑦) to denote the
probability of x given y.
• In machine learning problems we are interested
in the probability P (H/E) that h holds given the
observed training data E.
• P (H/E) is called the posterior probability of H,
because it reflects our confidence that H holds
after we have seen the training data E.

7/26/2021 Machine Learning Course- Dr G Madhu 14


Bayes’ Theorem

• Bayes’s theorem is a relationship between the


conditional probabilities of two events.

Machine Learning
15
Dr.G.Madhu
Source: https://storage.ning.com/topology/rest/1.0/file/get/1891844156?profile=original

7/26/2021 Machine Learning Course- Dr G Madhu 16


What is Naive Bayes algorithm?
• It is based on Bayes’ Theorem with an
assumption of independence among
predictors.
• In simple terms, a Naive Bayes classifier
assumes that the presence of a particular
feature in a class is unrelated to the presence
of any other feature.
• Basically, it's "naive" because it makes
assumptions that may or may not turn out to
be correct.

7/26/2021 Machine Learning Course- Dr G Madhu 17


• The Naive Bayes classifier assumes that the
presence of a feature in a class is unrelated to
any other feature.
• Even if these features depend on each other or
upon the existence of the other features, all of
these properties independently contribute to
the probability that a particular fruit is an
apple or an orange or a banana, and that is
why it is known as "Naive."

7/26/2021 Machine Learning Course- Dr G Madhu 18


How Naive Bayes algorithm works?
• Step 1: Convert the data set into a frequency
table
• Step 2: Create Likelihood table by finding the
probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
• Step 3: Now, use Naive Bayesian equation to
calculate the posterior probability for each class.
The class with the highest posterior probability is
the outcome of prediction.
• Step 4: See which class has a higher probability,
given the input belongs to the higher probability
class.
7/26/2021 Machine Learning Course- Dr G Madhu 19
Source: https://www.saedsayad.com/naive_bayesian.htm

Machine Learning
20
Dr.G.Madhu
Now suppose you want to calculate the probability of playing when the weather is
overcast.

7/26/2021 Machine Learning Course- Dr G Madhu 21


Probability of playing:
• P(Yes | Overcast) = P(Overcast | Yes) P(Yes) / P
(Overcast)......(1)
1. Calculate Prior Probabilities:
2. P(Overcast) = 4/14 = 0.29
3. P(Yes)= 9/14 = 0.64
4. Calculate Posterior Probabilities:
5. P(Overcast |Yes) = 4/9 = 0.44
6. Put Prior and Posterior probabilities in equation (1)
7. P (Yes | Overcast) = 0.44 * 0.64 / 0.29 =
0.98(Higher)

Similarly, you can calculate the probability of not


playing:
7/26/2021 Machine Learning Course- Dr G Madhu 22
Probability of not playing:
• P(No | Overcast) = P(Overcast | No) P(No) / P (Overcast)
.......(2)
1. Calculate Prior Probabilities:
2. P(Overcast) = 4/14 = 0.29
3. P(No)= 5/14 = 0.36
4. Calculate Posterior Probabilities:
5. P(Overcast |No) = 0/9 = 0
6.Put Prior and Posterior probabilities in
equation (2)
7. P (No | Overcast) = 0 * 0.36 / 0.29 = 0

The probability of a 'Yes' class is higher. So you


can determine here if the weather is overcast
than players will play the sport.
7/26/2021 Machine Learning Course- Dr G Madhu 23
Machine Learning
24
Dr.G.Madhu
Machine Learning
25
Dr.G.Madhu
Machine Learning
26
Dr.G.Madhu
In this example we have 4 inputs (predictors). The final posterior
probabilities can be standardized between 0 and 1.

Machine Learning
27
Dr.G.Madhu
Logistic Regression
• In statistics, logistic regression, or logit
regression, or logit model is a regression model
where the dependent variable (DV) is
categorical.
• Logistic regression measures the relationship
between the categorical dependent variable and
one or more independent variables by
estimating probabilities using a logistic function,
which is the cumulative logistic distribution.
Machine Learning using Python
28
Dr.G.Madhu
• Thus, it treats the same set of problems as
probit regression using similar techniques, with
the latter using a cumulative normal distribution
curve instead.
• Logistic regression can be seen as a special case
of the generalized linear model and thus
similar to linear regression.
• For example :
– To predict whether an email is spam (1) or (0)
– Whether the tumor is malignant (1) or not (0)

Machine Learning using Python


29
Dr.G.Madhu
Machine Learning using Python
30
Dr.G.Madhu
Two features of Logistic Regression
• First, the conditional distribution y|x is a
Bernoulli distribution rather than a Gaussian
distribution, because the dependent variable
is binary.
• Second, the predicted values are probabilities
and are therefore restricted to (0,1) through
the logistic distribution function because
logistic regression predicts the probability of
particular outcomes.
Machine Learning using Python
31
Dr.G.Madhu
Types of Logistic Regression
1. Binary Logistic Regression:
– The categorical response has only two 2 possible
outcomes. Example: Spam or Not
2. Multinomial Logistic Regression:
– Three or more categories without ordering. Example:
Predicting which food is preferred more (Veg, Non-
Veg, Vegan)
3. Ordinal Logistic Regression:
– Three or more categories with ordering. Example:
Movie rating from 1 to 5
Machine Learning using Python
32
Dr.G.Madhu
Binary Logistic Regression

Machine Learning using Python


33
Dr.G.Madhu
Machine Learning using Python
34
Dr.G.Madhu
Logistic Regression

Machine Learning using Python


35
Dr.G.Madhu
Support Vector Machine
History

• In 1963, Vladimir Vapnik and Alexey


Chervonenkis developed another
classification tool, the support vector
machine.
• Vapnik refined this classification
method in the 1990’s and extended
uses for SVMs.
• Support vector machines have
become a great tool for the data
scientist.
7/26/2021 Machine Learning Course- Dr G Madhu 37
What is Support Vector Machine ?
❖ “Support Vector Machine” (SVM) is a supervised machine
learning algorithm which can be used for both
classification or regression problems.
❖ The idea of Support Vector Machines is simple: The
algorithm creates a line which separates the classes in
case e.g. in a classification problem.
❖ The goal of the line is to maximizing the margin between
the points on either side of the so called decision line.
❖ The benefit of this process is, that after the separation,
the model can easily guess the target classes (labels) for
new cases.
July 26, 2021 Ma chi ne Learning with Python 38
7/26/2021 Machine Learning Course- Dr G Madhu 39
• SVM is basically a binary classifier, although it
can be modified for multi-class classification
as well as regression.
• Unlike logistic regression and other neural
network models, SVMs try to maximize the
separation between two classes of points.

7/26/2021 Machine Learning Course- Dr G Madhu 40


How does it Works?

• The basic principle behind the working of


Support vector machines is simple
o Create a hyperplane that separates
the dataset into classes.

Let us start with a sample problem.

July 26, 2021 Machine Learning with Python 41


• Suppose that for a given
dataset, you have to
classify red triangles
from blue circles. Your
goal is to create a line
that classifies the data
into two classes, creating
a distinction between
red triangles and blue
circles.
July 26, 2021 Machine Learning with Python 42
July 26, 2021 Machine Learning with Python 43
Difference between Linear and Non-Linear Data

❖ When we can classify the data with a linear classifier.


The linear classifier makes his classification decision
based on a linear combination of characteristics.
❖ The characteristics are also known as features in
machine learning.

July 26, 2021 Ma chi ne Learning with Python 44


Real-Life Applications of SVM (Support Vector Machines)

July 26, 2021 Machine Learning with Python 45


July 26, 2021 Machine Learning with Python 46
Vanilla(Plain) SVM & Its Objective Function
• A support-vector machine constructs a
hyperplane or set of hyperplanes in a high- or
infinite-dimensional space, which can be used
for classification, regression, or other tasks like
outliers detection.
• First things, the SVM creates a hyperplane (a
simple line in n-dimensions).

7/26/2021 Machine Learning Course- Dr G Madhu 47


• These two support hyperplanes lie on the
most extreme points between the classes
and are called support-vectors.

• The distance between the support


hyperplanes is called the Margin.

• Hence, our goal is to simply find the


Maximum Margin M. Using vector
operations, we can find that (given
the OPTIMAL HYPERPLANE (w.x+b=0)),
the Margin is equal to:

7/26/2021 Machine Learning Course- Dr G Madhu 48


7/26/2021 Machine Learning Course- Dr G Madhu 49
7/26/2021 Machine Learning Course- Dr G Madhu 50
Source: https://gkunapuli.github.io/files/cs6375/06-SupportVectorMachines.pdf
7/26/2021 Machine Learning Course- Dr G Madhu 51
Source: https://gkunapuli.github.io/files/cs6375/06-SupportVectorMachines.pdf
7/26/2021 Machine Learning Course- Dr G Madhu 52
This is the big drawback of SVMs! The two classes need to be fully separable. This is
never the case in real-world datasets! This is where Soft Margin SVMs come into play

7/26/2021 Machine Learning Course- Dr G Madhu 53


• Question: What is the best separating
hyperplane?
• SVM Answer: The one that maximizes the
distance to the closest data points from both
classes. We say it is the hyperplane with
maximum margin.

7/26/2021 Machine Learning Course- Dr G Madhu 54


Source: https://gkunapuli.github.io/files/cs6375/06-SupportVectorMachines.pdf
7/26/2021 Machine Learning Course- Dr G Madhu 55
Source: https://gkunapuli.github.io/files/cs6375/06-SupportVectorMachines.pdf
7/26/2021 Machine Learning Course- Dr G Madhu 56
7/26/2021 Machine Learning Course- Dr G Madhu 57
7/26/2021 Machine Learning Course- Dr G Madhu 58
Duality
• In mathematical optimization theory, duality
means that optimization problems may be
viewed from either of two perspectives, the
primal problem or the dual problem (the duality
principle).
• The solution to the dual problem provides a
lower bound to the solution of the primal
(minimization) problem.

7/26/2021 Machine Learning Course- Dr G Madhu 59


• Primal Problem is something that
we want to minimize.
• In the diagram, P* minimizes the
Primal Objective P.
• Dual Problem is something we
want to maximize.
• Here we want to convert the
Primal Problem (P) to a Dual
Problem (D).
• D* maximizes the Dual Objective D.
• Sometimes solving the Dual
Problem is same as solving the
Primal Problem.
• (P∗−D∗) is called as the Duality Gap.
• If P∗−D∗>0 we can say weak duality
holds.

7/26/2021 Machine Learning Course- Dr G Madhu 60


SVM : The Dual Formulation in Machine Learning

• Optimization problem with convex quadratic objectives and linear


constraints

• Can be solved using Quadratic Programming(QP).


✓ Allow us to use kernels to get optimal margin classifiers to work
efficiently in very high dimensional spaces.
✓ Allow us to derive an efficient algorithm for solving the above
optimization problem that will typically do much better than generic
QP software

7/26/2021 Machine Learning Course- Dr G Madhu 61


Lagrangian Duality in brief
• The Primal Problem

The generalized Lagrangian :

• the 𝛼's (𝛼≥0) and β' are called the Lagrange


multipliers Lemma:

7/26/2021 Machine Learning Course- Dr G Madhu 62


A re-written Primal:

The Primal Problem

The Dual Problem :

Theorem (weak duality) :

Theorem (strong duality):


If there exist a saddle point of L(w,𝛼,β), we have
7/26/2021 Machine Learning Course- Dr G Madhu 63
Nonlinear SVM and Kernel Function
• In this lecture, we are going to learn about a
detailed description of SVM Kernel and
Different Kernel Functions and its examples
such as
– linear,
– nonlinear,
– polynomial,
– Gaussian kernel,
– Radial basis function (RBF), sigmoid etc.

7/26/2021 Machine Learning Course- Dr G Madhu 64


What is Nonlinear SVM?
• Nonlinear classification: SVM can be extended
to solve nonlinear classification tasks when
the set of samples cannot be separated
linearly.
• By applying kernel functions, the samples are
mapped onto a high-dimensional feature
space, in which the linear classification is
possible.

7/26/2021 Machine Learning Course- Dr G Madhu 65


• A kernel function k (x, y) is an inner product
between the samples where
– 𝑘 𝑥, 𝑦 = 𝜙 𝑥 , 𝜙(𝑦)
• Valid kernel functions must satisfy the
Mercer’s conditions, namely k (x, y) must
equal k (y, x).
• By using the kernel function, a nonlinear
version of SVM can be developed and the
expression of the optimization problem for
such SVM can be written (in the dual form) as:

7/26/2021 Machine Learning Course- Dr G Madhu 66


7/26/2021 Machine Learning Course- Dr G Madhu 67
• Note that the samples closest to the separating
hyperplanes are those whose coefficients ai are
nonzero.
• These samples are called support vectors.
• The support vectors include the necessary
information to construct the optimal hyperplane
while other samples lay no effects on it.
• This is the reason why SVM could be used for the
classification tasks whose number of samples is
limited.
• There are different kernel functions used in SVM,
like linear, polynomial and Gaussian RBF,
7/26/2021 Machine Learning Course- Dr G Madhu 68
How to solve the Dual problem of SVM

7/26/2021 Machine Learning Course- Dr G Madhu 69

You might also like