0% found this document useful (0 votes)
5 views

UNIT-3

The document discusses Bayesian learning methods in machine learning, highlighting their importance in calculating probabilities for hypotheses and understanding various learning algorithms. It covers practical applications such as the naive Bayes classifier and the Gibbs algorithm, as well as the Bayes Optimal Classifier and the PAC learning framework. Additionally, it addresses sample complexity in both finite and infinite hypothesis spaces, emphasizing the role of the Vapnik-Chervonenkis dimension.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

UNIT-3

The document discusses Bayesian learning methods in machine learning, highlighting their importance in calculating probabilities for hypotheses and understanding various learning algorithms. It covers practical applications such as the naive Bayes classifier and the Gibbs algorithm, as well as the Bayes Optimal Classifier and the PAC learning framework. Additionally, it addresses sample complexity in both finite and infinite hypothesis spaces, emphasizing the role of the Vapnik-Chervonenkis dimension.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 99

MACHINE LEARNING

UNIT 3
BAYESIAN LEARNING
 Bayesian learning methods are relevant to our study of machine learning
for two different reasons
 First, Bayesian learning algorithms that calculate explicit probabilities for
hypotheses, such as the naive Bayes classifier, are among the most
practical approaches to certain types of learning problems.
 The second reason that Bayesian methods are important to our study of
machine learning is that they provide a useful perspective for
understanding many learning algorithms that do not explicitly manipulate
probabilities.
Features of Bayesian learning methods

 Each observed training example can incrementally decrease or increase


the estimated probability that a hypothesis is correct
 Prior knowledge can be combined with observed data to determine the

final probability of a hypothesis


 Bayesian methods can accommodate hypotheses that make
probabilistic predictions
 New instances can be classified by combining the predictions of multiple

 hypotheses, weighted by their probabilities.


WSN security
NAIVE BAYES CLASSIFIER
One highly practical Bayesian learning method is the naive Bayes
learner, often called the naive Bayes classijier.

The naive Bayes classifier applies to learning tasks where each


instance x is described by a conjunction of attribute values and
where the target function f ( x ) can take on any value from some
finite set V.

A set of training examples of the target function is provided, and a


new instance is presented, described by the tuple of attribute values
(al, a2.. .a,). The learner is asked to predict the target value, or
classification, for this new instance.
where VNB denotes the target value output by the naive Bayes
classifier
GIBBS ALGORITHM
Bayes optimal classifier obtains the best performance that
can be achieved from the given training data, it can be quite
costly to apply.

The expense is due to the fact that it computes the posterior


probability for every hypothesis in H and then combines the
predictions of each hypothesis to classify each new instance.
An alternative, less optimal method is the Gibbs algorithm
(see Opper and Haussler 1991), defined as follows:
Gibbs algorithm simply applies a hypothesis drawn at random
according to the current posterior probability distribution .

Surprisingly, it can be shown that under certain conditions the


expected misclassification error for the Gibbs algorithm is at
most twice the expected error of the Bayes optimal classifie
AN EXAMPLE: LEARNING TO
CLASSIFY TEXT
K-NEAREST NEIGHBUOR
CLASSIFICATION
K-Nearest Neighbours
 K-Nearest Neighbors is one of the most basic yet essential classification
algorithms in Machine Learning

 It belongs to the supervised learning domain

 Has application in pattern recognition, data mining and intrusion detection

 We are given some prior data (also called training data), which classifies
coordinates into groups identified by an attribute.

 given another set of data points (also called testing data), allocate these points a
group by analyzing the training set.
 given an unclassified point, we can assign it to a group by observing what
group its nearest neighbors belong to.

 This means a point close to a cluster of points classified as ‘Red’ has a higher
probability of getting classified as ‘Red’.

 Intuitively, we can see that the first point (2.5, 7) should be classified as
‘Green’ and the second point (5.5, 4.5) should be classified as ‘Red’.
Weighted K-Nearest Neighbor
Example
Minimum Description Length
Principle
Minimum Description Length Principle (Conti..)
Minimum Description Length Principle (Conti..)
BAYES OPTIMAL CLASSIFIER
The Bayes Optimal Classifier is a probabilistic model that
predicts the most likely outcome for a new situation

The Bayes theorem is a method for calculating a hypothesis’s


probability based on its prior probability, the probabilities of
observing specific data given the hypothesis, and the seen
data itself.

Maximum a Posteriori (MAP), a probabilistic framework for


determining the most likely hypothesis for a training
dataset.
Take a hypothesis space that has 3 hypotheses h1, h2, and
h3.

The posterior probabilities of the hypotheses are as


follows:
h1 -> 0.4
h2 -> 0.3
h3 -> 0.3

Hence, h1 is the MAP hypothesis. (MAP => max


posterior)
Suppose a new instance x is encountered, which is
classified negative by h2 and h3 but positive by h1.
Taking all hypotheses into account, the probability that x is
positive is .4 and the probability that it is negative is
therefore .6.
The classification generated by the MAP hypothesis is
different from the most probable classification in this case
which is negative.

The most probable classification of the new instance is


obtained by combining the predictions of all hypotheses,
weighted by their posterior probabilities.
If the new example’s probable classification can be any
value vj from a set V, the probability P(vj/D) that the right
classification for the new instance is vj is merely

The denominator is omitted since we’re only using this


for comparison and all the values of P(vj/D) will have the
same denominator.

The value vj, for which P (vj/D) is maximum, is the


best classification for the new instance.
Computational Learning Theory
PAC MODEL (PROBABLY
APPROXIMATELY CORRECT)

PAC (Probably Approximately Correct) learning is a framework used for


mathematical analysis. A PAC Learner tries to learn a concept
(approximately correct) by selecting a hypothesis from a set of
hypotheses that has a low generalization error.
Approximately correct Hypothesis if h⊕c <= ɛ
where 0< ɛ < 0.5

Probably Approximately Correct if Pr(h⊕ c ) <= 1- δ


δ is confidence interval where 0< δ < 0.5
SAMPLE COMPLEXITY FOR
FINITE HYPOTHESIS SPACES
 PAC-learnability is largely determined by the number of training examples
required by the learner.
 The growth in the number of required training examples with problem size,
called the sample complexity of the learning problem
 A learner is consistent if it outputs hypotheses that perfectly fit the training
data, whenever possible
 we derive a bound on the number of training examples required by
any consistent learner, independent of the specific algorithm it uses
SAMPLE COMPLEXITY FOR
INFINITE HYPOTHESIS SPACES
 we consider a second measure of the complexity of H, called the Vapnik-
Chervonenkis dimension of H we can state bounds on sample complexity
that use VC(H) rather than IHI.
 The VC dimension measures the complexity of the hypothesis space H, not
by the number of distinct hypotheses 1 H 1, but instead by the numberof
distinct instances from X that can be completely discriminated using H.
 It uses a notation called as shettering.

You might also like