0% found this document useful (0 votes)
682 views

Bayesian Learning Unit 3 PDF

This document discusses Bayesian learning and concepts in machine learning. It covers Bayes' theorem, which provides a probabilistic approach to inference based on probability distributions over quantities of interest and observed data. Bayesian learning allows prior knowledge to be combined with new data, hypotheses to make probabilistic predictions, and new instances to be classified by weighing multiple hypotheses. While practical difficulties include estimating unknown probabilities and computational costs, Bayesian learning provides optimal decision making standards. Specific topics covered include the Naive Bayes classifier, Bayesian belief networks, and the EM algorithm.

Uploaded by

Harry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
682 views

Bayesian Learning Unit 3 PDF

This document discusses Bayesian learning and concepts in machine learning. It covers Bayes' theorem, which provides a probabilistic approach to inference based on probability distributions over quantities of interest and observed data. Bayesian learning allows prior knowledge to be combined with new data, hypotheses to make probabilistic predictions, and new instances to be classified by weighing multiple hypotheses. While practical difficulties include estimating unknown probabilities and computational costs, Bayesian learning provides optimal decision making standards. Specific topics covered include the Naive Bayes classifier, Bayesian belief networks, and the EM algorithm.

Uploaded by

Harry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

UNIT 3- BAYESIAN LEARNING

Bayes Theorem, Concept Learning, Maximum Likelihood, Minimum Description Length Principle,
Bayes Optimal Classifier, Gibbs Algorithm , Naïve Bayes Classifier, Bayesian Belief Network, EM
Algorithm

Introduction
Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that
the quantities of interest are governed by probability distributions and that optimal decisions can be
made by reasoning about these probabilities together with observed data. It is important to machine
learning because it provides a quantitative approach to weighing the evidence supporting alternative
hypotheses. Bayesian reasoning provides the basis for learning algorithms that directly manipulate
probabilities, as well as a framework for analyzing the operation of other algorithms that do not
explicitly manipulate probabilities

Features of Bayesian learning:


 Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to learning
than algorithms that completely eliminate a hypothesis if it is found to be inconsistent with
any single example.
 Prior knowledge can be combined with observed data to determine the final probability of a
hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a prior
probability for each candidate hypothesis, and (2) a probability distribution over observed
data for each possible hypothesis.
 Bayesian methods can accommodate hypotheses that make probabilistic predictions(e.g.,
hypotheses such as "this pneumonia patient has a 93% chance of complete recovery").
 New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
 Even in cases where Bayesian methods prove computationally intractable, they can provide a
standard of optimal decision making against which other practical methods can be measured.

Limitations of Bayesian learning

 One practical difficulty in applying Bayesian methods is that they typically require initial
knowledge of many probabilities. When these probabilities are not known in advance they
are often estimated based on background knowledge, previously available data, and
assumptions about the form of the underlying distributions.
 A second practical difficulty is the significant computational cost required to determine the
Bayes optimal hypothesis in the general case

Page 1
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

BAYES THEOREM
In machine learning our objective is to determine the best hypothesis from some space H, given the
observed training data D. One way to specify the best hypothesis is to say that we demand the most
probable hypothesis, given the data D plus any initial knowledge about the prior probabilities of the
various hypotheses in H. Bayes theorem provides a way to calculate the probability of a hypothesis
based on its prior probability, the probabilities of observing various data given the hypothesis, and
the observed data itself.

Notations
P(h) - the initial probability that hypothesis h holds, before we have observed the training data.

P(h) is often called the prior probability of h and may reflect any background knowledge we have
about the chance that h is a correct
hypothesis. If we have no such prior knowledge, then we might simply assign the same prior
probability to each candidate hypothesis.

P(D) - the prior probability that training data D will be observed (i.e., the probability of D given no
knowledge about which hypothesis holds).

P(D|h) - the probability of observing data D given some world in which hypothesis h holds. More
generally, we write P(xly) to denote the probability of x given y.
In machine learning problems we are interested in the probability P (h|D) that h holds given the
observed training data D.

P (h |D) is called the posterior probability of h, because it reflects our confidence that h holds after
we have seen the training data D.

According to Bayes theorem , we can compute posterior probability P (h |D) provides from the prior
probability P(h), together with P(D) and P(D(h) as

In many learning scenarios, the learner considers some set of candidate hypotheses H and is
interested in finding the most probable hypothesis h ε H given the observed data D.. Any such
maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis. We can
determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of each
candidate hypothesis.

More precisely, we will say that MAP is a MAP hypothesis provided

Page 2
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Notice in the final step above we dropped the term P(D) because it is a constant independent of h. In
some cases, we will assume that every hypothesis in H is equally probable a priori (P(hi) = P(hj) for
all hi and hj in H). In this case we need to only consider the term P(D|h) to find the most probable
hypothesis. P(Dlh) is often called the likelihood of the data D given h, and any hypothesis that
maximizes P(D|h) is called a maximum likelihood (ML) hypothesis, hML.

An Example
Consider a medical diagnosis problem in which there are two alternative hypotheses:
(1) that the patient has a particular form of cancer.
(2) that the patient does not.

The available data is from a particular laboratory test with two possible outcomes:
+ (positive) and - (negative).
We have prior knowledge that over the entire population of people only .008 have this
disease. The lab test is only an imperfect indicator of the disease. The test returns a correct
positive result in only 98% of the cases in which the disease is actually present and a correct
negative result in only 97% of the cases in which the disease is not present. In other cases,
the test returns the opposite result. The above situation can be summarized by the following
probabilities:

Suppose we now observe a new patient for whom the lab test returns a positive result. Should we
diagnose the patient as having cancer or not? The maximum a posteriori hypothesis can be found
using

Thus, h~= ~cancer.

Page 3
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

BAYES THEOREM AND CONCEPT LEARNING

Brute-Force Bayes Concept Learning


We can design a straightforward concept learning algorithm to output the maximum a posteriori
hypothesis, based on Bayes theorem, as follows:

This algorithm may require significant computation, because it applies Bayes theorem to each
hypothesis in H to calculate P(h|D ).

In order specify a learning problem for the BRUTE-FORCE MAP LEARNING algorithm we must
specify what values are to be used for P(h) and for P(D|h)

The following are the assumptions.


1. The training data D is noise free (i.e., di = c(xi)).
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is more probable than any other.

To specify P(h)
Given no prior knowledge that one hypothesis is more likely than another, it is reasonable to assign
the same prior probability to every hypothesis h in H. Furthermore, because we assume the target
concept is contained in H we should require that these prior probabilities sum to 1. Together these
constraints imply that we should choose

To specify P(D|h)
P(D|h) is the probability of observing the target values D = (dl . . .dm) for the fixed set of instances
(xi . . . xn), given a world in which hypothesis h holds. Since we assume noise-free training data, the
probability of observing classification di given h is just 1 if di = h(xi) and 0 if di ≠ h(xi). Therefore,

Page 4
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

In other words, the probability of data D given hypothesis h is 1 if D is consistent with h, and 0
otherwise.

By Bayes theorem, we have

First consider the case where h is inconsistent with the training data D. Since P(D|h) is 0 when h is
inconsistent with D, we have

The posterior probability of a hypothesis inconsistent with D is zero.

Now consider the case where h is consistent with D. Since P(D|h) is 1 when h is consistent with D,
we have

To summarize, Bayes theorem implies that the posterior probability P(h |D)under our assumed P(h)
and P(D|h) is

Page 5
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

where IVSH,DI is the number of hypotheses from H consistent with D.

BAYES OPTIMAL CLASSIFIER


The question considered till now is "what is the most probable hypothesis given the training data?'
In fact, the question that is most significant is "what is the most probable classification of the new
instance given the training data?

Consider a hypothesis space containing three hypotheses, hl, h2, and h3. Suppose that the posterior
probabilities of these hypotheses given the training data are 0.4, 0.3, and 0.3 respectively. Thus, hl is
the MAP hypothesis. Suppose a new instance x is encountered, which is classified positive by h1,
but negative by h2 and h3. Taking all hypotheses into account, the probability that x is positive is .4
(the probability associated with hi), and the probability that it is negative is therefore .6. The most
probable classification (negative) in this case is different from the classification generated by the
MAP hypothesis.

In general, the most probable classification of the new instance is obtained by combining the
predictions of all hypotheses, weighted by their posterior probabilities.
If the possible classification of the new example can take on any value vj from some set V, then the
probability P(vj|D) that the correct classification for the new instance is vj, is

The optimal classification of the new instance is the value v,, for which P (vj | D) is maximum

To illustrate in terms of the above example, the set of possible classifications of the new instance is
V = +,- and

Page 6
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Any system that classifies new instances according to

is called a Bayes optimal classifier, or Bayes optimal learner. This method maximizes the
probability that the new instance is classified correctly, given the available data, hypothesis space,
and prior probabilities over the hypotheses.

GIBBS ALGORITHM
Although the Bayes optimal classifier obtains the best performance that can be achieved from the
given training data, it can be quite costly to apply. The expense is because it computes the posterior
probability for every hypothesis in H and then combines the predictions of each hypothesis to
classify each new instance.

An alternative, less optimal method is the Gibbs algorithm, defined as follows:


1. Choose a hypothesis h from H at random, according to the posterior probability distribution over
H.
2. Use h to predict the classification of the next instance x.

Given a new instance to classify, the Gibbs algorithm simply applies a hypothesis drawn at random
according to the current posterior probability distribution. It can be shown that under certain
conditions the expected misclassification error for the Gibbs algorithm is at most twice the expected
error of the Bayes optimal classifier. Under this condition, the expected value of the error of the
Gibbs algorithm is at worst twice the expected value of the error of the Bayes optimal classifier.

Page 7
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

NAIVE BAYES CLASSIFIER


The highly practical Bayesian learning method is the naive Bayes learner, often called the naive
Bayes classijier. In some domains its performance has been shown to be comparable to that of
neural network and decision tree learning.
.
The naive Bayes classifier applies to learning tasks where each instance x is described by a
conjunction of attribute values and where the target function f ( x ) can take on any value from some
finite set V. A set of training examples of the target function is provided, and a new instance is
presented, described by the tuple of attribute values (al, a2.. .an,). The learner is asked to predict the
target value, or classification, for this new instance.

The Bayesian approach to classifying the new instance is to assign the most probable target value,
VMAP given the attribute values (al,a 2 . . .an ,) that describe the instance.

We can use Bayes theorem to rewrite this expression as

The naive Bayes classifier is based on the simplifying assumption that the attribute values are
conditionally independent given the target value. In other words, the assumption is that given the
target value of the instance, the probability of observing the conjunction al, a2.. .an, is just the
product of the probabilities for the individual attributes:

where VNB denotes the target value output by the naive Bayes classifier.

Here, nc is the number of training examples satisfying the condition and n is the total number of examples, p is the
prior estimate of the probability we wish to determine, and m is a constant called the equivalent sample size, which
determines how heavily to weight p relative to the observed data.

Page 8
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Example

Page 9
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Page 10
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Page 11
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

2.

i. Estimate the conditional probability P(Color|Yes),P(Color/No), P(Type|Yes), P(Type|No).


P(Origin|Yes), P(Origin|No). Predict the class of the example (Red,Domestic,SUV) using Naïve
Bayes.
ii. Estimate the conditional probability using m-estimate. Predict the class of the example
(Red,Domestic,SUV) using Naïve Bayes.

Page 12
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Bayesian Belief Networks

Bayesian Belief Network is a Probabilistic Graphical Model (PGM) that represents a set of variables
and their conditional dependencies using a directed acyclic graph.

Bayesian networks are probabilistic, because these networks are built from a probability distribution,
and also use probability theory for prediction and anomaly detection.

Real world applications are probabilistic in nature, and to represent the relationship between multiple
events, we need a Bayesian network. It can also be used in various tasks including prediction,
anomaly detection, diagnostics, automated insight, reasoning, time series prediction, and decision
making under uncertainty.

Bayesian Network can be used for building models from data and experts opinions, and it consists of
two parts:

o Directed Acyclic Graph


o Table of conditional probabilities.

A Bayesian network graph is made up of nodes and Arcs (directed links), where

o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between
random variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes
of the network graph.
o If we are considering node B, which is connected with node A by a directed arrow, then node
A is called the parent of Node B.

Page 13
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

o Node C is independent of node A.


o Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node

Benefits of Bayesian Belief Networks

o Visualization. The model provides a direct way to visualize the structure of the model
and motivate the design of new models.
o Relationships. Provides insights into the presence and absence of the relationships
between random variables.
o Computations. Provides a way to structure complex probability calculations.

Joint probability distribution:

If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1,
x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.

= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably
responds at detecting a burglary but also responds for minor earthquakes. Harry has two neighbors
David and Sophia, who have taken a responsibility to inform Harry at work when they hear the
alarm. David always calls Harry when he hears the alarm, but sometimes he got confused with the
phone ringing and calls at that time too. On the other hand, Sophia likes to listen to high music, so
sometimes she misses to hear the alarm. Here we would like to compute the probability of Burglary
Alarm.

Problem:

Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called the Harry.

Page 14
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Solution:

o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly affecting
the probability of alarm's going off, but David and Sophia's calls depend on alarm
probability.
o The network is representing that our assumptions do not directly perceive the burglary and
also do not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table or
CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if there
are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:

o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can
rewrite the above probability statement using joint probability distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Page 15
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.

P(E= True)= 0.001, which is the probability of a minor earthquake

P(E= False)= 0.999, Which is the probability that an earthquake not occurred.

We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Page 16
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the form of probability
distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.

Page 17
Machine Learning Notes 6th Sem CSE Elective 2019-20Sujata Joshi/Assoc Prof/CSE

Page 18

You might also like