ARTIFICIAL INTELLIGENCE
By
BITS Team AI
Pilani
Pilani | Dubai | Goa | Hyderabad
BITS
Pilani
Pilani | Dubai | Goa | Hyderabad
Contact Session-6
Probability, Bayesian Networks and Hidden Markov Models
Facts
Delhi is the capital of India
You can reach Bangalore Airport from MG Road within 90 mins if
you go by route A.
” Would you call that as a fact?
Are you sure this would be true always?
Uncertainty
You can reach Bangalore Airport from MG Road within 90 mins if you go by
route A.
There is uncertainty in this information due to partial observability and non
determinism
Agents should handle such uncertainty
Previous approaches like Logic represent all possible world states
Such approaches can’t be used as multiple possible states need to be
enumerated to handle the uncertainty in our information
Belief
You can reach Bangalore Airport from MG Road within 90 mins if
you go by route A.
Such information can only provide a degree of belief, e.g., we are
80% confident that it would be true on any given day
In order to deal such degree of belief, we need Probability
Theory
Probability Theory
Probability provides a way to summarize the uncertainty
Sample Space: Set of all possible outcomes.
Ex: After tossing 2 coins, the set of all possible outcomes are
{HH, HT, TH, TT}
Event: A subset of a sample space.
An event of interest might be - {HH}
Probability Basics – Probability Model
A fully specified probability model associates a numerical
probability P(ω) with each possible world.
The basic axioms
Every possible world has a probability between 0 and 1
Sum of probabilities of possible worlds is 1
E.g., P(HH) = 0.25; P(HT) = 0.25; P(TT) = 0.25, P(TH) = 0.25
Probability Basics – Unconditional / Prior
Unconditional / Prior probabilities: Propositions like
P(sum = 11) or P(two dices rolling equals) are called
Unconditional or Prior probabilities
They refer to degree of belief in absence of any other
information
Probability Basics - Conditional
However, most of the time we have some information, we
call it evidence
E.g., we can be interested in two dice rolling a double (i.e., 1,1 or 2,2,
etc)
When one die has rolled 5 and the other die is still
spinning
Here, we not interested in unconditional probability of
rolling a double
Instead, the conditional or posterior probability for
rolling a double given the first die has rolled a 5
where | is pronounced “given”
E.g., if you are going for a dentist for a checkup, P(cavity) = 0.2
If you have a toothache, then P(cavity | toothache) = 0.6
Probability Basics - Conditional
Conditional probabilities can be expressed in unconditional
probabilities
E.g., for proposition a, b
Holds when P(b) > 0
Product rule :
Example
In a factory there are 100 units of a certain product, 5 of which are defective. We
pick three units from the 100 units at random. What is the probability that none of
them are defective?
Let Ai as the event and i=1,2,3
Probability Distribution
If there is a random variable Weather with domain {sunny, rain, cloudy,
snow}, we could write
P(Weather = sunny) = 0.6
P(Weather = rain) = 0.1
P(Weather = cloudy) = 0.2
P(Weather = snow) = 0.1
P(Weather) = <0.6, 0.1, 0.2, 0.1>
P defines a probability distribution for the random variable Weather
P is used for conditional distributions; P(X|Y) gives values for P(X = i | Y = j)
for all combinations of i and j
Probability Density Function
For continuous variables (e.g., weight) it is infeasible to write the distribution
as a vector
Instead, we define the probability as a function of the value the variable can
take
P(Weight = x) = Normal(x | mean = 60, std = 10)
It says that the weight of an individual is Normal (Gaussian) distributed with
average weight being 60 kgs and a deviation of 10 kgs
They are called as Probability Density Function
Joint Probability Distributions
Instead of distribution over single variable, we can model distribution over
multiple variables, separated by comma
E.g., P(A, B) = P(A | B) . P(B)
P(A, B) is the probability distribution over combination of all values of A and
B
E.g., if A = Weather and B = Cavity
Independence
If we have two random variables, TimeToBnlrAirport and
HyderabadWeather
P(TimeToBnlrAirport, HyderabadWeather)
To determine their relation, use the product rule
= P(TimeToBnlrAirport | HyderabadWeather) / P(HyderabadWeather)
However, we would argue that HyderabadWeather and
TimeToBnlrAirport doesn’t have any relation and hence
P(TimeToBnlrAirport | HyderabadWeather) = P(TimeToBnlrAirport)
This is called Independence or Marginal Independence
Independence between propositions a and b can be written as
Bayes Rule
Using the product rule for propositions a and b
Equating the right hand sides and dividing by P(a)
This is called the Bayes Rule
Conditional Independence
Can generalize to more than 2 random variables
E.g., K different symptom variables X1, X2,
… XK, and C = disease
P(X1, X2,…. XK | C) = Π P(Xi | C)
Also known as the naïve Bayes assumption
Bayesian Networks
Full joint probability distributions can be
used for inference however can become
intractable as the number of variables
grow
Bayesian Networks represent
dependencies among variables
Can represent full joint probability
distribution
What is a Bayesian Network
A Bayesian network is a directed graph in which each node is annotated with
quantitative probability information.
Each node corresponds to a random variable, which may be discrete or
continuous.
A set of directed links or arrows connects pairs of nodes. If there is an arrow
from node X to node Y , X is said to be a parent of Y. The graph has no
directed cycles (and hence is a directed acyclic graph, or DAG).
Each node Xi has a conditional probability distribution P(Xi | Parents(Xi )) that
quantifies the effect of the parents on the node.
Bayesian Networks
Once the topology of the Bayesian
Network is laid out,
Will specify the conditional probability
distribution for each variable given its
parents
Such that the topology and conditional
probabilities suffices to specify the full
joint probability distribution of all
variables
Building a Bayesian Network
BITS Pilani, Deemed to be University under Section 3 of UGC
BITS Pilani, Pilani Campus
Example Bayesian Net #2
A Burglary Alarm
System
– Fairly reliable on detecting a
burglary
– Also responds to earthquakes
– Two neighbors John and Mary
are asked to call you at work
when Burglary happens and
they hear the Alarm
– John nearly always calls when
he hears the alarm, however
sometimes confuses the
telephone ring with alarm and
calls then too
– Mary like loud music and often
misses the alarm altogether
– Problem: Given the
information that who has /
has not called we need to
estimate the probability of a
burglary
calculate the probability that
alarm has sounded, but neither
burglary nor earthquake
happened, and both John and
Mary called.
Example Bayesian Net #3Traffic Prediction -Travel Estimation
– AI system reminds traveler regarding
• day time
– Travel plan is to reach Delhi and the weather of Delhi may influence
the accommodation plans
– Traveler always take car to reach airport
– Car may be rerouted either due to road block or weekday traffic during
working hours which delays the arrival to airport
– Bars are always observed to be full on
• weekends
– Authorities block roads to safe the processions
– Processions observed during festive season or due to the political rally.
– Problem: Given the information that there is a political rally expected
estimate the probability of late arrival
BITS Pilani, Pilani Campus
Identify the R.Vs
Political
Festival RallY
Dependencies
among RVs
Procession Weather
@ Delhi
Find the
Conditional
Independences
Road
weekend Use ML to get the
best Linearization
among RVs
block
Construct the Bayes
Net
Cars
All bars
route
are
d
full
Encode the Local
dependencies by
CPT
Late
for
BGLR
airport
Example Bayesian Net #3
F ~F
Y ~Y
Identify the R.Vs
T
Political T
F Y P ~P Festival RallY
T T Dependencies
among RVs
T F D ~D
F T Procession Weather
T
@ Delhi Find the
F F Conditional
Independences
P R ~R
W ~W
T Road Use ML to get the
weekend T best Linearization
among RVs
F
block
R W C ~C
W A ~A Construct the Bayes
Net
T T Cars T
All bars
T F route
are F
d
F T full Encode the Local
dependencies by
CPT
F F
Late for
C L ~L BGLR
airport
T
F BITS Pilani, Pilani
Probabilistic Reasoning over Time
Hidden Markov Model
37
Partial Observability
Agents in partially observable environment should keep a track of current
state to the extent allowed by sensors
E.g., Robot moving in a new maze
Agent maintains a belief state representing the current possible world states
Transition Model: Using belief state and transition model, the agent can how
the world might evolve in next time step
Sensor Model: With the observed percepts and sensor model, the agent can
update the belief state
Degree of belief
Earlier, the states were captured as
Facts in Logic which can take either True/False
To capture the degree of belief we will use Probability
Theory
We model the change in world using a variable for each
aspect of state and at each point in time
Transition models – describe the probability distribution of
variables at time t, given the state of the world at past times
Sensor models – describe the probability of each percept at
time t, given the current state of the world
Time and Uncertainty
Static World: Each random variable would
have a single fixed value
E.g., Diagnosing a broken car
Dynamic World: The state information
keeps changing with time
E.g., treating a diabetic patient, tracking
the location of robot, tracking economic
activity of a nation
Hidden Markov model
Markov model /Markov chain
A Markov process is a process that generates a sequence of
outcomes in such a way that the probability of the next outcome
depends only on the current outcome and not on what happened
earlier.
MARKOV CHAIN: WEATHER EXAMPLE
Design a Markov Chain to
predict the weather of
tomorrow using previous
information of the past days.
states: 𝑆 = 𝑆1, 𝑆2, 𝑆3
Our model has only 3
𝑆1 = 𝑆𝑢𝑛𝑛𝑦 , 𝑆2 =
𝑅𝑎𝑖𝑛𝑦, 𝑆3 = 𝐶𝑙𝑜𝑢𝑑𝑦.
Contd..
Contd..
state sequence notation: 𝑞1, 𝑞2, 𝑞3, 𝑞4, 𝑞5, … . ., where 𝑞𝑖 𝜖
{𝑆𝑢𝑛𝑛𝑦, 𝑅𝑎𝑖𝑛𝑦, 𝐶𝑙𝑜𝑢𝑑𝑦}.
Markov Property
Example
Given that today is Sunny, what’s the probability that tomorrow is
Sunny and the next day Rainy?
Example2
Assume that yesterday’s weather was Rainy, and today is
Cloudy, what is the probability that tomorrow will be Sunny?
WHAT IS A HIDDEN MARKOV MODEL (HMM)?
A Hidden Markov Model, is a stochastic model where the states
of the model are hidden. Each state can emit an output which is
observed.
Imagine: You were locked in a room for several days and you
were asked about the weather outside. The only piece of
evidence you have is whether the person who comes into the
room bringing your daily meal is carrying an umbrella or not.
What is hidden? Sunny, Rainy, Cloudy
What can you observe? Umbrella or Not
Markov chain Vs HMM
Markov chain HMM
Hidden Markov Models (Formal)
• States Q = q1, q2…qN;
• Observations O= o1, o2…oN;
• Transition probabilities
• Transition probability matrix A = {aij}
• Emission Probability /Output probability
• Output probability matrix B={bi(k)}
• Special initial probability vector
First-Order HMM Assumptions
• Markov assumption: probability of a state depends
only on the state that precedes it
P(qi | q1...qi 1) P(qi | qi 1)
How to build a second-order HMM?
• Second-order HMM
• Current state only depends on previous 2 states
• Example
• Trigram model over POS tags
Markov Chain for Weather
What is the probability of 4 consecutive warm
days?
Sequence is
warm-warm-warm-warm
And state sequence is
3-3-3-3
P(3,3,3,3) =
3a33a33a33a33 = 0.2 x (0.6)3 = 0.0432
Hidden Markov Models
It is a sequence model.
Assigns a label or class to each unit in a sequence, thus
mapping a sequence of observations to a sequence
of labels.
Probabilistic sequence model: given a sequence of units
(e.g. words, letters, morphemes, sentences), compute a
probability distribution over possible sequences of labels
and choose the best label sequence.
This is a kind of generative model.
BITS Pilani, Pilani Campus
Hidden Markov Model (HMM)
Oftentimes we want to know what produced the sequence – the
hidden sequence for the observed sequence. For example,
– Inferring the words (hidden) from acoustic signal (observed) in
speech recognition
– Assigning part-of-speech tags (hidden) to a sentence
(sequence of words) – POS tagging.
– Assigning named entity categories (hidden) to a sentence
(sequence of words) – Named Entity Recognition.
BITS Pilani, Pilani Campus
Problem 1:
Observation Likelihood
• The probability of a observation sequence given a model
and state sequence
• Evaluation problem
BITS Pilani, Pilani Campus
Problem 2:
• Most probable state sequence given a model and an
observation sequence
• Decoding problem
BITS Pilani, Pilani Campus
Problem 3:
• Infer the best model parameters, given a partial model and an
observation sequence...
– That is, fill in the A and B tables with the right numbers --
• the numbers that make the observation sequence most likely
• This is to learn the probabilities!
BITS Pilani, Pilani Campus
Solutions
Problem 1: Forward (learn observation sequence)
Problem 2: Viterbi (learn state sequence)
Problem 3: Forward-Backward (learn probabilities)
– An instance of EM (Expectation Maximization)
BITS Pilani, Pilani Campus
Example :HMMs for Ice Cream
You are a climatologist in the year 2799 studying global warming
You can’t find any records of the weather in Baltimore for summer of
2007
But you find Jason Eisner’s diary which lists how many ice-creams
Jason ate every day that summer
Your job: figure out how hot it was each day
BITS Pilani, Pilani Campus
Hidden Markov Model
Example -1 – Viterbi Algorithm - Initialization
P(C)*P(3|C) = 0.2*0.1 = 0.02
C
*
Hot Cold
<S> 0.8 0.2
A Hot Cold
H Hot 0.6 0.4
Cold 0.5 0.5
P(H)*P(3|H) = 0.8*0.4 = 0.32
B 1 2 3
Hot .2 .4 .4
Cold .5 .4 .1
BITS Pilani, Pilani Campus
Hidden Markov Model
Example -1 – Viterbi Algorithm - Recursion
P(C)*P(C|C)*P(1|C) = 0.02*0.5*0.5 = 0.005
P(H)*P(C|H)*P(1|C) = 0.32*0.4*0.5 = 0.064
C C
0.02
* Hot Cold
<S> 0.8 0.2
A Hot Cold
Hot 0.6 0.4
H H Cold 0.5 0.5
0.32
P(C)*P(H|C)*P(1|H) = 0.02*0.5*0.2 = 0.002 B 1 2 3
P(H)*P(H|H)*P(1|H) = 0.32*0.6*0.2 = 0.0384 Hot .2 .4 .4
Cold .5 .4 .1
BITS Pilani, Pilani Campus
Hidden Markov Model
Example -1 – Viterbi Algorithm – Termination through Back Trace
P(C)*P(C|C)*P(3|C) = 0.064*0.5*0.1 = 0.032
P(H)*P(C|H)*P(3|C) = 0.0384*0.4*0.1 = 0.0015
C C C
0.02 0.064
* Best
Sequence:
Hot Cold
Hot🡪Cold🡪Cold
<S> 0.8 0.2
A Hot Cold
H H H Hot 0.6 0.4
Cold 0.5 0.5
0.32 0.0384
P(C)*P(H|C)*P(3|H) = 0.064 *0.5*0.2 = 0.0064 B 1 2 3
P(H)*P(H|H)*P(3|H) = 0.0384 *0.6*0.2 = 0.0046 Hot .2 .4 .4
Cold .5 .4 .1
BITS Pilani, Pilani Campus
Hidden Markov Model
Example -1 – Viterbi Algorithm
Source Credit : Speech and Language Processing - Jurafsky and Martin
BITS Pilani, Pilani Campus
Hidden Markov Model
Example -1 – Viterbi Algorithm
Source Credit : Speech and Language Processing - Jurafsky and Martin
BITS Pilani, Pilani Campus
Hidden Markov Model
Example -4 – Naïve Search
Hot Cold
<S> 0.8 0.2
HHH P(H)*P(1|H)*P(H|H)*P(3|H)*P(H|H)*P(1|H)
A Hot Cold
HHC
HCC Hot 0.7 0.3
CCC Cold 0.4 0.6
CHC P(C)*P(1|C)*P(H|C)*P(3|H)*P(C|H)*P(1|C) 0.0024
=0.2*0.5*0.4*0.4*0.3*0.5 B 1 2 3
CCH Hot .2 .4 .4
CHH Cold .5 .4 .1
HCH
BITS Pilani, Pilani Campus