Naïve Bayes Classifier: April 25, 2006
Naïve Bayes Classifier: April 25, 2006
hMAP argmax P ( h | D )
hH
P ( D | h ) P ( h)
argmax
hH P ( D)
argmax P ( D | h) P (h)
hH
H: set of all hypothesis.
Note that we can drop P(D) as the probability of the data is constant
(and independent of the hypothesis).
Maximum Likelihood
Now assume that all hypotheses are equally
probable a priori, i.e., P(hi ) = P(hj ) for all hi,
hj belong to H.
This is called assuming a uniform prior. It
simplifies computing the posterior:
hML arg max P ( D | h)
hH
cMAP argmax P (c j | x1 , x2 , , xn )
c j C
P ( x1 , x2 , , xn | c j ) P(c j )
argmax
c j C P( x1 , x2 , , xn )
argmax P( x1 , x2 , , xn | c j ) P(c j )
c j C
Parameters estimation
P(cj)
Can be estimated from the frequency of classes in the
training examples.
P(x1,x2,…,xn|cj)
O(|X|n•|C|) parameters
Could only be estimated if a very, very large number of
training examples was available.
Independence Assumption: attribute values are
conditionally independent given the target value: naïve
Bayes. P( x , x , , x | c ) P( x | c )
1 2 n j
i
i j
c NB arg max P (c j ) P ( xi | c j )
c j C i
Properties
Estimating P( xi | c j ) instead of P( x1 , x2 , , xn | c j ) greatly
reduces the number of parameters (and the data
sparseness).
The learning step in Naïve Bayes consists of
estimating P( xi | c j ) and P(c j ) based on the
frequencies in the training data
An unseen instance is classified by computing the
class that maximizes the posterior
When conditioned independence is satisfied, Naïve
Bayes corresponds to MAP classification.
Question: For the day <sunny, cool, high, strong>, what’s
the play prediction?
Underflow Prevention
Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying
probabilities.
Class with highest final un-normalized log
probability score is still the most probable.
c NB argmax log P(c j )
c jC
log P( x | c )
i positions
i j
Smoothing to Avoid Overfitting
N ( X i xi , C c j ) 1
Pˆ ( xi | c j )
N (C c j ) k
# of values of Xi
Somewhat more subtle version overall fraction in
data where Xi=xi,k
N ( X i xi ,k , C c j ) mpi ,k
Pˆ ( xi ,k | c j )
N (C c j ) m
extent of
“smoothing”