0% found this document useful (0 votes)
15 views

Lecture 7

Uploaded by

20208046
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lecture 7

Uploaded by

20208046
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Lecture 7: Naïve bayse

Naïve bayes classifier


• It is a classification technique based on Bayes theorem with
independent assumption among features (predictors).

• Naïve Bayes model is easy to build, with no complicated


iterative parameter estimation which makes it particularly
useful for very large datasets
Bayes Theorem
▪ Given a class C and feature X which bears on the class:

P( X | C ) P(C )
P(C | X ) =
P( X )

▪ P(C) : independent probability of C (hypotheses): prior probability


▪ P(X) : independent probability of X (data, predicator)
▪ P(X|C): conditional probability of X given C: likelihood
▪ P(C|X): conditional probability of C given X: posterior probability
Maximum A Posterior
▪ Based on Bayes Theorem, we can compute the Maximum A Posterior
(MAP) hypothesis for the data
▪ We are interested in the best hypothesis for some space C given
observed training data X.
c MAP  argmax P(c | X )
cC
P ( X | c ) P (c )
= argmax
cC P( X )
= argmax P( X | c) P(c)
cC

C: set of all hypothesis (Classes).


Note that we can drop P(X) as the probability of the data is constant
(and independent of the hypothesis).
Bayes Classifiers
Assumption: training set consists of instances of different classes
described cj as conjunctions of attributes values
Task: Classify a new instance d based on a tuple of attribute values
into one of the classes cj  C
Key idea: assign the most probable class c MAP using Bayes Theorem.

cMAP = argmax P(c j | x1 , x2 , , xn )


c j C

P( x1 , x2 , , xn | c j ) P(c j )
= argmax
c j C P( x1 , x2 ,, xn )
= argmax P( x1 , x2 , , xn | c j ) P(c j )
c j C
The Naïve Bayes Model

▪ The Naïve Bayes Assumption: Assume that the effect of the


value of the predictor (X) on a given class ( C ) is
independent of the values of other predictors.

▪ This assumption is called class conditional independence


P( x1 , x2 ,, xn | C ) = P( x1 | C )  P( x2 | C )  ..... P( xn | C )
P( x1 , x2 ,, xn | C ) =  n
i =1 P( xi | C )
Naïve Bayes Algorithm
• Naïve Bayes Algorithm (for discrete input attributes) has two phases
– 1. Learning Phase: Given a training set S, Learning is easy, just
create probability tables.
For each target value of ci (ci = c1 ,  , c L )
Pˆ (C = ci )  estimate P(C = ci ) with examples in S;
For every attribute value x jk of each attribute X j ( j = 1,  , n; k = 1,  , N j )
Pˆ ( X j = x jk |C = ci )  estimate P( X j = x jk |C = ci ) with examples in S;
Output: conditional probability tables; for X j , N j  L elements

– 2. Test Phase: Given an unknown instance X = ( a1 ,  , an ) ,


Look up tables to assign the label c* to X’ if

[ Pˆ ( a1 |c * )    Pˆ ( an |c * )]Pˆ (c * )  [ Pˆ ( a1 |c)    Pˆ ( an |c)]Pˆ (c), c  c * , c = c1 ,  , c L


Classification is easy, just multiply probabilities
Example
• Example: Play Tennis
Example
• Learning Phase

Outlook Play=Yes Play=No Temperature Play=Yes Play=No


Sunny 2/9 3/5 Hot 2/9 2/5

Overcast 4/9 0/5 Mild 4/9 2/5

Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=No Wind Play=Yes Play=No

High 3/9 4/5 Strong 3/9 3/5

Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14


Example
• Test Phase

– Given a new instance, predict its label


x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– Decision making with the MAP rule

P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053


P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”. 10


Naïve Bayes
• Algorithm: Continuous-valued Features
– Numberless values taken by a continuous-valued feature
– Conditional probability often modeled with the normal distribution
 ( x j −  ji ) 2 
exp  − 
1
Pˆ ( x j | ci ) =
2  ji  2 ji
2 
 
 ji : mean (avearage) of feature values x j of examples for which c = ci
 ji : standard deviation of feature values x j of examples for which c = ci

– Learning Phase: for X = ( X1 ,  , Xn ), C = c1 ,  , c L


Output: n L normal distributions and P(C = ci ) i = 1,  , L

– Test Phase: Given an unknown instance X = ( a1 ,  , an )


• Instead of looking-up tables, calculate conditional probabilities with all
the normal distributions achieved in the learning phrase
• Apply the MAP rule to assign a label (the same as done for the discrete case)
11
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes = 21.64 , Yes = 2.35
 =  xn ,  =  ( xn − )2
2
 No = 23.88 , No = 7.09
N n=1 N n=1

– Learning Phase: output two Gaussian models for P(temp|C)


1  ( x − 21 .64) 2
 1  ( x − 21.64) 2

Pˆ ( x | Yes) = exp − 
 = exp
 − 
2.35 2  2  2.35  2.35 2
2
 11.09 

ˆ 1  ( x − 23 .88) 2
 1  ( x − 23.88) 2

P( x | No) = 
exp − 
 = 
exp − 
7.09 2  2  7.09  7.09 2
2
 50.25 
12
Zero conditional probability
• If no example contains the feature value
– In this circumstance, we face a zero conditional probability problem
during test

Pˆ ( x1 | ci )    Pˆ (a jk | ci )    Pˆ ( xn | ci ) = 0 for x j = a jk , Pˆ (a jk | ci ) = 0

– For a remedy, class conditional probabilities re-estimated with

n + mp
Pˆ (a jk | ci ) = c (m-estimate)
n+m
nc : number of training examples for which x j = a jk and c = ci
n : number of training examples for which c = ci
p : prior estimate (usually, p = 1 / t for t possible values of x j )
m : weight to prior (number of " virtual" examples, m  1) 13
Zero conditional probability
• Example: P(outlook=overcast|no)=0 in the play-tennis dataset
– Adding m “virtual” examples (m: up to 1% of #training example)
• In this dataset, # of training examples for the “no” class is 5.
• We can only add m=1 “virtual” example in our m-esitmate remedy.
– The “outlook” feature can takes only 3 values. So p=1/3.
– Re-estimate P(outlook|no) with the m-estimate

0 (No.of samples outlook=overcast|no )

5 (No.of samples class=no )

1/3 (outloock has 3 values(sunny,


overcast, rain) )
1
Conclusion
▪ Naïve Bayes is based on the independence assumption
▪ Training is very easy and fast; just requiring considering each attribute in each
class separately
▪ Test is straightforward; just looking up tables or calculating conditional
probabilities with normal distributions

▪ Naïve Bayes
• Performance of naïve Bayes is competitive to most of state-of-the-art classifiers
even if in presence of violating the independence assumption
• It has many successful applications, e.g., spam mail filtering

You might also like