0% found this document useful (0 votes)
67 views10 pages

6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2

The document summarizes key concepts related to probabilistic models for classification. It discusses estimating the joint distribution Pr(X,Y) using linear discriminant analysis, which assumes data from each class is Gaussian distributed. Maximum likelihood estimates of the model parameters are provided. It also discusses estimating the conditional distribution Pr(Y|X) using methods like least squares, logistic regression, and generalized linear models. Finally, it distinguishes between generative and discriminative probabilistic models for classification.

Uploaded by

Suchan Khankluay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views10 pages

6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2

The document summarizes key concepts related to probabilistic models for classification. It discusses estimating the joint distribution Pr(X,Y) using linear discriminant analysis, which assumes data from each class is Gaussian distributed. Maximum likelihood estimates of the model parameters are provided. It also discusses estimating the conditional distribution Pr(Y|X) using methods like least squares, logistic regression, and generalized linear models. Finally, it distinguishes between generative and discriminative probabilistic models for classification.

Uploaded by

Suchan Khankluay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

6.

867 Section 3: Classification

Reading:

Probabilistic models: Bishop 4.1, 4.2

Bayesian methods: Bishop 4.3.14.3.4, 4.44.5

SVMs: Bishop (briefly) first part of 7.1; Murphy (grudgingly) 14.5; Hastie Tibshirani
Friedman 12.13

Contents
1 Intro 2

2 Representation 2

3 Probabilistic models 2
3.1 Estimating Pr(X, Y) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.1.1 Linear discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . 3
3.1.2 Factoring the class conditional probability . . . . . . . . . . . . . . . 4
3.1.3 Exponential family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.3.1 Bernoulli is in exponential family . . . . . . . . . . . . . . . 6
3.1.3.2 Normal is in exponential family . . . . . . . . . . . . . . . . 6
3.1.3.3 LDA for exponential family . . . . . . . . . . . . . . . . . . 7
3.2 Estimating Pr(Y | X) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.3 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Generative versus discriminative . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Distributions over models 9

1
MIT 6.867 Fall 2016 2

1 Intro
Now we will look at the supervised learning problem of classification, in which the data
given is a set of pairs
D = {(x(1) , y(1) ), . . . , (x(n) , y(n) )} ,
with x(i) RD and y(i) {c1 , . . . , ck }; that is, the output is a discrete value indicating the
class of the example. By default, well think about the two-class classification problem, with
the classes or labels often drawn from the set {0, 1} or {+1, 1}. This problem isnt concep-
tually different from regression, but it offers a different kind of structure to be exploited
during learning.
Well look at learning prediction rules, probabilistic models, and distributions over
models, and focus on the case in which the model represent a thresholded linear relation-
ship between inputs and outputs. In later parts of the course, we will return to classification
and look at non-linear and non-parametric approaches.
Because we have recently been looking at probabilistic approaches to regression, we
will begin by looking at probabilistic and Bayesian approaches to classification. When we
are done with that, we will consider some interesting strategies for finding linear separators
directly, without probabilistic modeling.

2 Representation
What does it mean to have a linear model for classification? Generally, it will be that we
can express the output value y(i) by specifying a D 1-dimensional hyperplane in the D-
dimensional feature space. Then, points that are on one side of the hyperplane are consid-
ered to be in one class, points on the other side, in the other class.
That is, there are some weight values w0 and w = (w1 , . . . , wD ) such that

+1 if w0 + w1 x1 + . . . + wD xD > 0
y= .
1 otherwise

Such a model is known as a linear separator. So, in a 2D feature space, our separator would
be a 1D hyperplane (otherwise known as a line). We would use three parameters (which is
really one more than necessary) to describe the line.
As in regression, we can transform the input space, via a set of non-linear basis func-
tions, and find a linear separator in the transformed space. Such a separator will be non-
linear when projected back down into the original space.
In the following, to simplify notation, we will omit the possibility of using basis func-
tion to transform the input values, but the extension is completely straightforward.

3 Probabilistic models
No model Prediction rule Prob model Dist over models
Classification c

3.1 Estimating Pr(X, Y)


Estimating the joint distribution is often referred to as learning a generative model. We will
make a point estimate of the parameters governing the distribution of the data, and then
we can use our loss function to decide how to make decisions or predictions.
MIT 6.867 Fall 2016 3

3.1.1 Linear discriminant analysis


If were going to estimate the joint distribution, we need to make some distributional as-
sumptions. A common model is:

Y Bernoulli()
X|Y=c Gaussian(c , c )

where c specifies the unconditional probability of getting an object of class c, and is a


vector of c for the possible class values c. We will use to name all of the parameters:
, , .
We can find the maximum-likelihood parameter estimates for this model straightfor-
wardly. Letting nc = |{y(i) = c | i = 1 . . . n}| be the number of examples in D of class c, we
have:
nc
c =
n
1 X
c = x(i)
nc (i){i|y =c}
1 X
c = (x(i) c )(x(i) c )T
nc
{i|y(i) =c}

Now, how should we make predictions? If we have 0-1 loss,



1 if Pr(Y = 1 | X = x; ) > Pr(Y = 0 | X = x; )
h(x) =
0 otherwise

Lets concentrate on the conditions under which we predict 1:

Pr(Y = 1 | X = x; ) > Pr(Y = 0 | X = x; )


Pr(X = x | Y = 1; ) Pr(Y = 1; ) > Pr(X = x | Y = 0; ) Pr(Y = 0; )
log Pr(X = x | Y = 1; ) + log Pr(Y = 1; ) > log Pr(X = x | Y = 0; ) + log Pr(Y = 0; )
log Pr(X = x | Y = 1; ) log Pr(X = x | Y = 0; ) > log Pr(Y = 0; ) log Pr(Y = 1; )
1 1
log det(1 ) (x 1 )1 1 (x 1 )
T
2 2
1 1
+ log det(0 ) + (x 0 )1 0 (x 0 )
T
> log 0 log 1
2 2
(x 1 )1 T 1
1 (x 1 ) + (x 0 )0 (x 0 )
T
> 2(log 0 log 1 ) + log det(1 ) log det(0 )

This is pretty clearly quadratic in x, so its what we would call a quadratic separator or
quadratic discriminant.
If we assume that the covariances of the two classes are equal, 1 = 0 , then things
simplify and we are doing linear discriminant analysis. Now we have that (determinant of
cancels):
 
1
Pr(Y = c | X = x) exp Tc 1 x Tc 1 c + log c exp xT 1 x .

2
Define

c = 1 c
1
c = Tc 1 c + log c
2
MIT 6.867 Fall 2016 4

Then
 
exp Tc x + c exp xT 1 x
Pr(Y = c | X = x) = P T

T 1
c 0 exp c 0 x + c 0 exp (x x)

exp Tc x + c
= P T

c 0 exp c 0 x + c 0

Note that, in the two class case, this reduces to



exp T1 x + 1
Pr(Y = 1 | X = x) =  
exp T0 x + 0 + exp T1 x + 1
1
= 
exp 0 x + 0 T1 x 1 + 1
T

= sigmoid(T1 x + 1 T0 x 0 )
where the sigmoid function and its inverse, the logit function are defined as follows:
1
sigmoid(a) =
1 + exp(a)
exp a
=
1 + exp a
 

logit() = log .
1

The sigmoid is a soft step function, which takes a real number and maps it to the interval
(0, 1). If we have an expression of the form sigmoid(W T X), then the larger the magnitude
of W, the steeper the slope on the sigmoid.
So, with two classes, we predict class 1 when
Pr(Y = 1 | X = x; ) > Pr(Y = 0 | X = x; )
exp T1 x + 1 > exp T0 x + 0
 

T1 x + 1 > T0 x + 0
x (1 0 ) + 1 0 > 0
This is definitely a linear separator.

3.1.2 Factoring the class conditional probability


In the previous section, we assumed a special structure, namely, that all the elements in
a particular class were drawn from a Gaussian distribution (over RD ). Now, well pursue
another method that is also based on a special distributional structure.
This is a method called Naive Bayes. Lets assume now that x(i) {0, 1}D . You could
interpret the X values as numbers, encoded in binary, and put a multinomial distribution
over all 2D values for each class. But that is a lot of parameters to estimate!
Assume: Features are independent, given the class. That is, that
Y
D
Pr(X | Y = c) = Pr(Xj | Y = c) .
j=1

So
Y Bernoulli()
Xj | Y = c Bernoulli(c,j )
MIT 6.867 Fall 2016 5

So, now, if we have two classes, we have 2D parameters, each of which is easy to estimate:

Pr(Xj = 1 | Y = 1) = 1,j
Pr(Xj = 1 | Y = 0) = 0,j

Use Bernoulli ML estimate, maybe with Laplace correction:

b1j = #(Xj = 1, Y = 1) + 1 .

#(Y = 1) + 2

Now, for prediction. Given x, predict C = 1 if

Pr(x | C = 1) Pr(C = 1) > Pr(x | C = 0) Pr(C = 0)


Y
D Y
D
Pr(xj | C = 1) Pr(C = 1) > Pr(xj | C = 0) Pr(C = 0)
j=1 j=1

X
D X
D
log Pr(xj | C = 1) + log Pr(C = 1) > log Pr(xj | C = 0) + log Pr(C = 0)
j=1 j=1

X
D
x
X
D
x
log(1jj (1 1j )(1xj ) ) + log 1 > log(0jj (1 0j )(1xj ) ) + log 0
j=1 j=1

X
D X
D
(xj log 1j + (1 xj ) log(1 1j )) + log 1 > (xj log 0j + (1 xj ) log(1 0j )) + log 0
j=1 j=1

X
D 
1j 0j
 X
D
1 1j
xj log log > log 0 log 1 log
(1 1j ) 1 0j 1 0j
j=1 j=1

So, this is a linear separator of the form xT w + w0 > 0 with

1j 0j
Wj = log log ,
1 1j 1 0j

and
1 X log(1 1j )
n
W0 = log + .
0 log(1 0j )
j=1

Interestingly, the probability model is also sigmoidal. We have

Pr(x | C = 1) Pr(C = 1)
Pr(C = 1 | X = x) =
Pr(x | C = 1) Pr(C = 1) + Pr(x | C = 0) Pr(C = 0)
exp(f1 (x))
=
exp(f1 (x)) + exp(f2 (x))
1
=
1 + exp(f2 (x) f1 (x))
= sigmoid(f1 (x) f2 (x))

where
X
D
fc (x) = (xj log cj + (1 xj ) log(1 cj )) + log c .
j=1
MIT 6.867 Fall 2016 6

3.1.3 Exponential family


A really cool family of distributions.

Only family for which conjugate priors exist

Finite-size sufficient statistics

Includes Normal, Bernoulli, Multinomial, Poisson, Gamma, Exponential, Beta, Dirich-


let, various combinations

A (somewhat simplified) subset of exponential-family distributions can be written in the


form:
Pr(x; ) = h(x)g() exp(T u(x))
where x may be a scalar or vector, discrete or continuous; is called the natural parameters,
and g() is a normalization constant.

3.1.3.1 Bernoulli is in exponential family

Pr(x | p) = px (1 p)(1x)
= exp (x log(p) + (1 x) log(1 p))
= exp (x log(p) x log(1 p) + log(1 p))
 
p
= (1 p) exp x log
1p

p
This fits in the family with: u(x) = x, = log 1p , h(x) = 1, and g() = 1 p. So

Pr(x | ) = sigmoid() exp(x) .

3.1.3.2 Normal is in exponential family Well just look at the one-dimensional case, but
its true for multi-variate Gaussian as well.
 
1 1
Pr(x | , ) =
2
exp 2 (x ) 2
2 2
 
1 1 2 1 2
= exp 2 x + 2 x 2
2 2 2

To make this match up, let

/2
 
=
1/22
 
x
u(x) =
x2
h(x) = (2)1/2
 2 
p 1
g() = 22 exp .
42
MIT 6.867 Fall 2016 7

3.1.3.3 LDA for exponential family Cool result! If we have two classes, and Pr(X | Y =
c) is a distribution in the exponential family, with the restrictions that u(x) = x, and that the
scale parameters are shared among the classes (e.g., the covariance in the Gaussian case)
so that the form is
   
1 1 1 T
Pr(x | c , s) = h x g(c ) exp c x ,
s s s
then
The separator is linear and
The predictive probability Pr(Y = 1 | x) is a sigmoid on an activation function which
is the difference between the logs of the class membership probabilities of the two
classes. That is,
Pr(Y = 1 | x) = sigmoid(a(x)) ,
where

a(x) = (1 0 )T x + log g(1 ) log g(0 ) + log 1 log 0

3.2 Estimating Pr(Y | X)


This is called estimating a discriminative model.

3.2.1 Least squares


A favorite trick is to reduce our current problem to a previous problem we already know
how to solve. In this case, we could try to treat classification as a regression problem, by
taking the Y values to be in {+1, 1} and applying one of our standard regression methods.
We could then predict class +1, given x, if

W T x + W0 > 0 .

That will generate a separator, but it turns out to be a bad idea. There is generally no
hypothesis that does a good job of representing the data, and it is easy to show that there are
situations in which a linear separator for the data exists, but the hypothesis that regression
comes up with does not separate the data.
One reason that this doesnt work out is that it isnt founded on a sensible probabilistic
model: the squared error criterion fundamentally assumes that the Y values are normally
distributed, but theyre not.

3.2.2 Logistic regression


So, what assumption can we reasonably make about the distribution of outputs, in the
classification case? Rather than directly trying to predict the value, we could try to find a
regression model that predicts Pr(Y = 1 | X = x) as a function of x. One thought would be
to use the form
Pr(Y = 1 | X = x) = W T x ,
but the problem is that we need the probabilities to be in the interval [0, 1], and that linear
form is unconstrained.
We can take inspiration from what we saw in the LDA case: that in at least two cases
we considered, the predictive distribution could be described as a sigmoid applied to a dot
product. So, lets consider models of the form

Pr(Y = 1 | X = x) = sigmoid(W T x) .
MIT 6.867 Fall 2016 8

Such models are called logistic regression models. Which is confusing,


So, how does this differ from LDA? We are not making any distributional assumptions since were using them
for classification...but
about X. So, were going to train a predictive model from the same class, but using a different the idea is that were
criterion. doing a regression to
Given data D, what are the maximum-likelihood estimates of the parameters W? As- find a probability value.
suming y(i) {0, 1}, the likelihood function is:

Y
n  y(i)   (1y(i) )
Pr(D | W) = s W T x(i) 1 s W T x(i) ,
i=1

where s is short for sigmoid. The negative log likelihood is: Which we want to mini-
mize.
X
n      
NLL(W) = y(i) log s W T x(i) + (1 y(i) ) log 1 s W T x(i) .
i=1

The gradient with respect to the weights is Heres the cool thing
about the derivative of
X
n     the sigmoid: dxd
s(x) =
W NLL(W) = s W T x(i) y(i) x(i) . s(x)(1 s(x))
i=1

This is the same as the gradient of the error function for the linear regression model! Well see why in the
If the data is linearly separable, the optimization will want to make the weights as large next section.
as possible (to make the sigmoid as steep as possible, which will push the output values
closer and closer to +1 and 1. Furthermore, again, if the data are separable, there is an An old trick from neu-
infinity of separators that are equally good; so the optimization problem is not well posed. ral network days was to
use targets of +0.6 and
This can be addressed by adding a regularization penalty or using a prior. 0.6, which are actually
Unfortunately, there is no closed form solution for the maximum likelihood values of attainable with finite
W, so we need to use an iterative optimization method. Two common choices are: weights.

Gradient descent, especially stochastic gradient descent, which goes through the data
points one by one and does an update to the weights with a very small step size
based on each individual point. This can be guaranteed to converge, though it might
be slow, but it can be less likely than batch gradient descent to get stuck in local
optima on non-convex functions.

Iterative reweighted least squares, which is essentially Newtons method. It can be com-
putationally challenging, because it uses the Hessian (matrix of second derivatives),
but is more reliable than regular gradient descent.

Fortunately, the error function is convex, so these methods, appropriately tuned, will find
the global optimum.

3.2.3 Generalized linear models


Both linear regression and logistic regression are instances of a more general framework of
generalized linear models:

The distribution of Y | X is an exponential family distribution

The goal is to predict E[Y | X; ]

The natural parameter and the inputs are linearly related: = T x.

What is cool about this is that any regression-type problem that satisfies the require-
ments above can be handled in common way. So, there are general purpose methods for
MIT 6.867 Fall 2016 9

Maximum likelihood parameter estimation (IRLS)

Bayesian estimation

Various statistical tests

This also means that these models are not too sensitive to the distributional assump-
tions made (beyond the general GLM assumptions), since the parameter-fitting is indepen-
dent of which exponential-family distribution you have.

3.3 Generative versus discriminative


Given a choice, should we choose a generative or discriminative model? This discussion cribbed
from Machine Learning:
Generative classifiers are usually computationally easier to fit. A Probabilistic Perspective
by Kevin Murphy.
Generative classifiers, in the multi-class case, generalize more easily to adding a new
class (since the per-class models are independent).

Generative models can be estimated relatively easily in the presence of missing or


unlabeled data.

Generative models can be run backwards (used to predict X from Y).

Discriminative models apply very well when we expand the input space using a basis
set of feature functions; generative models can get into trouble due to correlation
among the inputs.

If the distributional assumptions are true, generative models can be well estimated with
fewer training examples than discriminative models.

If the distributional assumptions are not true, generative models can result in really bad
predictions.

4 Distributions over models


No model Prediction rule Prob model Dist over models
Classification c

Rather than finding a single best weight vector W and using that to make predictions
Pr(y | x; W ), we might again wish to be Bayesian, by putting a prior on W and then using
the posterior on W, that is Pr(W | D), to make predictions that take uncertainty in the
weights into account.
Unfortunately, there is no conjugate prior for logistic regression. Here, we will show
one strategy (that can be used in all sorts of cases with non conjugate priors, sometimes to
good effect, sometimes less so) for approximating the posterior. The idea is to start with a
Gaussian prior, compute the posterior, which is

1 X n
log Pr(W | D) = (W m0 )T S1
0 (W m0 ) + (y(i) log o(i) + (1 y(i) ) log(1 o(i) )) ,
2
i=1

where o(i) = sigmoid(W T x(i) ).


This clearly does not have the form of a Gaussian. So we want to find a Gaussian
approximation to the posterior, which means finding parameters mn and Sn that make a
MIT 6.867 Fall 2016 10

good approximation. Using a second-order Taylor-series expansion about the mode, we


find that the MAP estimate of the weights, which under a simple diagonal Gaussian prior
is
Wmap = arg min NLL(W) + W T W ,
W

the appropriate choice of mn . This is known as the


The gradient of the negative log likelihood at the mode is: Laplace approximation;
see Bishop 4.4 for more
X
n details.
NLL(W) 1
= s0 (W m0 ) (y(i) o(i) )x(i) ,
W
WMAP i=1

T
where output o(i) = sigmoid(x(i) WMAP ).
The covariance matrix is then the inverse of the Hessian, which is a matrix of second
derivatives of NLL. So,

2 NLL(W)
X
n
T
1 1
sn = H = T
= s 0 + o(i) (1 o(i) )x(i) x(i) .
WW
WMAP i=1

Going from here to a predictive distribution requires further approximation. We wont


go into it in detail. However, there is one more useful concept to get out of this, which is
the Bayesian information criterion or BIC. We can approximate

D 1
log Pr(D) log Pr(D | WMAP ) + log Pr(WMAP ) + log(2) log |H| .
2 2
The second two terms, called the Occam factor, measure the complexity of the model. If the
prior on weights is uniform, then we can simplify to

1
log Pr(D) log Pr(D | WML ) log |H| .
2
The determinant of a covariance matrix can be seen informally as characterizing the size
of the covariance ellipse: if its large then we are more uncertain about the weight values,
which means the marginal likelihood will be lower.
If we think of H as being a sum of Hi over the individual data points, and assume that
they are approximable by single H, b then

log |H| = log |nH|


b
 
= log nD |H|
b

= D log n + log |H|


b

We drop log |H|b because it is independent of n and D, and wind up with a fairly gross
approximation:
D
log Pr(D) log Pr(D | WML ) log n .
2
This is the BIC score, which can be used as a quick-and-dirty way of selecting model com-
plexity.
BIC is asymptotically consistent as a selection criterion: that is, given a set of models,
as n , BIC will select the correct model. It has a tendency to select models that are too
simple, with small n, though.

You might also like