6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
Reading:
SVMs: Bishop (briefly) first part of 7.1; Murphy (grudgingly) 14.5; Hastie Tibshirani
Friedman 12.13
Contents
1 Intro 2
2 Representation 2
3 Probabilistic models 2
3.1 Estimating Pr(X, Y) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.1.1 Linear discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . 3
3.1.2 Factoring the class conditional probability . . . . . . . . . . . . . . . 4
3.1.3 Exponential family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.3.1 Bernoulli is in exponential family . . . . . . . . . . . . . . . 6
3.1.3.2 Normal is in exponential family . . . . . . . . . . . . . . . . 6
3.1.3.3 LDA for exponential family . . . . . . . . . . . . . . . . . . 7
3.2 Estimating Pr(Y | X) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.3 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Generative versus discriminative . . . . . . . . . . . . . . . . . . . . . . . . . 9
1
MIT 6.867 Fall 2016 2
1 Intro
Now we will look at the supervised learning problem of classification, in which the data
given is a set of pairs
D = {(x(1) , y(1) ), . . . , (x(n) , y(n) )} ,
with x(i) RD and y(i) {c1 , . . . , ck }; that is, the output is a discrete value indicating the
class of the example. By default, well think about the two-class classification problem, with
the classes or labels often drawn from the set {0, 1} or {+1, 1}. This problem isnt concep-
tually different from regression, but it offers a different kind of structure to be exploited
during learning.
Well look at learning prediction rules, probabilistic models, and distributions over
models, and focus on the case in which the model represent a thresholded linear relation-
ship between inputs and outputs. In later parts of the course, we will return to classification
and look at non-linear and non-parametric approaches.
Because we have recently been looking at probabilistic approaches to regression, we
will begin by looking at probabilistic and Bayesian approaches to classification. When we
are done with that, we will consider some interesting strategies for finding linear separators
directly, without probabilistic modeling.
2 Representation
What does it mean to have a linear model for classification? Generally, it will be that we
can express the output value y(i) by specifying a D 1-dimensional hyperplane in the D-
dimensional feature space. Then, points that are on one side of the hyperplane are consid-
ered to be in one class, points on the other side, in the other class.
That is, there are some weight values w0 and w = (w1 , . . . , wD ) such that
+1 if w0 + w1 x1 + . . . + wD xD > 0
y= .
1 otherwise
Such a model is known as a linear separator. So, in a 2D feature space, our separator would
be a 1D hyperplane (otherwise known as a line). We would use three parameters (which is
really one more than necessary) to describe the line.
As in regression, we can transform the input space, via a set of non-linear basis func-
tions, and find a linear separator in the transformed space. Such a separator will be non-
linear when projected back down into the original space.
In the following, to simplify notation, we will omit the possibility of using basis func-
tion to transform the input values, but the extension is completely straightforward.
3 Probabilistic models
No model Prediction rule Prob model Dist over models
Classification c
Y Bernoulli()
X|Y=c Gaussian(c , c )
This is pretty clearly quadratic in x, so its what we would call a quadratic separator or
quadratic discriminant.
If we assume that the covariances of the two classes are equal, 1 = 0 , then things
simplify and we are doing linear discriminant analysis. Now we have that (determinant of
cancels):
1
Pr(Y = c | X = x) exp Tc 1 x Tc 1 c + log c exp xT 1 x .
2
Define
c = 1 c
1
c = Tc 1 c + log c
2
MIT 6.867 Fall 2016 4
Then
exp Tc x + c exp xT 1 x
Pr(Y = c | X = x) = P T
T 1
c 0 exp c 0 x + c 0 exp (x x)
exp Tc x + c
= P T
c 0 exp c 0 x + c 0
= sigmoid(T1 x + 1 T0 x 0 )
where the sigmoid function and its inverse, the logit function are defined as follows:
1
sigmoid(a) =
1 + exp(a)
exp a
=
1 + exp a
logit() = log .
1
The sigmoid is a soft step function, which takes a real number and maps it to the interval
(0, 1). If we have an expression of the form sigmoid(W T X), then the larger the magnitude
of W, the steeper the slope on the sigmoid.
So, with two classes, we predict class 1 when
Pr(Y = 1 | X = x; ) > Pr(Y = 0 | X = x; )
exp T1 x + 1 > exp T0 x + 0
T1 x + 1 > T0 x + 0
x (1 0 ) + 1 0 > 0
This is definitely a linear separator.
So
Y Bernoulli()
Xj | Y = c Bernoulli(c,j )
MIT 6.867 Fall 2016 5
So, now, if we have two classes, we have 2D parameters, each of which is easy to estimate:
Pr(Xj = 1 | Y = 1) = 1,j
Pr(Xj = 1 | Y = 0) = 0,j
b1j = #(Xj = 1, Y = 1) + 1 .
#(Y = 1) + 2
X
D X
D
log Pr(xj | C = 1) + log Pr(C = 1) > log Pr(xj | C = 0) + log Pr(C = 0)
j=1 j=1
X
D
x
X
D
x
log(1jj (1 1j )(1xj ) ) + log 1 > log(0jj (1 0j )(1xj ) ) + log 0
j=1 j=1
X
D X
D
(xj log 1j + (1 xj ) log(1 1j )) + log 1 > (xj log 0j + (1 xj ) log(1 0j )) + log 0
j=1 j=1
X
D
1j 0j
X
D
1 1j
xj log log > log 0 log 1 log
(1 1j ) 1 0j 1 0j
j=1 j=1
1j 0j
Wj = log log ,
1 1j 1 0j
and
1 X log(1 1j )
n
W0 = log + .
0 log(1 0j )
j=1
Pr(x | C = 1) Pr(C = 1)
Pr(C = 1 | X = x) =
Pr(x | C = 1) Pr(C = 1) + Pr(x | C = 0) Pr(C = 0)
exp(f1 (x))
=
exp(f1 (x)) + exp(f2 (x))
1
=
1 + exp(f2 (x) f1 (x))
= sigmoid(f1 (x) f2 (x))
where
X
D
fc (x) = (xj log cj + (1 xj ) log(1 cj )) + log c .
j=1
MIT 6.867 Fall 2016 6
Pr(x | p) = px (1 p)(1x)
= exp (x log(p) + (1 x) log(1 p))
= exp (x log(p) x log(1 p) + log(1 p))
p
= (1 p) exp x log
1p
p
This fits in the family with: u(x) = x, = log 1p , h(x) = 1, and g() = 1 p. So
3.1.3.2 Normal is in exponential family Well just look at the one-dimensional case, but
its true for multi-variate Gaussian as well.
1 1
Pr(x | , ) =
2
exp 2 (x ) 2
2 2
1 1 2 1 2
= exp 2 x + 2 x 2
2 2 2
/2
=
1/22
x
u(x) =
x2
h(x) = (2)1/2
2
p 1
g() = 22 exp .
42
MIT 6.867 Fall 2016 7
3.1.3.3 LDA for exponential family Cool result! If we have two classes, and Pr(X | Y =
c) is a distribution in the exponential family, with the restrictions that u(x) = x, and that the
scale parameters are shared among the classes (e.g., the covariance in the Gaussian case)
so that the form is
1 1 1 T
Pr(x | c , s) = h x g(c ) exp c x ,
s s s
then
The separator is linear and
The predictive probability Pr(Y = 1 | x) is a sigmoid on an activation function which
is the difference between the logs of the class membership probabilities of the two
classes. That is,
Pr(Y = 1 | x) = sigmoid(a(x)) ,
where
W T x + W0 > 0 .
That will generate a separator, but it turns out to be a bad idea. There is generally no
hypothesis that does a good job of representing the data, and it is easy to show that there are
situations in which a linear separator for the data exists, but the hypothesis that regression
comes up with does not separate the data.
One reason that this doesnt work out is that it isnt founded on a sensible probabilistic
model: the squared error criterion fundamentally assumes that the Y values are normally
distributed, but theyre not.
Pr(Y = 1 | X = x) = sigmoid(W T x) .
MIT 6.867 Fall 2016 8
Y
n y(i) (1y(i) )
Pr(D | W) = s W T x(i) 1 s W T x(i) ,
i=1
where s is short for sigmoid. The negative log likelihood is: Which we want to mini-
mize.
X
n
NLL(W) = y(i) log s W T x(i) + (1 y(i) ) log 1 s W T x(i) .
i=1
The gradient with respect to the weights is Heres the cool thing
about the derivative of
X
n the sigmoid: dxd
s(x) =
W NLL(W) = s W T x(i) y(i) x(i) . s(x)(1 s(x))
i=1
This is the same as the gradient of the error function for the linear regression model! Well see why in the
If the data is linearly separable, the optimization will want to make the weights as large next section.
as possible (to make the sigmoid as steep as possible, which will push the output values
closer and closer to +1 and 1. Furthermore, again, if the data are separable, there is an An old trick from neu-
infinity of separators that are equally good; so the optimization problem is not well posed. ral network days was to
use targets of +0.6 and
This can be addressed by adding a regularization penalty or using a prior. 0.6, which are actually
Unfortunately, there is no closed form solution for the maximum likelihood values of attainable with finite
W, so we need to use an iterative optimization method. Two common choices are: weights.
Gradient descent, especially stochastic gradient descent, which goes through the data
points one by one and does an update to the weights with a very small step size
based on each individual point. This can be guaranteed to converge, though it might
be slow, but it can be less likely than batch gradient descent to get stuck in local
optima on non-convex functions.
Iterative reweighted least squares, which is essentially Newtons method. It can be com-
putationally challenging, because it uses the Hessian (matrix of second derivatives),
but is more reliable than regular gradient descent.
Fortunately, the error function is convex, so these methods, appropriately tuned, will find
the global optimum.
What is cool about this is that any regression-type problem that satisfies the require-
ments above can be handled in common way. So, there are general purpose methods for
MIT 6.867 Fall 2016 9
Bayesian estimation
This also means that these models are not too sensitive to the distributional assump-
tions made (beyond the general GLM assumptions), since the parameter-fitting is indepen-
dent of which exponential-family distribution you have.
Discriminative models apply very well when we expand the input space using a basis
set of feature functions; generative models can get into trouble due to correlation
among the inputs.
If the distributional assumptions are true, generative models can be well estimated with
fewer training examples than discriminative models.
If the distributional assumptions are not true, generative models can result in really bad
predictions.
Rather than finding a single best weight vector W and using that to make predictions
Pr(y | x; W ), we might again wish to be Bayesian, by putting a prior on W and then using
the posterior on W, that is Pr(W | D), to make predictions that take uncertainty in the
weights into account.
Unfortunately, there is no conjugate prior for logistic regression. Here, we will show
one strategy (that can be used in all sorts of cases with non conjugate priors, sometimes to
good effect, sometimes less so) for approximating the posterior. The idea is to start with a
Gaussian prior, compute the posterior, which is
1 X n
log Pr(W | D) = (W m0 )T S1
0 (W m0 ) + (y(i) log o(i) + (1 y(i) ) log(1 o(i) )) ,
2
i=1
T
where output o(i) = sigmoid(x(i) WMAP ).
The covariance matrix is then the inverse of the Hessian, which is a matrix of second
derivatives of NLL. So,
2 NLL(W)
X
n
T
1 1
sn = H = T
= s 0 + o(i) (1 o(i) )x(i) x(i) .
WW
WMAP i=1
D 1
log Pr(D) log Pr(D | WMAP ) + log Pr(WMAP ) + log(2) log |H| .
2 2
The second two terms, called the Occam factor, measure the complexity of the model. If the
prior on weights is uniform, then we can simplify to
1
log Pr(D) log Pr(D | WML ) log |H| .
2
The determinant of a covariance matrix can be seen informally as characterizing the size
of the covariance ellipse: if its large then we are more uncertain about the weight values,
which means the marginal likelihood will be lower.
If we think of H as being a sum of Hi over the individual data points, and assume that
they are approximable by single H, b then
We drop log |H|b because it is independent of n and D, and wind up with a fairly gross
approximation:
D
log Pr(D) log Pr(D | WML ) log n .
2
This is the BIC score, which can be used as a quick-and-dirty way of selecting model com-
plexity.
BIC is asymptotically consistent as a selection criterion: that is, given a set of models,
as n , BIC will select the correct model. It has a tendency to select models that are too
simple, with small n, though.