0% found this document useful (0 votes)

67 views10 pages

6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2

The document summarizes key concepts related to probabilistic models for classification. It discusses estimating the joint distribution Pr(X,Y) using linear discriminant analysis, which assumes data from each class is Gaussian distributed. Maximum likelihood estimates of the model parameters are provided. It also discusses estimating the conditional distribution Pr(Y|X) using methods like least squares, logistic regression, and generalized linear models. Finally, it distinguishes between generative and discriminative probabilistic models for classification.

Uploaded by

Suchan Khankluay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views10 pages

6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2

Uploaded by

Suchan Khankluay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

6.

867 Section 3: Classification

Reading:

Probabilistic models: Bishop 4.1, 4.2

Bayesian methods: Bishop 4.3.14.3.4, 4.44.5

SVMs: Bishop (briefly) first part of 7.1; Murphy (grudgingly) 14.5; Hastie Tibshirani
Friedman 12.13

Contents
1 Intro 2

2 Representation 2

3 Probabilistic models 2
3.1 Estimating Pr(X, Y) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.1.1 Linear discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . 3
3.1.2 Factoring the class conditional probability . . . . . . . . . . . . . . . 4
3.1.3 Exponential family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.3.1 Bernoulli is in exponential family . . . . . . . . . . . . . . . 6
3.1.3.2 Normal is in exponential family . . . . . . . . . . . . . . . . 6
3.1.3.3 LDA for exponential family . . . . . . . . . . . . . . . . . . 7
3.2 Estimating Pr(Y | X) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.3 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Generative versus discriminative . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Distributions over models 9

1
MIT 6.867 Fall 2016 2

1 Intro
Now we will look at the supervised learning problem of classification, in which the data
given is a set of pairs
D = {(x(1) , y(1) ), . . . , (x(n) , y(n) )} ,
with x(i) RD and y(i) {c1 , . . . , ck }; that is, the output is a discrete value indicating the
class of the example. By default, well think about the two-class classification problem, with
the classes or labels often drawn from the set {0, 1} or {+1, 1}. This problem isnt concep-
tually different from regression, but it offers a different kind of structure to be exploited
during learning.
Well look at learning prediction rules, probabilistic models, and distributions over
models, and focus on the case in which the model represent a thresholded linear relation-
ship between inputs and outputs. In later parts of the course, we will return to classification
and look at non-linear and non-parametric approaches.
Because we have recently been looking at probabilistic approaches to regression, we
will begin by looking at probabilistic and Bayesian approaches to classification. When we
are done with that, we will consider some interesting strategies for finding linear separators
directly, without probabilistic modeling.

2 Representation
What does it mean to have a linear model for classification? Generally, it will be that we
can express the output value y(i) by specifying a D 1-dimensional hyperplane in the D-
dimensional feature space. Then, points that are on one side of the hyperplane are consid-
ered to be in one class, points on the other side, in the other class.
That is, there are some weight values w0 and w = (w1 , . . . , wD ) such that

+1 if w0 + w1 x1 + . . . + wD xD > 0
y= .
1 otherwise

Such a model is known as a linear separator. So, in a 2D feature space, our separator would
be a 1D hyperplane (otherwise known as a line). We would use three parameters (which is
really one more than necessary) to describe the line.
As in regression, we can transform the input space, via a set of non-linear basis func-
tions, and find a linear separator in the transformed space. Such a separator will be non-
linear when projected back down into the original space.
In the following, to simplify notation, we will omit the possibility of using basis func-
tion to transform the input values, but the extension is completely straightforward.

3 Probabilistic models
No model Prediction rule Prob model Dist over models
Classification c

3.1 Estimating Pr(X, Y)

Estimating the joint distribution is often referred to as learning a generative model. We will
make a point estimate of the parameters governing the distribution of the data, and then
we can use our loss function to decide how to make decisions or predictions.
MIT 6.867 Fall 2016 3

3.1.1 Linear discriminant analysis

If were going to estimate the joint distribution, we need to make some distributional as-
sumptions. A common model is:

Y Bernoulli()
X|Y=c Gaussian(c , c )

where c specifies the unconditional probability of getting an object of class c, and is a

vector of c for the possible class values c. We will use to name all of the parameters:
, , .
We can find the maximum-likelihood parameter estimates for this model straightfor-
wardly. Letting nc = |{y(i) = c | i = 1 . . . n}| be the number of examples in D of class c, we
have:
nc
c =
n
1 X
c = x(i)
nc (i){i|y =c}
1 X
c = (x(i) c )(x(i) c )T
nc
{i|y(i) =c}

Now, how should we make predictions? If we have 0-1 loss,

1 if Pr(Y = 1 | X = x; ) > Pr(Y = 0 | X = x; )
h(x) =
0 otherwise

Lets concentrate on the conditions under which we predict 1:

Pr(Y = 1 | X = x; ) > Pr(Y = 0 | X = x; )

Pr(X = x | Y = 1; ) Pr(Y = 1; ) > Pr(X = x | Y = 0; ) Pr(Y = 0; )
log Pr(X = x | Y = 1; ) + log Pr(Y = 1; ) > log Pr(X = x | Y = 0; ) + log Pr(Y = 0; )
log Pr(X = x | Y = 1; ) log Pr(X = x | Y = 0; ) > log Pr(Y = 0; ) log Pr(Y = 1; )
1 1
log det(1 ) (x 1 )1 1 (x 1 )
T
2 2
1 1
+ log det(0 ) + (x 0 )1 0 (x 0 )
T
> log 0 log 1
2 2
(x 1 )1 T 1
1 (x 1 ) + (x 0 )0 (x 0 )
T
> 2(log 0 log 1 ) + log det(1 ) log det(0 )

This is pretty clearly quadratic in x, so its what we would call a quadratic separator or
quadratic discriminant.
If we assume that the covariances of the two classes are equal, 1 = 0 , then things
simplify and we are doing linear discriminant analysis. Now we have that (determinant of
cancels):

1
Pr(Y = c | X = x) exp Tc 1 x Tc 1 c + log c exp xT 1 x .

2
Define

c = 1 c
1
c = Tc 1 c + log c
2
MIT 6.867 Fall 2016 4

Then

exp Tc x + c exp xT 1 x
Pr(Y = c | X = x) = P T

T 1
c 0 exp c 0 x + c 0 exp (x x)

exp Tc x + c
= P T

c 0 exp c 0 x + c 0

Note that, in the two class case, this reduces to

exp T1 x + 1
Pr(Y = 1 | X = x) =
exp T0 x + 0 + exp T1 x + 1
1
=
exp 0 x + 0 T1 x 1 + 1
T

= sigmoid(T1 x + 1 T0 x 0 )
where the sigmoid function and its inverse, the logit function are defined as follows:
1
sigmoid(a) =
1 + exp(a)
exp a
=
1 + exp a

logit() = log .
1

The sigmoid is a soft step function, which takes a real number and maps it to the interval
(0, 1). If we have an expression of the form sigmoid(W T X), then the larger the magnitude
of W, the steeper the slope on the sigmoid.
So, with two classes, we predict class 1 when
Pr(Y = 1 | X = x; ) > Pr(Y = 0 | X = x; )
exp T1 x + 1 > exp T0 x + 0

T1 x + 1 > T0 x + 0
x (1 0 ) + 1 0 > 0
This is definitely a linear separator.

3.1.2 Factoring the class conditional probability

In the previous section, we assumed a special structure, namely, that all the elements in
a particular class were drawn from a Gaussian distribution (over RD ). Now, well pursue
another method that is also based on a special distributional structure.
This is a method called Naive Bayes. Lets assume now that x(i) {0, 1}D . You could
interpret the X values as numbers, encoded in binary, and put a multinomial distribution
over all 2D values for each class. But that is a lot of parameters to estimate!
Assume: Features are independent, given the class. That is, that
Y
D
Pr(X | Y = c) = Pr(Xj | Y = c) .
j=1

So
Y Bernoulli()
Xj | Y = c Bernoulli(c,j )
MIT 6.867 Fall 2016 5

So, now, if we have two classes, we have 2D parameters, each of which is easy to estimate:

Pr(Xj = 1 | Y = 1) = 1,j
Pr(Xj = 1 | Y = 0) = 0,j

Use Bernoulli ML estimate, maybe with Laplace correction:

b1j = #(Xj = 1, Y = 1) + 1 .

#(Y = 1) + 2

Now, for prediction. Given x, predict C = 1 if

Pr(x | C = 1) Pr(C = 1) > Pr(x | C = 0) Pr(C = 0)

Y
D Y
D
Pr(xj | C = 1) Pr(C = 1) > Pr(xj | C = 0) Pr(C = 0)
j=1 j=1

X
D X
D
log Pr(xj | C = 1) + log Pr(C = 1) > log Pr(xj | C = 0) + log Pr(C = 0)
j=1 j=1

X
D
x
X
D
x
log(1jj (1 1j )(1xj ) ) + log 1 > log(0jj (1 0j )(1xj ) ) + log 0
j=1 j=1

X
D X
D
(xj log 1j + (1 xj ) log(1 1j )) + log 1 > (xj log 0j + (1 xj ) log(1 0j )) + log 0
j=1 j=1

X
D
1j 0j
X
D
1 1j
xj log log > log 0 log 1 log
(1 1j ) 1 0j 1 0j
j=1 j=1

So, this is a linear separator of the form xT w + w0 > 0 with

1j 0j
Wj = log log ,
1 1j 1 0j

and
1 X log(1 1j )
n
W0 = log + .
0 log(1 0j )
j=1

Interestingly, the probability model is also sigmoidal. We have

Pr(x | C = 1) Pr(C = 1)
Pr(C = 1 | X = x) =
Pr(x | C = 1) Pr(C = 1) + Pr(x | C = 0) Pr(C = 0)
exp(f1 (x))
=
exp(f1 (x)) + exp(f2 (x))
1
=
1 + exp(f2 (x) f1 (x))
= sigmoid(f1 (x) f2 (x))

where
X
D
fc (x) = (xj log cj + (1 xj ) log(1 cj )) + log c .
j=1
MIT 6.867 Fall 2016 6

3.1.3 Exponential family

A really cool family of distributions.

Only family for which conjugate priors exist

Finite-size sufficient statistics

Includes Normal, Bernoulli, Multinomial, Poisson, Gamma, Exponential, Beta, Dirich-

let, various combinations

A (somewhat simplified) subset of exponential-family distributions can be written in the

form:
Pr(x; ) = h(x)g() exp(T u(x))
where x may be a scalar or vector, discrete or continuous; is called the natural parameters,
and g() is a normalization constant.

3.1.3.1 Bernoulli is in exponential family

Pr(x | p) = px (1 p)(1x)
= exp (x log(p) + (1 x) log(1 p))
= exp (x log(p) x log(1 p) + log(1 p))

p
= (1 p) exp x log
1p

p
This fits in the family with: u(x) = x, = log 1p , h(x) = 1, and g() = 1 p. So

Pr(x | ) = sigmoid() exp(x) .

3.1.3.2 Normal is in exponential family Well just look at the one-dimensional case, but
its true for multi-variate Gaussian as well.

1 1
Pr(x | , ) =
2
exp 2 (x ) 2
2 2

1 1 2 1 2
= exp 2 x + 2 x 2
2 2 2

To make this match up, let

/2

=
1/22

x
u(x) =
x2
h(x) = (2)1/2
2
p 1
g() = 22 exp .
42
MIT 6.867 Fall 2016 7

3.1.3.3 LDA for exponential family Cool result! If we have two classes, and Pr(X | Y =
c) is a distribution in the exponential family, with the restrictions that u(x) = x, and that the
scale parameters are shared among the classes (e.g., the covariance in the Gaussian case)
so that the form is

1 1 1 T
Pr(x | c , s) = h x g(c ) exp c x ,
s s s
then
The separator is linear and
The predictive probability Pr(Y = 1 | x) is a sigmoid on an activation function which
is the difference between the logs of the class membership probabilities of the two
classes. That is,
Pr(Y = 1 | x) = sigmoid(a(x)) ,
where

a(x) = (1 0 )T x + log g(1 ) log g(0 ) + log 1 log 0

3.2 Estimating Pr(Y | X)

This is called estimating a discriminative model.

3.2.1 Least squares

A favorite trick is to reduce our current problem to a previous problem we already know
how to solve. In this case, we could try to treat classification as a regression problem, by
taking the Y values to be in {+1, 1} and applying one of our standard regression methods.
We could then predict class +1, given x, if

W T x + W0 > 0 .

That will generate a separator, but it turns out to be a bad idea. There is generally no
hypothesis that does a good job of representing the data, and it is easy to show that there are
situations in which a linear separator for the data exists, but the hypothesis that regression
comes up with does not separate the data.
One reason that this doesnt work out is that it isnt founded on a sensible probabilistic
model: the squared error criterion fundamentally assumes that the Y values are normally
distributed, but theyre not.

3.2.2 Logistic regression

So, what assumption can we reasonably make about the distribution of outputs, in the
classification case? Rather than directly trying to predict the value, we could try to find a
regression model that predicts Pr(Y = 1 | X = x) as a function of x. One thought would be
to use the form
Pr(Y = 1 | X = x) = W T x ,
but the problem is that we need the probabilities to be in the interval [0, 1], and that linear
form is unconstrained.
We can take inspiration from what we saw in the LDA case: that in at least two cases
we considered, the predictive distribution could be described as a sigmoid applied to a dot
product. So, lets consider models of the form

Pr(Y = 1 | X = x) = sigmoid(W T x) .
MIT 6.867 Fall 2016 8

Such models are called logistic regression models. Which is confusing,

So, how does this differ from LDA? We are not making any distributional assumptions since were using them
for classification...but
about X. So, were going to train a predictive model from the same class, but using a different the idea is that were
criterion. doing a regression to
Given data D, what are the maximum-likelihood estimates of the parameters W? As- find a probability value.
suming y(i) {0, 1}, the likelihood function is:

Y
n y(i) (1y(i) )
Pr(D | W) = s W T x(i) 1 s W T x(i) ,
i=1

where s is short for sigmoid. The negative log likelihood is: Which we want to mini-
mize.
X
n
NLL(W) = y(i) log s W T x(i) + (1 y(i) ) log 1 s W T x(i) .
i=1

The gradient with respect to the weights is Heres the cool thing
about the derivative of
X
n the sigmoid: dxd
s(x) =
W NLL(W) = s W T x(i) y(i) x(i) . s(x)(1 s(x))
i=1

This is the same as the gradient of the error function for the linear regression model! Well see why in the
If the data is linearly separable, the optimization will want to make the weights as large next section.
as possible (to make the sigmoid as steep as possible, which will push the output values
closer and closer to +1 and 1. Furthermore, again, if the data are separable, there is an An old trick from neu-
infinity of separators that are equally good; so the optimization problem is not well posed. ral network days was to
use targets of +0.6 and
This can be addressed by adding a regularization penalty or using a prior. 0.6, which are actually
Unfortunately, there is no closed form solution for the maximum likelihood values of attainable with finite
W, so we need to use an iterative optimization method. Two common choices are: weights.

Gradient descent, especially stochastic gradient descent, which goes through the data
points one by one and does an update to the weights with a very small step size
based on each individual point. This can be guaranteed to converge, though it might
be slow, but it can be less likely than batch gradient descent to get stuck in local
optima on non-convex functions.

Iterative reweighted least squares, which is essentially Newtons method. It can be com-
putationally challenging, because it uses the Hessian (matrix of second derivatives),
but is more reliable than regular gradient descent.

Fortunately, the error function is convex, so these methods, appropriately tuned, will find
the global optimum.

3.2.3 Generalized linear models

Both linear regression and logistic regression are instances of a more general framework of
generalized linear models:

The distribution of Y | X is an exponential family distribution

The goal is to predict E[Y | X; ]

The natural parameter and the inputs are linearly related: = T x.

What is cool about this is that any regression-type problem that satisfies the require-
ments above can be handled in common way. So, there are general purpose methods for
MIT 6.867 Fall 2016 9

Maximum likelihood parameter estimation (IRLS)

Bayesian estimation

Various statistical tests

This also means that these models are not too sensitive to the distributional assump-
tions made (beyond the general GLM assumptions), since the parameter-fitting is indepen-
dent of which exponential-family distribution you have.

3.3 Generative versus discriminative

Given a choice, should we choose a generative or discriminative model? This discussion cribbed
from Machine Learning:
Generative classifiers are usually computationally easier to fit. A Probabilistic Perspective
by Kevin Murphy.
Generative classifiers, in the multi-class case, generalize more easily to adding a new
class (since the per-class models are independent).

Generative models can be estimated relatively easily in the presence of missing or

unlabeled data.

Generative models can be run backwards (used to predict X from Y).

Discriminative models apply very well when we expand the input space using a basis
set of feature functions; generative models can get into trouble due to correlation
among the inputs.

If the distributional assumptions are true, generative models can be well estimated with
fewer training examples than discriminative models.

If the distributional assumptions are not true, generative models can result in really bad
predictions.

4 Distributions over models

No model Prediction rule Prob model Dist over models
Classification c

Rather than finding a single best weight vector W and using that to make predictions
Pr(y | x; W ), we might again wish to be Bayesian, by putting a prior on W and then using
the posterior on W, that is Pr(W | D), to make predictions that take uncertainty in the
weights into account.
Unfortunately, there is no conjugate prior for logistic regression. Here, we will show
one strategy (that can be used in all sorts of cases with non conjugate priors, sometimes to
good effect, sometimes less so) for approximating the posterior. The idea is to start with a
Gaussian prior, compute the posterior, which is

1 X n
log Pr(W | D) = (W m0 )T S1
0 (W m0 ) + (y(i) log o(i) + (1 y(i) ) log(1 o(i) )) ,
2
i=1

where o(i) = sigmoid(W T x(i) ).

This clearly does not have the form of a Gaussian. So we want to find a Gaussian
approximation to the posterior, which means finding parameters mn and Sn that make a
MIT 6.867 Fall 2016 10

good approximation. Using a second-order Taylor-series expansion about the mode, we

find that the MAP estimate of the weights, which under a simple diagonal Gaussian prior
is
Wmap = arg min NLL(W) + W T W ,
W

the appropriate choice of mn . This is known as the

The gradient of the negative log likelihood at the mode is: Laplace approximation;
see Bishop 4.4 for more
X
n details.
NLL(W) 1
= s0 (W m0 ) (y(i) o(i) )x(i) ,
W
WMAP i=1

T
where output o(i) = sigmoid(x(i) WMAP ).
The covariance matrix is then the inverse of the Hessian, which is a matrix of second
derivatives of NLL. So,

2 NLL(W)
X
n
T
1 1
sn = H = T
= s 0 + o(i) (1 o(i) )x(i) x(i) .
WW
WMAP i=1

Going from here to a predictive distribution requires further approximation. We wont

go into it in detail. However, there is one more useful concept to get out of this, which is
the Bayesian information criterion or BIC. We can approximate

D 1
log Pr(D) log Pr(D | WMAP ) + log Pr(WMAP ) + log(2) log |H| .
2 2
The second two terms, called the Occam factor, measure the complexity of the model. If the
prior on weights is uniform, then we can simplify to

1
log Pr(D) log Pr(D | WML ) log |H| .
2
The determinant of a covariance matrix can be seen informally as characterizing the size
of the covariance ellipse: if its large then we are more uncertain about the weight values,
which means the marginal likelihood will be lower.
If we think of H as being a sum of Hi over the individual data points, and assume that
they are approximable by single H, b then

log |H| = log |nH|

b

= log nD |H|
b

= D log n + log |H|

We drop log |H|b because it is independent of n and D, and wind up with a fairly gross
approximation:
D
log Pr(D) log Pr(D | WML ) log n .
2
This is the BIC score, which can be used as a quick-and-dirty way of selecting model com-
plexity.
BIC is asymptotically consistent as a selection criterion: that is, given a set of models,
as n , BIC will select the correct model. It has a tendency to select models that are too
simple, with small n, though.

Introduction To Graph Theory Second Edition Solution Manual (By Douglas B. West)
100% (4)
Introduction To Graph Theory Second Edition Solution Manual (By Douglas B. West)
260 pages
Healed Mind Booklet
100% (1)
Healed Mind Booklet
99 pages
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
Murphy Book Solution
No ratings yet
Murphy Book Solution
100 pages
Exercise Solution 05 Linear Classification
No ratings yet
Exercise Solution 05 Linear Classification
9 pages
A Strategic Plan Presented To The Faculty of College of Business Administration University of Cordilleras
100% (1)
A Strategic Plan Presented To The Faculty of College of Business Administration University of Cordilleras
23 pages
Outlet Selection
No ratings yet
Outlet Selection
23 pages
Devi Mahatmyam Durga Saptasati Chapter 13 in Telugu PDF
No ratings yet
Devi Mahatmyam Durga Saptasati Chapter 13 in Telugu PDF
5 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
n9 PDF
No ratings yet
n9 PDF
6 pages
Generative Algorithms
No ratings yet
Generative Algorithms
3 pages
Bayesian Classifier Implementation Using MATLAB
No ratings yet
Bayesian Classifier Implementation Using MATLAB
21 pages
ML-chap10_2024_110300
No ratings yet
ML-chap10_2024_110300
29 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Lecture 6_Generative Models
No ratings yet
Lecture 6_Generative Models
33 pages
PBM Notes
No ratings yet
PBM Notes
130 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Bayesian
No ratings yet
Bayesian
21 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Lecture_Notes_MAI
No ratings yet
Lecture_Notes_MAI
114 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Notes6_Classification
No ratings yet
Notes6_Classification
10 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Linear Models For Classification: Logreg - PDF - May 4, 2010 - 1
No ratings yet
Linear Models For Classification: Logreg - PDF - May 4, 2010 - 1
7 pages
AE - Tema 5 - Two-class Fisher Discriminant Analysis
No ratings yet
AE - Tema 5 - Two-class Fisher Discriminant Analysis
6 pages
Generative Learning Algorithms: CS229 Lecture Notes
No ratings yet
Generative Learning Algorithms: CS229 Lecture Notes
14 pages
Generative Learning Algorithms: CS229 Lecture Notes
No ratings yet
Generative Learning Algorithms: CS229 Lecture Notes
14 pages
4Gaussian Discriminant
No ratings yet
4Gaussian Discriminant
50 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
Generative Learning Algorithms: CS229 Lecture Notes
No ratings yet
Generative Learning Algorithms: CS229 Lecture Notes
14 pages
Bayesian Learning
No ratings yet
Bayesian Learning
21 pages
Pattern Reco Tutorial
No ratings yet
Pattern Reco Tutorial
13 pages
cs229-notes2
No ratings yet
cs229-notes2
14 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Lec 6
No ratings yet
Lec 6
14 pages
Unit 5
No ratings yet
Unit 5
21 pages
Q. 1) What Is Class Condition Density? (3 Marks) Ans
No ratings yet
Q. 1) What Is Class Condition Density? (3 Marks) Ans
12 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Multivariate classification
No ratings yet
Multivariate classification
7 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
58 pages
Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)
No ratings yet
Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)
21 pages
Legal 3 AI
No ratings yet
Legal 3 AI
3 pages
Bayes Classification
No ratings yet
Bayes Classification
86 pages
Cheat Sheet
No ratings yet
Cheat Sheet
163 pages
ML Columbia PDF
No ratings yet
ML Columbia PDF
615 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
Pattern Recognition
No ratings yet
Pattern Recognition
9 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Lecture 2 - Principle of Machine Learning
No ratings yet
Lecture 2 - Principle of Machine Learning
39 pages
Machine Learning UNIT-2: Logistic Regression
No ratings yet
Machine Learning UNIT-2: Logistic Regression
12 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet
Sumac 2013
No ratings yet
Sumac 2013
9 pages
6.867 Nonlinear Models, Kernels: Fall 2016
No ratings yet
6.867 Nonlinear Models, Kernels: Fall 2016
33 pages
Final Exam Practice: Bring Picture I.D. New Material Only
No ratings yet
Final Exam Practice: Bring Picture I.D. New Material Only
26 pages
The Scarlet Letter
No ratings yet
The Scarlet Letter
90 pages
Discussion Thousand Splendid Suns v2
No ratings yet
Discussion Thousand Splendid Suns v2
2 pages
Hunter College Reading/Writing Center: Grammar and Mechanics The Verb System: Subject-Verb Agreement
No ratings yet
Hunter College Reading/Writing Center: Grammar and Mechanics The Verb System: Subject-Verb Agreement
7 pages
Ans Classifying
No ratings yet
Ans Classifying
10 pages
Harmonic Division and Its Applications
100% (1)
Harmonic Division and Its Applications
10 pages
Olek Kulik Animality
100% (1)
Olek Kulik Animality
14 pages
Martin Luther and Guru Nanak
100% (1)
Martin Luther and Guru Nanak
5 pages
Organizational Culture and Behavior: Presented By: Dr. Amor A. Galicia
0% (2)
Organizational Culture and Behavior: Presented By: Dr. Amor A. Galicia
70 pages
Progress Test 2 Units 5-8: Use of English
No ratings yet
Progress Test 2 Units 5-8: Use of English
4 pages
Management Diamond Value
No ratings yet
Management Diamond Value
6 pages
For All There Exists Proof Writing
100% (1)
For All There Exists Proof Writing
3 pages
Easy Sanskrit Course Level 1: Aboutcif
No ratings yet
Easy Sanskrit Course Level 1: Aboutcif
2 pages
Generations Of Jewish Directors And The Struggle For Americas Soul 15th Edition Sam B Girgus instant download
100% (1)
Generations Of Jewish Directors And The Struggle For Americas Soul 15th Edition Sam B Girgus instant download
84 pages
Content Weightage - Public Sector Organization PDF
No ratings yet
Content Weightage - Public Sector Organization PDF
4 pages
Media Manipulation - Wikipedia
No ratings yet
Media Manipulation - Wikipedia
41 pages
Hospitality As A Way To Promote Societal Harmony: Peter Jonkers (Tilburg University, The Netherlands)
No ratings yet
Hospitality As A Way To Promote Societal Harmony: Peter Jonkers (Tilburg University, The Netherlands)
11 pages
Model Essay Test 3
No ratings yet
Model Essay Test 3
5 pages
Evangelism Soul Winning Mateo
No ratings yet
Evangelism Soul Winning Mateo
21 pages
d5c9b253-d11a-496c-bc22-3658bad41cb9-1696661133
No ratings yet
d5c9b253-d11a-496c-bc22-3658bad41cb9-1696661133
14 pages
Cultural Symbol: The Laughing Buddha
No ratings yet
Cultural Symbol: The Laughing Buddha
2 pages
Contrasting Cultural Values and Perceptions
No ratings yet
Contrasting Cultural Values and Perceptions
13 pages
8 Dream Signs You Shouldn
No ratings yet
8 Dream Signs You Shouldn
2 pages
Embracing the Shadow _(Carl Jung_)
No ratings yet
Embracing the Shadow _(Carl Jung_)
9 pages
Lived Experiences of The Senior High School Teachers
No ratings yet
Lived Experiences of The Senior High School Teachers
11 pages
STANAG 6001 Language Proficiency Levels
0% (1)
STANAG 6001 Language Proficiency Levels
32 pages
Percy Greg - Across The Zodiac
No ratings yet
Percy Greg - Across The Zodiac
595 pages
Logarithms Surds and Indices Formulas Cracku PDF
No ratings yet
Logarithms Surds and Indices Formulas Cracku PDF
11 pages
Wifredo Lam: 2 Career in Europe
No ratings yet
Wifredo Lam: 2 Career in Europe
6 pages
Global Dexterity
No ratings yet
Global Dexterity
4 pages
V-Diaries 2007 - Ludy Corrales (With Addendum Insert)
No ratings yet
V-Diaries 2007 - Ludy Corrales (With Addendum Insert)
42 pages
Essentialism and Progressivism
No ratings yet
Essentialism and Progressivism
23 pages

6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2

Uploaded by

6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2

Uploaded by

6.

867 Section 3: Classification

Probabilistic models: Bishop 4.1, 4.2

Bayesian methods: Bishop 4.3.14.3.4, 4.44.5

4 Distributions over models 9

3.1 Estimating Pr(X, Y)

3.1.1 Linear discriminant analysis

where c specifies the unconditional probability of getting an object of class c, and is a

Now, how should we make predictions? If we have 0-1 loss,

Lets concentrate on the conditions under which we predict 1:

Pr(Y = 1 | X = x; ) > Pr(Y = 0 | X = x; )

Note that, in the two class case, this reduces to

3.1.2 Factoring the class conditional probability

Use Bernoulli ML estimate, maybe with Laplace correction:

Now, for prediction. Given x, predict C = 1 if

Pr(x | C = 1) Pr(C = 1) > Pr(x | C = 0) Pr(C = 0)

So, this is a linear separator of the form xT w + w0 > 0 with

Interestingly, the probability model is also sigmoidal. We have

3.1.3 Exponential family

Only family for which conjugate priors exist

Finite-size sufficient statistics

Includes Normal, Bernoulli, Multinomial, Poisson, Gamma, Exponential, Beta, Dirich-

A (somewhat simplified) subset of exponential-family distributions can be written in the

3.1.3.1 Bernoulli is in exponential family

Pr(x | ) = sigmoid() exp(x) .

To make this match up, let

a(x) = (1 0 )T x + log g(1 ) log g(0 ) + log 1 log 0

3.2 Estimating Pr(Y | X)

3.2.1 Least squares

3.2.2 Logistic regression

Such models are called logistic regression models. Which is confusing,

3.2.3 Generalized linear models

The distribution of Y | X is an exponential family distribution

The goal is to predict E[Y | X; ]

The natural parameter and the inputs are linearly related: = T x.

Maximum likelihood parameter estimation (IRLS)

Various statistical tests

3.3 Generative versus discriminative

Generative models can be estimated relatively easily in the presence of missing or

Generative models can be run backwards (used to predict X from Y).

4 Distributions over models

where o(i) = sigmoid(W T x(i) ).

good approximation. Using a second-order Taylor-series expansion about the mode, we

the appropriate choice of mn . This is known as the

Going from here to a predictive distribution requires further approximation. We wont

log |H| = log |nH|

= D log n + log |H|

You might also like