0% found this document useful (0 votes)
88 views21 pages

Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)

Linear Models for Classification discusses several linear classification models: 1. Linear discriminant functions assign data points to classes based on a linear combination of features. This can result in ambiguous classification regions with multiple classes. 2. Fisher's linear discriminant aims to maximize separation between projected class means while minimizing variance within classes, providing an optimal linear classifier. 3. The perceptron model applies a step function to the weighted sum of inputs, mimicking neurons. It is optimized to minimize misclassifications using gradient descent. 4. Gaussian discriminant analysis models class-conditional densities as Gaussians, allowing linear or quadratic decision boundaries depending on covariance assumptions. 5. Logistic regression directly models class probabilities via
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views21 pages

Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)

Linear Models for Classification discusses several linear classification models: 1. Linear discriminant functions assign data points to classes based on a linear combination of features. This can result in ambiguous classification regions with multiple classes. 2. Fisher's linear discriminant aims to maximize separation between projected class means while minimizing variance within classes, providing an optimal linear classifier. 3. The perceptron model applies a step function to the weighted sum of inputs, mimicking neurons. It is optimized to minimize misclassifications using gradient descent. 4. Gaussian discriminant analysis models class-conditional densities as Gaussians, allowing linear or quadratic decision boundaries depending on covariance assumptions. 5. Logistic regression directly models class probabilities via
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Linear Models for Classification

Sumeet Agarwal, EEL709

(Most figures from Bishop, PRML)


Approaches to classification
Discriminant function: Directly assigns each
data point x to a particular class Ci
Model the conditional class distribution p(C i|x):
allows separation of inference and decision
Generative approach: model class likelihoods,
p(x|Ci), and priors, p(Ci); use Bayes' theorem to
get posteriors:
p(Ci|x) ~ p(x|Ci)p(Ci)
Linear discriminant functions
y(x) = wTx + w0
Multiple Classes
Problem of ambiguous regions
Multiple Classes
Consider a single K-class discriminant, with K linear functions:
yk(x) = wkTx + wk0
And assign x to class Ck if yk(x) > yj(x) for all j k
Implies singly connected and convex decision regions:
Least squares for classification
Too sensitive to outliers:
Least squares for classification
Problematic due to evidently non-Gaussian distribution of target
values:
Fisher's linear discriminant
Linear classification model is like 1-D projection of data: y = wTx.
Thus we need to find a decision threshold along this 1-D
projection (line). Simplest measure is separation of the class
means: m2 m1 = wT(m2 m1). If classes have nondiagonal
covariances, then a better idea is to use the Fisher criterion:

J(w) = (m2 m1)2 / (s12 + s22)

Where s12 denotes the variance of class 1 in the 1-D projection.

Maximising J() attempts to give a large separation between


projected class means, but also a small variance within each
class.
Fisher's linear discriminant

Line joining class means Fisher discriminant


The Perceptron

1(x) w1

f(wT(x))
2(x) w2
f() 1
Activation
function
3(x) w3 0 wT(x)
-1

4(x) w4

A non-linear transformation in the form of a step function


is applied to the weighted sum of the input features. This
is inspired by the way neurons appear to function,
mimicking the action potential.
The perceptron criterion
We'd like a weight vector w such that wT(xi) > 0 for xi C1
(say, ti=1) and wT(xi) < 0 for xi C2 (ti=-1)
Thus, we want wT(xi)ti > 0 i; those data points for which
this is not true will be misclassified
The perceptron criterion tries to minimise the 'magnitude' of
misclassification, i.e., it tries to minimise -wT(xi)ti for all
misclassified points (the set of which is denoted by M):

EP(w) = -iM wT(xi)ti


Why not just count the number of misclassified points?
Because this is a piecewise constant function of w, and thus
the gradient is zero at most places, making optimisation hard
Learning by gradient descent
w(+1) = w() EP(w)
= w() + (xi)ti
(if xi is misclassified)
We can show that after this update, the error due to xi will be
reduced:

-w(+1)T(xi)ti = -w()T(xi)ti ((xi)ti)T(xi)ti


< -w()T(xi)ti
(having set =1, which can be done without loss of generality)
Perceptron convergence

Perceptron
convergence
theorem
guarantees
exact solution in
finite steps for
linearly
separable data;
but no
convergence for
nonseparable
data
Gaussian Discriminant Analysis
Generative approach, with class-conditional densities
(likelihoods) modelled as Gaussians

For the case of two classes, we have:


Logistic sigmoid
Gaussian Discriminant Analysis
In the Gaussian case, we get

The assumption
of equal
covariance
matrices leads
to linear
decision
boundaries
Gaussian Discriminant Analysis

Allowing for unequal covariance matrices for different


classes leads to quadratic decision boundaries
Parameter estimation for GDA
Likelihood:
(assuming equal covariance matrices)

Maximum Likelihood Estimators


Logistic Regression
An example of a probabilistic discriminative model
Rather than learning P(x|Ci) and P(Ci), attempts to directly
learn P(Ci|x)
Advantages: fewer parameters, better if assumptions in
class-conditional density formulation are inaccurate
We have seen how the class posterior for a two-class setting
can be written as a logistic sigmoid acting on a linear function
of the feature vector :

This model is called logistic regression, even though it is


a model for classification, not regression!
Parameter learning
If we let

then the likelihood function is

and we can define a corresponding error, known as


cross-entropy:
Parameter learning
The derivative of the sigmoid function is given by:

Using this, we can obtain the gradient of the error function


with respect to w:

Thus the contribution to the gradient from point n is given by


the 'error' between model prediction and actual class label (yn
tn) times the basis function vector for that point, n
Could use this for sequential learning by gradient descent,
exactly as for least-squares linear regression
Nonlinear basis functions

You might also like