0% found this document useful (0 votes)
3 views33 pages

Regression

The document discusses Linear Discriminant Analysis (LDA) as a method for classification in machine learning, comparing likelihood-based and discriminant-based approaches. It covers the formulation of linear and quadratic discriminants, the geometry of classification, and the application of logistic regression for binary and multi-class problems. Additionally, it highlights the use of gradient descent for training models and the importance of estimating decision boundaries rather than densities.

Uploaded by

Alekh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views33 pages

Regression

The document discusses Linear Discriminant Analysis (LDA) as a method for classification in machine learning, comparing likelihood-based and discriminant-based approaches. It covers the formulation of linear and quadratic discriminants, the geometry of classification, and the application of logistic regression for binary and multi-class problems. Additionally, it highlights the use of gradient descent for training models and the importance of estimating decision boundaries rather than densities.

Uploaded by

Alekh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 33

CSCE833 Machine Learning

Lecture 9
Linear Discriminant Analysis

Dr. Jianjun Hu
mleg.cse.sc.edu/edu/csce833

University of South Carolina


Department of Computer Science and Engineering
Likelihood- vs. Discriminant-
based Classification
Likelihood-based: Assume a model for p(x|
Ci), use Bayes’ rule to calculate P(Ci|x)
gi(x) = log P(Ci|x)
Discriminant-based: Assume a model for
gi(x|Φi); no density estimation
Estimating the boundaries is enough; no
need to accurately estimate the densities
inside the boundaries

Lecture Notes for E Alpaydın 2004


2 Introduction to Machine Learning © The
MIT Press (V1.1)
Example: Boundary or Class
Model

Lecture Notes for E Alpaydın 2004


3 Introduction to Machine Learning © The
MIT Press (V1.1)
Linear Discriminant

Linear discriminant:
d
gi x | wi ,wi 0  wiT x  wi 0   wij x j  wi 0
j 1
Advantages:
Simple: O(d) space/computation
Knowledge extraction: Weighted sum of attributes;
positive/negative weights, magnitudes (credit scoring)
Optimal when p(x|Ci) are Gaussian with shared cov
matrix; useful when classes are (almost) linearly
separable
Lecture Notes for E Alpaydın 2004
4 Introduction to Machine Learning © The
MIT Press (V1.1)
Generalized Linear Model

Quadratic discriminant:

gi x | Wi ,wi ,wi 0   xT Wi x  wiT x  wi 0


Higher-order (product) terms:

z1 x1 , z 2 x2 , z 3 x12 , z 4 x22 , z5 x1x2

Map from x to z using nonlinear basis functions


and use a linear discriminant in z-space
k
gi x   wij  j x
Lecture Notes for E Alpaydın 2004 j 1
5 Introduction to Machine Learning © The
MIT Press (V1.1)
Two Classes
g x g1 x  g2 x
  
 w1T x  w10  w2T x  w 20 
 w1  w2  x  w10  w 20 
T

wT x  w 0

C 1 if g x  0
choose 
C 2 otherwise

Lecture Notes for E Alpaydın 2004


6 Introduction to Machine Learning © The
MIT Press (V1.1)
Geometry

Lecture Notes for E Alpaydın 2004


7 Introduction to Machine Learning © The
MIT Press (V1.1)
Multiple Classes
gi x | wi ,wi 0  wiT x  wi 0

Choose C i if
K
gi x  max g j x
j 1

Classes are
linearly separable

Lecture Notes for E Alpaydın 2004


8 Introduction to Machine Learning © The
MIT Press (V1.1)
Pairwise Separation
g x | w ij ij ,w ij 0  wT
ij x  w ij 0

 0 if x  C i

gij x   0 if x  C j
don' t care otherwise

choose C i if
j i ,gij x  0

9
Lecture Notes for E Alpaydın 2004 Introduction to Machine
From Discriminants to
Posteriors

When p (x | Ci ) ~ N ( μi , ∑)
  T
gi x | wi ,wi 0 wi x  wi 0
1 T 1
1
wi   μi wi 0   μi  μi  log P C i 
2
y  P C 1 | x and P C 2 | x 1  y
 y  0.5

chooseC 1 if  y / 1  y   1 and C 2 otherwise
log y / 1  y   0

Lecture Notes for E Alpaydın 2004
10 Introduction to Machine Learning © The
MIT Press (V1.1)
P C 1 | x P C 1 | x
logitP C 1 | x  log  log
1  P C 1 | x P C 2 | x
p x | C 1  P C 1 
 log  log
p x | C 2  P C 2 
2 d / 2   1 / 2 exp 1 / 2x  μ 1    1 x  μ 1 
T
  log P C 
1
 log
2 d / 2   1 / 2 exp 1 / 2x  μ 2    1 x  μ 2
T
 P C 
2

wT x  w 0
1
where w    1 μ 1  μ 2  w 0   μ 1  μ 2 T   1 μ 1  μ 2 
2
The inverse of logit
P C 1 | x
log wT x  w 0
1  P C 1 | x
1

P C 1 | x  sigmoid wT x  w 0     
1  exp  wT x  w 0

11
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Sigmoid (Logistic) Function

1. Calculateg x wT x  w 0 and chooseC 1 if g x  0, or


2. Calculate y for
Lecture Notes Esigmoid 
Alpaydın 2004
Introduction to Machine Learning © The

wT x  w 0 and chooseC 1 if y  0.5
12
MIT Press (V1.1)
Gradient-Descent

E(w|X) is error with parameters w on sample X


w*=arg minw E(w | X)
T
Gradient  E E E 
w E   , ,..., 
 w 1 w 2 w d 

Gradient-descent:
Starts from random w and updates w iteratively
in the negative direction of gradient

Lecture Notes for E Alpaydın 2004


13 Introduction to Machine Learning © The
MIT Press (V1.1)
Gradient-Descent
E
w i    ,i
wi
wi wi  wi

E (wt)

E (wt+1)

wt wt+1
14
η
Lecture Notes for E Alpaydın 2004
Introduction to Machine Learning © The
MIT Press (V1.1)
Logistic Discrimination

Two classes: Assume log likelihood ratio is


linear p x | C 1 
log wT x  w 0o
p x | C 2 
P C 1 | x p x | C 1  P C 1 
logitP C 1 | x  log  log  log
1  P C 1 | x p x | C 2  P C 2 
wT x  w 0
P C 1 
o
where w 0 w  log
P C 2 
0

1
y  P̂ C 1 | x 
Lecture Notes for E Alpaydın 2004
 
1  exp  wT x  w 0 
15 Introduction to Machine Learning © The
MIT Press (V1.1)
Training: Two Classes

X  xt ,r t t r t | xt ~ Bernoulli y t  
1
y  P C 1 | x 
 
1  exp  wT x  w 0 
   
| X   y  1  y 
t t

l w,w 0 t r t 1 r

E   logl

E w,w 0 | X    r t log y t  1  r t log 1  y t   
t

Lecture Notes for E Alpaydın 2004


16 Introduction to Machine Learning © The
MIT Press (V1.1)
Training: Gradient-Descent
  
E w,w 0 | X    r t log y t  1  r t log 1  y t 
t

dy
If y sigmoid a  y 1  y 
da
E  r t 1 r t  t
w j   
w j
   t  
t 
y 1  
y t
x t
j 
t  y 1 y 
 
  r t  y t xtj , j 1,...,d
t

E
w 0   
w 0
  r t  y t  
t
Lecture Notes for E Alpaydın 2004
17 Introduction to Machine Learning © The
MIT Press (V1.1)
Lecture Notes for E Alpaydın 2004
18 Introduction to Machine Learning © The
MIT Press (V1.1)
100 1000

10

Lecture Notes for E Alpaydın 2004


19 Introduction to Machine Learning © The
MIT Press (V1.1)
K>2 Classes
X x ,r  r | x ~ Mult 1, y 
t t
t
t t
K
t

p x | C i 
log wiT x  wio0
p x | C K  softmax

y  P̂ C i | x 

exp wiT x  wi 0  ,i 1,...,K
 j 1
K
exp wT
j x wj0  
 
 
t

l wi ,wi 0 i | X   y t ri


i
t i

E wi ,wi 0 i | X    rit log yit


t

wLecture
j   r
Notes

 forj E Alpaydın
t
 y t
j xt
 
w
2004 j 0
  j j
r t
 y t
 
20 t to Machine Learning © The
Introduction t
MIT Press (V1.1)
Lecture Notes for E Alpaydın 2004
21 Introduction to Machine Learning © The
MIT Press (V1.1)
Example

Lecture Notes for E Alpaydın 2004


22 Introduction to Machine Learning © The
MIT Press (V1.1)
Generalizing the Linear
Model
 Quadratic:
p x | C i 
log  xT Wi x  wiT x  wi 0
p x | C K 
 Sum of basis functions:
p x | C i 
log wiT x  wi 0
p x | C K 

where φ(x) are basis functions


 Kernels in SVM
 Hidden units in neural networks

Lecture Notes for E Alpaydın 2004


23 Introduction to Machine Learning © The
MIT Press (V1.1)
Discrimination by Regression

Classes are NOT mutually exclusive and


r t  y t   where  ~ N 0,  2 
exhaustive
1
t

y  sigmoid w x  w 0 T t
 
  
1  exp  wT xt  w 0

l w,w 0 | X   
1 
exp 
r  y  
t t 2

t 2  2 2 
1
E w,w 0 | X    r t  y t
2 t
 2

w  Notes
Lecture 
 for
r t EAlpaydın
y t y t 2004  
1  yt x t

24 Introductiont to Machine Learning © The


MIT Press (V1.1)
Optimal Separating
Hyperplane

  1 if xt
 C1

t t
 t
X  x ,r t wherer   t
  1 if x  C2
find w and w 0 such that
wT xt  w 0  1 for r t  1
wT xt  w 0  1 for r t   1
which can be rewritten as
 
r t wT xt  w 0  1

(Cortes and Vapnik, 1995; Vapnik, 1995)


Lecture Notes for E Alpaydın 2004
25 Introduction to Machine Learning © The
MIT Press (V1.1)
Margin

Distance from the discriminant to the closest


instances on either side T t
Distance of x to the hyperplane is w x  w0
w
t

r w x  w0T t
 ,t

We require w

For a unique sol’n, fix ρ||w||=1and to max


1
margin 2
 
min w subject to r t wT xt  w 0  1,t
2
Lecture Notes for E Alpaydın 2004
26 Introduction to Machine Learning © The
MIT Press (V1.1)
Lecture Notes for E Alpaydın 2004
27 Introduction to Machine Learning © The
MIT Press (V1.1)
1 2
min w subject to r t wT xt  w 0  1,t
2
 
1
   
N
2
L p  w    t r t wT xt  w 0  1
2 t 1

1 N N
2
 w 
2
  r w x  w t
0  
t
 
t T t

t 1 t 1

L p N
 0  w    t r t xt
w t 1

L p N

w 0
0   r 0
 t t

t 1

Lecture Notes for E Alpaydın 2004


28 Introduction to Machine Learning © The
MIT Press (V1.1)
1 T
 
Ld  w w  wT   tr t xt  w 0   tr t    t
2 t t t

1 T

 w w    t
2

t

1 t T s
     r r x x    t
2 t s
t s t s
 
t

subject to   tr t 0 and  t 0,t


t

Most αt are 0 and only a small number have αt


>0; they are the support vectors

Lecture Notes for E Alpaydın 2004


29 Introduction to Machine Learning © The
MIT Press (V1.1)
Soft Margin Hyperplane

Not linearly separable


 
r t wT xt  w 0 1  t

Soft error

t
t

New primal is

1 2
   
Lp  w  C t t  t t r t wT xt  w 0  1  t  t t t
2 Lecture Notes for E Alpaydın 2004
30 Introduction to Machine Learning © The
MIT Press (V1.1)
Kernel Machines

Preprocess input x by basis functions


z = φ(x) g(z)=wTz
g(x)=wT φ(x)
The SVM solution
w   tr t z t   tr t φxt  
t t

g x w φx    r φx
T t t
  φx
t T


g x   tr t K xt , x 
t
Lecture Notes for E Alpaydın 2004
31 Introduction to Machine Learning © The
MIT Press (V1.1)
Kernel Functions

Polynomials of degree q: K x , x  x x  1  t
  T t

q


K x, y   x y  1
T
 2

 x1y1  x2y 2  1
2

1  2x1y1  2x2y 2  2x1x2y1y 2  x12y12  x22y 22



x  1, 2x1 , 2x2 , 2x1x2 ,x ,x2
1
2
2 
T

 xt  x 2 
Radial-basis functions:  

K xt , x exp 
 2


Sigmoidal functions:  
K xt , x  tanh2xT xt  1

32
(Cherkassky and Mulier, 1998)
Lecture Notes for E Alpaydın 2004
Introduction to Machine Learning © The
MIT Press (V1.1)
SVM for Regression

Use a linear model (possibly kernelized)


f(x)=wTx+w0
Use the є-sensitive error function
0 if r t
 f xt
 

t t

e r ,f x   t  
t
 
 r  f x   otherwise
1
min

2
2
w  C  t  t  
t
 
r t  wT x  w 0    t
w x  w  r
T
0
t
   t
t Notes
t for E Alpaydın 2004
33
 ,  0
Lecture
  to Machine Learning © The
Introduction
MIT Press (V1.1)

You might also like