Logistic Regression
Logistic Regression
Jia Li
Department of Statistics
The Pennsylvania State University
Email: jiali@[Link]
[Link]
Jia Li
[Link]
Logistic Regression
Logistic Regression
Preserve linear classification boundaries.
I By the Bayes rule:
(x) = arg max Pr (G = k | X = x) .
G
k
Decision boundary between class k and l is determined by the
equation:
Pr (G = k | X = x) = Pr (G = l | X = x) .
Divide both sides by Pr (G = l | X = x) and take log. The
above equation is equivalent to
log
Jia Li
Pr (G = k | X = x)
=0.
Pr (G = l | X = x)
[Link]
Logistic Regression
Since we enforce linear boundary, we can assume
p
X (k,l)
Pr (G = k | X = x)
(k,l)
log
= a0 +
aj xj .
Pr (G = l | X = x)
j=1
Jia Li
For logistic regression, there are restrictive relations between
a(k,l) for different pairs of (k, l).
[Link]
Logistic Regression
Assumptions
Pr (G = 1 | X = x)
Pr (G = K | X = x)
Pr (G = 2 | X = x)
log
Pr (G = K | X = x)
log
= 10 + 1T x
= 20 + 2T x
..
.
log
Jia Li
Pr (G = K 1 | X = x)
Pr (G = K | X = x)
[Link]
= (K 1)0 + KT 1 x
Logistic Regression
For any pair (k, l):
log
Pr (G = k | X = x)
= k0 l0 + (k l )T x .
Pr (G = l | X = x)
Number of parameters: (K 1)(p + 1).
Denote the entire parameter set by
= {10 , 1 , 20 , 2 , ..., (K 1)0 , K 1 } .
Jia Li
The log ratio of posterior probabilities are called log-odds or
logit transformations.
[Link]
Logistic Regression
Under the assumptions, the posterior probabilities are given
by:
Pr (G = k | X = x) =
exp(k0 + kT x)
PK 1
1 + l=1 exp(l0 + lT x)
for k = 1, ..., K 1
Pr (G = K | X = x) =
I
1+
l=1
exp(l0 + lT x)
For Pr (G = k | X = x) given above, obviously
I
I
Jia Li
1
PK 1
PK
Sum up to 1: k=1 Pr (G = k | X = x) = 1.
A simple calculation shows that the assumptions are satisfied.
[Link]
Logistic Regression
Comparison with LR on Indicators
Similarities:
I
I
Difference:
I
Jia Li
Both attempt to estimate Pr (G = k | X = x).
Both have linear classification boundaries.
Linear regression on indicator matrix: approximate
Pr (G = k | X = x) by a linear function of x.
Pr (G = k | X = x) is not guaranteed to fall between 0 and 1
and to sum up to 1.
Logistic regression: Pr (G = k | X = x) is a nonlinear function
of x. It is guaranteed to range from 0 to 1 and to sum up to 1.
[Link]
Logistic Regression
Fitting Logistic Regression Models
I
Criteria: find parameters that maximize the conditional
likelihood of G given X using the training data.
Denote pk (xi ; ) = Pr (G = k | X = xi ; ).
Given the first input x1 , the posterior probability of its class
being g1 is Pr (G = g1 | X = x1 ).
Since samples in the training data set are independent, the
posterior probability for the N samples each having class gi ,
i = 1, 2, ..., N, given their inputs x1 , x2 , ..., xN is:
N
Y
i=1
Jia Li
[Link]
Pr (G = gi | X = xi ) .
Logistic Regression
The conditional log-likelihood of the class labels in the
training data set is
L() =
N
X
i=1
N
X
i=1
Jia Li
[Link]
log Pr (G = gi | X = xi )
log pgi (xi ; ) .
Logistic Regression
Binary Classification
For binary classification, if gi = 1, denote yi = 1; if gi = 2,
denote yi = 0.
Let p1 (x; ) = p(x; ), then
p2 (x; ) = 1 p1 (x; ) = 1 p(x; ) .
Jia Li
Since K = 2, the parameters = {10 , 1 }.
We denote = (10 , 1 )T .
[Link]
Logistic Regression
If yi = 1, i.e., gi = 1,
log pgi (x; ) = log p1 (x; )
= 1 log p(x; )
= yi log p(x; ) .
If yi = 0, i.e., gi = 2,
log pgi (x; ) = log p2 (x; )
= 1 log(1 p(x; ))
= (1 yi ) log(1 p(x; )) .
Since either yi = 0 or 1 yi = 0, we have
log pgi (x; ) = yi log p(x; ) + (1 yi ) log(1 p(x; )) .
Jia Li
[Link]
Logistic Regression
The conditional likelihood
L() =
N
X
i=1
N
X
log pgi (xi ; )
[yi log p(xi ; ) + (1 yi ) log(1 p(xi ; ))]
i=1
I
I
Jia Li
There p + 1 parameters in = (10 , 1 )T .
Assume a column vector form for :
10
11
= 12 .
..
1,p
[Link]
Logistic Regression
Jia Li
Here we add the constant term 1 to x to accommodate the
intercept.
1
x,1
x = x,2 .
..
.
x,p
[Link]
Logistic Regression
By the assumption of logistic regression model:
p(x; ) = Pr (G = 1 | X = x) =
exp( T x)
1 + exp( T x)
1 p(x; ) = Pr (G = 2 | X = x) =
1
1 + exp( T x)
Substitute the above in L():
L() =
N h
X
i=1
Jia Li
[Link]
yi T xi log(1 + e
Tx
i
i
) .
Logistic Regression
To maximize L(), we set the first order partial derivatives of
L() to zero.
L()
1j
N
X
i=1
N
X
i=1
N
X
i=1
for all j = 0, 1, ..., p.
Jia Li
[Link]
yi xij
yi xij
T
N
X
xij e xi
1 + e T xi
i=1
N
X
p(x; )xij
i=1
xij (yi p(xi ; ))
Logistic Regression
In matrix form, we write
N
L() X
=
xi (yi p(xi ; )) .
i=1
To solve the set of p + 1 nonlinear equations L()
1j = 0,
j = 0, 1, ..., p, use the Newton-Raphson algorithm.
The Newton-Raphson algorithm requires the
second-derivatives or Hessian matrix:
N
X
2 L()
=
xi xiT p(xi ; )(1 p(xi ; )) .
T
i=1
Jia Li
[Link]
Logistic Regression
The element on the jth row and nth column is (counting from
0):
L()
1j 1n
=
T
T
T
N
X
(1 + e xi )e xi xij xin (e xi )2 xij xin
i=1
N
X
i=1
N
X
(1 + e T xi )2
xij xin p(xi ; ) xij xin p(xi ; )2
xij xin p(xi ; )(1 p(xi ; )) .
i=1
Jia Li
[Link]
Logistic Regression
Starting with old , a single Newton-Raphson update is
new
old
2 L()
T
1
L()
,
where the derivatives are evaluated at old .
Jia Li
[Link]
Logistic Regression
The iteration can be expressed compactly in matrix form.
I
I
I
Let y be the column vector of yi .
Let X be the N (p + 1) input matrix.
Let p be the N-vector of fitted probabilities with ith element
p(xi ; old ).
Let W be an N N diagonal matrix of weights with ith
element p(xi ; old )(1 p(xi ; old )).
Then
L()
2
L()
T
Jia Li
[Link]
= XT (y p)
= XT WX .
Logistic Regression
The Newton-Raphson step is
new
= old + (XT WX)1 XT (y p)
= (XT WX)1 XT W(X old + W1 (y p))
= (XT WX)1 XT Wz ,
where z , X old + W1 (y p).
If z is viewed as a response and X is the input matrix, new is
the solution to a weighted least square problem:
new arg min(z X)T W(z X) .
Recall that linear regression by least square is to solve
arg min(z X)T (z X) .
I
I
Jia Li
z is referred to as the adjusted response.
The algorithm is referred to as iteratively reweighted least
squares or IRLS.
[Link]
Logistic Regression
Pseudo Code
1. 0
2. Compute y by setting its elements to
1 if gi = 1
yi =
0 if gi = 2
i = 1, 2, ..., N.
3. Compute p by setting its elements to
T
e xi
i = 1, 2, ..., N.
1 + e T xi
Compute the diagonal matrix W. The ith diagonal element is
p(xi ; )(1 p(xi ; )), i = 1, 2, ..., N.
z X + W1 (y p).
(XT WX)1 XT Wz.
If the stopping criteria is met, stop; otherwise go back to step
3.
p(xi ; ) =
4.
5.
6.
7.
Jia Li
[Link]
Logistic Regression
Computational Efficiency
Jia Li
Since W is an N N diagonal matrix, direct matrix
operations with it may be very inefficient.
A modified pseudo code is provided next.
[Link]
Logistic Regression
1. 0
2. Compute y by setting its elements to
yi =
1
0
if gi = 1
if gi = 2
, i = 1, 2, ..., N .
3. Compute p by setting its elements to
T
e xi
i = 1, 2, ..., N.
1 + e T xi
by multiplying the
4. Compute the N (p + 1) matrix X
matrix X by p(xi ; )(1 p(xi ; )), i = 1, 2, ..., N:
T
x1
p(x1 ; )(1 p(x1 ; ))x1T
xT
p(x2 ; )(1 p(x2 ; ))x T
2
2
X=
X =
xNT
p(xN ; )(1 p(xN ; ))xNT
p(xi ; ) =
ith row of
1 XT (y p).
5. + (XT X)
6. If the stopping criteria is met, stop; otherwise go back to step 3.
Jia Li
[Link]
Logistic Regression
Example
Diabetes data set
I
Input X is two dimensional. X1 and X2 are the two principal
components of the original 8 variables.
Class 1: without diabetes; Class 2: with diabetes.
Applying logistic regression, we obtain
= (0.7679, 0.6816, 0.3664)T .
Jia Li
[Link]
Logistic Regression
The posterior probabilities are:
Pr (G = 1 | X = x) =
Pr (G = 2 | X = x) =
Jia Li
e 0.76790.6816X1 0.3664X2
1 + e 0.76790.6816X1 0.3664X2
1
1 + e 0.76790.6816X1 0.3664X2
The classification rule is:
1 0.7679 0.6816X1 0.3664X2 0
G (x) =
2 0.7679 0.6816X1 0.3664X2 < 0
[Link]
Logistic Regression
Solid line: decision boundary obtained by logistic regression. Dash
line: decision boundary obtained by LDA.
Jia Li
[Link]
Within training
data set
classification error
rate: 28.12%.
Sensitivity: 45.9%.
Specificity: 85.8%.
Logistic Regression
Multiclass Case (K 3)
I
Jia Li
When K 3, is a (K-1)(p+1)-vector:
10
11
..
10
1p
1
20
20
.
2
.
=
= .
..
.
2p
.
(K 1)0
..
(K 1)0
K 1
..
.
(K 1)p
[Link]
Logistic Regression
I
I
l0
.
l
The likelihood function becomes
Let l =
L() =
N
X
i=1
N
X
log
i=1
"
N
X
i=1
Jia Li
log pgi (xi ; )
[Link]
1+
gTi xi
e gi xi
PK 1
l=1
!
T xi
e l
log 1 +
K
1
X
l=1
!#
e
lT xi
Logistic Regression
Note: the indicator function I () equals 1 when the argument
is true and 0 otherwise.
First order derivatives:
#
"
N
T
X
e k xi xij
L()
=
I (gi = k)xij
P 1 T x
kj
l i
1+ K
l=1 e
i=1
=
N
X
i=1
Jia Li
[Link]
xij (I (gi = k) pk (xi ; ))
Logistic Regression
Second order derivatives:
2 L()
kj mn
=
N
X
xij
i=1
(1 +
1
PK 1
l=1
T xi 2
)
e l
"
e
kT xi
I (k = m)xin (1 +
K
1
X
#
e
lT xi
)+e
Tx
kT xi m
i
xin
l=1
N
X
xij xin (pk (xi ; )I (k = m) + pk (xi ; )pm (xi ; ))
i=1
N
X
xij xin pk (xi ; )[I (k = m) pm (xi ; )] .
i=1
Jia Li
[Link]
Logistic Regression
Matrix form.
I
y is the concatenated
N (K 1).
y1
y2
y= .
..
indicator vector of dimension
yk =
yK 1
I (g1 = k)
I (g2 = k)
..
.
I (gN = k)
1k K 1
I
p is the concatenated
N (K 1).
p1
p2
p= .
..
pK 1
Jia Li
[Link]
vector of fitted probabilities of dimension
pk =
pk (x1 ; )
pk (x2 ; )
..
.
pk (xN ; )
1k K 1
Logistic Regression
Jia Li
is an N(K 1) (p + 1)(K 1) matrix:
X
X 0
0
0
X
0
=
X
0
0
X
[Link]
Logistic Regression
Jia Li
Matrix W is an N(K 1) N(K 1) square matrix:
W11
W12
W1(K 1)
W21
W
W2(K 1)
22
W =
W(K 1),1 W(K 1),2 W(K 1),(K 1)
Each submatrix Wkm , 1 k, m K 1, is an N N
diagonal matrix.
When k = m, the ith diagonal element in Wkk is
pk (xi ; old )(1 pk (xi ; old )).
When k 6= m, the ith diagonal element in Wkm is
pk (xi ; old )pm (xi ; old ).
[Link]
Logistic Regression
Similarly as with binary classification
L()
T (y p)
= X
2 L()
T
T WX
.
= X
The formula for updating new in the binary classification case
holds for multiclass.
T WX)
1 X
T Wz ,
new = (X
old + W1 (y p). Or simply:
where z , X
T WX)
1 X
T (y p) .
new = old + (X
Jia Li
[Link]
Logistic Regression
Computation Issues
Jia Li
Initialization: one option is to use = 0.
Convergence is not guaranteed, but usually is the case.
Usually, the log-likelihood increases after each iteration, but
overshooting can occur.
In the rare cases that the log-likelihood decreases, cut step
size by half.
[Link]
Logistic Regression
Connection with LDA
I
Under the model of LDA:
Pr (G = k | X = x)
Pr (G = K | X = x)
k
1
= log
(k + K )T 1 (k K )
K
2
T 1
+x (k K )
log
= ak0 + akT x .
Jia Li
The model of LDA satisfies the assumption of the linear
logistic model.
The linear logistic model only specifies the conditional
distribution Pr (G = k | X = x). No assumption is made
about Pr (X ).
[Link]
Logistic Regression
The LDA model specifies the joint distribution of X and G .
Pr (X ) is a mixture of Gaussians:
Pr (X ) =
K
X
k (X ; k , ) .
k=1
where is the Gaussian density function.
Jia Li
Linear logistic regression maximizes the conditional likelihood
of G given X : Pr (G = k | X = x).
LDA maximizes the joint likelihood of G and X :
Pr (X = x, G = k).
[Link]
Logistic Regression
Jia Li
If the additional assumption made by LDA is appropriate,
LDA tends to estimate the parameters more efficiently by
using more information about the data.
Samples without class labels can be used under the model of
LDA.
LDA is not robust to gross outliers.
As logistic regression relies on fewer assumptions, it seems to
be more robust.
In practice, logistic regression and LDA often give similar
results.
[Link]
Logistic Regression
Simulation
Assume input X is 1-D.
Two classes have equal priors and the class-conditional
densities of X are shifted versions of each other.
Each conditional density is a mixture of two normals:
I
I
Jia Li
Class 1 (red): 0.6N(2, 14 ) + 0.4N(0, 1).
Class 2 (blue): 0.6N(0, 14 ) + 0.4N(2, 1).
The class-conditional densities are shown below.
[Link]
Logistic Regression
Jia Li
[Link]
Logistic Regression
LDA Result
Jia Li
Training data set: 2000 samples for each class.
Test data set: 1000 samples for each class.
The estimation by LDA:
1 = 1.1948,
2 = 0.8224,
2 = 1.5268. Boundary value between the two classes is
(
1 +
2 )/2 =0.1862.
The classification error rate on the test data is 0.2315.
Based on the true distribution, the Bayes (optimal) boundary
value between the two classes is 0.7750 and the error rate is
0.1765.
[Link]
Logistic Regression
Jia Li
[Link]
Logistic Regression
Logistic Regression Result
I
Linear logistic regression obtains
= (0.3288, 1.3275)T .
The boundary value satisfies 0.3288 1.3275X = 0, hence
equals 0.2477.
The error rate on the test data set is 0.2205.
The estimated posterior probability is:
Pr (G = 1 | X = x) =
Jia Li
[Link]
e 0.32881.3275x
.
1 + e 0.32881.3275x
Logistic Regression
The estimated posterior probability Pr (G = 1 | X = x) and its true
value based on the true distribution are compared in the graph
below.
Jia Li
[Link]