0% found this document useful (0 votes)
3 views5 pages

cheatsheet 2

The document discusses various regression techniques, including Logistic Regression, Ridge Regression, and Lasso Regression, highlighting their mathematical formulations and optimization methods. It emphasizes the importance of convexity in optimization and the bias-variance trade-off in model selection. Additionally, it touches on gradient descent methods and the implications of feature selection and regularization in regression analysis.

Uploaded by

zwu363
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views5 pages

cheatsheet 2

The document discusses various regression techniques, including Logistic Regression, Ridge Regression, and Lasso Regression, highlighting their mathematical formulations and optimization methods. It emphasizes the importance of convexity in optimization and the bias-variance trade-off in model selection. Additionally, it touches on gradient descent methods and the implications of feature selection and regularization in regression analysis.

Uploaded by

zwu363
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

not convex

Logistic Regression A Gradient Descent


↑ (Y 1) X w) (WiX) dflw
=
x, Hexp(-WTX)
= :
WK+ WK y
=
wi
-

PI = 0 1 ..

) =
10(Wix) =

HexpIwTX) Ridge Regression


:
8
f(wt) = -

X " ly -Xw + ) + Aw +

Feature can be cont /dis. We + He I Ne +


. =

7
-

wory i
, Shrinkage Is giddient

Wo = 0 , WFI Lasso o f (We) = -


X (y -
Awt) +
xsignIwe)
·
Wo = 0 W = 0 5 +
y X" ly x
,
.

,
Wt +1 =
We -

Xwt)
-

sign(we)
-
I
I Doesn't
O
1 exp -No +
IWkXk) shrinkage with We

i · j See chastic Gradient Descent

odd
Pll(Xx exp(wTx o Ow 12t(W)
:
= We + 1
=
We -

=
We
w

ECOS2+ (W)) :
Jl(w)

If IIWK-Wolk 11 Oli (w)/k = G


Log odd WTX (Decision Rule ( The R
- max
: + Wo 1 sup
= dR
w
R y
WMLE :
argmax
+ P(yi/Xi ; w) Eld(iv) -l(we)] =
<Ty

=
arg min [log (1 + expl-yix (6) J
=
3 =
64
LOW LOSS efficient (take small sample (
> :
sign lyi) Sign (NiTW) Y k) K > i more
memory
-
-
= :

find
,

eo
hard faster , works well on large data
lwll without
chage sogn(XiTw) point
>
increase
-

stopping
- >
GDI take data sit)
-

a entire
J is CONVEX easy to
optimize
,
>
-
" noise has higher impact on stability of SGD
NO CLOSED FORT SOLUTION
i more sensitive to learning rate

=> When linearly separable ,


w- <
(overfit) >
-
i Both unbiased both scale linearly
,

Prediction Pitfall with datapoint


Convexity >
- not always have one

mininum
all local min are global min
>
-

Normalize 18jK (k) more


imp X if not scaled
> (l x) xy GK >
-
-
x +
-

>
-

Eliminate correlation => F = 0


>
-

f((x)x + xy) = ((x) f(-) + xf(y)


>
-
CorrelationF Causation =>
expect to do with :
* of feature
>
-
f is convex if [(X , +) :
f(x) < +] is convex
>
Check feature with 0
-
variance
>
↓ differentiable everywhere is
-
is convex

"
if f(y) > f(x) + 7 f(x) (y x) x,
y +domIf)
Regression
-

Lasso
>
-
...
twice differentiable - - -

in ary min I . (yi xiTWK X w


=
+
-

- f(x) 20 + E dom (f)


Ridge r(w) =
Hull M and smooth
gradient efficient
= > convex
descent i
-

to
optimize
i only to smooth functions >
-
dense solution ,
L ball "smooth"

>
-

Sub-gradien 9 + :
f(y) > f(x) + gily -

x)
↓ Lasso 0 (w) =
Hwll , M >
-

convey but non-smooth

subgd for non-smooth :NON-SmooTH - small get


sparse solution Liball "pointy"
>
-

than simple reg


,

i slower than
gd on smooth

*
>
-
L,
encourage
>
-

M d (Exp) feasible set shronks


always GD ,
can use
coef > o
-

& guarantee only when f is convex ↓ Model Selection &

& Feature selection (non-zero)

Bias-Variance Trade-off ③ retrain with


sparse model , x= 0

y(X) likelihood
(Exy Is maximize
((Y-y(X))")
assuming Grassin
minimize

y(x)
=
Ey(x [Y(X x] = Linear Regression
Eg(x ] /X x] If (TX) "exists WMLE (XX)' XT
y
EyIx [Ep[ (Y
-
=
, :

data BLs # ] Yi
Ex1x [ (Y y(x)) /x x] + Ex [ (y(x) 8b(x))) not enough If X 120
:

= -
,
=
-

irreducible crow (label noise) error (simple model


( demean X with
training mean
learning
learning Ept) Ep (fb()) fp(x))
error =
(g(x) -
Ep (Yo(s))2 + -

] error
test error

biased varian
square
a

error
training
degree P.
Ridge Regression > never make
- coefficient zere Useful Gradient

LS If don
unique and
:
solution is not unstable
Ow(wTAx) Ax w(W A)
, = =
A
I (y XiT WK ↓ I W112
wridge any
+
-

min
: :

ATX OW(Aw) A
Jw(XTAW)
:
=

2
+
Xy + WY (XTX +
XIw)
argain
:
-
w

For Aw
E = 0 i .

Fridge = (X +
X + x2)
xTy symmetric A , Jw(WT A w) =L

Is (x +
X + x 1)"always exist ?

> XT X is
PSD/alleigenvale =0 (def PSD)
-

1) y -
Xc11 =
<y
-
Xw)" (y -

Xw)
: XTX =
HX11230 : OE zTXTXE),
yTy
+ +
= -
<Wi X y
+ w X" Xw
>
-

def eigen .

pair
(v, o) XXV = Or or UTXYXV = o nu

symmetric
"
Let V= IV ,
...
Va] Orthonormal .
St UV = UTV = I
flws :

/y - Awlk
n
diag (5, 52)
=

...,

XTX =
VXUT
Ow f(w) : <
XTXw-2XTy
Tw f(w) = 2 x+ X
XX + XI = VNUP + <2 = UNUT + XUUT =
U(N + XIJUT
2) *TX)E =
1/ Xz112 >0 =>
f(w) is convex
x 1) "Vi
"
(xTX + X 1) = V(N +

decrease
X more
regularization ,
more bias , variance

1-0 Wridge >


-

vs Linear Algebra
++ 0
Fridge >
- 0 16 is not
regularized ( -

A G /RM" Orthonormal :
AAT = A" A :
I

Bias-Variance 11 Axll2 Hx/k


Property
=
>
-

Assume XTX = n] Xw + 9 E -N10 02 I


-

BEIRM invertible and


symmetric
(B Bi =

y
: .
, ,
,

EYIX , D [ (Y -
↑ Wridge) /X =
x ] &" is also symmetric
CERO PSD
EYix [CY xTwY/X x] Ep[(X'w -x ↑
Fridge) xTCx >
-

= - = + :
symmetric s . t .
O
E

= of + (ntTwix) +
MG DX1K >
-
its
eigenvales are
non-negative
o
ridge
=
(XTX + x2)" XTXw +
(x + 12) x &
-

12/k =
dI zit 1) zlk :
ETz

<y
(
= W + n+
x XT 9 Based estimator
-

x,
y orthogonal : = 0

>
-

change in bias affect vaniance -

If columns of A are orthogonal ,

>
-

change in variance might not affect bias A TA is


diagonal
>
- Stein paradox : overall sum might be reduced -

eigenvalle x , eigenvector x

overfit Ax x 1 AGRUXN(
few data points likely x
=
>
-
: more to

Span([X1 An3) :
90 : 0= [ aixi]
of
"

LOOCU
, ,
>
-
: overestimate true error

bias with S'< bias with S A :


Ruxn
feature
-

>
- SIS' :

[VG/RM Ax]
range (A) v=
: :

nuUspace A =
[XEIR" Ax : = 0
3
& PSD
-
xiX is
always sym
V not invertible

~ null space is
non-empty
X columns are
linearly independent
Useful Gradient Chuster jag
Ow(wTAx) =
Ax w(W A) =
A K-machs clustering and non-convex :
converges " ...
-
>
-

- W(Aw) A be spherical
.

Jw (x Aw) Aix
, .

=
+ =
-
assumes clusters to
.

For A Jw(WT A w) =L Aw
symmetric ,
F(M , C) =
Ill Maj -
*
j/
Mixture of Gaussians observed data
Xc11 Xw)" (y > z :

Xw)
-

1) y - =
<y
Oilg(2d0c(yi))
-

LogDrm)
-

110 , z , 0) =
I < 1- 0i) 40 (y .
:
) +
- : unobserved data
yTy PLOi=P(y:10:=1) responsibilty to
+ +
= -
<Wi X y
+ w X" Xw choose Vi(O) E(0i 10 , E) P101
how O:
1(yi)
= =
to = =
nu 28 O known ,
symmetric
model complex cluster shapes
and density ,
accommodate elliptical
-
shapes
flws /y - Awlk Kernel Density Estimate

Bayes optimalclassifier: =argmaxPCXSYP


: >
-

- gr p(X x) =

Ow f(w) : <
XTXw-2XTy all
generative
.
model

full joint dist. P(X , y) ; enables prob infere (Bayes classificat)


Tw f(w) = 2 x+ X >
-
generative
: learn . .

m needs a lot of data (at all possible could. of 3x Y3) , .

2) *TX)E =
1/ Xz112 >0 =>
f(w) is convex discriminative :
just learn what you need to make a specific class .
of pred
-

just P(Y(D) ey
. reg .
No regard for P(X) or PCX , %)
. An easier
modeling
full rank
inverse limited about P(Y(X)
Algebra
=
Linear problem less data, but
utility is to
quaries
requires
-

Span([X1 ,
"

, An3) :
90 : 0= [ aixi]
Feature extraction
-
A :
Ruxn
uninformative sparse (LASSO)
[VG/RM Ax]
set

range (A)
:
: :
v=
dimension reduction / down-sample
superfluous/correlated :

nuUspace A =
[XEIR" Ax : = 0
3 autoencoder :
find low
a dimensional representation by prediction
& PSD set of learnable filters (kernels) slide
over
-
xiX is
always sym convolutional layer : a

determined and sizes also of Kernel


V not invertible # weight by # of filters , .

of inprot image stride , padding impact output size


resolution ,

different features ; ' robustness


~ multiple filters for single layer capture
:
null space is a
non-empty add non-linearity ↓ size of input
Non-linear poliag layer : ,

X columns are
linearly independent the input with a filter
Pooling dim red summarizes output of convolving
: ,

A G /RM" Orthonormal :
AAT = A" A :
I
Last few layers typically FC = Conolayers are feature extraction

>
-
11 Axll2 =
Hx/k preparing input for final FC layly
-

BEIRM invertible and (B Bi = Data augmentation :


mirror /Zoom i

symmetric
.

&" also
dolvehigh-dimentality
is symmetric .
Kernels
-
CERO PSD :
symmetric s t xTCx > O than features
. .

much more efficient to compute .

[-"***
-

Gaussian (MBF) exp


:
*
K(x , x')
> its
eigenvales are
non-negative Poly of degree exactly K: (x) x')
- =

to ki 21 + xix'
*
sigmoid tahn(2x"x'tr)
-

12/k =
dI zit 1) zlk :
ETz up
Wi X : " ↑ Hwll?
arguin [ (y
+
E
-

Regularized LS =
.
Kernel trick for
<y
>

orthogonal
-

x,
:
0
y
-
=
I aix
< ERR : W = .

If columns of A orthogonal arg min [1Yi -I aj Xj > + xI]didj Xisxj >


< xi <
-
are , I : ,

min [ /y [Cjk(xj xi)) x 2I d :


6jK(xi ,j)
-

diagonal
+
A TA any
= :
is
,

=
arg min Iy -Kall2 + x < TKX
.
-

eigenvalle x , eigenvector x
: .
=
(k + x2)"y .

the predictor For i.


Ax =
x x 1 AGRUXN( -
x regularizes
or bandwidth of Keond
the regularizes the predictor off :: ::
A has distinct eigenvales
ind
eigenvectors form linearly
=) .
set
Tree brast variance'
data V intuitive + interpretable
Midterm
bins 4 varianced U deal well with categorical
regularization , ,


RE bias & variance +
bray to predictors
weakly correlated
variance
overfit ,
Power of
:
:

=>
1-0 Wridge >
-

vs >
-

Bagging (bootstrap aggregating)


:
ang trees on bots trapped
data that used all
a .
riv
&+ 0
Fridge >
- 0 Lunderfit)
and distinctly separable trained botstrapped data used
log weg Linear that
:
> RF :
average trees on

mad
-

correlation 4 performance
. r V

reducing
.

distinct classes .
.

is
X not inthitite

between the optional pred and the exp works well


with default parameters
.
Bias :
diff hard
model
. ↓ computation
of the best possible trained
version of your
Boosting
add input features additive models : in , P min
[lossly :, We Pelic,
..
I
Reduce biay
"weak" learners (also tree)
:

U v computationally efficient with


<Orerfit generalize
hyperparameter tuning
: ,

overshoot the min , < converge Bagging Us .


boosting
learning rate too
high :
vari

chais , out-ch= 28 K:9 stride = 1,


pad =
K bagging averages : low-bias ,
lightly dependent classifier to reduce
classifi
Convdd (in of high-boas highly dependent
, ,

(hin- k learns linear combination


+
up
-

3
--
,

boosting
:

(N 28 , 250) Nibatch size +1


reduce error
output , 180 , S to .
-

-
* L
Comparison
o b 2 on trees

-9-9 ind
f RF builted Using bagging
.
learning from prev
:
.

errors
each
sequentially
tree
built trees
Boosting
: ,
los) Value Decomposition (SUD) VTV 1
Neural Network Singular
=
-
cross entropy
UTU
SEIRUY" diagonal
=
I,
a IRMXN A = USUT is pos. .

=
X AG ,

....
·E 9 =
g(zR)) z =
pl all >
-
ATAU = (USULY (USUP) 0 T
:
=
US SUTVi : US e; = ViSic
AA" with
glds an al
+
1 U the first reigenvalues of
g(z) 1) z(l 1) oll) all Sit
+ are
alls
+ U: =
...

= A AT Ui =

eigenvales diaglSt)
:
PCA .

g(z(b
+

a + K K( US UTSVD on de-mean
data matrix I running
y
=
=
[(Xi *
)T) x=
-

>
-
x =

VSTUT
log(1 y
#XTX
5) y (og(a) * ACT = =
(1 y) I I(Xi -5)(Xi
-

+
L(y
-
-
:
=
,
discard low-rank components A : USUT
SVD :

g(z) Hez =

- S has values its diagnol only


2L(yy
>
-
on .

Ple ol-y2 of A ordered from


(

Train by Stochastistic g d .: the singular


diagonal
values
0 (b)
E contains
>
-

the top left


.

j
:

to the smallest starting at .

64(y , ) dziletk the largest


2((y 5) Ajll k
all
, +
=

Pigles
=
Ce+ I
-
pi 26 i <

PCA
84B all I wil , diht' o
>
.
-

I PCA
gichts gcd
=
=
E . .. En EIR8
compressed expression
. .
:

automatic differentration
Python package
, error
VgUq
:
I min
Vgzi mat
=
> X +
.

> Xi
-proj
-

=
- .
.

, GPU support
>

convenient library +

x-UgZilk x) -8
quq (xi
* )/k
.

[11 min I 1) (xi


-

min Xi -
= -

gradient blow up/


-
new convex
training issue :
> ,
dir
projuar along
.
U
-

When max
vizo
.

Vi
9
=
:
1 max
zero (Res net) , stepsize batch size , , momentum hall
NUTV X eigenvale is the var of data proj
large impact optimizing training
VTIV
=
on error
. = =

errord also val data


dimgoCUdwork
on
all pred> Choose (19 , ,

sigmoid will
>
cause
-

ReLl then
using choose explain 95 % of variance
-

Ad-hos approach : to

represent Prob.
·
to
normalize the output IPL
-softmax :

Up : the first q eigenractors of


I = [ (xi -
*) (X : -
*T
include loss calculation ,
-
forwardnight 1) updates weight ↓ -
IxT = USUT = I Vi Vit Sii
backward updates gradients Step
.

projected data (x-2*T)Uc :


=
UnSz G /Rux2
4 bias
weight
+

input
:s 3x 4 : 12
Kernel PCA
bits
weight
+ d

[EnewSj IE Vij FLXis


bin 492 8
hidden
=
I
: 4 = Xnew
,
↓x + 1 = 26
id +bras
output
Find eigen ectors of JKJT initialized
0(z)) z( W> all min I (EUSTij -xij)
Less =
Ely -

55 j = =

=> How to solve Fix U , mini over U

is U
al = 0(z) z = w X/
· gradient descentB alternating Fix U, min over .

22( -al 2z"


-Loss
↓ Loss zy wl
=
29(k It'l a
GW' Cy 2z(

mis O'CE)
KNN
=
(y
-

j)7)5((j) ,
X, -
linear decision boundary
.
-
kt boas & variance ↑
Convex loss Logistics hinge
-

:
MSE , .
~ Ls norm
L norm

logistic loss differentiable every where


:
. mahalanobis L
infinity
sigmoid/logistic Herz Rell maxco z)
: : ,
activation
computation#
:

but
funding NN
large data > =

training
Uno X in

(l x)
-
x +
xy GK and local linear can 1
smoothing reg
.

f(( x)x ((x) f(-) + xf(y) lot of data , neighbors aren't local
xy) x without
= a ,
-
+

curse of dimensionality
.
(X +
X + x 2J
xTy WMLE (X
+
X) XT
y
Wridge contrast between the nearest
= :

=> num of dimt ,


y) BLs :
] Yi and furthest point from a
given reference point
minimize (Exy ((Y-y(X))")
tends to decease
.
g(x)
=
Ey(x [Y(X x] =

EyIx [Ep[ (Y
-

Eg(x ] /X =
x]
=
Ex1x [ (Y -

y(x)) /x x] =
+
Ex[(y(x) -

8b(x))) enough
not
data
Bootstrap sid
F
(simple model) (D
= Ez.s zn3-Fz =
ECD
noise)
.. -

irreducible (label
crow
learning error
Jid Ez bootstrapped
learning fp()) *** = 9216, E bth data
(Yo(s))2 Ept) Ep (fb()) ]
...

error =
(g(x) -
Ep + -
.

I
drawing a samples with replacement from D
based varian
square
a

Exb = t(D
*
b)
>
-
estimate parameters that escape simple analysis
(variance /medians
>
-
CI
>
-
estimate error for a
.
particular example
with sample sample
~ general , simple , meaningful
~ asymptotic .

sample gurantee
X few meaningful finite
intensive
X computation
E F (unknown
of convergence of
to
on test stat rate
x rely ,

extreme (max)
x poor performance on
S
4) + )
-
=
+

-214 : 20

=
34 2

You might also like