cheatsheet 2
cheatsheet 2
PI = 0 1 ..
) =
10(Wix) =
X " ly -Xw + ) + Aw +
7
-
wory i
, Shrinkage Is giddient
,
Wt +1 =
We -
Xwt)
-
sign(we)
-
I
I Doesn't
O
1 exp -No +
IWkXk) shrinkage with We
odd
Pll(Xx exp(wTx o Ow 12t(W)
:
= We + 1
=
We -
=
We
w
ECOS2+ (W)) :
Jl(w)
=
arg min [log (1 + expl-yix (6) J
=
3 =
64
LOW LOSS efficient (take small sample (
> :
sign lyi) Sign (NiTW) Y k) K > i more
memory
-
-
= :
find
,
eo
hard faster , works well on large data
lwll without
chage sogn(XiTw) point
>
increase
-
stopping
- >
GDI take data sit)
-
a entire
J is CONVEX easy to
optimize
,
>
-
" noise has higher impact on stability of SGD
NO CLOSED FORT SOLUTION
i more sensitive to learning rate
mininum
all local min are global min
>
-
>
-
"
if f(y) > f(x) + 7 f(x) (y x) x,
y +domIf)
Regression
-
Lasso
>
-
...
twice differentiable - - -
to
optimize
i only to smooth functions >
-
dense solution ,
L ball "smooth"
>
-
Sub-gradien 9 + :
f(y) > f(x) + gily -
x)
↓ Lasso 0 (w) =
Hwll , M >
-
i slower than
gd on smooth
*
>
-
L,
encourage
>
-
y(X) likelihood
(Exy Is maximize
((Y-y(X))")
assuming Grassin
minimize
y(x)
=
Ey(x [Y(X x] = Linear Regression
Eg(x ] /X x] If (TX) "exists WMLE (XX)' XT
y
EyIx [Ep[ (Y
-
=
, :
data BLs # ] Yi
Ex1x [ (Y y(x)) /x x] + Ex [ (y(x) 8b(x))) not enough If X 120
:
= -
,
=
-
] error
test error
biased varian
square
a
error
training
degree P.
Ridge Regression > never make
- coefficient zere Useful Gradient
LS If don
unique and
:
solution is not unstable
Ow(wTAx) Ax w(W A)
, = =
A
I (y XiT WK ↓ I W112
wridge any
+
-
min
: :
ATX OW(Aw) A
Jw(XTAW)
:
=
2
+
Xy + WY (XTX +
XIw)
argain
:
-
w
For Aw
E = 0 i .
Fridge = (X +
X + x2)
xTy symmetric A , Jw(WT A w) =L
Is (x +
X + x 1)"always exist ?
> XT X is
PSD/alleigenvale =0 (def PSD)
-
1) y -
Xc11 =
<y
-
Xw)" (y -
Xw)
: XTX =
HX11230 : OE zTXTXE),
yTy
+ +
= -
<Wi X y
+ w X" Xw
>
-
def eigen .
pair
(v, o) XXV = Or or UTXYXV = o nu
symmetric
"
Let V= IV ,
...
Va] Orthonormal .
St UV = UTV = I
flws :
/y - Awlk
n
diag (5, 52)
=
...,
XTX =
VXUT
Ow f(w) : <
XTXw-2XTy
Tw f(w) = 2 x+ X
XX + XI = VNUP + <2 = UNUT + XUUT =
U(N + XIJUT
2) *TX)E =
1/ Xz112 >0 =>
f(w) is convex
x 1) "Vi
"
(xTX + X 1) = V(N +
decrease
X more
regularization ,
more bias , variance
vs Linear Algebra
++ 0
Fridge >
- 0 16 is not
regularized ( -
A G /RM" Orthonormal :
AAT = A" A :
I
y
: .
, ,
,
EYIX , D [ (Y -
↑ Wridge) /X =
x ] &" is also symmetric
CERO PSD
EYix [CY xTwY/X x] Ep[(X'w -x ↑
Fridge) xTCx >
-
= - = + :
symmetric s . t .
O
E
= of + (ntTwix) +
MG DX1K >
-
its
eigenvales are
non-negative
o
ridge
=
(XTX + x2)" XTXw +
(x + 12) x &
-
12/k =
dI zit 1) zlk :
ETz
<y
(
= W + n+
x XT 9 Based estimator
-
x,
y orthogonal : = 0
>
-
>
-
eigenvalle x , eigenvector x
overfit Ax x 1 AGRUXN(
few data points likely x
=
>
-
: more to
Span([X1 An3) :
90 : 0= [ aixi]
of
"
LOOCU
, ,
>
-
: overestimate true error
>
- SIS' :
[VG/RM Ax]
range (A) v=
: :
nuUspace A =
[XEIR" Ax : = 0
3
& PSD
-
xiX is
always sym
V not invertible
~ null space is
non-empty
X columns are
linearly independent
Useful Gradient Chuster jag
Ow(wTAx) =
Ax w(W A) =
A K-machs clustering and non-convex :
converges " ...
-
>
-
- W(Aw) A be spherical
.
Jw (x Aw) Aix
, .
=
+ =
-
assumes clusters to
.
For A Jw(WT A w) =L Aw
symmetric ,
F(M , C) =
Ill Maj -
*
j/
Mixture of Gaussians observed data
Xc11 Xw)" (y > z :
Xw)
-
1) y - =
<y
Oilg(2d0c(yi))
-
LogDrm)
-
110 , z , 0) =
I < 1- 0i) 40 (y .
:
) +
- : unobserved data
yTy PLOi=P(y:10:=1) responsibilty to
+ +
= -
<Wi X y
+ w X" Xw choose Vi(O) E(0i 10 , E) P101
how O:
1(yi)
= =
to = =
nu 28 O known ,
symmetric
model complex cluster shapes
and density ,
accommodate elliptical
-
shapes
flws /y - Awlk Kernel Density Estimate
- gr p(X x) =
Ow f(w) : <
XTXw-2XTy all
generative
.
model
2) *TX)E =
1/ Xz112 >0 =>
f(w) is convex discriminative :
just learn what you need to make a specific class .
of pred
-
just P(Y(D) ey
. reg .
No regard for P(X) or PCX , %)
. An easier
modeling
full rank
inverse limited about P(Y(X)
Algebra
=
Linear problem less data, but
utility is to
quaries
requires
-
Span([X1 ,
"
, An3) :
90 : 0= [ aixi]
Feature extraction
-
A :
Ruxn
uninformative sparse (LASSO)
[VG/RM Ax]
set
range (A)
:
: :
v=
dimension reduction / down-sample
superfluous/correlated :
nuUspace A =
[XEIR" Ax : = 0
3 autoencoder :
find low
a dimensional representation by prediction
& PSD set of learnable filters (kernels) slide
over
-
xiX is
always sym convolutional layer : a
X columns are
linearly independent the input with a filter
Pooling dim red summarizes output of convolving
: ,
A G /RM" Orthonormal :
AAT = A" A :
I
Last few layers typically FC = Conolayers are feature extraction
>
-
11 Axll2 =
Hx/k preparing input for final FC layly
-
symmetric
.
&" also
dolvehigh-dimentality
is symmetric .
Kernels
-
CERO PSD :
symmetric s t xTCx > O than features
. .
[-"***
-
to ki 21 + xix'
*
sigmoid tahn(2x"x'tr)
-
12/k =
dI zit 1) zlk :
ETz up
Wi X : " ↑ Hwll?
arguin [ (y
+
E
-
Regularized LS =
.
Kernel trick for
<y
>
orthogonal
-
x,
:
0
y
-
=
I aix
< ERR : W = .
diagonal
+
A TA any
= :
is
,
=
arg min Iy -Kall2 + x < TKX
.
-
eigenvalle x , eigenvector x
: .
=
(k + x2)"y .
↑
RE bias & variance +
bray to predictors
weakly correlated
variance
overfit ,
Power of
:
:
=>
1-0 Wridge >
-
vs >
-
mad
-
correlation 4 performance
. r V
reducing
.
distinct classes .
.
is
X not inthitite
3
--
,
boosting
:
-
* L
Comparison
o b 2 on trees
-9-9 ind
f RF builted Using bagging
.
learning from prev
:
.
errors
each
sequentially
tree
built trees
Boosting
: ,
los) Value Decomposition (SUD) VTV 1
Neural Network Singular
=
-
cross entropy
UTU
SEIRUY" diagonal
=
I,
a IRMXN A = USUT is pos. .
=
X AG ,
....
·E 9 =
g(zR)) z =
pl all >
-
ATAU = (USULY (USUP) 0 T
:
=
US SUTVi : US e; = ViSic
AA" with
glds an al
+
1 U the first reigenvalues of
g(z) 1) z(l 1) oll) all Sit
+ are
alls
+ U: =
...
= A AT Ui =
eigenvales diaglSt)
:
PCA .
g(z(b
+
a + K K( US UTSVD on de-mean
data matrix I running
y
=
=
[(Xi *
)T) x=
-
>
-
x =
VSTUT
log(1 y
#XTX
5) y (og(a) * ACT = =
(1 y) I I(Xi -5)(Xi
-
+
L(y
-
-
:
=
,
discard low-rank components A : USUT
SVD :
g(z) Hez =
j
:
Pigles
=
Ce+ I
-
pi 26 i <
PCA
84B all I wil , diht' o
>
.
-
I PCA
gichts gcd
=
=
E . .. En EIR8
compressed expression
. .
:
automatic differentration
Python package
, error
VgUq
:
I min
Vgzi mat
=
> X +
.
> Xi
-proj
-
=
- .
.
, GPU support
>
convenient library +
x-UgZilk x) -8
quq (xi
* )/k
.
min Xi -
= -
When max
vizo
.
Vi
9
=
:
1 max
zero (Res net) , stepsize batch size , , momentum hall
NUTV X eigenvale is the var of data proj
large impact optimizing training
VTIV
=
on error
. = =
sigmoid will
>
cause
-
ReLl then
using choose explain 95 % of variance
-
Ad-hos approach : to
represent Prob.
·
to
normalize the output IPL
-softmax :
input
:s 3x 4 : 12
Kernel PCA
bits
weight
+ d
55 j = =
is U
al = 0(z) z = w X/
· gradient descentB alternating Fix U, min over .
mis O'CE)
KNN
=
(y
-
j)7)5((j) ,
X, -
linear decision boundary
.
-
kt boas & variance ↑
Convex loss Logistics hinge
-
:
MSE , .
~ Ls norm
L norm
but
funding NN
large data > =
training
Uno X in
(l x)
-
x +
xy GK and local linear can 1
smoothing reg
.
f(( x)x ((x) f(-) + xf(y) lot of data , neighbors aren't local
xy) x without
= a ,
-
+
curse of dimensionality
.
(X +
X + x 2J
xTy WMLE (X
+
X) XT
y
Wridge contrast between the nearest
= :
EyIx [Ep[ (Y
-
Eg(x ] /X =
x]
=
Ex1x [ (Y -
y(x)) /x x] =
+
Ex[(y(x) -
8b(x))) enough
not
data
Bootstrap sid
F
(simple model) (D
= Ez.s zn3-Fz =
ECD
noise)
.. -
irreducible (label
crow
learning error
Jid Ez bootstrapped
learning fp()) *** = 9216, E bth data
(Yo(s))2 Ept) Ep (fb()) ]
...
error =
(g(x) -
Ep + -
.
I
drawing a samples with replacement from D
based varian
square
a
Exb = t(D
*
b)
>
-
estimate parameters that escape simple analysis
(variance /medians
>
-
CI
>
-
estimate error for a
.
particular example
with sample sample
~ general , simple , meaningful
~ asymptotic .
sample gurantee
X few meaningful finite
intensive
X computation
E F (unknown
of convergence of
to
on test stat rate
x rely ,
extreme (max)
x poor performance on
S
4) + )
-
=
+
-214 : 20
=
34 2