0% found this document useful (0 votes)
1 views14 pages

Cours ML

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views14 pages

Cours ML

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Formalization of superficial learning

-How to
formalize the data ?

3 input space X
>
-
spaces :

label Y
a
space

a
prediction space

eg am clasification of diabetes
patiente : X =
CR2 Y =
& yes ,
nol =
3 yes not
,

:
the learner has a cart (lar) function C

y y ,

The learner
generates a model (or prediction function h : X Y
x - h(x)

The dataset is (as ,


ya ---- (an , yul with niex ,
yie Y

-
The empirical risk of a model h am a dataset is its
average
loss Ch) = Chri , gi

Empirical Risk Minimization (ERMI principle

A learner which outputs Eaugmin, Ra is said to


follow
Ex least but I nearment not
:
ordinary square a re ERM , neighbors a re

-y O

&

*
-

Y
Linear
Regression ERM
algorithm have X= R Y= SR
-

as an in
we =

The model we a re
learning is ho b) ,
= on + b

a in
- usual formalization :

-ERM
formulation

2/sec fery((y y) (y yk
(
Gatox + -
= =
, ,

= augmin hahi

How do we compute and 5

La
Fid
.
The loss
function is strictly connex in a and b so we write .
The solution to en
se en se

of equation is b ,
(in TD)
cincoamado
et, o
arroga to te
paring cerpee et

- Gard is the offset

atotensile
-

en

The "Comores cordinate trick" -Assume we add a va r equal to

dimensional
one to each si

-
This stimulates the
offset
With this trick our prediction function becomes ha(a)-sign (atal

In the ERM formulation


en
auguin h gi ,
with
e =y hylon

(
=
+
ifyx y
o else

complexity of ERM for linear classification


infitting : write things more
simply

intuitively in
Regression and classificate

be
② data Undergi right fit
low
might noisy .
When
you reach
very error .
: not complex enough . Best :

① we have
points .

Thy to fit perf the curves we clave oscillation :


thigs go badly on
futr data

Naïve vision

In Regression : 3 cases
of overfitting -

Benign

Tempered

-Catastrophic
ween
overfitting ,
the empirical or is close to 0 but the
generalization er ror is high (enor we get on futur datal
risk

EE :
assumptions : -
statistical :
points X a re
fixed but the labels a re random .
For instance in linear rogresse
* +m

-parian
1)
yi = i + Einoise Einchod)
(yi merloti ,

-
ML : ci not fixed/Naïves Bayes) Random design

wardom dist , an unknown dist .


each
examples have been drawn from a dist

classifier h Pxy(h(x) Y)
generalisat er ror :
given a ,
c .
e = *

m .
s .
e =
E,(x) y(2)-

Fixed design anumpte


-x an fixed for any we have a dist on
X
y
---

La Ex nut(oi 1)
:
yi ,

We need to find randam labels C 1 as-un


Gast"-Yuest
new : .

IP(h(xi) test)
c . e
= =
y,

tree diagram that


Now in
regressionReg :
represent a c u ve

Expected -

Expected MSE

EEMSe =
Ex ...
yn Eyste -. Ynteil-yitest') recall that yiflai) + Ei avec Ei a noise
The leaves have the better the on i.
m o re
you . e s

EEMSE = Bias + Var 2


+ +

Bias-Variance

II)
Mitigatting overfitting
2
options : -reduce bias

-
reduce variance

1) Control the complexity of the learnat clarifiers (e


g
limit the
.

size of a tree , b
of leaves
>
-
reduce the variance but increases bias (hopefully less

of classifiers
2) Build ensembles :
bias = biasens(l) = bias((a)) , Vau(ens(a)) =
Vou
= vare(a)
-Vor(he(a)
SAME bias ,
SMALLER Variance

.
12
Regularization
the class leave
-a
way
to reduce the
complexity of of funct we

Im parametrized regression (e linear 1) we will impose either soft or hand constraints on the nor m
of o
g
.
-

=
aminiyagmini-gil
(Ivanov Regularization (Thikonov regularizate

Both a re equivalent

Let us retun to ou 1D
regression plx =
auguin (o -y x -y
,
+ =

=
A small regularizat is always better than no
regularizat at all (TP)
Supervised Generative
learning :
Bayes clarifier

-
The classificate setting

-The classifier/The model / The predic

hix- :
angoal is to learn b which predict the label of example as well as possible

h(as-fyes
if al is
sports Empirical Mick : R(h) =

A . Macailyis
no otherwise

RDA : each labeled (ni yil


,
has been drawn

independantly from an unknown distribution

P(X Y),

If
cale) fusports the
errorae
XR

eare of
c) The True errort
clarifier

risk (or
Af : The twe
generalisation error is
El
random
von drown from P(X , X/

Ch) = abyi) .

IGX and Y a re
discret YPlay el , e

]
Heree(h(X) ,
Y) =

Va **) To the risk of h noted R(2) =

Eca(x) * =
Pa(x) * Y)

- plug in method

From the data ,


then estimate (X/x) then Coptimal , estimation/classifier
*
is the h
The OBC
function E
argmin RCh)

*
with he measurable function x-

reminder on the expectations

·
E[1] = PC

for amy g(x y) y)n(y(x)dy


Eg(x Y)(x)
) g(x
, , = ,
·

tif x is continuous

.
Total law of expectation :
Eg(x x)), = E
, [Eg(x Y)1X]]
,

R(h) =
E 54e(x) = x]
,

= E
,
CE21ce(x
, =
xy(x]]
=
Ex[Py(h(x) = Y(x)]

=
Ex[1 P(h(x) -

= Y(x)]

Y = ex]

ECE Maxe
= -

,
-

determine
when
conditionning on X

-
EIP(Y (x)ex
-

= =

gopt E Ral
augmin
h measurable

gopt (Y
au
augmin E
e = h(x) x
*
ces

gopt ex
eaugmax E= (X)awe =

augmin
Cy ex)
=

hopt (x) N [y le se
angreax
= =
,

R(hoPt) (x-en(x)) Bayes Rich


=

Ex[max =
Another to write gopt GoP(x) augmax P(Y k(x)
way = =
-

REY

P(x(Y e) PLY er)


augmax
=
= =

2 P(x)

gopt Pla( a)N(V 2) (generative formulation


-
augmax = =

heY

-Generative
learning
-
Our
goal is to get P(X1Y) and P(Y) ,
estimates
of P(X1Y) and PCY)

-start with Y =Y= Eyes not ,

-
we have some data ((x ys), , ..,
(nn ,
yul]

-We will estimate P(Y) likelihoad


using maximum
approach

-Random PCY) is Bernouilli


Design = a

-
Let =
P(Y yes) and
= T = (Y=
yes) in re n known

gri) =y)
of It Ply
=
-Likelihood on the data in ....,

-
Max
log Likelied approach : =
augmax log &(y .
---
,
yni)

augmax le
=
r

augmax Neslog(t
+
N log() with
Nyesie
=

augmax Nyes log No log


+ -
Noegina

=
Andy Nyes No

=d =
#(No Nye) n +
Ny
= + = = =

No Nyes
(No P(Ya = =
=

Pleed(Yes) = 3/5 Hsurlyes) = 115 Hdomlyes) = 31

PredInd = 25 P(surlnd =
315 dom(nd = 3/5

Cred ,
sur ,
daml =

REY
I(x
augmax = x((y 2)(y 2)
= =

.
for k =
yes :
(Y= yes) (x = red/yes) =
Generative learning for continuous data : Linear Discriminant Analysis (LDA)

#) Recep

.
We have a dataset (x ya) ,
----
(an , yal

.
nieX ,
gitY with Y=
&G ...

Cr]

.
RDA : we assume all (xi , yi) ~P(X Y) ,

classifier hal the risk P(h(X) Y)


.
We a re
looking for a
minimizing *

·
If Pl . , ) wa s unknown ,
we would have the optimal Bayes h (4) =
augmax
P(x/Y = cel(ca) with a classes

i f (aKP(c)
E
*
h (x) =
2 =
P(x((2)P((z)
C2 else

Im
generative learning ,
we estimate Pluich) and cel then we estimate a)

Last week ,
we worked on the case where X is discrete

Today crd continuous and to P(x(ch) will


X =
approximate we use
gausians

Let X ECR random variable


following a
gausian distribution , X) o r X) =
off
p(t) density distribution
( %252(x
-

cr(x(p t) ,

=
Let XER & An
easy extension to this multivariate case is to a ss u re that each
component x of X
follow independantly or (p sal,

·
P(x)
= cr... exp(-XNG (2 rd

ur(X(p 2) ,
X(2

·
N

· >
X()

INECR
men
S
For the
general case
of multivariate
gamerion in we need

I ecraxel covaciance matrice (symetic , real ,


seri positive
definit
values 70
=
eighen

ur(X(p [)
(28det()nexp(-f(X-T-(X-
() ex
1)
, = 1 -

Ig = is
diagonal chen

Stutics : ECX] =
p

Cor(Xt x() ,
=
Eje in vector fam
Van(x) = E[(X -

piT(x- p] =
E

(X'x' car(X *), X(2)


*
Im 2D relation with stal dow and carrelation coef corelation )
Poe
= =

;
f(j) +(m)x

·
Var (X(t) =
+(5)

d) Pr 2xf()f(2)
·
I =
,

Pr , 2x + ()xf(2) -(2

+
Let us
analyse -1(x-pITI (X p)
-

symmatric decomposed
= -)
I T is real I DUT U athonormal
so it can be . with eighen waches,

...(
D =
diagonal (valou propre

Et
*

So = (UDEJ = UT D'u = UDUT

X(2)
Lot z = UT(X-N) .
Z is a rector X in the base Un ,
Uz , Us .....
M

·
-

f(x
-

pitz- (X -

N) =

- 02 x2
P(x) L
-

-
=
D z

>
x(x)

-
=

(directo determinée
por l a propre
la mat de
Si covariance est
diagonale Si
pas diagonale

Ig P(x *,...., X( )
*
cr(X(p I) them the marginal (X- X2) and conditional (X X2)X-- X)
= ..., ave
gaussians
---- --
, , ,
,

#I) LDA

to
>
- We need
approximate P(X(Y = Ce

Assumptions : -

yESCa , Cry

P(XI(2)
- P(XIC) and a re
gaussians sharing the same covariance : Es =
Er = T

&
P(xica) =
u(x(m [1) ,

P(x((z) = u(x)pz (2) ,

P(Cal =e P((2) Ta =

se
der
3 cases : a

-
·

here
culif
E E log
If we a re able to estimate : (F . , , pi) ,
then we can
compute if

C otherwise C2 else

exp
log
=en 1(x-plT(x + (x pi) ma)
+
p (x
en(
-

=
=
-
- -
+

N)
exp(-(a-Ni)
=
(x
-

--
Ecra --

(Coizer
cal =
classi
en se

IV) Estimating parameters ,


max likelihood

a = (T
+ + ,2 , Na Nc ,
,
Il

22(d) yn(d) likelihood


=
log plat ,
ya , ..., an , =
log

= 220)
augmax

Let Ik =
Li :
yi =
Cal and Ne = Kel ·

plat ,
ya ....,
an
,
ynldplai , gild (sample al

=Plailyi , d) p(yi)d

= urlai ,il

Lomcompete
E and

aumax logie e
=

*
=

augmax *r{-zi
-

Nel [ (xi -

per) +
logz]

For RE24 2} ,

=
:
augmas -pen
N2=ren

Doing similary for I ,


we get =-i

# Is
get the linear classifier
ve
= we
E) Beyond LDA asumptions Kernal
Density estimator

.
LDA a ss u m e s each class is modeled by one
gauniar

. DE assumes class a re modelled by many goussions


Cone per point !, centered on that point of cor matrixId

Let XECRI be a random rector of class ch (DA assumption KDE

place) ur(xilner , El pla(ca)


=
=

(2) =
augmas r (ali ,.

You might also like