Ridge
Ridge
0.6
●
0.4
●
●
Coefficients
●
● ●
●
●
● ●
●
● ●
0.2
●
● ●
● ●
● ●
● ● ●
● ● ●
●
● ● ●
● ●
● ● ●
●
●● ●
● ●
●● ● ●
●● ● ● ●
●
●● ● ●
●
●● ●
●● ●● ●
●● ●● ●
●● ● ●
●
●●●● ●●
●●
● ●
●
●●
● ●●● ● ● ● ● ● ● ●
●
●● ●●● ● ● ●
●
● ● ● ●
●●
● ●●● ●● ●
● ● ●
●
●●
● ●●● ●●● ● ●
●●
●
● ●●● ●●●●●●●
●●
●
●
●
●
●●
●
●●
●●●●● ●●●●●●●●●●●●● ●
● ●
●
●
●●
●
● ●
●●
●● ●●●●● ●●● ● ● ●
●● ● ●
● ●●● ● ●
●
●● ●
●
●●
● ● ●
●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●●
●● ●
●●
●
●●
● ●●
●●
●● ●●●● ●● ● ● ● ●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●●●●●● ●●●●●●●●●●●●●●●● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
●●
●●
●●
●●
●
●● ●
●●
● ●●●●●●●● ●●●●●●● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●●●●●●● ●●●●●●●●● ●●●
●
●
● ●
●●
●
●
●●
●
●●● ●●
●●
●●
●
●
●
●●
●
●●
●
● ●●●
●
● ●
●● ●●●●●●●
●●
●
●●
●●
●
●
●●●
●
●●
●
●●
●
●●
●
●● ●●●●●●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
●
●
●
●●
●●
●●
●●
●●●●●●●●● ● ●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
● ●●
●●●●●●●●●●●●●●●●●●●●●●●●●
0.0
●
●
●●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●●
●●
●●
● ●● ● ● ● ●
●●
● ● ●
● ● ● ●
●
● ●
●
●
●
0 2 4 6 8
Yannig Goude
EDF R&D
Comme nous l’avons vu précédemment, l’estimateur des Moindres Carrés
Ordinaires (MCO) est la solution de:
n
X
minp (yi − xi β)2 = minp ||Y − X β||2
β∈R β∈R
i=1
Et s’écrit:
βb = (X 0 X )−1 X 0 Y
sous l’hypothèse que X est de plein rang
Rq: on considèrera ici que les variables sont centrées pour plus de commodité.
En pratique, cet estimateur ne fonctionne pas:
I si les x.,j sont corréléees entre elles, X n’est pas de plein rang
I si p >> n
dans ces cas X 0 X doit être régularisée pour pouvoir être inversée et on ajoute
une pénalisation → ridge, lasso...
Régression Ridge:
On résout le pb:
n
X
minp (yi − xi β)2 + λ||β||2
β∈R
i=1
s. c. ||β||2 6 t
Rq:
I bijection entre t et λ
I les solution du pb ne sont pas invariante par changement d’échelle,
usuellement on standardise les variables avant
I les variables sont centrées (∼ on ne pénalise pas la constante)
l’estimateur des coefficients de la régression ridge est donné par:
βbridge = (X 0 X + λI )−1 X 0 Y
C’est un estimateur biaisé de β!
b
I
E(βbridge ) = (X 0 X + λI )−1 X 0 X E(β)
b = β − λ(X 0 X + λI )−1 β
I
Var (βbridge ) = σ 2 (X 0 X + λI )−1 X 0 X (X 0 X + λI )−1
Preuve:
Pn
I g (β) =
i=1 (yi − xi β)2 + λβ 0 β
I dg
dβ
= 20 (X β − Y ) + 2λβ
I s’annule en (X 0 X + λ)β = X 0 Y
I soit en β = (X 0 X + λ)−1 X 0 Y
Preuve:
Pn
I g (β) =
i=1 (yi − xi β)2 + λβ 0 β
I dg
dβ
= 20 (X β − Y ) + 2λβ
I s’annule en (X 0 X + λ)β = X 0 Y
I soit en β = (X 0 X + λ)−1 X 0 Y
Preuve:
Pn
I g (β) =
i=1 (yi − xi β)2 + λβ 0 β
I dg
dβ
= 20 (X β − Y ) + 2λβ
I s’annule en (X 0 X + λ)β = X 0 Y
I soit en β = (X 0 X + λ)−1 X 0 Y
Preuve:
Pn
I g (β) =
i=1 (yi − xi β)2 + λβ 0 β
I dg
dβ
= 20 (X β − Y ) + 2λβ
I s’annule en (X 0 X + λ)β = X 0 Y
I soit en β = (X 0 X + λ)−1 X 0 Y
bridge est biaisé, son biais vaut −λ(X 0 X + λI )−1 β
I β
I l’inversion de X 0 X est remplacée par celle de (X 0 X + λI ), estimation plus
robuste si les variables explicatives sont corrélées empiriquement
Intuition:
I pénaliser permet de réduire l’espace S: réduire la variance (grande dans
les cas évoqué) mais augmente le biais
I le compromis biais-variance revient à trouver la bonne pénalité λ
I β
bλ = Fλ βb0
Où X
Fλ = (X T X + λj Sj )−1 X T X
0.6
lcavol lbph gleason ●
●
0.4
●
Coefficients
●
● ●
●
●
● ●
●
● ●
0.2
●
● ●
● ●
● ●
● ● ●
● ● ●
●
● ● ●
● ●
● ● ●
●● ● ●
●● ● ●
●● ●
●●● ● ● ●
●
●
●● ● ● ●
●● ●● ●
●● ●● ●
●● ●● ● ●
●
●●●● ●● ● ●
●
●●
● ●●● ● ● ● ● ● ● ●
●
●● ●●● ● ● ●
●
● ● ● ●
●●
● ●●● ●● ● ● ●
●
●●
● ●●● ●●●●●●
●●
●● ● ●
●●
●
● ●●●
●●
●
●
●
●
●●
●
●●
●●●●● ●●●●●●●●●●●●● ●
● ●
●●
●
● ●
●● ●●● ●● ● ●
●
●●
●●
●
● ●
●
●●
●●
●●
●
●●
●●
●●●●●● ●●●●●● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
●●
●● ●●
●●
●
●● ●●
●●
●● ●●● ●●● ● ● ● ● ●
● ●
● ●
●
●●
●
●●
●
●●
●●
●
●●
●●●
● ●
●●●●●● ●●●●●●●●●●●●●● ●● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●●
●●
●
●●
●●●●●●●● ●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ●
●
●
●●●
●
●● ●
●●
●
● ●
●●
●●●
● ●●●●●●●●●● ●
●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●●
●●
● ●●●● ●
●
●
●
●●
●
●●
●
● ●●
●
●●
●●
●● ●●●●●●
●●
●
●●
●●
●
●
●●●
●
●●
●
●●
●
●●
●
●● ●●●●●●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
●
●
●
●●
●●
●●
●●
●●●●●●●●● ● ●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
● ●
●●
●●●●●●●●●●●●●●●●●●●●●●●● ●
0.0
●
●
●●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●●
●●
●
● ●● ● ● ● ●
●●
● ● ●
● ● ● ●
●
● ●
●
●
●
0 2 4 6 8
●
0.4
●
Ridge Coefficients
●
● ●
●
●
● ●
●
● ●
0.2
●
● ●
● ●
● ●
● ● ●
● ● ●
●
● ● ●
● ●
● ● ●
●
●● ●
● ●
●● ● ●
●●● ● ● ●
●
● ●
●● ● ● ●
●● ●● ●
●● ●● ●
●● ●● ● ●
●
●●●● ●● ● ●
●
●●
● ●●● ● ● ● ● ● ● ●
●
●● ●●● ● ● ●
●
● ● ● ●
●
● ●●● ●● ● ● ●
●●
●
● ●●● ●●
●● ● ●
●●
●
●● ●●● ●●●●●●
●●
●
●
●
●
●●
●
●●
●●●●● ●●●●●●●●●●●●● ●
● ●
●
●
●●
●
● ●
●●
●● ●●●●● ●●●●●● ● ● ●
●
●
●● ●
●●
● ●
●●●● ● ● ●
● ● ● ● ● ●
●●
●●
● ●●
●
●●
●● ●
●
● ●
● ●
●●●
●
●●●●
●
●●● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●●●●●● ●●●●●●●●●●●●●● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
●
●
●●
●●
●
●●
●
●● ●
●●
● ●●●●●●● ●●●●●●●●●●●●● ● ● ● ●
●●
●●
●●
●●
●
●
●● ●
●●
●●
● ●
●●
●●
● ●● ● ●
●
●
●
●
●●
●●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●●
●
●●
●●
●●
●●●●●●●●●● ●●●● ●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●●
●●
●
●●
●●
●●
●●●●●●●●●●
●● ●●
●
●● ●●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
●
●
●
●●
●●
●●
●●
●●●●●●●●● ● ●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
● ●
●●
●●●●●●●●●●●●●●●●●●●●●●●● ●
0.0
●
●
●●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●●
●●
●
● ●● ● ● ● ●
●●
● ● ●
● ● ● ●
●
● ●
●
●
●
0 2 4 6 8
●
●
●
2
●
●
●
● ●●
●
● ●
● ●●● ●
●
● ● ● ●
● ●● ● ● ●
1
● ● ● ●●●● ●
● ●●
● ●
● ●● ●●
●
● ● ● ● ● ●
● ● ●●● ● ● ●
● ●
● ● ●●
●
●● ●● ● ●
● ● ●
● ● ●●●
● ●
● ● ●●
● ● ● ● ●
● ● ●● ● ●
●● ●●● ●● ●●
log lpsa
● ● ●
●● ●● ● ●●●
● ● ●
●●●● ●
● ●● ●●
●● ● ● ●●
● ●●●● ●● ● ● ● ● ● ●● ●
● ●● ● ●●●
●●● ●
●● ● ●
●●●●
● ●
● ● ●●
0
●●
● ● ●
● ●● ● ●
● ●●●●● ● ●●●
● ●
●● ●●
● ●●
●
● ● ●● ●
●●●●●● ● ● ●●● ●● ● ● ●
● ●● ● ●●●● ●
●
● ●
●● ●●● ●●● ●●
● ● ●
● ● ●● ● ●
● ● ● ●●●●●●● ●● ● ●●● ●●
● ●● ●
● ● ● ●
● ● ●●●●
●● ●● ● ●
● ● ● ● ●
● ● ● ●
● ● ●●
● ● ● ●
● ● ● ●
●
●●● ● ● ●
● ●● ●
−1
● ● ● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ● ●
●
●● ●
● ●
●
−2
● ● ●
−2 −1 0 1 2
lcavol
Régression Lasso -least absolute shrinkage and selection operator-:
On résout le pb:
n
X
minp (yi − xi β)2 + λ|β|
β∈R
i=1
Pp
Avec |β| = j=1 |βj |
équivalent au pb suivant:
n
X
minp (yi − xi β)2
β∈R
i=1
s. c. |β|2 6 t
Problème similaire à ridge mais la pénalité L2 de ridge est ici remplacée par une
pénalité en norme L1 : la solution de ce problème n’est plus linéaire en y
●
0.4
●
Lasso Coefficients
● ●
0.2
●
●
●
●
● ●
●
●
● ● ● ●
●
● ● ●
●
●
●
0.0
● ● ● ● ●
● ● ● ●
●
●
●
0 2 4 6 8
0.6
lcavol lbph gleason ●
lcavol lbph gleason ●
● ●
0.4
0.4
●
●
●
Lasso Coefficients
Ridge Coefficients
●
● ● ●
●
●
● ●
●
● ●
● ●
0.2
0.2
●
● ● ●
●
●
●
●
●
● ●
●
●
● ● ● ●
●
●
●
● ● ●
●●
●
● ●
●
●
● ●
●● ●
●
●
● ●
●● ● ● ●
●●
●● ● ● ●
● ● ●
●● ●● ●
●● ●● ●
●● ●
●●
●●●●
●●●
●●
●●
● ●
● ●
● ● ● ●
●● ● ● ● ● ● ● ●
●
●
●
●
●
●
●●
●
●
●●
●●●
●●●
●●●
●●●●● ●
●
● ● ● ● ● ● ●
● ●
● ●
●●
●
● ●●● ●●●●●●●● ●
●
●
●
●
●
●●
● ●
●●●●● ●●●●●●● ●●●●● ● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●●
●
●●●
●●●●● ●●●●●●
●●
● ● ● ● ● ● ● ● ●
●
●
● ● ●
●
● ● ● ● ● ●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●●
●●
●●
●●
●
●●● ●● ● ●
●●●●● ●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●
● ●
●
● ●
●
● ● ● ● ● ● ● ●
●●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●●
●●
●●
●
●●
●●
●●●●●●●● ●●●●●●●●●●●●●●● ● ● ● ● ● ●
●
●●
●●
●
● ●
●
●●
●●
●
●●
●●
●●
● ●●●●●●●●● ●●● ●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●
●●●●●●●
●●●●●●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●●
●●
●●
●●
●●
●●
●●●●●●● ● ● ●
●
●
●●
●●
●
●●
●●
●
●
●● ●●
●●●●●●●●●●●●●●●●●●●●●●●●●
0.0
●●
0.0
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●●
●●
●
●●
●
●
●
●●
●●
●
●● ●● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
●
● ● ●
●
● ●
●
●
● ●
●
●
●
● ●
0 2 4 6 8 0 2 4 6 8
1. initialisation à µ
b=0
2. calcul des corrélations cb = X 0 (y − µ j = argmax |b
b) puis de b cj |
b←µ
3. mise à jour µ b + ε sign cbj
4. retour à l’étape 1
b = 0, βb0 = 0
1. initialisation à µ
2. r = y − µ
b, xj la variable la plus corrélée à r
3. faire varier βj de 0 dans la direction cj = xj0 r jusqu’à ce qu’une variable xk
soit plus corrélée à r que xj , |ck | > |cj |
4. faire varier βj et βk (de 0) dans la direction δ = (X 0 X )−1 X 0 r jusqu’à ce
qu’une variable xl soit plus corrélée à r
●
0.4
●
LARS Coefficients
● ●
0.2
●
●
●
●
● ●
●
●
● ● ● ●
●
● ● ●
●
●
●
0.0
● ● ● ● ●
● ● ● ●
●
●
●
0 2 4 6 8
X = UDV 0
avec
I U matrice orthogonale n × p
I V matrice orthogonale p × p
I D matrice diagonale dont les éléments di sont d1 > d2 > ... > dp > 0
I si au moins une valeur dj = 0, alors X est singulière
Alors la régression linéaire de Y sur X s’écrit:
yb = X βb = X (X 0 X )−1 X 0 Y = UU 0 Y
U étant une base orthogonale de l’espace engendré par X , U 0 Y les coefficients
de la projection de Y sur cette base
Les composantes principales de X sont les vecteurs propres de la matrice de
covariance: X 0 X /n (rappelons que l’on a centré les variables).
Celle-ci s’exprime, après décomposition SVD et à un facteur 1/n près:
X 0 X = (VDU 0 )(UDV 0 ) = VD 2 V 0
Les vecteurs propres vj sont également appelés direction de karhunen-loeve de
X.
Avantage: simplification, accélération des calculs, possibilité de conserver
seulement les axes ”importants” de la matrice de covariance
La projection de X sur la j e composante principale vj vaut:
zj = Xvj = UDV 0 vj = uj dj
on a aisément: Var(z1 ) > Var(z2 ) > ... > Var(zp )
La première composante d’un jeux de données X est la direction qui maximise
la variance des données projetées.
●
3
● ●
● ●
●
● ●
●
2
●●
●
●
● ●
●
●
●● ● ● ●
●
● ●
● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ●
● ●●
1
● ●
● ● ●● ● ●
● ●
●
●● ● ●
● ●●
●
● ● ●
● ●● ● ● ● ●
● ● ● ●
● ●●● ● ● ●
● ●
● ● ●● ● ● ●
● ● ● ●● ● ●
●●● ●
● ● ●
● ● ●●
●
●● ● ● ● ● ● ●
● ● ●● ●
●●
●
● ●
X2
●● ●
● ● ●
● ● ● ● ●
● ●
● ● ●
● ● ●●
0
● ● ●
● ●
● ● ● ●●●
● ● ● ● ●
●● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
●
● ● ● ●● ●
● ● ● ●
● ● ●
● ● ● ●
●
● ● ●
● ● ●
● ● ●
● ●● ●
● ● ●
● ● ● ● ●
● ●
● ●
● ● ●
−1
● ●
● ● ●
●
● ● ● ●● ●
● ● ● ● ●●
● ● ●
●
● ● ● ●
● ●
● ● ●
● ●
● ● ●
● ●
●
● ●
● ●
● ●●
●
−2
● ● ●
● ●
●
● ●
● ●
● ●
●
−3
−3 −2 −1 0 1 2 3
X1
Pour la régression sur composante principale, on procède ainsi:
I calcul de la SVD de X
I les composantes principales sont obtenus par: zj = Xvj = uj dj
I les coefficients de projection sur ces composantes sont donnés par:
β = D −1 U 0 Y
I on sélectionne les ”premières composantes”: graphe des valeurs propres
dj puis critère du coude par exemple, sélection de modèles emboités
Lien avec la régression ridge
On a vu que pour la régression ridge:
X βbridge = X (X 0 X + λI )−1 X 0 y
replacing with the SVD of X we have:
= UD(D 2 + λI )−1 DU 0 y
Soit:
p
X dj2
X βbridge = uj uj0 y
j=1
dj2+λ
dj2
Comme λ > 0, dj2 +λ
6 1: les coefficients de la régression linéaire classique sur
les PC sont simplement réduits de ce facteur. Les coefficients associés aux
valeurs propres les plus faibles sont les plus réduits.
Shrinkage
0
●
●
2
●
●
4
●
d
●
6
● ●
8
●
●
10
Choix du paramètre de pénalisation
On procède, de même que pour les procédures de choix de modèle vu en
régression linéaire simple:
I validation croisée
I critère du coude
I critère de ”pénalisation” dépendant -souvent proportionnel- de la
dimension du modèle (le nombre de degrés de liberté): ici on a besoin
d’un estimateur de ce degré de liberté qui n’est pas simplement le nombre
de paramètre comme ds la régression
Validation croisée
De même que pour la régression on a une expression du critère de VC:
X Hi,j (λ)
fb−i = yj
1 − Hi,i
j6=i
et
n
1 X (yi − fbλ )2
CV (λ) =
n i=1 (1 − Hi,i )2
D’ou en utilisant l’estimateur du degré de liberté introduit pour la ridge
regression:
n
1 X (yi − fbλ )2
GCV (λ) =
n i=1 (1 − tr (H)/n)2
le critère de VC est donc une moyenne de l’erreur d’estimation pondérée par
”l’importance” de chaque observation. Le GCV est une erreur de VC dans
laquelle chaque observation à le même ”poids”.
Cp de Mallows
On considère le modèle y = f + ε, fb = Hy
||f − fb||2 = ||f − Hy ||2 = ||y − Hy − ε||2
= ||y − Hy ||2 + ||ε||2 − 2ε0 (y − Hy )
= ||y − Hy ||2 + ||ε||2 − 2ε0 (f + ε) + 2ε0 (Hf + hε)
D’ou:
E ||f − fb||2 = E ||y − Hy ||2 + nσ 2 − 2nσ 2 + 2E (ε0 Hε)
I Cp = 1
Pn b 2 2σ2 tr (Hλ )
n i=1 (yi − f ) + n
Pn b)2 /(1 − tr (H)/n)2
I GCV (λ) = 1 (yi − f
n i=1