Class 02
Class 02
Tomaso Poggio
Plan
Distribution
X Y
P(y|x)
P(x)
Hypothesis Space
ypred = fS (xnew )
V (f (x), y) = |f (x) − y |
For binary classification, the most intuitive loss is the 0-1 loss:
V (f (x), y) = Θ(−yf (x))
where Θ(−yf (x)) is the step function. For tractability and other rea-
sons, we often use the hinge loss (implicitely introduced by Vapnik) in
binary classification:
V (f (x), y) = (1 − y · f (x))+
The learning problem: summary so far
The training set S = {(x1, y1), ..., (xn , yn)} = {z1, ...zn}
consists of n samples drawn i.i.d. from μ.
generalization
1�
IS [f ] = V (f, zi)
n
Empirical error, generalization error,
generalization
In other words, the training error for the solution must converge to the
expected error and thus be a “proxy” for it. Otherwise the solution
would not be “predictive”.
� �
∀ε > 0 lim sup IPS I[fS ] > inf I[f ] + ε = 0.
n→∞ μ f ∈H
lim Xn = X in probability
n→∞
if
or
(and consistency)
fS = arg min IS [f ]
f ∈H
If the minimum does not exist we can work with the infimum.
• “generalize”
samples...
f(x)
x
suppose this is the “true” solution...
f(x)
x
... but suppose ERM gives this solution!
f(x)
x
How can I guarantee that for a sufficient
f(x)
x
Classical conditions for consistency of ERM
if
� �
∀ε > 0 lim sup IPS sup |I[] − IS []| > ε = 0.
n→∞ μ ∈L
ˇ
Theorem [Vapnik and Cervonenkis (71), Alon et al (97), Dudley, Gin´
e, and Zinn
(91)]
Thus, as we will see later, a proper choice of the hypothesis space H ensures gen-
eralization of ERM (and consistency since for ERM generalization is necessary and
sufficient for consistency and viceversa). We will be exploring the uGC definition
(and equivalent definitions) in detail in 9.520.
Well-posedness of ERM
Ill-Posed problems
given 10 samples...
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
...we can find the smoothest interpolating
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
If we restrict ourselves to degree two
polynomials...
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
...the solution varies only a small amount
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Regularization
while satisfying
f
2 ≤ A
Alternatively, Tikhonov regularization minimizes over the
hypothesis space H, for a fixed positive parameter λ, the
regularized functional
n
1 �
V (f (xi ), yi) + γ
f
2
K, (1)
n i=1
where
f
is the norm in H – the Reproducing Kernel
Hilbert Space (RKHS), defined by the kernel K.
Well-posed and Ill-posed problems
• is unique and
Approximation Error
The error is the sum of the sample error and the approxi-
mation error:
CuckerSmale...
(f ) ←→ I(f )
z (f ) ←→ IS (f )
Thus
Lz ←→ I(f ) − IS (f )
For ERM
fz ←→ fS