How Many Samples To Learn A Finite Class?
How Many Samples To Learn A Finite Class?
089 GITCS
1 April 2008
Lecture 20
Lecturer: Scott Aaronson
1
|C|
m=O
log
samples drawn from D, any hypothesis h C we can nd that agrees with all of these samples
(i.e., such that h(xi ) = c(xi ) for all i) will satisfy
Pr [h(x) = c(x)] 1
with probability at least 1 over the choice of x1 , . . . , xm .
We can prove this theorem by the contrapositive. Let h C be any bad hypothesis: that
is, such that Pr [h(x) = c(x)] < 1 . Then if we independently pick m points from the sample
distribution D, the hypothesis h will be correct on all of these points with probability at most
(1 )m . So by the union bound, the probability that there exists a bad hypothesis in C that
nevertheless agrees with all our sample data is at most |C| (1 )m (the number of hypotheses,
20-1
good or bad, times the maximum probability of each bad hypothesis agreeing with the sample
data). Now we just do algebra:
= |C| (1 )m
m = log1
|C|
log /|C|
=
log 1
1
|C|
log
.
Note that there always exists a hypothesis in C that agrees with c on all the sample points:
namely, c itself (i.e. the truth)! So as our learning algorithm, we can simply do the following:
1. Find any hypothesis in h C that agrees with all the sample data (i.e., such that h(xi ) = c(xi )
for all x1 , . . . , xm ).
2. Output h.
Such an h will always exist, and by the theorem above it will probably be a good hypothesis.
All we need is to see enough sample points.
1
|C|
log
works so long as |C| is nite, but it breaks down when |C| is innite. How can we formalize the
intuition that the concept class of lines is learnable, but the concept class of arbitrary squiggles is
not? A line seems easy to guess (at least approximately), if I give you a small number of random
points and tell you whether each point is above or below the line. But if I tell you that these points
are on one side of a squiggle, and those points are on the other side, then no matter how many
points I give you, it seems impossible to predict which side the next point will be on.
So whats the dierence between the two cases? It cant be the number of lines versus the number
of squiggles, since theyre both innite (and be taken to have the same innite cardinality).
From the oor: Isnt the dierence just that you need two parameters to specify a line, but
innitely many parameters to specify a squiggle?
Thats getting closer! The trouble is that the notion of a parameter doesnt occur anywhere
in the theory; its something we have to insert ourselves. To put it another way, its possible to
come up with silly parameterizations where even a line takes innitely many parameters to specify,
as well as clever parameterizations where a squiggle can be specied with just one parameter.
Well, the answer isnt obvious! The idea that nally answered the question is called VCdimension (after two of its inventors, Vapnik and Chervonenkis). We say the set of points x1 , . . . , xm
is shattered by a concept class C if for all 2m possible settings of c(x1 ), . . . , c(xm ) to 0 or 1 (reject
or accept), there is some concept c C that agrees with those values. Then the VC-dimension
of C, denoted VCdim(C), is the size of the largest set of points shattered by C. If we can nd
arbitrarily large (nite) sets of points that can be shattered, then VCdim(C) = .
m
20-2
If we let C be the concept class of lines in the plane, then VCdim(C) = 3. Why? Well, we can
put three points in a triangle, and each of the eight possible classications of those points can be
realized by a single line. On the other hand, theres no set of four points such that all sixteen possible
classications of those points can be realized by a line. Either the points form a quadrilateral, in
which case we cant make opposite corners have the same classication; or they form a triangle and
an interior point, in which case we cant make the interior point have a dierent classication from
the other three points; or three of the points are collinear, in which case we certainly cant classify
those points with a line.
Blumer et al.1 proved that a concept class is PAC-learnable if and only if its VC-dimension is
nite, and that
VCdim(C)
1
m=O
log
samples suce. Once again, a learning algorithm that works is just to output any hypothesis h in
the concept class that agrees with all the data. Unfortunately we dont have time to prove that
here.
A useful intuition is provided by a corollary of Blumer et al.s result called the Occams Razor
Theorem: whenever your hypothesis has suciently fewer bits of information than the original
data, it will probably correctly predict most future data drawn from the same distribution.
Blumer, Ehrenfeucht, Haussler, Warmuth, Learnability and the Vapnik-Chervonenkis dimension, JACM, 1989
20-3
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.