0% found this document useful (0 votes)
121 views75 pages

Class 01

This document provides an overview of a class on statistical learning theory and applications taught in spring 2006. It discusses several key topics: 1) The problem of learning from examples and the goal of finding functions that generalize to predict new examples rather than just memorizing training data. 2) Examples of engineering applications of learning algorithms developed by students in the class. 3) The connection between learning theory foundations and how the visual cortex works, with the goal of informing computer vision systems. The document emphasizes that theoretical foundations of learning are important both for developing predictive algorithms and for understanding brain function, with implications for many fields.

Uploaded by

Habib Mrad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views75 pages

Class 01

This document provides an overview of a class on statistical learning theory and applications taught in spring 2006. It discusses several key topics: 1) The problem of learning from examples and the goal of finding functions that generalize to predict new examples rather than just memorizing training data. 2) Examples of engineering applications of learning algorithms developed by students in the class. 3) The connection between learning theory foundations and how the visual cortex works, with the goal of informing computer vision systems. The document emphasizes that theoretical foundations of learning are important both for developing predictive algorithms and for understanding brain function, with implications for many fields.

Uploaded by

Habib Mrad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

9.

520

Statistical Learning Theory and


Applications

Sasha Rakhlin and Andrea Caponnetto and Ryan Rifkin + tomaso poggio

9.520, spring 2006


Learning: Brains and Machines

Learning is the gateway to


understanding the brain and to
making intelligent machines.

Problem of learning:
a focus for
o modern math
o computer algorithms
o neuroscience

9.520, spring 2006


Learning: much more than memory

ƒ Role of learning (theory and applications


in many different domains) has grown substantially in CS

ƒ Plasticity and learning have a central stage in the


neurosciences

ƒ Until now math and engineering of learning has developed


independently of neuroscience…but it may begin to change: we
will see the example of learning+computer vision…
Learning:
math, engineering, neuroscience
⎡1 l

∑ V ( y , f ( x )) + μ
2
m in ⎢ i i f ⎥
⎣l ⎦
f ∈H K
i =1

Theorems on foundations of learning:


Learning theory
+ algorithms Predictive algorithms

• Bioinformatics
ENGINEERING • Computer vision
• Computer graphics, speech
APPLICATIONS
synthesis, creating a virtual actor

Computational How visual cortex works – and how it


Neuroscience: may suggest better computer vision
systems
models+experiments
Class

Rules of the game: problem sets (2)


final project (min = review; max = j. paper)
grading
participation!
mathcamps? Monday late afternoon?

Web site: http://www.mit.edu/~9.520/

9.520, spring 2006


9.520 Statistical Learning Theory and Applications
Class 24: Project presentations
2:30—2:45 "Adaboosting SVMs to recover motor behavior from motor
data", Neville Sanjana
2:45-3:00 "Review of Hierarchical Learning", Yann LeTallec
3:00—3:15 "An analytic comparison between SVMs and Bayes Point
Machines", Ashis Kapoor
3:15-3:30 "Semi-supervised learning for tree-structured data", Charles
Kemp
3:30—3:45 “Unsupervised Clustering with Regularized Least Square
classifiers" - Ben Recht
3:40—3:50 "Multi-modal Human Identification." Brian Kim
3:50—4:00 "Regret Bounds, Sequential Decision-Making and Online
Learning", Sanmay Das

9.520, spring 2003


9.520 Statistical Learning Theory and Applications
Class 25: Project presentations

2:35-2:50 "Learning card playing strategies with SVMs", David


Craft and Timothy Chan
2:50-3:00 "Artificial Markets: Learning to trade using Support
Vector Machines“, Adlar Kim
3:00-3:10 "Feature selection: literature review and new
development'‘, Wei Wu
3:10—3:25 "Man vs machines: A computational study on face
detection" Thomas Serre

9.520, spring 2003


9.520, spring 2006
Overview of overview

o The problem of supervised


s learning: “real” math
behind it

o Examples of engineering applications (from our


group)

o Learning and the brain (example of object


recognition)

9.520, spring 2006


Learning from examples: goal is not to memorize
but to generalize, eg predict.

INPUT
f OUTPUT

Given a set of l examples (data)


{( x1 , y1 ), ( x 2 , y 2 ) ,..., ( x l , y l )}

Question: find function f such that

is a good predictor of y for a future input x (fitting the data is not


enough!):
f ( x ) = yˆ
Reason for you to know theory

We will speak today and later about applications…

they are not simply using a black box. The best ones are about
the right formulation of the problem (choice of representation
(inputs, outputs), choice of examples, validate predictivity, do not
datamine)

… f (x) = wx + b
Notes

Two strands in learning theory:

‰ Bayes, graphical models…

‰ Statistical learning theory, regularization (closer to classical


math, functional analysis+probability theory+empirical process
theory…)
Interesting development: the theoretical foundations of
learning are becoming part of mainstream mathematics
Learning from examples: predictive, multivariate
function estimation from sparse data
(not just curve fitting)

= data from f
= function f
= approximation of f y

x
Generalization: estimating value of function where
there are no data (good generalization means
predicting the function well; most important is for
empirical or validation error to be a good proxy of the
prediction error)

Regression: function is real valued

Classification:
9.520, spring 2006 function is binary
Thus….the key requirement (main focus of learning
theory) to solve the problem of learning from
examples:
generalization (and possibly even consistency).

A standard way to learn from examples is ERM (empirical risk


minimization)

The problem does not have a predictive solution in general


(just fitting the data does not work). Choosing an appropriate
hypothesis space H (for instance a compact set of continuous
functions) can guarantee generalization (how good depends on
the problem and other parameters).
9.520, spring 2006
Learning from examples: another goal (from inverse
problems) is to ensure that problem is well-posed (solution
exists stable)

A problem is well-posed if its solution


exists, unique and J. S. Hadamard, 1865-1963

is stable, eg depends continuously on the data


(here examples)

9.520, spring 2006


Thus….two key requirements to solve the problem
of learning from examples:
well-posedness and generalization
Consider the standard learning algorithm, i.e. ERM

The main focus of learning theory is predictivity of the


solution eg generalization. The problem is in addition ill-posed.
It was known that by choosing an appropriate hypothesis space
H predictivity is ensured. It was also known that appropriate H
provide well-posedness.
A couple of years ago it was shown that generalization and
well-posedness are equivalent, eg one implies the other.
Thus a stable solution is predictive and (for
ERM) also viceversa.
9.520, spring 2006
More later…..

9.520, spring 2006


Learning theory and natural sciences

Conditions for generalization in learning theory

have deep, almost philosophical, implications:

they may be regarded as conditions that guarantee a


theory to be predictive (that is scientific)
We have used a simple algorithm
-- that ensures generalization --
in most of our applications…

⎡1 l ⎤
min ⎢ ∑ V ( f ( xi ) − yi ) + λ
2
f K⎥ implies
⎣ i =1
f ∈H l

f ( x ) = ∑i α i K ( x , x i )
l

Equation includes Regularization Networks (special cases


are splines, Radial Basis Functions and Support Vector
Machines). Function is nonlinear and general approximator…

For a review, see Poggio and Smale, The Mathematics of Learning,


Notices of the AMS, 2003
Classical framework but with more general
loss function

The algorithm uses a quite general space of functions or “hypotheses” :


RKHSs. n of the classical framework can provide a better measure
of “loss” (for instance for classification)…

⎡1 l ⎤
min ⎢ ∑ V ( f ( xi ) − yi ) + λ
2
f K⎥
⎣ i =1
f ∈H l

9.520, spring 2006 Girosi, Caprile, Poggio, 1990


Another remark: equivalence to networks

Many different V lead to the same solution…

f (x) = ∑i ci K (x, x i ) + b
l
x1

…and can be “written” as K K K

the same type of network…where the


value of K corresponds to the “activity”
ci
of the “unit” and
x
d
the ci correspond to
(synaptic) “weights” +

f
Theory summary
In the course we will introduce

• Generalization (predictivity of the solution)


• Stability (well-posedness)
• RKHSs hypotheses spaces
• Regularization techniques leading to RN and SVMs
• Manifold Regularization (semisupervised learning)
• Unsupervised learning
• Generalization bounds based on stability
• Alternative classical bounds (VC and Vgamma dimensions)

• Related topics

• Applications
S
9.520, spring 2006 y
Syllabus

9.520, spring 2006


Overview of overview

o Supervised learning: real math


o Examples of recent and ongoing in-house engineering
on applications
o Learning and the brain

9.520, spring 2006


Learning from Examples: engineering
applications

INPUT OUTPUT

Bioinformatics
Artificial Markets
Object categorization
Object identification
Image analysis
Graphics
Text Classification
…..
9.520, spring 2006
Bioinformatics application: predicting type of
cancer from DNA chips signals
Learning from examples paradigm

Prediction
Statistical Learning Prediction
Algorithm

Examples
New sample

9.520, spring 2006


Bioinformatics application: predicting type of
cancer from DNA chips

New feature selection SVM:

Only 38 training examples, 7100 features

AML vs ALL: 40 genes 34/34 correct, 0 rejects.


5 genes 31/31 correct, 3 rejects of which 1 is an error.

Pomeroy, S.L., P. Tamayo, M. Gaasenbeek, L.M. Sturia, M. Angelo, M.E.


McLaughlin, J.Y.H. Kim, L.C. Goumnerova, P.M. Black, C. Lau, J.C. Allen, D.
Zagzag, M.M. Olson, T. Curran, C. Wetmore, J.A. Biegel, T. Poggio, S.
Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D.N. Louis, J.P. Mesirov, E.S.
Lander and T.R. Golub. Prediction of Central Nervous System Embryonal
Tumour Outcome Based on Gene Expression, Nature, 2002.

9.520, spring 2006


Learning from Examples: engineering
applications

INPUT OUTPUT

Bioinformatics
Artificial Markets
Object categorization
Object identification
Image analysis
Graphics
Text Classification
…..
9.520, spring 2006
Face identification: example

An old view-based system: 15 views

Performance: 98% on 68 person database


Beymer, 1995

9.520, spring 2006


Learning from Examples: engineering
applications

INPUT OUTPUT

Bioinformatics
Artificial Markets
Object categorization
Object identification
Image analysis
Graphics
Text Classification
…..
9.520, spring 2006
System Architecture

Scanning in x,y and


scale

Preprocessing with
overcomplete
TRAINING
dictionary of Haar
wavelets Data Base

QP Solver
SVM Classifier

9.520, spring 2006 Sung, Poggio 1994; Papageorgiou and Poggio, 1998
People classification/detection: training
the system

... ...

1848 patterns 7189 patterns

Representation: overcomplete dictionary of Haar wavelets; high


dimensional feature space (>1300 features)

Core learning algorithm:


Support Vector Machine
classifier

pedestrian detection system


9.520, spring 2006
Trainable System for Object Detection:
Pedestrian detection - Results

Papageorgiou and Poggio, 1998


The system was tested in a test car
(Mercedes)
System installed in
experimental Mercedes

A fast version, integrated


with a real-time obstacle
detection system

MPEG

Constantine Papageorgiou
People classification/detection: training the
system

... ...

1848 patterns 7189 patterns

Representation: overcomplete dictionary of Haar wavelets; high


dimensional feature space (>1300 features)

Core learning algorithm:


Support Vector Machine
classifier

pedestrian detection system


9.520, spring 2006
Face classification/detection: training the
system

... ...

Representation: grey levels (normalized) or overcomplete


dictionary of Haar wavelets

Core learning algorithm:


Support Vector Machine
classifier

face detection system


9.520, spring 2006
Face identification: training the system

... ...

Representation: grey levels (normalized) or overcomplete


dictionary of Haar wavelets

Core learning algorithm:


Support Vector Machine
classifier

face identification system


9.520, spring 2006
Computer vision: new StreetScenes
Project
Learning Algorithms for Scene Understanding

Project Timeline
Construction of Automatic Recognition of Automatic Scene
the StreetScenes Learning of object 10 Object Description
Database specific features Categories
or parts

Lior Wolf, Stan Bileschi, …


Learning from Examples: Applications

INPUT OUTPUT

Object identification
Object categorization
Image analysis
Graphics
Finance
Bioinformatics

9.520, spring 2006
Image Analysis

IMAGE ANALYSIS: OBJECT RECOGNITION AND POSE


ESTIMATION

⇒ Bear (0° view)

⇒ Bear (45° view)

9.520, spring 2006


Computer vision: analysis of facial expressions

12

10

85
79
73
67
61
55
49
43
37
31
25
19
13
7
1

The main goal is to estimate basic facial parameters, e.g.


degree of mouth openness, through learning. One of the main
applications is video-speech fusion to improve speech
recognition systems.
9.520, spring 2002 Kumar, Poggio, 2001
Learning from Examples: engineering
applications
CBCL MIT

INPUT OUTPUT

Bioinformatics
Artificial Markets
Object categorization
Object identification
Image analysis
Image synthesis, eg Graphics
Text Classification
…..
9.520, spring 2003
Image Synthesis

Metaphor for UNCONVENTIONAL GRAPHICS

Θ = 0° view ⇒

Θ = 45° view ⇒

9.520, spring 2006


Reconstructed 3D Face Models from 1 image

Blanz and Vetter,


MPI
SigGraph ‘99
9.520, spring 2006
Reconstructed 3D Face Models from 1
image

Blanz and Vetter,


MPI
SigGraph ‘99
9.520, spring 2006
V. Blanz, C. Basso,
T. Poggio
and
T. Vetter, 2003
Extending the same basic learning techniques (in 2D):
Trainable Videorealistic Face Animation
(voice is real, video is synthetic)

Ezzat, Geiger, Poggio, SigGraph 2002


Trainable Videorealistic Face Animation
1. Learning 2. Run Time

For any speech input the system


provides as output a synthetic
video stream
System learns from 4 mins Phone Stream
of video the face appearance /SIL//B/ /B//AE/ /AE//AE/ /JH//JH/ /JH//SIL/

(Morphable Model) and the


speech dynamics of the
person Trajectory
Phonetic Models
Synthesis

MMM Image Prototypes

Tony Ezzat,Geiger, Poggio, SigGraph 2002


A Turing test: what is real and what is
synthetic?

We assessed the realism of the talking face with


psychophysical experiments.
Data suggest that the system passes a visual
version of the Turing test.
Overview of overview

o Supervised learning: the problem and how to frame


it within classical math
o Examples of in-house applications
o Learning and the brain

9.520, spring 2006


Learning to recognize objects and the ventral
stream in visual cortex
Some numbers
Human Brain
1011… 1012 neurons
1014 + synapses
Neuron
Fine dendrites : 0.1 µ diameter
Lipid bylayer membrane : 5 nm thick
Specific proteins : pumps, channels, receptors, enzymes
Synaptic packet of transmitter opens 2 x 103 channels
(with 104 AcH molecules)
Each channel: conductance g = 10-11 mho
Fundamental time length : 1 msec
A theory
of the ventral stream of visual cortex

Thomas Serre, Minjoon Kouh, Charles Cadieu, Ulf Knoblich


and Tomaso Poggio

The McGovern Institute for Brain Research,


Department of Brain Sciences
Massachusetts Institute of Technology
The Ventral Visual Stream: From V1 to IT

modified from Ungerleider and Haxby, 1994

Hubel & Wiesel, 1959


Desimone, 1991
Desimone, 1991
Summary of “basic facts”
Accumulated evidence points to three (mostly accepted)
properties of the ventral visual stream architecture:

• Hierarchical build-up of invariances (first to


translation and scale, then to viewpoint etc.) , size of
the receptive fields and complexity of preferred
stimuli

• Basic feed-forward processing of information (for


“immediate” recognition tasks)

• Learning of an individual object generalizes to scale


and position
Mapping the ventral stream into a model

Serre, Kouh, Cadieu, Knoblich, Poggio, 2005;


Riesenhuber et al, Nat. Neuro, 1999,2000 …
The model
Claims to interpret or predict several existing data in microcircuits and system
physiology, and also in cognitive science:

• What some complex cells in V1 and V4 do and why: MAX…

• View-tuning of IT cells (Logothetis)


• Response to pseudomirror views
• Effect of scrambling
• Multiple objects
• Robustness/sensitivity to clutter
• K. Tanaka’s simplification procedure
• Categorization tasks (cats vs dogs)
• Invariance to translation, scale etc…
• Read-out data…

• Gender classification
• Face inversion effect : experience, viewpoint, other-race, configural
vs. featural representation
• Binding problem, no need for oscillations…
Neural Correlate of Categorization (NCC)

Define categories in morph space

60% Cat 60% Dog


80% Cat Morphs 80% Dog
Morphs Morphs
Morphs
Prototypes Prototypes
100% Dog
100% Cat

Category
9.520, spring 2006 boundary
Categorization task

Train monkey on categorization task

.
.
.
. (Match)
Fixation
500 ms. Sample
600 ms. Delay .
1000 .
ms.
Test .
(Nonmatch)
Delay
Test
(Match)

After training, record from neurons in IT & PFC


9.520, spring 2006
Single cell example: a “categorical” PFC neuron that
responds more strongly to DOGS than CATS

Fixation Sample Delay Choice


13
Dog 100%
Dog 80%
Dog 60%
Firing Rate (Hz)

10

4
Cat 100%
Cat 80%
Cat 60%
1
-500 0 500 1000 1500 2000
Time from sample stimulus onset (ms)

D. Freedman + E. Miller + M.
Riesenhuber+T. Poggio (Science,
9.520, spring 2006 2001)
The model fits many physiological data,
predicts several new ones…

recently it provided a surprise (for us)…


…when we compared its performance with
machine vision…
Sample Results on the CalTech 101-object dataset
The model performs at the level of the best
computer vision systems
…and another surprise…

… was the comparison with human performance


(Thomas Serre with Aude Oliva)
on rapid categorization of complex natural images
Experiment: rapid (to avoid backprojections)
animal detection in natural images

Image
Interval
Image-Mask

Mask
1/f noise
20 msec

30 msec

80 msec Animal present


or not ?
[Thorpe et al, 1996; Van Rullen & Koch, 2003;
Oliva & Torralba, in press]
Targets and distractors

[Serre, Oliva & Poggio, in prep]


Humans achieve model-level performance

Model results obtained without tuning a single parameter!

Human: 80% correct


vs.
Model: 82% correct

[Serre, Oliva & Poggio, in prep]


Theory supported by data in V1, V4, IT; works as well as the best computer vision; mimics human
performance

Freedman, Science, 2002


Logothetis et al., Cur. Bio., 1995
Gawne et al., J. Neuro., 2002
Lampl et al.,J. Neuro, 2004.
A challenge for learning theory:

an unusual, hierarchical architecture


with unsupervised and supervised learning
and learning of invariances…

We will see later why this is unusual and interesting for learning
theory!

You might also like