cs330_2021_lifelong_learning.pdf

Reminders
Tuesday (Nov 30th): Project poster session
Wednesday (Dec 8th): Project due
This Wednesday: Lecture and instructor office hours over zoom

Plan for Today
The lifelong learning problem statement
Basic approaches to lifelong learning
Can we do better than the basics?
Revisiting the problem statement
from the meta-learning perspective

A brief review of problem statements.
Meta-Learning
Given i.i.d. task distribution,
learn a new task efficiently
quickly learn
new task
learn to learn tasks
Multi-Task Learning
Learn to solve a set of tasks.
perform tasks
learn tasks

In contrast, many real world settings look like:
Meta-Learning
learn to learn tasks
quickly learn
new task
Multi-Task Learning
perform tasks
learn tasks
time
- a student learning concepts in school
- a deployed image classification system learning from a
stream of images from users
- a robot acquiring an increasingly large set of skills in
different environments
- a virtual assistant learning to help different users with
different tasks at different points in time
- a doctor’s assistant aiding in medical decision-making
Some examples:
Our agents may not be given a large batch of
data/tasks right off the bat!

Sequential learning settings
online learning, lifelong learning, continual learning, incremental learning, streaming data
distinct from sequence data and sequential decision-making
Some Terminology

1. Pick an example setting.
2. Discuss problem statement in your break-out room:
(a) how would you set-up an experiment to develop & test your algorithm?
(b) what are desirable/required properties of the algorithm?
(c) how do you evaluate such a system?
What is the lifelong learning problem statement?
A. a student learning concepts in school
B. a deployed image classification system learning from a
stream of images from users
C. a robot acquiring an increasingly large set of skills in
different environments
D. a virtual assistant learning to help different users with
different tasks at different points in time
E. a doctor’s assistant aiding in medical decision-making
Example settings:
Exercise:

Desirable properties/considerations Evaluation setup

Some considerations:
- computational resources
- memory
- model performance
- data efficiency
Problem variations:
- task/data order: i.i.d. vs. predictable vs. curriculum vs. adversarial
- others: privacy, interpretability, fairness,
test time compute & memory
- discrete task boundaries vs. continuous shifts (vs. both)
- known task boundaries/shifts vs. unknown
Substantial variety in problem statement!

General [supervised] online learning problem:
for t = 1, …, n
observe 𝑥𝑡
predict 𝑦
̂
𝑡
observe label 𝑦𝑡
i.i.d. setting: 𝑥𝑡 ∼ 𝑝(𝑥), 𝑦𝑡 ∼ 𝑝(𝑦|𝑥)
𝑝 not a function of 𝑡
streaming setting: cannot store (𝑥𝑡, 𝑦𝑡)
- lack of memory
- lack of computational resources
- privacy considerations
- want to study neural memory mechanisms
otherwise: 𝑥𝑡 ∼ 𝑝𝑡(𝑥), 𝑦𝑡 ∼ 𝑝𝑡(𝑦|𝑥)
true in some cases, but not in many cases!
- recall: replay buffers
<— if observable task boundaries: observe 𝑥𝑡, 𝑧𝑡

What do you want from your lifelong learning algorithm?
minimal regret (that grows slowly with 𝑡)
regret: cumulative loss of learner — cumulative loss of best learner in hindsight
(cannot be evaluated in practice, useful for analysis)
Regret𝑇: = ∑
1
𝑇
ℒ𝑡(𝜃𝑡) − 𝑚𝑖𝑛
𝜃
∑
1
𝑇
ℒ𝑡(𝜃)
Regret that grows linearly in 𝑡 is trivial. Why?

regret: cumulative loss of learner — cumulative loss of best learner in hindsight
Regret𝑇: = ∑
1
𝑇
ℒ𝑡(𝜃𝑡) − 𝑚𝑖𝑛
𝜃
∑
1
𝑇
ℒ𝑡(𝜃)
t 1
10
30
2
30
28
3
28
32
𝑦
̂
𝑡
𝑦𝑡
10 10

positive & negative transfer
positive forward transfer: previous tasks cause you to do better on future tasks
compared to learning future tasks from scratch
positive backward transfer: current tasks cause you to do better on previous tasks
compared to learning past tasks from scratch
positive -> negative : better -> worse

Store all the data you’ve seen so far, and train on it.
Approaches
—> follow the leader algorithm
Take a gradient step on the datapoint you observe. —> stochastic gradient descent
+ will achieve very strong performance
- computation intensive
- can be memory intensive
—> Continuous fine-tuning can help.
[depends on the application]
+ computationally cheap
+ requires 0 memory
- subject to negative backward transfer
“forgetting”
sometimes referred to as
catastrophic forgetting
- slow learning

Very simple continual RL algorithm
Julian, Swanson, Sukhatme, Levine, Finn, Hausman, Never Stop Learning, 2020
86% 49%
7 robots collected 580k grasps

Can we do better?
What about negative transfer?

Case Study: Can we modify vanilla SGD to avoid negative backward transfer?
(from scratch)

Idea:
35
Lopez-Paz & Ranzato. Gradient Episodic Memory for Continual Learning. NeurIPS ‘17
(1) store small amount of data per task in memory
(2) when making updates for new tasks, ensure that they don’t unlearn previous tasks
How do we accomplish (2)?
memory: ℳ𝑘 for task 𝑧𝑘
For 𝑡 = 0, . . . , 𝑇
minimize ℒ(𝑓𝜃(⋅, 𝑧𝑡), (𝑥𝑡, 𝑦𝑡))
subject to ℒ(𝑓𝜃, ℳ𝑘) ≤ ℒ(𝑓𝜃
𝑡−1
, ℳ𝑘) for all 𝑧𝑘 < 𝑧𝑡
learning predictor 𝑦𝑡 = 𝑓𝜃(𝑥𝑡, 𝑧𝑡)
(i.e. s.t. loss on previous
tasks doesn’t get worse)
Can formulate & solve as a QP.
Assume local
linearity:
⟨𝑔𝑡, 𝑔𝑘⟩: = ⟨
𝜕ℒ(𝑓𝜃, (𝑥𝑡, 𝑦𝑡))
𝜕𝜃
,
ℒ(𝑓𝜃, ℳ𝑘)
𝜕𝜃
⟩ ≥ 0 for all 𝑧𝑘 < 𝑧𝑡

36
Lopez-Paz & Ranzato. Gradient Episodic Memory for Continual Learning. NeurIPS ‘17
Experiments
If we take a step back… do these experimental domains make sense?
BWT: backward transfer,
FWT: forward transfer
- MNIST permutations
- MNIST rotations
- CIFAR-100 (5 new classes/task)
Problems:
Total memory size:
5012 examples

Can we meta-learn how to avoid negative backward transfer?
Javed & White. Meta-Learning Representations for Continual Learning. NeurIPS ‘19
Beaulieu et al. Learning to Continually Learn. ‘20

More realistically:
learn learn learn learn learn
learn
slow learning rapid learning
learn
time
What might be wrong with the online learning formulation?
Online Learning
(Hannan ’57, Zinkevich ’03)
Perform sequence of tasks
while minimizing static regret. time
perform perform perform perform perform perform
perform
zero-shot performance

Online Learning
(Hannan ’57, Zinkevich ’03)
Perform sequence of tasks
while minimizing static regret.
(Finn*, Rajeswaran*, Kakade, Levine ICML ’18)
Online Meta-Learning
Efficiently learn a sequence of tasks
from a non-stationary distribution.
time
learn learn learn learn learn learn
learn
time
perform perform perform perform perform perform
perform
zero-shot performance
evaluate performance after seeing a small amount of data
What might be wrong with the online learning formulation?
Primarily a difference in evaluation, rather than the data stream.

The Online Meta-Learning Setting
Goal: Learning algorithm with sub-linear
Loss of algorithm
Loss of best algorithm
in hindsight
for task t = 1, …, n
observe 𝒟𝑡
tr
use update procedure Φ(𝜃𝑡, 𝒟𝑡
tr) to produce parameters 𝜙𝑡
observe label 𝑦𝑡
observe 𝑥𝑡
predict 𝑦
̂
𝑡 = 𝑓𝜙𝑡
(𝑥𝑡)
Standard online learning setting
(Finn*, Rajeswaran*, Kakade, Levine ICML ’18)

Store all the data you’ve seen so far, and train on it.
Recall the follow the leader (FTL) algorithm:
Follow the meta-leader (FTML) algorithm:
Can we apply meta-learning in lifelong learning settings?
Store all the data you’ve seen so far, and meta-train on it.
Run update procedure on the current task.
Deploy model on current task.
What meta-learning algorithms are well-suited for FTML?
What if 𝑝𝑡(𝒯) is non-stationary?

Experiment with sequences of tasks:
- Colored, rotated, scaled MNIST
- 3D object pose prediction
- CIFAR-100 classification
Example pose prediction tasks
plane
car
chair
Experiments

Experiments
Learning
efficiency
(#
datapoints)
Task index
Rainbow MNIST Pose Prediction
Task index
Rainbow MNIST Pose Prediction
- TOE (train on everything): train on all data so far
- FTL (follow the leader): train on all data so far, fine-tune on current task
- From Scratch: train from scratch on each task
Learning
proficiency
(error)
Comparisons:
Follow The Meta-Leader
learns each new task faster & with greater proficiency,
approaches few-shot learning regime

Takeaways
Many flavors of lifelong learning, all under the same name.
Defining the problem statement is often the hardest part
Meta-learning can be viewed as a slice of the lifelong learning problem.
A very open area of research.

cs330_2021_lifelong_learning.pdf

More Related Content

Similar to cs330_2021_lifelong_learning.pdf

More from Kuan-Tsae Huang

Recently uploaded

cs330_2021_lifelong_learning.pdf