Learning Without Forgetting
Learning Without Forgetting
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 1
Abstract—When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training
data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A
new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing
capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network
while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and
fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A
more surprising observation is that Learning without Forgetting may be able to replace fine-tuning with similar old and new task
datasets for improved new task performance.
Index Terms—Convolutional Neural Networks, Transfer Learning, Multi-task Learning, Deep Learning, Visual Recognition
1 I NTRODUCTION
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 2
Fig. 1. We wish to add new prediction tasks to an existing CNN vision system without requiring access to the training data for existing tasks. This
table shows relative advantages of our method compared to commonly used methods. Limitations of our method include requiring knowing discrete
task ID for each sample, requiring all new task data in advance, and the performance influenced by task similarity. (Section 5.1)
Fine Duplicating and Feature Joint Learning without
Tuning Fine Tuning Extraction Training Forgetting
new task performance good good X medium best best
original task performance X bad good good good good
training efficiency fast fast fast X slow fast
testing efficiency fast X slow fast fast fast
storage requirement medium X large medium X large medium
requires previous task data no no no X yes no
on all relevant images, the old task accuracy will be the training set or a large unlabeled set of data. Our approach
same as the original network. In practice, the images for the differs in that we solve for a set of parameters that works
new task may provide a poor sampling of the original task well on both old and new tasks using the same data to
domain, but our experiments show that preserving outputs supervise learning of the new tasks and to provide unsu-
on these examples is still an effective strategy to preserve pervised output guidance on the old tasks.
performance on the old task and also has an unexpected
benefit of acting as a regularizer to improve performance 2.1 Compared methods
on the new task. Our Learning without Forgetting approach Feature Extraction [5], [12] (Fig. 2(c)) uses a pre-trained
has several advantages: deep CNN to compute features for an image. The extracted
(1) Classification performance: Learning without Forget- features are the activations of one layer (usually the last
ting outperforms feature extraction and, more sur- hidden layer) or multiple layers given the image. Classifiers
prisingly, fine-tuning on the new task while greatly trained on these features can achieve competitive results,
outperforming using fine-tuned parameters θs on the sometimes outperforming human-engineered features [5].
old task. Our method also generally perform better in Further studies [13] show how hyper-parameters, e.g. orig-
experiments than recent alternatives [8], [9]. inal network structure, should be selected for better perfor-
(2) Computational efficiency: Training time is faster than mance. Feature extraction does not modify the original net-
joint training and only slightly slower than fine-tuning, work and allows new tasks to benefit from complex features
and test time is faster than if one uses multiple fine- learned from previous tasks. However, these features are not
tuned networks for different tasks. specialized for the new task and can often be improved by
(3) Simplicity in deployment: Once a task is learned, the fine-tuning.
training data does not need to be retained or reapplied Fine-tuning [6] modifies the parameters of an existing
to preserve performance in the adapting network. CNN to train a new task. (Fig. 2(b)) As mentioned in
Compared to our previous work [10], we conduct more Section 1, a small learning rate is often used, and sometimes
extensive experiments. We compare to additional methods part of the network is frozen to prevent overfitting. Using
– fine-tune FC, a commonly used baseline, and Less Forget- appropriate hyper-parameters for training, the resulting
ting Learning, a recently proposed method. We experiment model often outperforms feature extraction [6], [13] or learn-
on adjusting the balance between old-new task losses, pro- ing from a randomly initialized network [14], [15]. Fine-
viding a more thorough and intuitive comparison of related tuning makes θs more discriminative for the new task, and
methods (Figure 7). We switch from the obsolete Places2 to a the low learning rate is an indirect mechanism to preserve
newer Places365-standard dataset. We perform stricter, more some of the representational structure learned in the original
careful hyperparameter selection process, which slightly tasks. Our method provides a more direct way to preserve
changed our results. We also include more detailed expla- representations that are important for the original task,
nation of our method. Finally, we perform an experiment on improving both original and new task performance relative
application to video object tracking in Appendix A. to fine-tuning in most experiments.
Multitask learning (e.g., [7]; Fig. 2(d)) aims to improve
all tasks simultaneously by combining the common knowl-
2 R ELATED WORK edge from all tasks. Each task provides extra training data
Multi-task learning, transfer learning, and related methods for the parameters that are shared or constrained, serving as
have a long history. In brief, our Learning without Forget- a form of regularization for the other tasks [16]. For neural
ting approach could be seen as a combination of Distillation networks, Caruana [7] gives a detailed study of multi-
Networks [11] and fine-tuning [6]. Fine-tuning initializes task learning. Usually the bottom layers of the network
with parameters from an existing network trained on a are shared, while the top layers are task-specific. Multitask
related data-rich problem and finds a new local minimum learning requires data from all tasks to be present, while our
by optimizing parameters for a new task with a low learning method requires only data for the new tasks.
rate. The idea of Distillation Networks is to learn parameters Adding new nodes to each network layer is a way
in a simpler network that produce the same outputs as a to preserve the original network parameters while learn-
more complex ensemble of networks either on the original ing new discriminative features. For example, Terekhov et
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 3
(old task 1) 𝜃𝑜
…
new task
… …
…
(test image)
image
(old task 𝑚)
𝜃𝑛 new task
𝜃𝑠
𝜃𝑠 𝜃𝑜 ground truth
…
…
𝜃𝑜
orig. model’s orig. model’s
last 𝜃𝑠 layer’s
…
new task new task
𝜃𝑜 response for
… activation … old tasks
image image
𝜃𝑛 new task 𝜃𝑛 new task
𝜃𝑠 ground truth 𝜃𝑠 ground truth
Fig. 2. Illustration for our method (f) and methods we compare to (b-e). Images and labels used in training are shown. Data for different tasks are
used in alternation in joint training.
al. [17] propose Deep Block-Modular Neural Networks for extra guidance on the middle layer. Chen et al. [20] proposes
fully-connected neural networks, and Rusu et al. [18] pro- the Net2Net method that immediately generates a deeper,
pose Progressive Neural Networks for reinforcement learn- wider network that is functionally equivalent to an exist-
ing. Parameters for the original network are untouched, and ing one. This technique can quickly initialize networks for
newly added nodes are fully connected to the layer beneath faster hyper-parameter exploration. These methods aim to
them. These methods has the downside of substantially produce a differently structured network that approximates
expanding the number of parameters in the network, and the original network, while we aim to find new parameters
can underperform [17] both fine-tuning and feature extrac- for the original network structure (θs , θo ) that approximate
tion if insufficient training data is available to learn the the original outputs while tuning shared parameters θs for
new parameters, since they require a substantial number of new tasks.
parameters to be trained from scratch. We experiment with
expanding the fully connected layers of original network Feature extraction and fine-tuning are special cases of
but find that the expansion does not provide an improve- Domain Adaptation (when old and new tasks are the same)
ment on our original approach. or Transfer Learning (different tasks). These are different
from multitask learning in that tasks are not simultaneously
optimized. Transfer Learning uses knowledge from one
2.2 Topically relevant methods task to help another, as surveyed by Pan et al. [21]. The
Our work also relates to methods that transfer knowledge Deep Adaption Network by Long et al. [22] matches the
between networks. Hinton et al. [11] propose Knowledge RKHS embedding of the deep representation of both source
Distillation, where knowledge is transferred from a large and target tasks to reduce domain bias. Another similar
network or a network assembly to a smaller network for domain adaptation method is by Tzeng et al. [23], which
efficient deployment. The smaller network is trained using a encourages the shared deep representation to be indistin-
modified cross-entropy loss (further described in Sec. 3) that guishable across domains. This method also uses knowledge
encourages both large and small responses of the original distillation, but to help train the new domain instead of
and new network to be similar. Romero et al. [19] builds preserving the old task. Domain adaptation and transfer
on this work to transfer to a deeper network by applying learning require that at least unlabeled data is present for
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 4
both task domains. In contrast, we are interested in the Elastic Weight Consolidation (EWC) [27] can be seen
case when training data for the original tasks (i.e. source as an advanced version of the L2 soft-constraint baseline in
domains) are not available. Section 4.2. Instead of adding a loss proportional to the L2
Methods that integrate knowledge over time, e.g. Life- distance between modified network weights and original
long Learning [24] and Never Ending Learning [25], are also ones, in EWC the squared distance of each parameter are
related. Lifelong learning focuses on flexibly adding new weighted according to their importance to the old task. This
tasks while transferring knowledge between tasks. Never weight is estimated using the diagonal precision matrix
Ending Learning focuses on building diverse knowledge of an assumed gaussian distribution of the posterior of
and experience (e.g. by reading the web every day). Though P (θs , θo |DA ), which is again estimated using the Fisher ma-
topically related to our work, these methods do not provide trix. Although the computation of the Fisher matrix requires
a way to preserve performance on existing tasks without the the presence of the old task data DA , fortunately this matrix
original training data. Ruvolo et al. [26] describe a method can be computed before throwing away the old data, and
to efficiently add new tasks to a multitask system, co- be seen as part of the model. In this way the method would
training all tasks while using only new task data. However, not require the old data to be present for new task training.
the method assumes that weights for all classifiers and However, it is unclear if the weights need to be com-
regression models can be linearly decomposed into a set puted again (needing all previous old task data) when
of bases. In contrast with our method, the algorithm applies subsequent new tasks are added. Moreover, this method
only to logistic or linear regression on engineered features, reduces new task performance in the MNIST→permuted-
and these features cannot be made task-specific, e.g. by fine- MNIST experiment, compared to the fine-tuning baseline.
tuning. This phenomenon lies in line with the observations of our
L2 soft-constraint baseline. Also notable is that the paper
focuses on situations where the input and output forms of
2.3 Concurrently developed methods
all tasks are the same, and no task-specific final layers are
Concurrent with our previous work [10], two methods have needed, which is slightly different from our multiple task
been proposed for continually add and integrate new tasks scenario.
without using previous tasks’ data. Cross-stitch Network [28] works on multi-task learning
A-LTM [8], developed independently, is nearly identical that resembles an advanced version of our network ex-
in method but has very different experiments and conclu- pansion experiment. It introduces the cross-stitch module,
sions. The main differences of method are in the weight which takes two same-structured inputs, jointly learning
decay regularization used for training and the warm-up step two same-structured network blocks, as well as two pairs
that we use prior to full fine-tuning. of weights to average their activations to obtain two same-
However, we use large datasets to train our initial net- structured outputs. By replacing the original network mod-
work (e.g. ImageNet) and then extend to new tasks from ules with the cross-stitch module, the authors were able to
smaller datasets (e.g. PASCAL VOC), while A-LTM uses outperform joint training, among other methods, in most
small datasets for the old task and large datasets for the new experiments. However, this method still needs the old task
task. The experiments in A-LTM [8] find much larger loss data to be present to jointly optimize, and it increases the
due to fine-tuning than we do, and the paper concludes that network size proportionally to the number of tasks.
maintaining the data from the original task is necessary to WA-CNN (Growing a Brain [29]) is most similar to our
maintain performance. Our experiments, in contrast, show network expansion alternative design in Section 4.2. The
that we can maintain good performance for the old task paper increases layer sizes or network depth to improve
while performing as well or sometimes better than fine- fine-tuning performance on the new task, using a normal-
tuning for the new task, without access to original task data. ization in the activation scale of the new nodes. Although
We believe the main difference is the choice of old-task the model focuses on improving new task performance,
new-task pairs and that we observe less of a drop in old- when the method freezes all old parameters, it can maintain
task performance from fine-tuning due to the choice (and in the performance of the old task while still outperform-
part to the warm-up step; see Table 2(b)). We believe that ing traditional fine-tuning on the new task. However, the
our experiments, which start from a well-trained network performance comes at the price of an increased network
and add tasks with less training data available, are better size – the number of parameters increases faster than LwF
motivated from a practical perspective. when aiming for a higher performance than fine-tuning.
Less Forgetting Learning [9] is also a similar method, (See discussion in Section 4.2) But if a greater network size
which preserves the old task performance by discouraging is not considered to be an issue, the method can be applied
the shared representation to change (Fig. 2(e)). This method in parallel to ours.
argues that the task-specific decision boundaries should not
change, and the shared representation should not change.
Therefore, LFL adds a L2 loss that discourages the output 3 L EARNING WITHOUT FORGETTING
after θs from changing for new task images, while θo remains
as is. In comparison, our LwF method adds a loss that dis- Given a CNN with shared parameters θs and task-specific
courages the old task output to change for new task images, parameters θo (Fig. 2(a)), our goal is to add task-specific
and jointly optimizes both the shared representation and parameters θn for a new task and to learn parameters that
all the final layers. We empirically show that our method work well on old and new tasks, using images and labels
outperforms Less Forgetting Learning on the new tasks. from only the new task (i.e., without using data from existing
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 5
tasks). Our algorithm is outlined in Fig. 3, and the network outputs of another. This is a modified cross-entropy loss that
structure illustrated in Fig. 2(f). increases the weight for smaller probabilities:
First, we record responses yo on each new task image
from the original network for outputs on the old tasks Lold (yo , ŷo ) = −H(yo0 , ŷo0 ) (2)
(defined by θs and θo ). Our experiments involve classifi- l
X
cation, so the responses are the set of label probabilities for =− yo0(i) log ŷo0(i) (3)
each training image. Nodes for each new class are added i=1
to the output layer, fully connected to the layer beneath, 0(i) 0(i)
where l is the number of labels and yo , ŷo are the
with randomly initialized weights θn . The number of new (i)
modified versions of recorded and current probabilities yo ,
parameters is equal to the number of new classes times the (i)
number of nodes in the last shared layer, typically a very ŷo :
small percent of the total number of parameters. In our (yo )1/T
(i) (i)
(ŷo )1/T
experiments (Sec. 4.2), we also compare alternate ways of yo0(i) = P (j) , ŷo0(i) = P (j) . (4)
1/T 1/T
modifying the network for the new task. j (yo ) j (ŷo )
Next, we train the network to minimize loss for all tasks If there are multiple old tasks, or if an old task is multi-label
and regularization R using stochastic gradient descent. The classification, we take the sum of the loss for each old task
regularization R corresponds to a simple weight decay of and label. Hinton et al. [11] suggest that setting T > 1,
0.0005. When training, we first freeze θs and θo and train which increases the weight of smaller logit values and
θn to convergence (warm-up step). Then, we jointly train encourages the network to better encode similarities among
all weights θs , θo , and θn until convergence (joint-optimize classes. We use T = 2 according to a grid search on a held
step). The warm-up step greatly enhances fine-tuning’s old- out set, which aligns with the authors’ recommendations. In
task performance, but is not so crucial to either our method our experiments, we find that most reasonable losses would
or the compared Less Forgetting Learning (see Table 2(b)). lead to similar performance, and the use of knowledge
We still adopt this technique in Learning without Forgetting distillation loss leads to a marginal boost (Fig. 7(c), (d)).
(as well as most compared methods) for the slight enhance- Therefore, it is important to constrain outputs for original
ment and a fair comparison. tasks to be similar to the original network, but the similarity
For simplicity, we denote the loss functions, outputs, and measure is not very crucial.
ground truth for single examples. The total loss is averaged λo is a loss balance weight, set to 1 for most our experi-
over all images in a batch in training. For new tasks, the ments. Making λ larger will favor the old task performance
loss encourages predictions ŷn to be consistent with the over the new task’s, so we can obtain a old-task-new-task
ground truth yn . The tasks in our experiments are multiclass performance line by changing λo . (Figure 7)
classification, so we use the common [3], [30] multinomial
logistic loss:
Relationship to joint training. As mentioned before, the
Lnew (yn , ŷn ) = −yn · log ŷn (1) main difference between joint training and our method is
the need for the old dataset. Joint training uses the old task’s
where ŷn is the softmax output of the network and yn is images and labels in training, while Learning without For-
the one-hot ground truth label vector. If there are multiple getting no longer uses them, and instead uses the new task
new tasks, or if the task is multi-label classification where images Xn and the recorded responses Yo as substitutes.
we make true/false predictions for each label, we take the This eliminates the need to require and store the old dataset,
sum of losses across the new tasks and the labels. brings us the benefit of joint optimization of the shared θs ,
For each original task, we want the output probabilities and also saves computation since the images Xn only has
for each image to be close to the recorded output from the to pass through the shared layers once for both the new
original network. We use the Knowledge Distillation loss, task and the old task. However, the distribution of images
which was found by Hinton et al. [11] to work well for from these tasks may be very different, and this substitu-
encouraging the outputs of one network to approximate the tion may potentially decrease performance. Therefore, joint
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 6
training’s performance may be seen as an upper-bound for 10× at the epoch when the held out accuracy plateaus.
our method. Exception to all these are ImageNet→Scene with feature
extraction where we observe overfitting and have to shorten
Efficiency comparison. The most computationally expen- the training, and Places365→MNIST with feature extraction
sive part of using the neural network is evaluating or back- where convergence is slower and we double the number of
propagating through the shared parameters θs , especially epochs. We perform stricter validation than in our previous
the convolutional layers. For training, feature extraction is work [10], and the number of epochs is generally longer for
the fastest because only the new task parameters are tuned. each scenario.
LwF is slightly slower than fine-tuning because it needs To make a fair comparison, the intermediate network
to back-propagate through θo for old tasks but needs to trained using our method (after the warm-up step) is used
evaluate and back-propagate through θs only once. Joint as a starting point for joint training and Fine Tuning, since
training is the slowest, because different images are used this may speed up training convergence. In other words,
for different tasks, and each task requires separate back- for each run of our experiment, we first freeze θs , θo and
propagation through the shared parameters. train θn , and use the resulting parameters to initialize our
All methods take approximately the same amount of method, joint training and fine-tuning. Feature extraction is
time to evaluate a test image. However, duplicating the trained separately because does not share the same network
network and fine-tuning for each task takes m times as long structure as our method.
to evaluate, where m is the total number of tasks. For the feature extraction baseline, instead of extracting
features at the last hidden layer of the original network (at
the top of θs ), we freeze the shared parameters θs , disable
3.1 Implementation details
the dropout layers, and add a two-layer network with 4096
We use MatConvNet [31] to train our networks using nodes in the hidden layer on top of it. This has the same
stochastic gradient descent with momentum of 0.9 and effect of training a 2-layer network on the extracted features,
dropout enabled in the fully connected layers. The data but with data augmentation.
normalization (mean subtraction) of the original task is used For joint training, loss for one task’s output nodes is ap-
for the new task. The resizing follows the implementation plied to only its own training images. We interleave batches
of the original network, which is 256 × 256 for AlexNet and of different tasks for gradient descent. For implementation
256 pixels in the shortest edge with aspect ratio preserved convenience, we consider the epoch length of all tasks to
for VGG. We randomly jitter the training data by taking be one iteration of the new task dataset. Note that the new
random fixed-size crops of the resized images with offset dataset is typically much smaller than the old one, therefore
on a 5 × 5 grid, randomly mirroring the crop, and adding in one epoch of the new dataset, not all training data in
variance to the RGB values like in AlexNet [3]. This data the old task will be seen by the network. We simply shuffle
augmentation is applied to feature extraction too. both datasets at the end of each epoch, so that every old
When training networks, we follow the standard prac- task training sample will potentially appear in subsequent
tices for fine-tuning existing networks. The selection of epochs. Another strategy can be maintaining separate epoch
hyperparameters, mainly the number of epochs, warm-up iterator for each dataset and shuffle them at the end of their
peroid, and the learning rate schedule, are chosen using the own epoch, but we believe that these two strategies would
new task performance on a held-out set, which is a 20% perform very similarly.
subset of the training set whenever (1) the dataset does not
have a validation set, or (2) the validation set is used for
testing. When testing on VOC test set, the official validation 4 E XPERIMENTS
set is used for hyperparameter selection. We do not look at Our experiments are designed to evaluate whether Learning
the old task performance during this selection. without Forgetting (LwF) is an effective method to learn a
For random initialization of θn , we use Xavier [32] ini- new task while preserving performance on old tasks. We
tialization. We use a learning rate much smaller than when compare to common approaches of feature extraction, fine-
training the original network (0.1 ∼ 0.02 times the original tuning, and fine-tuning FC, and also Less Forgetting Learning
rate). The learning rates are selected to maximize new task (LFL) [9]. These methods leverage an existing network for
performance with a reasonable number of epochs. For each a new task without requiring training data for the original
scenario, the same learning rate are shared by all methods tasks. Feature extraction maintains the exact performance
except feature extraction, which uses 5× the learning rate on the original task. We also compare to joint training
due to its small number of parameters. (sometimes called multitask learning) as an upper-bound
We choose the number of epochs for both the warm- on possible old task performance, since joint training uses
up step and the joint-optimize step based on validation images and labels for original and new tasks, while LwF
on the held-out set. Since we look at only the new task uses only images and labels for the new tasks.
performance during validation, our hyperparameters favor We experiment on a variety of image classification prob-
the new task more. The compared methods converge at lems with varying degrees of inter-task similarity. For the
similar speeds, so we used the same number of epochs for original (“old”) task, we consider the ILSVRC 2012 subset
each method for fair comparison; however, the convergence of ImageNet [4] and the Places365-standard [33] dataset. Note
speed heavily depend on the original network and the task that our previous work used Places2, a taster challenge in
pair, and we validate for the number of epoch separately ILSVRC 2015 [4] and an earlier version of Places365, but
for each scenario. We lower the learning rate once by the dataset was deprecated after our publication. ImageNet
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 7
has 1,000 object category classes and more than 1,000,000 (similar performance), and ImageNet→MNIST (worse
training images. Places365 has 365 scene classes and ∼ performance). The gain over fine-tuning was unexpected
1, 600, 000 training images. We use these large datasets also and indicates that preserving outputs on the old task is an
because we assume we start from a well-trained network, effective regularizer. (See Section 5 for a brief discussion).
which implies a large-scale dataset. For the new tasks, This finding motivates replacing fine-tuning with LwF as
we consider PASCAL VOC 2012 image classification [34] the standard approach for adapting a network to a new
(“VOC”), Caltech-UCSD Birds-200-2011 fine-grained classifi- task.
cation [35] (“CUB”), and MIT indoor scene classification [36] On the old task, our method performs better than fine-tuning
(“Scenes”). These datasets have a moderate number of im- but often underperforms feature extraction, fine-tuning FC, and
ages for training: 5,717 for VOC; 5,994 for CUB; and 5,360 for occasionally LFL. By changing shared parameters θs , fine-
Scenes. Among these, VOC is very similar to ImageNet, as tuning significantly degrades performance on the task for
subcategories of its labels can be found in ImageNet classes. which the original network was trained. By jointly adapting
MIT indoor scene dataset is in turn similar to Places365. θs and θo to generate similar outputs to the original network
CUB is dissimilar to both, since it includes only birds and on an old task similar to the new one, the performance loss
requires capturing the fine details of the image to make a is greatly reduced.
valid prediction. In one experiment, we use MNIST [37] Considering both tasks, Figure 7 shows that if λo is adjusted,
as the new task expecting our method to underperform, LwF can perform better than LFL and fine-tuning FC on the new
since the hand-written characters are completely unrelated task for the same old task performance on the first task pair,
to ImageNet classes. and perform similarly to LFL on the second. Indeed, fine-
We mainly use the AlexNet [3] network structure be- tuning FC gives a performance between fine-tuning and
cause it is fast to train and well-studied by the commu- feature extraction. LwF provides freedom of changing the
nity [6], [13], [15]. We also verify that similar results hold shared representation compared to LFL, which may have
using 16-layer VGGnet [30] on a smaller set of experiments. boosted the new task performance.
For both network structures, the final layer (fc8) is treated Our method usually performs similarly to joint training with
as task-specific, and the rest are shared (θs ) unless otherwise AlexNet. Our method tends to slightly outperform joint
specified. The original networks pre-trained on ImageNet training on the new task but underperform on the old task,
and Places365-standard are obtained from public online which we attribute to a different distribution in the two task
sources. datasets. Overall, the methods perform similarly (except on
We report the center image crop mean average precision the extreme *→MNIST cases), a positive result since our
for VOC, and center image crop accuracy for all other method does not require access to the old task training
tasks. We report the accuracy of the validation set of VOC, data and is faster to train. Note that sometimes both tasks’
ImageNet and Places365, and on the test set of CUB and performance degrade with λo too large or too small. We
Scenes dataset. Since the test performance of the former suspect that making it too large essentially increases the
three cannot be evaluated frequently, we only provide the old task learning rate, potentially making it suboptimal, and
performance on their test sets in one experiment. Due to the making it too small lessens the regularization.
randomness within CNN training, we run our experiments Dissimilar new tasks degrade old task performance more. For
three times, and report the mean performance. example, CUB is very dissimilar task from Places365 [13],
Our experiments investigate adding a single new task to and adapting the network to CUB leads to a Places365
the network or adding multiple tasks one-by-one. We also accuracy loss of 8.4% (3.8% + 4.6%) for fine-tuning, 3.8%
examine effect of dataset size and network design. In ab- for LwF, and 1.5% (3.8% − 2.3%) for joint training. In these
lation studies, we examine alternative response-preserving cases, learning the new task causes considerable drift in the
losses, the utility of expanding the network structure, and shared parameters, which cannot fully be accounted for by
fine-tuning with a lower learning rate as a method to pre- LwF because the distribution of CUB and Places365 images
serve original task performance. Note that the results have is very different. Even joint training leads to more accuracy
multiple sources of variance, including random initializa- loss on the old task because it cannot find a set of shared
tion and training, pre-determined termination (performance parameters that works well for both tasks. Our method
can fluctuate by training 1 or 2 additional epochs), etc. does not outperform fine-tuning for Places365→CUB and,
as expected, *→MNIST on the new task, since the hand-
written characters provide poor indirect supervision for the
4.1 Main experiments old task. The old task accuracy drops substantially with fine-
Single new task scenario. First, we compare the results tuning and LwF, though more with fine-tuning.
of learning one new task among different task pairs and Similar observations hold for both VGG and AlexNet struc-
different methods. Table 1(a), 1(b) shows the performance of tures, except that joint training outperforms consistently for
our method, and the relative performance of other methods VGG, and LwF performs worse than before on the old task.
compared to it using AlexNet. We also visualize the old- (Table 1(c)) This indicates that these results are likely to
new performance comparison on two of the task pairs in hold for other network structures as well, though joint
Figure 7. We make the following observations: training may have a larger benefit on networks with more
On the new task, our method consistently outperforms representational power. Among these results, LFL diverges
LFL, fine-tuning FC, and feature extraction, while using stochastic gradient descent, so we tuned down the
outperforming fine-tuning on most task pairs except learning rate (0.5×) and used λi = 0.2 instead.
Places365→CUB, Places365→VOC, Places365→MNIST Multiple new task scenario. Second, we compare differ-
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 8
TABLE 1
Performance for the single new task scenario. For all tables, the difference of methods’ performance with LwF (our method) is reported to facilitate
comparison. Mean Average Precision is reported for VOC and accuracy for all others. On the new task, LwF outperforms baselines in most
scenarios, and performs comparably with joint training, which uses old task training data we consider unavailable for the other methods. On the old
task, our method greatly outperforms fine-tuning and achieves slightly worse performance than joint training. An exception is the ImageNet-MNIST
task where LwF does not perform well on the old task.
ent methods when we cumulatively add new tasks to the training, since we interleave batches from both datasets,
system, simulating a scenario in which new object or scene only a small percentage of the ImageNet training set is
categories are gradually added to the prediction vocabulary. iterated over for one CUB epoch. (See the end of Section 3.1)
We experiment on gradually adding VOC task to AlexNet Our results are shown in Figure 5. Results show that the same
trained on Places365, and adding Scene task to AlexNet observations hold. Our method outperforms fine-tuning on both
trained on ImageNet. These pairs have moderate difference tasks. Differences between methods tend to increase with more
between original task and new tasks. We split the new task data used, although the correlation is not definitive.
classes into three parts according to their similarity – VOC
into transport, animals and objects, and Scenes into large
4.2 Design choices and alternatives
rooms, medium rooms and small rooms. (See supplemental
material) The images in Scenes are split into these three Choice of task-specific layers. It is possible to regard more
subsets. Since VOC is a multilabel dataset, it is not possible layers as task-specific θo , θn (see Figure 6(a)) instead of
to split the images into different categories, so the labels regarding only the output nodes as task-specific. This may
are split for each task and images are shared among all the provide advantage for both tasks because later layers tend
tasks. to be more task specific [13]. However, doing so requires
more storage, as most parameters in AlexNet are in the first
Each time a new task is added, the responses of all other two fully connected layers. Table 2(a) shows the comparison
tasks Yo are re-computed, to emulate the situation where on three task pairs. Our results do not indicate any advantage
data for all original tasks are unavailable. Therefore, Yo for to having additional task-specific layers.
older tasks changes each time. For feature extractor and joint
training, cumulative training does not apply, so we only Network expansion. We explore another way of modify-
report their performance on the final stage where all tasks ing the network structure, which we refer to as “network
are added. Figure 4 shows the results on both dataset pairs. expansion”, which adds nodes to some layers. This allows
Our findings are usually consistent with the single new task for extra new-task-specific information in the earlier layers
experiment: LwF outperforms fine-tuning, feature extraction, while still using the original network’s information.
LFL, and fine-tuning FC for most newly added tasks. However, Figure 6(b) illustrates this method. We add 1024 nodes
LwF performs similarly to joint training only on newly added to each layer of the top 3 layers. The weights from all nodes
tasks (except for Scenes part 1), and underperforms joint training at previous layer to the new nodes at current layer are
on the old task after more tasks are added. initialized the same way Net2Net [20] would expand a layer
by copying nodes. Weights from new nodes at previous
Influence of dataset size. We inspect whether the size of layer to the original nodes at current layer are initialized to
the new task dataset affects our performance relative to zero. The top layer weights of the new nodes are randomly
other methods. We perform this experiment on adding CUB re-initialized. Then we either freeze the existing weights
to ImageNet AlexNet. We subsample the CUB dataset to and fine-tune the new weights on the new task (“network
30%, 10% and 3% when training the network, and report expansion”), or train using Learning without Forgetting as
the result on the entire validation set. Note that for joint before (“network expansion + LwF”). Note that both meth-
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 9
55 60
Places365
fine-tuning
Image-
55
50 joint training
Net
50
feat. extraction
45 45 LwF (ours)
85 80 LFL
fine-tune FC
(part 1)
(part 1)
Scenes
VOC
80 75
75 70
75 75
(part 2)
(part 2)
Scenes
VOC
70 70
65 65
65 75
(part 3)
(part 3)
Scenes
VOC
60 70
55 65
Places365 VOC VOC VOC Image- Scenes Scenes Scenes
(part 1) (part 2) (part 3) Net (part 1) (part 2) (part 3)
Fig. 4. Performance of each task when gradually adding new tasks to a pre-trained network. Different tasks are shown in different sub-graphs.
The x-axis labels indicate the new task added to the network each time. Error bars shows ±2 standard deviations for 3 runs with different θn
random initializations. Markers are jittered horizontally for visualization, but line plots are not jittered to facilitate comparison. For all tasks, our
method degrades slower over time than fine-tuning and outperforms feature extraction in most scenarios. For Places2→VOC, our method performs
comparably to joint training.
0.52
0.2
0.1 0.5
3% 10% 30% 100% 3% 10% 30% 100%
Fig. 5. Influence of subsampling new task training set on compared methods. The x-axis indicates diminishing training set size. Three runs of our
experiments with different random θn initialization and dataset subsampling are shown. Scatter points are jittered horizontally for visualization, but
line plots are not jittered to facilitate comparison. Differences between LwF and compared methods on both the old task and the new task decrease
with less data, but the observations remain the same. LwF outperforms fine-tuning despite the change in training set size.
ods needs the network to scale quadratically with respect to dataset [38], the network outperforms fine-tuning by 0.53%
the number of new tasks. while maintaining the original old task performance by
Table 2(a) shows the comparison with our original adding 1024 nodes to the 4096-node fc7; 0.88% if 2048
method. Network expansion by itself performs better than feature nodes are added. Adding the 2048 nodes and the new top
extraction, but not as well as LwF on the new task. Network layer increases the number of parameters by 21.1%, while
Expansion + LwF performs similarly to LwF with additional LwF increases network size by 2.7%.
computational cost and complexity. We also note that the WA-CNN variant that does not
We note that Growing a Brain [29] offers a more thor- freeze parameters performs better on the new task. In a way,
ough experiment of network expansion (w/o LwF). By the variant that freezes old parameters suffer from the same
freezing all old parameters, adding a novel normalization drawback as feature extraction. Potentially the freezing
step, and experimenting on the layer for adding nodes variant can be improved by unfreezing and applying LwF,
and the number of nodes to add, their method WA-CNN which can potentially increase new task performance while
is able to outperform fine-tuning, but still at the cost of only slightly sacrificing the old task performance. However,
an increased network size. For example, on the SUN-397 this experiment is outside the scope of this paper.
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 10
…
…
new task tasks’ response new task unchanged
… … new task Net2Net weights
image image
label 0-init’d weights
new task label
Fig. 6. Illustration for alternative network modification methods. In (a), more fully connected layers are task-specific, rather than shared. In (b),
nodes for multiple old tasks (not shown) are connected in the same way. LwF can also be applied to Network Expansion by unfreezing all nodes
and matching output responses on the old tasks.
TABLE 2
Performance of our method versus various alternative design choices. In most cases, these alternative choices do not provide consistent
advantage or disadvantage compared to our method.
(a) Changing the number of task-specific layers, using network expansion, or attempting to lower θs ’s
learning rate when fine-tuning.
ImageNet→CUB ImageNet→Scenes Places365→VOC
old new old new old new
LwF at output layer (ours) 54.7 57.7 55.9 64.5 50.6 70.2
last hidden layer 54.7 56.2 55.7 65.0 50.7 70.6
2nd last hidden (Fig. 6(a)) 54.6 57.1 55.8 64.2 50.8 70.5
network expansion 57.0 54.0 57.0 62.5 51.7 67.1
network expansion + LwF 54.4 57.0 55.7 63.9 50.7 70.4
fine-tuning (10% θs learning rate) 52.2 54.9 54.8 62.7 49.3 69.5
(b) Performing LwF and fine-tuning with and without warmup. The
warmup step is not crucial for LwF, but is essential for fine-tuning’s old
task performance.
ImageNet→Scenes Places365→VOC
old new old new
LwF 55.9 64.5 50.6 70.2
fine-tuning 53.9 63.8 48.4 70.3
LFL 55.5 63.6 50.8 69.5
LwF (no warm-up) 55.2 64.9 50.4 70.0
fine-tuning (no warm-up) 49.8 63.9 42.3 70.0
LFL (no warm-up) 55.4 63.0 50.6 69.1
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 11
70 Fine-tuning
64
New task performance
70 64 Fine-tuning
New task performance
Feat. Extraction
68.5 61
48 50 52 53 54 55 56 57
Old task performance Old task performance
(c) Places365→VOC (d) ImageNet→Scene
Fig. 7. Visualization of both new and old task performance for compared methods, some with different weights of losses. (a)(b): comparing methods;
(c)(d): comparing losses. Larger symbols signifies larger λo , i.e. heavier weight towards response-preserving loss.
knowledge distillation and fine-tuning, learning parameters Finally, as observed in Section 4.1, the performance of
that are discriminative for the new task while preserving LwF largely depends on how much the new task data
outputs for the original tasks on the training data. We show resembles the old task’s. For example, when the old task is
the effectiveness of our method on a number of classification ImageNet classification, we have new tasks ranging from
tasks. PASCAL VOC multilabel classification (very similar), to
MIT indoor scenes (somewhat similar, since scene classifi-
cation can rely on the presence of certain object categories),
5.1 Limitations
to CUB (not similar, since CUB only have pictures of birds,
Firstly, it is worth pointing out that LwF operates on distinct mostly in nature scenes, lacking most ImageNet objects), to
tasks. Like many multitask learning methods, it cannot MNIST (no resemblance at all). As shown in Table 1(a), and
properly deal with domains that are continually changing visualized in Figure 8, the old task preservation compared to
on a spectrum (e.g. old task being classification from top- original performance (Feat. Extraction) is quite well for the
down view, and new task being classification from views former two, a little worse on the quite dissimilar CUB, and
of unknown angles); the tasks must be enumerated. In ad- bad on the irrelevant MNIST. The same trend emerges with
dition, LwF requires each sample to be accompanied by the Places365 as the old task, where Places365→CUB old task
information of which task it belongs to, and this information preservation is perhaps less than satisfactory. We conjecture
is needed for both training and testing. that LwF will be not very effective for task pairs that are
Secondly, in contrast to methods such as Never Ending more dissimilar than ImageNet and CUB.
Learning [25], LwF requires all new task training data to
be present before computing their old task responses. This
would not be applicable if the data come in a stream and 5.2 Usage and future work
the model is required to be trained incrementally. Our work has implications for two uses. First, if we want
Third, the ability of LwF to incrementally learn new to expand the set of possible predictions on an existing
tasks is limited, as the performance of old tasks gradually network, our method performs similarly to joint training
drop. Fourth, the gap between LwF and joint training per- but is faster to train and does not require access to the
formance on both tasks are larger when experimented on training data for previous tasks. Second, if we care only
VGG structure. about the performance for the new task, our method often
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 12
2 2
0
0
-2 Fine-tuning
-2
-4 Joint Training
Old task performance
-6 -8
LwF (ours)
-10
-8 LFL
-12
Fine-tune fc
-10
-14
-12 -16
VOC Scene CUB MNIST Scene VOC CUB MNIST
Increasingly dissimilar new task Increasingly dissimilar new task
Fig. 8. Influence of new-old task similarity on old task performance preservation, related to the original old task performance. As the tasks becomes
further irrelevant, the old task preservation drops.
outperforms the current standard practice of fine-tuning. [4] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Fine-tuning approaches use a low learning rate in hopes that Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
the parameters will settle in a “good” local minimum not too International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.
far from the original values. Preserving outputs on the old 211–252, 2015.
task is a more direct and interpretable way to to retain the [5] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng,
important shared structures learned for the previous tasks. and T. Darrell, “Decaf: A deep convolutional activation feature for
generic visual recognition,” in International Conference in Machine
As an additional use-case example, we investigate using Learning (ICML), 2014.
LwF in the application of tracking in the Appendix. We [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hier-
build on MD-Net [39], which views tracking as a template archies for accurate object detection and semantic segmentation,”
in The IEEE Conference on Computer Vision and Pattern Recognition
classification task. A classifier transferred from training (CVPR), June 2014.
videos is fine-tuned online to classify regions as the object [7] R. Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1,
or background. We propose to replace the fine-tuning step pp. 41–75, 1997.
with Learning without Forgetting. We leave the details and [8] T. Furlanello, J. Zhao, A. M. Saxe, L. Itti, and B. S. Tjan, “Active
long term memory networks,” arXiv preprint arXiv:1606.02355,
implementation to the appendix. We observe some improve- 2016.
ments by applying LwF, but the difference is not statistically [9] H. Jung, J. Ju, M. Jung, and J. Kim, “Less-forgetting learning in
significant. deep neural networks,” arXiv preprint arXiv:1607.00122, 2016.
[10] Z. Li and D. Hoiem, “Learning without forgetting,” in European
We see several directions for future work. We have
Conference on Computer Vision. Springer, 2016, pp. 614–629.
demonstrated the effectiveness of LwF for image classi- [11] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
fication and one experiment on tracking, but would like neural network,” in NIPS Workshop, 2014.
to further experiment on semantic segmentation, detection, [12] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn
features off-the-shelf: an astounding baseline for recognition,” in
and problems outside of computer vision. Additionally, one Proceedings of the IEEE Conference on Computer Vision and Pattern
could explore variants of the approach, such as maintaining Recognition Workshops, 2014, pp. 806–813.
a set of unlabeled images to serve as representative exam- [13] H. Azizpour, A. Razavian, J. Sullivan, A. Maki, and S. Carlsson,
ples for previously learned tasks. Theoretically, it would be “Factors of transferability for a generic convnet representation,” in
IEEE Transactions on Pattern Analysis & Machine Intelligence, 2014.
interesting to bound the old task performance based on [14] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance
preserving outputs for a sample drawn from a different of multilayer neural networks for object recognition,” in Proceed-
distribution. More generally, there is a need for approaches ings of the European Conference on Computer Vision (ECCV), 2014.
[15] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable
that are suitable for online learning across different tasks, are features in deep neural networks?” in Advances in Neural
especially when classes have heavy tailed distributions. Information Processing Systems, 2014, pp. 3320–3328.
[16] O. Chapelle, P. Shivaswamy, S. Vadrevu, K. Weinberger, Y. Zhang,
and B. Tseng, “Boosted multi-task learning,” Machine learning,
ACKNOWLEDGMENTS vol. 85, no. 1-2, pp. 149–173, 2011.
This work is supported in part by NSF Awards 14-46765 and [17] A. V. Terekhov, G. Montone, and J. K. ORegan, “Knowledge
transfer in deep block-modular neural networks,” in Biomimetic
10-53768 and ONR MURI N000014-16-1-2007. and Biohybrid Systems. Springer, 2015, pp. 268–279.
[18] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirk-
R EFERENCES patrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive
neural networks,” arXiv preprint arXiv:1606.04671, 2016.
[1] M. McCloskey and N. J. Cohen, “Catastrophic interference in con- [19] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
nectionist networks: The sequential learning problem,” Psychology Y. Bengio, “Fitnets: Hints for thin deep nets,” in Proceedings of the
of learning and motivation, vol. 24, pp. 109–165, 1989. International Conference on Learning Representations (ICLR), 2015.
[2] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio, [20] T. Chen, I. Goodfellow, and J. Shlens, “Net2net: Accelerating
“An empirical investigation of catastrophic forgetting in gradient- learning via knowledge transfer,” in Proceedings of the International
based neural networks,” arXiv preprint arXiv:1312.6211, 2013. Conference on Learning Representations (ICLR), 2016, p. to appear.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi- [21] S. J. Pan and Q. Yang, “A survey on transfer learning,” Knowledge
cation with deep convolutional neural networks,” in Advances in and Data Engineering, IEEE Transactions on, vol. 22, no. 10, pp. 1345–
neural information processing systems, 2012, pp. 1097–1105. 1359, 2010.
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2773081, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ?, NO. ?, ? 2017 13
[22] M. Long and J. Wang, “Learning transferable features with deep Zhizhong Li Zhizhong Li is a second year Ph.
adaptation networks,” arXiv preprint arXiv:1502.02791, 2015. D. student in Computer Science at the University
[23] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous of Illinois at Urbana-Champaign, supervised by
deep transfer across domains and tasks,” in Proceedings of the IEEE Derek Hoiem. He received his MS degree from
International Conference on Computer Vision, 2015, pp. 4068–4076. the Robotics Institute, Carnegie Mellon Univer-
[24] S. Thrun, “Lifelong learning algorithms,” in Learning to learn. sity, where he was supervised by Daniel Hu-
Springer, 1998, pp. 181–209. ber. Before that, he received his B.Eng. degree
[25] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, from the Department of Automation, Tsinghua
A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, University, completing his thesis with Changshui
N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, Zhang. Zhizhong’s research interest lies in Com-
A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, puter Vision, especially its intersection with Ma-
X. Chen, A. Saparov, M. Greaves, and J. Welling, “Never-ending chine Learning. Most recently his research is focused on the application
learning,” in Proceedings of the Twenty-Ninth AAAI Conference on of transfer learning and deep learning in vision.
Artificial Intelligence (AAAI-15), 2015.
[26] E. Eaton and P. L. Ruvolo, “Ella: An efficient lifelong learning
algorithm,” in Proceedings of the 30th International Conference on
Machine Learning, 2013, pp. 507–515.
[27] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins,
A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska
et al., “Overcoming catastrophic forgetting in neural networks,”
Proceedings of the National Academy of Sciences, p. 201611835, 2017.
[28] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch
networks for multi-task learning,” in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 2016, pp. 3994–
4003.
[29] Y. Wang, D. Ramanan, and M. Hebert, “Growing a brain: Fine-
tuning by increasing model capacity,” in IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR), July
2017.
[30] K. Simonyan and A. Zisserman, “Very deep convolutional
networks for large-scale image recognition,” CoRR, vol.
abs/1409.1556, 2014.
[31] A. Vedaldi and K. Lenc, “Matconvnet – convolutional neural
networks for matlab,” in Proceeding of the ACM Int. Conf. on
Multimedia, 2015.
[32] X. Glorot and Y. Bengio, “Understanding the difficulty of training
deep feedforward neural networks.” in Aistats, vol. 9, 2010, pp. Derek Hoiem Derek Hoiem is an associate
249–256. professor Computer Science at the University
[33] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, of Illinois at Urbana-Champaign, since January
“Places: An image database for deep scene understanding,” arXiv 2009. Derek received his PhD in Robotics from
preprint arXiv:1610.02055, 2016. Carnegie Mellon University in 2007 and com-
[34] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, pleted a postdoctoral fellowship at the Beckman
J. Winn, and A. Zisserman, “The pascal visual object classes Institute in 2008. Derek’s primary research goal
challenge: A retrospective,” International Journal of Computer Vision, is to model the physical and semantic structure
vol. 111, no. 1, pp. 98–136, Jan. 2015. of the world, so computers can better under-
[35] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, stand scenes from images. In particular, he re-
“The Caltech-UCSD Birds-200-2011 Dataset,” California Institute searches algorithms to interpret physical space
of Technology, Tech. Rep. CNS-TR-2011-001, 2011. from images and to relate objects to their environment and to each
[36] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in other. Example applications include creating 3D models of scenes and
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE objects from one image, photorealistic rendering of object models into
Conference on, 2009, pp. 413–420. images, robot navigation, and creating and matching as-built 3D models
[37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based of construction scenes to planned models. Derek has published dozens
learning applied to document recognition,” Proceedings of the IEEE, of papers and several patents, and his work has been recognized
vol. 86, no. 11, pp. 2278–2324, 1998. with awards including an ACM Doctoral Dissertation Award honorable
[38] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun mention, CVPR best paper award, Intel Early Career Faculty award,
database: Large-scale scene recognition from abbey to zoo,” in Sloan Fellowship, and PAMI Significant Young Researcher award. Derek
Computer vision and pattern recognition (CVPR), 2010 IEEE conference Hoiem is also co-founder and CTO of Reconstruct, which visually doc-
on. IEEE, 2010, pp. 3485–3492. uments construction sites, matching images to plans and analyzing
[39] H. Nam and B. Han, “Learning multi-domain convolutional productivity and risk for delay.
neural networks for visual tracking,” in The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2016.
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.