0% found this document useful (0 votes)
18 views10 pages

Yuan Multiple Instance Active Learning For Object Detection CVPR 2021 Paper

This document proposes a new method called Multiple Instance Active Object Detection (MI-AOD) for active learning in object detection. MI-AOD learns instance-level uncertainty using two adversarial instance classifiers, and weights instances to suppress noisy ones and bridge the gap between instance and image uncertainty. It treats unlabeled images as bags of instances and uses multiple instance learning for re-weighting. Experiments show MI-AOD outperforms other active learning methods for object detection when labeled data is limited.

Uploaded by

thuyntt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views10 pages

Yuan Multiple Instance Active Learning For Object Detection CVPR 2021 Paper

This document proposes a new method called Multiple Instance Active Object Detection (MI-AOD) for active learning in object detection. MI-AOD learns instance-level uncertainty using two adversarial instance classifiers, and weights instances to suppress noisy ones and bridge the gap between instance and image uncertainty. It treats unlabeled images as bags of instances and uses multiple instance learning for re-weighting. Experiments show MI-AOD outperforms other active learning methods for object detection when labeled data is limited.

Uploaded by

thuyntt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Multiple Instance Active Learning for Object Detection

Tianning Yuan† , Fang Wan† , Mengying Fu† ,


Jianzhuang Liu‡ , Songcen Xu‡ , Xiangyang Ji§ and Qixiang Ye†



University of Chinese Academy of Sciences, Beijing, China

Noah’s Ark Lab, Huawei Technologies, Shenzhen, China. § Tsinghua University, Beijing, China
{yuantianning19,fumengying19}@mails.ucas.ac.cn, {wanfang,qxye}@ucas.ac.cn
{liu.jianzhuang,xusongcen}@huawei.com, [email protected]

Image Image Uncertainty


Uncertainty 0.6 0.4 0.5
Abstract
Mean Mean Mean Mean
Despite the substantial progress of active learning for
image recognition, there still lacks an instance-level ac-
Instance ..... .....
tive learning method specified for object detection. In this Uncertainty
paper, we propose Multiple Instance Active Object Detec- Instances with Significant Uncertainty
tion (MI-AOD), to select the most informative images for
detector training by observing instance-level uncertainty. Conventional Detector
MI-AOD defines an instance uncertainty learning module, (a)
which leverages the discrepancy of two adversarial in- Image Uncertainty
stance classifiers trained on the labeled set to predict in- Image
Uncertainty Class Label “Dog”
stance uncertainty of the unlabeled set. MI-AOD treats 0.7 0.8 0.9
unlabeled images as instance bags and feature anchors in
Mean Mean Mean Mean
images as instances, and estimates the image uncertainty Re-
by re-weighting instances in a multiple instance learning weighting
(MIL) fashion. Iterative instance uncertainty learning and Instance ..... .....
Uncertainty
re-weighting facilitate suppressing noisy instances, toward
bridging the gap between instance uncertainty and image- Instances with Re-weighted Uncertainty
level uncertainty. Experiments validate that MI-AOD sets a MI-AOD Detector
solid baseline for instance-level active learning. On com- (b)
monly used object detection datasets, MI-AOD outperforms
Figure 1. Comparison of active object detection methods. (a) Con-
state-of-the-art methods with significant margins, particu- ventional methods compute image uncertainty by simply averag-
larly when the labeled sets are small. Code is available at ing instance uncertainties, ignoring interference from a large num-
https://github.com/yuantn/MI-AOD. ber of background instances. (b) Our MI-AOD leverages uncer-
tainty re-weighting via multiple instance learning to filter out in-
terfering instances while bridging the gap between instance uncer-
1. Introduction tainty and image uncertainty. (Best viewed in color)

The key idea of active learning is that a machine learn-


ing algorithm can achieve better performance with fewer active learning remains the cornerstone of many practical
training samples if it is allowed to select which to learn. applications for its simplicity and higher performance.
Despite the rapid progress of the methods with less supervi- In the computer vision area, active learning has been
sion [21, 20], e.g., weak supervision and semi-supervision, widely explored for image classification (active image clas-
sification) by empirically generalizing the model trained on
* Corresponding Authors. the labeled set to the unlabeled set [9, 30, 18, 39, 4, 24, 36,

5330
25, 34]. Uncertainty-based methods define various metrics relationship between the instance uncertainty and image un-
for selecting informative images and adapting trained mod- certainty for informative image selection.
els to the unlabeled set [9]. Distribution-based approaches (2) We design instance uncertainty learning (IUL) and
[30, 1] aim at estimating the layout of unlabeled images to instance uncertainty re-weighting (IUR) modules, provid-
select samples of large diversity. Expected model change ing effective approaches to highlight informative instances
methods [8, 15] find out samples that can cause the greatest while filtering out noisy ones in object detection.
change of model parameters or the largest loss [42]. (3) We apply MI-AOD to object detection on commonly
Despite the substantial progress, there still lacks an used datasets, improving state-of-the-art methods with sig-
instance-level active learning method specified for object nificant margins.
detection (i.e., active object detection [42, 43, 2]), where
the instance denotes the object proposal in an image. The 2. Related Work
objective goal of active object detection is to select the most
informative images for detector training. However, the re- 2.1. Active Learning
cent methods tackled it by simply summarizing/averaging Uncertainty-based Methods. Uncertainty is the most
instance/pixel uncertainty as image uncertainty and unfor- popular metric to select samples for active learning [31],
tunately ignored the large imbalance of negative instances which can be defined as the posterior probability of a pre-
in object detection, which causes significant noisy instances dicted class [17, 16], or the margin between the posterior
in the background and interferes with the learning of image probabilities of the first and the second predicted class [14,
uncertainty, Fig. 1(a). The noisy instances also cause the in- 28]. It can also be defined upon entropy [32, 26, 14] to mea-
consistency between image and instance uncertainty, which sure the variance of unlabeled samples. The expected model
hinders selecting informative images. change methods [29, 33] utilized the present model to esti-
In this paper, we propose a Multiple Instance Active mate the expected gradient or prediction changes [8, 15] for
Object Detection (MI-AOD) approach, Fig. 1(b), and tar- sample selection. MIL-based methods [33, 13, 40, 6] se-
get at selecting informative images from the unlabeled set lected informative images by discovering representative in-
by learning and re-weighting instance uncertainty with dis- stances. However, they are designed for image classification
crepancy learning and multiple instance learning (MIL). and not applicable to object detection due to the challenging
To learn the instance-level uncertainty, MI-AOD first de- aspect of crowded and noisy instances [38, 37].
fines an instance uncertainty learning (IUL) module, which Distribution-based Methods. These methods select di-
leverages two adversarial instance classifiers plugged atop verse samples by estimating the distribution of unlabeled
the detection network (e.g., a feature pyramid network) to samples. Clusters [27] were applied to build the unla-
learn the uncertainty of unlabeled instances. Maximizing beled sample distribution while discrete optimization meth-
the prediction discrepancy of two instance classifiers pre- ods [11, 7, 41] were employed to perform sample selection.
dicts instance uncertainty while minimizing classifiers’ dis- Considering the distances to the surrounding samples, the
crepancy drives learning features to reduce the distribution context-aware methods [12, 3] selected the samples that can
bias between the labeled and unlabeled instances. represent the global sample distribution. Core-set [30] de-
To establish the relationship between instance uncer- fined active learning as core-set selection, i.e., choosing a
tainty and image uncertainty, MI-AOD incorporates a MIL set of points such that a model learned on the labeled set
module, which is in parallel with the instance classifiers. can capture the diversity of the unlabeled samples.
MIL treats each unlabeled image as an instance bag and In the deep learning era, active learning methods remain
performs instance uncertainty re-weighting (IUR) by evalu- falling into the uncertainty-based or distribution-based rou-
ating instance appearance consistency across images. Dur- tines [18, 39, 4]. Sophisticated methods extended active
ing MIL, the instance uncertainty and image uncertainty are learning to the open sets [24], or combined it with self-
forced to be consistently driven by a classification loss de- paced learning [36]. Nevertheless, it remains questionable
fined on image class labels (or pseudo-labels). Optimizing whether the intermediate feature representation is effective
the image-level classification loss facilitates suppressing the for sample selection. The learning loss method [42] can
noisy instances while highlighting truly representative ones. be categorized as either uncertainty-based or distribution-
Iterative instance uncertainty learning and instance uncer- based. By introducing a module to predict the “loss” of the
tainty re-weighting bridge the gap between instance-level unlabeled samples, it estimates sample uncertainty and se-
observation and image-level evaluation, towards selecting lects samples with large “loss” like hard negative mining.
the most informative images for detector training.
The contributions of this paper include: 2.2. Active Learning for Object Detection
(1) We propose Multiple Instance Active Object Detec- Despite the substantial progress of active learning, few
tion (MI-AOD), establishing a solid baseline to model the methods are specified for active object detection, which

5331
f1 f1 Labeled Instances
Unlabeled Instances
f1
Informative Instances
Label Set f2 Labeled / Unlabeled
Training Instance Bags
f2 f2 Labeled / Unlabeled
Instance Bags / Images Maximizing Minimizing
Instance Set
Instance Uncertainty Instance Uncertainty
Selected / Unselected
(a) Instance Bags
Uncertainty Re-weighting
Instance Classifiers
(Pseudo) Image Labels MIL Classifier
Re-weighting f1 f1
f1
f2 fmil fmil
Label Set fmil
fmil Training f2
f2
Maximizing Minimizing Selected
Instance Bags / Images Re-weighted Uncertainty Re-weighted Uncertainty Instance Bags / Images
(b) (in red)
Figure 2. MI-AOD illustration. (a) Instance uncertainty learning (IUL) utilizing two adversarial classifiers. (b) Instance uncertainty re-
weighting (IUR) using multiple instance learning. Bigger symbols (“+” and “−”) indicate larger weights. (Best viewed in color)

faces complex instance distributions in the same images and set XL1 , the task model is retrained and updated to M1 . The
are more challenging than active image classification. By model training and sample selection repeat some cycles un-
simply sorting the loss predictions of instances to evaluate til the size of labeled set reaches the annotation budget.
the image uncertainty, the learning loss method [42] speci- Considering the large number1 of instances in each im-
fied for image classification was directly applied to object age, there are two key problems for active object detec-
detection. The image-level uncertainty can also be esti- tion: (1) how to evaluate the uncertainty of the unlabeled in-
mated by the uncertainty of a lot of background pixels [2]. stances using the detector trained on the labeled set; (2) how
CDAL [1] introduced spatial context to active detection and to precisely estimate the image uncertainty while filtering
selected diverse samples according to the distances to the la- out noisy instances. MI-AOD handles these two problems
beled set. Existing approaches simply used instance-/pixel- by introducing two learning modules respectively. For the
level observations to represent the image-level uncertainty. first problem, MI-AOD incorporates instance uncertainty
There still lacks a systematic method to learn the image un- learning, with the aim of highlighting informative instances
certainty by leveraging instance-level models [44, 45]. in the unlabeled set, as well as aligning the distributions of
the labeled and unlabeled set, Fig. 2(a). It is motivated by
3. The Proposed Approach the fact that most active learning methods remain simply
generalizing the models trained on the labeled set to the un-
3.1. Overview labeled set. This is problematic when there is a distribution
bias between the two sets [10]. For the second problem,
For active object detection, a small set of images XL0
MI-AOD introduces MIL to both the labeled and unlabeled
(the labeled set) with instance labels YL0 and a large set of
set to estimate the image uncertainty by re-weighting the in-
images XU0 (the unlabeled set) without labels are given. For
stance uncertainty. This is done by treating each image as
each image, the label consists of bounding boxes (yxloc ) and
an instance bag while re-weighting the instance uncertainty
categories (yxcls ) for objects of interest. A detection model
under the supervision of the image classification loss. Opti-
M0 is firstly initialized by using the labeled set {XL0 , YL0 }.
mizing the image classification loss facilitates highlighting
With the initialized model M0 , active learning targets at se-
truly representative instances belonging to the same object
lecting a set of images XS0 from XU0 to be manually labeled
classes while suppressing the noisy ones, Fig. 2(b).
and merging them with XL0 for a new labeled set XL1 , i.e.,
XL1 = XL0 ∪ XS0 . The selected image set XS0 should be
the most informative, i.e., can improve the detection perfor- 1 For example, the RetinaNet detector [19] produces ∼100k of anchors

mance as much as possible. Based on the updated labeled (instances) for an image.

5332
Labeled Data Flow Unlabeled Data Flow Labeled and Unlabeled Data Flow Fixed
fr ldet Fixed fr ldet fr ldet

g f1 g f1 g f1

f2 f2 -ldis f2 ldis
(a) (b) (c)
Figure 3. Network architecture for instance uncertainty learning. (a) Label set training. (b) Maximizing instance uncertainty by maximizing
classifier prediction discrepancy. (c) Minimizing instance uncertainty by minimizing classifier prediction discrepancy.

3.2. Instance Uncertainty Learning ancy of the adversarial classifiers, Fig. 3(b). In this proce-
dure, θg is fixed so that the distributions of both the labeled
Label Set Training. Using the RetinaNet as the base-
and unlabeled instances are fixed. θf1 and θf2 are fine-tuned
line [19], we construct a detector with two discrepant in-
on the unlabeled set to maximize the prediction discrepan-
stance classifiers (f1 and f2 ) and a bounding box regres-
cies for all instances. At the same time, it requires to pre-
sor (fr ), Fig. 3(a). We utilize the prediction discrepancy
serve the detection performance on the labeled set. This is
between the two instance classifiers to learn the instance
fulfilled by optimizing the following loss function, as
uncertainty on the unlabeled set. Let g denote the feature
extractor parameterized by θg . The discrepant classifiers
X X
argmin Lmax = ldet (x) − λ · ldis (x), (2)
are parameterized by θf1 and θf2 and the regressor by θfr . Θ\θ g x∈XL x∈XU
Θ = {θf1 , θf2 , θfr , θg } denotes the set of all parameters,
where θf1 and θf2 are independently initialized. where X
In object detection, each image x can be represented by ldis (x) = |ŷif1 − ŷif2 | (3)
i
multiple instances {xi , i = 1, ..., N } corresponding to the
feature anchors on the feature map [19]. N is the number denotes the prediction discrepancy loss. ŷif1 , ŷif2 ∈ R1×C
of the instances in image x. {yi , i = 1, ..., N } denote the are the instance classification predictions of the two classi-
labels for the instances. Given the labeled set, a detection fiers for the i-th instance in image x, where C is the number
model is trained by optimizing the detection loss, as of object classes in the dataset, and λ is a regularization
X hyper-parameter determined by experiment. As shown in
argmin ldet (x) = FL(ŷif1 , yicls ) + FL(ŷif2 , yicls ) Fig. 2(a), the informative instances with different predic-
Θ i
 tions by the adversarial classifiers tend to have larger pre-
+ SmoothL1(ŷifr , yiloc ) , diction discrepancy and larger uncertainty.
Minimizing Instance Uncertainty. After maximizing
(1)
the prediction discrepancy, we further propose to minimize
where FL(·) is the focal loss function for instance clas- the prediction discrepancy to align the distributions of the
sification and SmoothL1(·) is the smooth L1 loss func- labeled and unlabeled instances, Fig. 3(c). In this proce-
tion for bounding box regression [19]. ŷif1 = f1 (g(xi )), dure, the classifier parameters θf1 and θf2 are fixed, while
ŷif2 = f2 (g(xi )) and ŷifr = fr (g(xi )) denote the pre- the parameters θg of the feature extractor are optimized by
diction results (classification and localization) for the in- minimizing the prediction discrepancy loss, as
stances. yicls and yiloc denote the ground-truth class label X X
argmin Lmin = ldet (x) + λ · ldis (x). (4)
and bounding box label, respectively. θ
g x∈XL x∈XU
Maximizing Instance Uncertainty. Before the labeled
set can precisely represent the unlabeled set, there exists a By minimizing the prediction discrepancy, the distribution
distribution bias between the labeled and unlabeled set, es- bias between the labeled set and the unlabeled set is mini-
pecially when the labeled set is small. The informative in- mized and their features are aligned as much as possible.
stances are in the biased distribution area. To find them out, In each active learning cycle, the max-min prediction
f1 and f2 are designed as the adversarial instance classifiers discrepancy procedures repeat several times so that the in-
with larger prediction discrepancy on the instances close to stance uncertainty is learned and the instance distributions
the boundary, Fig. 2(a). The instance uncertainty is defined of the labeled and unlabeled set are progressively aligned.
as the prediction discrepancy of f1 and f2 . This actually defines an unsupervised learning procedure,
To find out the most informative instances, it requires to which leverages the information (i.e., prediction discrep-
fine-tune the network and maximize the prediction discrep- ancy) of the unlabeled set to improve the detection model.

5333
Labeled Data Flow Unlabeled Data Flow Labeled and Unlabeled Data Flow Fixed
fr ldet Fixed fr ldet fr ldet

f1 f1 f1
g g -ldis g ldis
f2 f2 f2

fmil limgcls fmil limgcls fmil limgcls


(a) (b) (c)
Figure 4. Network architecture for instance uncertainty re-weighting. (a) Label set training. (b) Re-weighting and maximizing instance
uncertainty. (c) Re-weighting while minimizing instance uncertainty.

3.3. Instance Uncertainty Re-weighting where yccls ∈ {0, 1} denotes the image class label, which
can be directly obtained from the instance class label yicls in
With instance uncertainty learning, the informative in-
the labeled set. Optimizing Eq. (6) drives the MIL classifier
stances are highlighted. However, as there is a lot of in- fmil
to activate instances with large MIL score (ŷi,c ) and large
stances (∼100k) in each image, the instance uncertainty f1 f2
may be not consistent with the image uncertainty. Some classification outputs (ŷi,c + ŷi,c ). The instances with small
instances of high uncertainty are simply background noise MIL scores will be suppressed as background. The image
or hard negatives for the detector. We thereby introduce an classification loss is firstly applied in the label set training to
MIL procedure to bridge the gap between the instance-level get the initial model, and then used to re-weight the instance
and image-level uncertainty by filtering out noisy instances. uncertainty in the unlabeled set.
Multiple Instance Learning. MIL treats each image as Uncertainty Re-weighting. To ensure that the instance
an instance bag and utilizes the instance classification pre- uncertainty is consistent with the image uncertainty, we as-
dictions to estimate the bag labels. In turn, it re-weights semble the image classification scores for all classes to a
the instance uncertainty scores by minimizing the image score vector wi and re-weight the instance uncertainty as
classification loss. This actually defines an Expectation- ˜ldis (x) =
X
|wi · (ŷif1 − ŷif2 )|, (7)
Maximization procedure [35, 5] to re-weight instance un-
i
certainty across bags while filtering out noisy instances.
Specifically, we add an MIL classifier fmil parameter- where wi = ŷicls . We then update Eq. (2) to
ized by θfmil in parallel with the instance classifiers, Fig. 4. X   X
cls
The image classification score ŷi,c for multiple instances in argmin L̃max = ldet (x)+limgcls (x) − λ·˜ldis (x),
an image is calculated as Θ̃\θg x∈XL x∈XU

  (8)
fmil exp (ŷi,cf1 f2
+ ŷi,c )/2 where Θ̃ = Θ∪{θfmil }. By optimizing Eq. (8), the discrep-
cls
exp(ŷ i,c )
ŷi,c =P ·P  , (5) ancies of instances with large image classification scores are
fmil

exp(ŷ f1 f2
c i,c ) i exp (ŷi,c + ŷi,c )/2 preferentially estimated, while those with small classifica-
tion scores are suppressed. Similarly, Eq. (4) is updated to
where ŷ fmil = fmil (g(x)) is an N × C score matrix, and X  
fmil argmin L̃min = ldet (x) + limgcls (x)
ŷi,c is the element in ŷ fmil indicating the score of the i-th θg ,θfmil x∈XL
instance for class c. According to Eq. (5), the image clas- X   (9)
cls
sification score ŷi,c is large only when xi belongs to class + λ · ˜ldis (x) + limgcls (x) .
c (the first term in Eq. (5)) and its instance classification x∈XU
f1 f2
scores ŷi,c and ŷi,c are significantly larger than those of oth-
In Eq. (9), the image classification loss is applied to the un-
ers (the second term in Eq. (5)).
labeled set, where the pseudo image labels are estimated
Considering that the image classification scores of the
using the outputs of the instance classifiers, as
instances from other classes/backgrounds are small, the im-
age classification loss limgcls is defined as   ŷ f1 + ŷ f2  
i,c i,c
ycpseudo = 1 max , 0.5 , (10)
X X i 2
limgcls (x) = − yccls log cls
ŷi,c
c i where 1(a, b) is a binarization function. When a > b, it
 (6)
returns 1; otherwise 0. Eq. (10) is defined based on that in-
X
+(1 − yccls ) log(1 − cls
ŷi,c ) ,
i
stance classifiers can find true instances but are easy to be

5334
RetinaNet on PASCAL VOC SSD on PASCAL VOC RetinaNet on MS COCO
70 75 20
65 18
60 70
16
mAP (%)

55

mAP (%)
65

AP (%)
50 14
45 Random Sampling Random Sampling 12 Random Sampling
60 Entropy Sampling
40 Entropy Sampling Entropy Sampling
Core-set (ICLR 18) Core-set (ICLR 18) 10 Core-set (ICLR 18)
35 55 LL4AL (CVPR 19)
CDAL (ECCV 20) CDAL (ECCV 20) 8 CDAL (ECCV 20)
30
MI-AOD (Ours) MI-AOD (Ours) MI-AOD (Ours)
25 50 6
5.0 7.5 10.0 12.5 15.0 17.5 20.0 1k 2k 3k 4k 5k 6k 7k 8k 9k 10k 2.0 4.0 6.0 8.0 10.0
Proportion (%) of Labeled Images Number of Labeled Images Proportion (%) of Labeled Images
(a) (b) (c)
Figure 5. Performance comparison of active object detection methods. (a) On PASCAL VOC using RetinaNet. (b) On PASCAL VOC
using SSD. (c) On MS COCO using RetinaNet.

confused by complex backgrounds. We use the maximum lected images from the training set to initialize the labeled
instance score to predict pseudo image labels and lever- set on PASCAL VOC. In each active learning cycle, it se-
age MIL to reduce background interference. According to lects 2.5% images from the rest unlabeled set until the la-
Eqs. (5) and (6), the image classification loss ensures the beled images reach 20.0% of the training set. For the large-
highlighted instances are representative of the image, i.e., scale MS COCO, MI-AOD uses only 2.0% of randomly se-
minimizing the image classification loss bridges the gap be- lected images from the training set to initialize the labeled
tween the instance uncertainty and image uncertainty. By it- set, and then selects 2.0% images from the rest of the un-
eratively optimizing Eqs. (8) and (9), informative object in- labeled set in each cycle until reaching 10.0% of the train-
stances in the same class are statistically highlighted, while ing set. In each cycle, the model is trained for 26 epochs
the background instances are suppressed. with the mini-batch size 2 and the learning rate 0.001. After
20 epochs, the learning rate decreases to 0.0001. The mo-
3.4. Informative Image Selection mentum and the weight decay are set to 0.9 and 0.0001 re-
In each learning cycle, after instance uncertainty learn- spectively. For SSD, we follow the settings in LL4AL [42]
ing (IUL) and instance uncertainty re-weighting (IUR), and CDAL [1], where 1k images in the training set are se-
we select the most informative images from the unlabeled lected to initialize the labeled set and 1k images are selected
set by observing the top-k instance uncertainty defined in in each cycle. The learning rate is 0.001 for the first 240
Eq. (3) for each image, where k is a hyper-parameter. This epochs and reduced to 0.0001 for the last 60 epochs. The
is based on the fact that the noisy instances have been sup- mini-batch size is set to 32 which is required by LL4AL.
pressed and the instance uncertainty becomes consistent We compare MI-AOD with random sampling, entropy
with the image uncertainty. The selected images are merged sampling, Core-set [30], LL4AL [42] and CDAL [1]. For
into the labeled set for the next learning cycle. entropy sampling, we use the averaged instance entropy as
the image uncertainty. We repeat all experiments for 5 times
4. Experiments and use the mean performance. MI-AOD and other meth-
ods share the same random seed and initialization for a fair
4.1. Experimental Settings comparison. λ defined in Eqs. (2), (4), (8) and (9) is set to
0.5 and k mentioned in Sec. 3.4 is set to 10k.
Datasets. The trainval sets of PASCAL VOC 2007
and 2012 datasets which contain 5011 and 11540 images are
4.2. Performance
used for training. The VOC 2007 test set is used to evalu-
ate mean average precision (mAP). The MS COCO dataset PASCAL VOC. In Fig. 5, we report the performance
contains 80 object categories with challenging aspects in- of MI-AOD and compare it with state-of-the-art methods
cluding dense objects and small objects with occlusion. We on a TITAN V GPU. Using either the RetinaNet [19] or
use the train set with 117k images for active learning and SSD [22] detector, MI-AOD outperforms state-of-the-art
the val set with 5k images for evaluating AP. methods with large margins. Particularly, it respectively
Active Learning Settings. We use the RetinaNet [19] outperforms state-of-the-art methods by 18.08%, 7.78%,
with ResNet-50 and SSD [23] with VGG-16 as the base de- and 5.19% when using 5.0%, 7.5%, and 10.0% samples.
tector. For RetinaNet, MI-AOD uses 5.0% of randomly se- With 20.0% samples, MI-AOD achieves 72.27% detection

5335
Training Sample Selection mAP (%) on Proportion (%) of Labeled Images
Max Mean
IUL IUR Rand. 5.0 7.5 10.0 12.5 15.0 17.5 20.0 100.0
Unc. Unc.
X 28.31 49.42 56.03 59.81 64.02 65.95 67.09
X X 30.09 49.17 55.64 60.93 64.10 65.77 67.20
77.28
X X 30.09 49.79 58.94 63.11 65.61 67.84 69.01
X X 30.09 49.74 60.60 64.29 67.13 68.76 70.06
X X 47.18 57.12 60.68 63.72 66.10 67.59 68.48
X X 47.18 57.58 61.74 64.58 66.98 68.79 70.33 78.37
X X 47.18 58.03 63.98 66.58 69.57 70.96 72.03

Table 1. Module ablation on PASCAL VOC. The first line shows the result of the baseline method with random image selection. “Max
Unc.” and “Mean Unc.” respectively denote that the image uncertainty is represented by the maximum and averaged instance uncertainty.

Training mAP (%) on Proportion (%) of Labeled Imgs. 4.3. Ablation Study
IUL 2.0 4.0 6.0 8.0 10.0 IUL. As shown in Tab. 1, with IUL, the detection per-
51.01 61.48 69.14 75.14 79.77 formance is improved up to 70.06% in the last cycle, which
outperforms the Random method by 2.97% (70.06% vs.
X 58.07 67.75 74.91 78.88 80.96
67.09%). In Tab. 2, IUL also significantly improves the
Table 2. The effect of IUL for active image classification. Experi-
image classification performance with active learning on
ments are conducted on CIFAR-10 using the ResNet-18 backbone CIFAR-10. Particularly when using 2.0% samples, it im-
while the images are randomly selected in all cycles. proves the classification performance by 7.06% (58.07%
vs. 51.01%), demonstrating the effectiveness of the discrep-
ancy learning module for instance uncertainty estimation.
mAP (%) on Proportion (%) of Labeled Imgs.
wi Set IUR. In Tab. 1, IUL achieves comparable performance
5.0 7.5 10.0 12.5 15.0 17.5 20.0 with the method using the random image selection strategy
1 ∅ 30.09 49.17 55.64 60.93 64.10 65.77 67.20 in the early cycles. This is because there are significant
noisy instances that make the instance uncertainty inconsis-
ŷif1 ∅ 31.67 50.67 55.93 60.78 64.17 66.22 67.30 tent with image uncertainty. After using IUR to re-weight
1 XL 42.52 54.08 57.18 63.43 65.04 66.74 68.32 instance uncertainty, the performance at early cycles is im-
ŷicls X 47.18 57.12 60.68 63.72 66.10 67.59 68.48 proved by 5.04%∼17.09% in the first three cycles (row 4
vs. row 1 in Tab. 3). In the last cycle, the performance is im-
Table 3. Ablation study on IUR. “wi ” is the wi in Eq. (7). “Set” proved by 1.28% (68.48% vs. 67.20%) in comparison with
denotes the sample set for IUR. X and XL denote the whole sam- IUL and 1.39% in comparison with the Random method
ple set and the labeled set, respectively. (68.48% vs. 67.09%). As shown in Tab. 3, image classifi-
cation score ŷicls is the best re-weighting metric (row 4 vs.
others). Interestingly, when using 100.0% images for train-
mAP, which significantly outperforms CDAL by 3.20%. ing, the detector with IUR outperforms the detector without
The improvements validate that MI-AOD can precisely IUR by 1.09% (78.37% vs. 77.28%). These results clearly
learn instance uncertainty while selecting informative im- verify that the IUR module can suppress the interfering in-
ages. When using the SSD detector, MI-AOD outperforms stances while highlighting more representative ones, which
state-of-the-art methods in almost all cycles, demonstrating can indicate informative images for detector training.
the general applicability of MI-AOD to object detectors. Hyper-parameters and Time Cost. The effects of the
MS COCO. MS COCO is a challenging dataset for regularization factor λ defined in Eqs. (2), (4), (8) and (9)
more categories, denser objects, and larger scale variation, and the valid instance number k in each image for selection
where MI-AOD also outperforms the compared methods, are shown in Tab. 4. MI-AOD has the best performance
Fig. 5. Particularly, it respectively outperforms Core-set when λ is set to 0.5 and k is set to 10k (for ∼100k in-
and CDAL by 0.6%, 0.5%, and 2.0%, and 0.6%, 1.3%, and stances/anchors in each image). Tab. 5 shows that MI-AOD
2.6% when using 2.0%, 4.0%, and 10.0% labeled images. costs less time at early cycles than CDAL.

5336
mAP (%) on Proportion (%) of Labeled Imgs. Unlabeled Image IUL yˆcls IUR
λ k
5.0 7.5 10.0 12.5 15.0 17.5 20.0
2 10k 47.18 56.94 64.44 67.70 69.58 70.67 72.12
1 10k 47.18 57.30 64.93 67.40 69.63 70.53 71.62
0.5 10k 47.18 58.41 64.02 67.72 69.79 71.07 72.27
0.2 10k 47.18 58.02 64.44 67.67 69.42 70.98 72.06
0.5 N 47.18 58.03 63.98 66.58 69.57 70.96 72.03
0.5 10k 47.18 58.41 64.02 67.72 69.79 71.07 72.27
0.5 100 47.18 58.74 63.62 67.03 68.63 70.26 71.47
0.5 1 47.18 57.58 61.74 64.58 66.98 68.79 70.33

Table 4. Performance under different hyper-parameters.

Time (h) on Proportion (%) of Labeled Imgs. Figure 6. Visualization of learned and re-weighted instance uncer-
Method tainty and image classification score. (Best viewed in color)
5.0 7.5 10.0 12.5 15.0 17.5 20.0
Random 0.77 1.12 1.45 1.78 2.12 2.45 2.78 15172
16000
13576
Number of True Positive Instances
CDAL [1] 1.18 1.50 1.87 2.19 2.68 2.83 2.82
14000 Cycle
MI-AOD 1.03 1.42 1.78 2.18 2.55 2.93 3.12 11386 6
12000
5
Table 5. Comparison of time cost on PASCAL VOC. 10000 7960 7851 4 s
8000 3
2
6000 1
4.4. Model Analysis 4000 0
Visualization Analysis. In Fig. 6, we visualize the 2000
learned and re-weighted uncertainty and image classifica- 0 Random Entropy Core-set CDAL MI-AOD
tion scores of instances. The heatmaps are calculated by Sampling Sampling (ICLR 18) (ECCV 20) (Ours)
summarizing the uncertainty scores of all instances. With Figure 7. The number of true positive instances selected in each
only IUL, there exist interference instances from the back- active learning cycle on PASCAL VOC using RetinaNet.
ground (row 1) or around the true positive instance (row
2), and the results tend to miss the true positive instances
(row 3) or instance parts (row 4). MIL can assign high im-
instance classifiers to learn the uncertainty of unlabeled in-
age classification scores to the instances of interesting while
stances. MI-AOD treats the unlabeled images as instance
suppressing backgrounds. As a result, IUR leverages the
bags and estimates the image uncertainty by re-weighting
image classification scores to re-weight instances towards
instances in a multiple instance learning (MIL) fashion. It-
accurate instance uncertainty prediction.
erative instance uncertainty learning and re-weighting facil-
Statistical Analysis. In Fig. 7, we calculate the number itate suppressing noisy instances, towards selecting infor-
of true positive instances selected in each active learning mative images for detector training. Experiments on large-
cycle. It can be seen that MI-AOD significantly hits more scale datasets have validated the superiority of MI-AOD,
true positives in all learning cycles. This shows that the pro- in striking contrast with state-of-the-art methods. MI-AOD
posed MI-AOD approach can activate true positive objects sets a solid baseline for active object detection.
better while filtering out interfering instances, which facili-
ties selecting informative images for detector training. Acknowledgement. This work was supported by Nat-
ural Science Foundation of China (NSFC) under Grant
5. Conclusion 62006216, 61836012 and 61620106005, the Strategic Pri-
ority Research Program of Chinese Academy of Sciences
We proposed Multiple Instance Active Object Detection under Grant No. XDA27000000, Post Doctoral Innovative
(MI-AOD) to select informative images for detector training Talent Support Program of China under Grant 119103S304,
by observing instance uncertainty. MI-AOD incorporates a CAAI-Huawei MindSpore Open Fund and MindSpore deep
discrepancy learning module, which leverages adversarial learning computing framework at www.mindspore.cn

5337
References [18] Liang Lin, Keze Wang, Deyu Meng, Wangmeng Zuo, and
Lei Zhang. Active self-paced learning for cost-effective and
[1] Sharat Agarwal, Himanshu Arora, Saket Anand, and Chetan progressive face identification. IEEE TPAMI, 40(1):7–19,
Arora. Contextual diversity for active learning. In ECCV, 2018. 2
pages 137–153, 2020. 2, 3, 6, 8 [19] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He,
[2] Hamed Habibi Aghdam, Abel Gonzalez-Garcia, Antonio M. and Piotr Dollár. Focal loss for dense object detection. IEEE
López, and Joost van de Weijer. Active learning for deep TPAMI, 42(2):318–327, 2020. 3, 4, 6
detection neural networks. In ICCV, pages 3671–3679, 2019. [20] Binghao Liu, Yao Ding, Jianbin Jiao, Ji Xiangyang, and Qix-
2, 3 iang Ye. Anti-aliasing semantic reconstruction for few-shot
[3] Oisin Mac Aodha, Neill D. F. Campbell, Jan Kautz, and semantic segmentation. In CVPR, 2021. 1
Gabriel J. Brostow. Hierarchical subquery evaluation for ac- [21] Binghao Liu, Jianbin Jiao, and Qixiang Ye. Harmonic feature
tive learning on a graph. In CVPR, pages 564–571, 2014. activation for few-shot semantic segmentation. IEEE TIP,
2 pages 3142–3153, 2021. 1
[4] William H. Beluch, Tim Genewein, Andreas Nürnberger, [22] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
and Jan M. Köhler. The power of ensembles for active learn- Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
ing in image classification. In CVPR, pages 9368–9377, Berg. Ssd: Single shot multibox detector. In ECCV, pages
2018. 2 21–37, 2016. 6
[5] Hakan Bilen and Andrea Vedaldi. Weakly supervised deep [23] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
detection networks. In CVPR, pages 2846–2854, 2016. 5 Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C.
Berg. SSD: single shot multibox detector. In ECCV, pages
[6] Marc-Andre Carbonneau, Eric Granger, and Ghyslain
21–37, 2016. 6
Gagnon. Bag-level aggregation for multiple-instance active
[24] Zhao-Yang Liu and Sheng-Jun Huang. Active sampling for
learning in instance classification problems. IEEE TNNLS,
open-set classification without initial annotation. In AAAI,
30(5):1441–1451, 2019. 2
pages 4416–4423, 2019. 2
[7] Ehsan Elhamifar, Guillermo Sapiro, Allen Y. Yang, and
[25] Zhao-Yang Liu, Shao-Yuan Li, Songcan Chen, Yao Hu, and
S. Shankar Sastry. A convex optimization framework for ac-
Sheng-Jun Huang. Uncertainty aware graph gaussian process
tive learning. In ICCV, pages 209–216, 2013. 2
for semi-supervised learning. In AAAI, pages 4957–4964,
[8] Alexander Freytag, Erik Rodner, and Joachim Denzler. Se- 2020. 2
lecting influential examples: Active learning with expected [26] Wenjie Luo, Alexander G. Schwing, and Raquel Urtasun.
model output changes. In ECCV, pages 562–577, 2014. 2 Latent structured active learning. In NeurIPS, pages 728–
[9] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep 736, 2013. 2
bayesian active learning with image data. In ICML, pages [27] Hieu Tat Nguyen and Arnold W. M. Smeulders. Active learn-
1183–1192, 2017. 2 ing using pre-clustering. In ICML, 2004. 2
[10] Denis A. Gudovskiy, Alec Hodgkinson, Takuya Yamaguchi, [28] Dan Roth and Kevin Small. Margin-based active learning for
and Sotaro Tsukizawa. Deep active learning for biased structured output spaces. In ECML, pages 413–424, 2006. 2
datasets via fisher kernel self-supervision. In CVPR, pages [29] Nicholas Roy and Andrew McCallum. Toward optimal ac-
9038–9046, 2020. 3 tive learning through sampling estimation of error reduction.
[11] Yuhong Guo. Active instance sampling via matrix partition. In ICML, pages 441–448, 2001. 2
In NeurIPS, pages 802–810, 2010. 2 [30] Ozan Sener and Silvio Savarese. Active learning for con-
[12] Mahmudul Hasan and Amit K. Roy-Chowdhury. Context volutional neural networks: A core-set approach. In ICLR,
aware active learning of activity recognition models. In 2018. 2, 6
ICCV, pages 4543–4551, 2015. 2 [31] Burr Settles. Active Learning. Synthesis Lectures on Artifi-
cial Intelligence and Machine Learning. 2012. 2
[13] Sheng-Jun Huang, Nengneng Gao, and Songcan Chen.
[32] Burr Settles and Mark Craven. An analysis of active learn-
Multi-instance multi-label active learning. In IJCAI, pages
ing strategies for sequence labeling tasks. In EMNLP, pages
1886–1892, 2017. 2
1070–1079, 2008. 2
[14] Ajay J. Joshi, Fatih Porikli, and Nikolaos Papanikolopou-
[33] Burr Settles, Mark Craven, and Soumya Ray. Multiple-
los. Multi-class active learning for image classification. In
instance active learning. In NeurIPS, pages 1289–1296,
CVPR, pages 2372–2379, 2009. 2
2007. 2
[15] Christoph Käding, Erik Rodner, Alexander Freytag, and [34] Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Vari-
Joachim Denzler. Active and continuous exploration with ational adversarial active learning. In ICCV, pages 5971–
deep neural networks and expected model output changes. 5980, 2019. 2
arXiv preprint arXiv:1612.06129, 2016. 2 [35] Andrews Stuart, Tsochantaridis Ioannis, and Hofmann
[16] David D. Lewis and Jason Catlett. Heterogeneous uncer- Thomas. Support vector machines for multiple-instance
tainty sampling for supervised learning. In Machine Learn- learning. In NeurIPS, pages 561–568, 2002. 5
ing, pages 148–156, 1994. 2 [36] Ying-Peng Tang and Sheng-Jun Huang. Self-paced active
[17] David D. Lewis and William A. Gale. A sequential algorithm learning: Query the right thing at the right time. In AAAI,
for training text classifiers. In SIGIR, pages 3–12, 1994. 2 pages 5117–5124, 2019. 2

5338
[37] Fang Wan, Chang Liu, Wei Ke, Xiangyang Ji, Jianbin Jiao,
and Qixiang Ye. C-mil: Continuation multiple instance
learning for weakly supervised object detection. In CVPR,
pages 2199–2208, 2019. 2
[38] Fang Wan, Pengxu Wei, Zhenjun Han, Jianbin Jiao, and Qix-
iang Ye. Min-entropy latent model for weakly supervised
object detection. IEEE TPAMI, 41(10):2395–2409, 2019. 2
[39] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and
Liang Lin. Cost-effective active learning for deep image
classification. IEEE TCSVT, 27(12):2591–2600, 2017. 2
[40] Ran Wang, Xi-Zhao Wang, Sam Kwong, and Chen Xu.
Incorporating diversity and informativeness in multiple-
instance active learning. IEEE TFS, 25(6):1460–1475, 2017.
2
[41] Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and
Alexander G. Hauptmann. Multi-class active learning by
uncertainty sampling with diversity maximization. IJCV,
113(2):113–127, 2015. 2
[42] Donggeun Yoo and In So Kweon. Learning loss for active
learning. In CVPR, pages 93–102, 2019. 2, 3, 6
[43] Beichen Zhang, Liang Li, Shijie Yang, Shuhui Wang, Zheng-
Jun Zha, and Qingming Huang. State-relabeling adversarial
active learning. In CVPR, pages 8753–8762, 2020. 2
[44] Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, and
Qixiang Ye. Freeanchor: Learning to match anchors for vi-
sual object detection. In NeurIPS, pages 147–155, 2019. 3
[45] Xiaosong Zhang, Fang Wan, Chang Liu, Xiangyang Ji, and
Qixiang Ye. Learning to match anchors for visual object de-
tection. IEEE TPAMI, pages 1–14, 2021. 3

5339

You might also like