2008 - CVPR-gjqi - Two-Dimensional Active Learning For Image Classification
2008 - CVPR-gjqi - Two-Dimensional Active Learning For Image Classification
†
Guo-Jun Qi, ‡ Xian-Sheng Hua, ‡ Yong Rui, † Jinhui Tang, ‡ Hong-Jiang Zhang
†
MOE-Microsoft Key Laboratory of Multimedia Computing and Communication
& Department of Automation, University of Science and Technology of China
{qgj, jhtang}@mail.ustc.edu.cn
‡
Microsoft Research Asia
{xshua, yongrui, hjzhang}@microsoft.com
multi-label image classification. Figure 1. Some examples of multi-labeled images. “P” means pos-
Active learning is one of the most used methods in im- itive label and “N” means negative label.
1
Label Dimension Label Dimension
the expected classification error. Specifically, in each iter- Before 2DAL After 2DAL
... ... ... ...
ation, the annotators are only required to annotate/confirm
X1 ? ? P X1 S ? P
a selected part of labels of selected samples while the re- Xi ? ... P ... N Xi ? ... P ... N
maining unlabeled part will be inferred according to the la- ... ... ... ... ... ... ... ...
... ... ... ...
bel correlations. We call this strategy 2 Dimensional Active Xj ? ? ? 2DAL Xj S ? S
Sample ... ... ... ... Sample ... ... ... ...
be labeled along the sample dimension but also the labels Samples Labels Samples Labels
both sample and label redundancies for multi-labeled sam- Unlabeled Concept Selected to be labeled Concept
? S
ples. Therefore, annotating a set of selected sample-label
pairs provides enough information for training the classi- Figure 2. Proposed two-dimension 2DAL strategy
fiers since the information in the selected pairs can be prop-
as follows. In section 2, we present the 2DAL selection
agated to the rest along both the sample and the label “di-
strategy used in the proposed active learning algorithm. We
mensions”. Unlike existing binary-based active learning
also show that the traditional active learning formulation is
strategies [11][3] which only take the sample redundancy
a special case of 2DAL when there is only one lable. After
into account, the 2DAL strategy additionally considers the
that, a Kernelized Maximum Entropy Model is proposed in
label dimension to leverage the rich relationships embedded
section 3 to model the label correlations. In addition, an
in the multiple labels. 2DAL efficiently selects an informa-
Expectation-Maximum (EM) algorithm is also given in this
tive part of the labels rather than all the labels for a partic-
section to solve the incomplete labeling problem. In section
ular selected sample. Such a strategy significantly reduces
4 we evaluate the proposed 2DAL with comparison with the
the required human labors during active learning. For ex-
state-of-the-art one dimensional active learning approach on
ample, Field and Mountain tend to occur simultaneously in
two real-world data sets. Finally, we conclude in section 5.
an image. Therefore, it is reasonable to select only one la-
bel (e.g., Mountain) for annotation since the uncertainty of
the other label can be remarkably decreased after annotating 2. Two-Dimensional Active Learning
this one. Another example is Mountain and Urban, in con- In this section, we start by detailing the underlying idea
trast to Field and Mountain, these two labels often do not of the proposed 2DAL strategy in multi-label setting from
occur simultaneously. Thus, annotating one of them most the perspective of information theory. Then, a Bayesian
likely will probably eliminate the existence of the other. error bound is derived that states the expected classifica-
To realize 2DAL, we will answer the following questions tion error given a selected sample-label pair. The pro-
in this paper: posed 2DAL strategy will then be deduced by selecting the
1 What is the proper selection strategy for finding the sample-label pairs which optimize this bound.
sample-label pairs? To address this issue, we formu-
late the selection of sample-label pairs as minimizing 2.1. The proposed 2DAL strategy
a derived Multi-Label Bayesian Classification Error Figure 2 illustrates the proposed 2DAL strategy. Dif-
Bound. We will demonstrate that selecting sample- ferent from the typical binary active learning formulation
label pairs in this way will significantly reduce the un- that selects the most informative samples for annotation, we
certainty of both the samples and the labels. jointly select both the samples and labels simultaneously.
2 How can we model the label relationships/correlations? The underlying assumption is that different labels of a cer-
Since the proposed 2DAL strategy utilizes the label tain sample have different contributions to minimizing the
dependencies to reduce labeling cost, the underlying expected classification error of the to-be-trained classifier.
classifier should be able to model the corresponding And annotating a portion of well-selected labels may pro-
label correlations. Accordingly, we propose a Ker- vide sufficient information for learning the classifier. As
nelized Maximum Entropy Model (KMEM) to model shown in Figure 1, this strategy trades off between the an-
such correlations. Furthermore, since the 2DAL strat- notation labors and the learning performance along two di-
egy only annotates a sub set of labels, we formulate an mensions, i.e., the sample and label dimensions. In essence,
Expectation-Maximum(EM) [5] algorithm to solve the the multi-label classifiers do have uncertainty along differ-
incomplete labeling problem. ent labels as well as different samples. Traditional active
To the best of our knowledge, we are the first to present the learning algorithms can be seen as a one-dimension active
study of active learning on the granularity of sample-label selection approach, which only reduces the sample uncer-
pairs, with both theoretical analysis and empirical results tainty. In contrast, 2DAL is a two-dimensional active learn-
on real-world data sets. The rest of the paper is organized ing strategy, which selects the most “informative” sample-
label pairs to reduce the uncertainty along the dimensionali- parts U (x) and L(x). Once ys is selected to ask for label-
ties of both sample and label. More specifically, along label ing¡ (but not yet annotated),
¢ the Bayesian classification error
dimension all of the labels correlatively interact. Therefore, E yi |ys , yL(x) , x for an unlabeled yi , i ∈ U (x) is bounded
once partial labels are annotated, the left unlabeled concepts as
can then be inferred based on label correlations. Theoreti- 1
¡ ¢ ¡ ¢
2 H yi¡|ys ; yL(x) , x − ¢ ² ≤ E yi |ys ; yL(x) , x
(1)
cally, the label correlations have a connection with the ex- ≤ 12 H yi |ys ; yL(x) , x
pected Bayesian Error Bound (see the following lemma and
theorem in section 2.2), and thus these label correlations can where
help to reduce the prediction errors in the testing set during ¡ ¢ P ¡ ¢
H yi |ys ; yL(x) , x = {−P yi = t, ys = r|yL(x) , x
the active learning procedure. This approach saves much ¡ t,r∈{0,1} ¢
labor from fully annotating multiple labels. Thus, it is far × log P yi = t|ys = r; yL(x) , x }
more efficient when the number of labels is huge. For in-
is the conditional entropy of yi given the selected part ys
stance, an image may be associated with thousands and hun-
( both yi and ys are random variables since they have
dreds of concepts. That means a full annotation strategy will
not been labeled) and yL(x) is the known labeled part;
pay large labor costs for only one image. On the contrary,
2DAL only selects the most informative labels for annota- ² = 21 log 45 is a constant.
This lemma will be proven in the appendix.
tion. In the following section, we will derive such a two- Remark 1. It is worth noting that this bound is irrelevant
dimension selection criterion based on a derived Bayesian to the true label of the selected ys . In fact, before the anno-
classification error bound in multi-label setting. tator gives the label of ys , the true value of ys is unknown.
On the other hand, it is worth noting that as illustrated However, no matter what ys holds, 1 or 0, this bound always
in Figure 2, during the learning process, some samples may holds.
be lack of some labels since only a partial of labels are an- Based on this lemma, we can obtain the following theo-
notated. This is different from traditional active learning rem which bounds the multi-label error:
algorithm. In the section 3.2, we will address how to learn Theorem 1. (Multi-labeled Bayesian classification error
the classification model from incomplete labels. bound) Under the condition of lemma 1, the Bayesian clas-
sification error bound E(y|ys ; yL(x) , x) for sample x over all
2.2. Multi-labeled Bayesian error bound the labels y is
¡ ¢
The 2DAL learner requests annotations on the basis of E y|ysP ; yL(x)©, x ¡ ¢ ¡ ¢ª
1 m
sample-label pairs which, once incorporated into the train- ≤ 2m i=1 H yi |yL(x) , x − M I yi ; ys |yL(x) , x
ing set, are expected to result in the lowest generalization (2)
error. Here we will first derive a Multi-Labeled Bayesian where M I(yi ; ys |yL(x) , x) is the mutual information be-
Error Bound when a selected sample-label pair is labeled tween the random variables yi and ys given the known la-
under multi-label setting, and 2DAL accordingly will itera- beled part yL(x) .
tively select the ones to minimize this bound.
Proof.
Before we move further, we first define some notations. ¡ ¢
For each sample x, it has m labels yi (1 ≤ i ≤ m) and each E y|ys ; yL(x) , x
of them indicates whether its corresponding semantic con- (1) 1 Pm ¡ ¢
= m i=1 E yi |ys ; yL(x) , x
cept occurs. As stated before, in each 2DAL iteration, some (2) Pm ¡ ¢
1
of these labels have already been annotated while others ≤ 2m i=1 H yi |ys ; yL(x) , x
not. Let U (x) , {i|(x, yi ) is unlabeled sample-label pair.} (3) 1 Pm © ¡ ¢ ¡ ¢ª
= 2m i=1 H yi |yL(x) , x − M I yi ; ys |yL(x) , x
denote the set of indices of unlabeled part and L(x) , (3)
{i|(x, yi ) is labeled sample-label pair.} denote the labeled where (2) directly comes from Lemma 1, (3) makes use of
part. Note that L(x) can be an empty set ∅, which indicates the relationship between mutual information and entropy:
that no label has been annotated for x. Let P (y|x) be the M I(X; Y ) = H(X) − H(X|Y ).
m
conditional distribution over samples, where y = {0, 1}
is the complete label vector and P (x) be the marginal sam- We are concerned with pool-based active learning, i.e., a
ple distribution. large pool P is available to the learner sampled from P (x)
First, we establish a Bayesian error bound for classifying and the proposed 2DAL then selects the most informative
one unlabeled yi once ys is actively selected for annotating. sample-label pairs from the pool. We first write the ex-
This error bound originates from the equivocation bound pected Bayesian classification error over all samples in P
given in [7], and we extend it to multi-label setting so it can before selecting a sample-label pair (xs , ys )
handle sample-label pairs. 1 X ¡ ¢
Lemma 1. Given a sample x and its labeled and unlabeled E b (P) = E y|yL(x) , x (4)
|P| x∈P
We can use the above classification error on the pool to esti- but also label uncertainty. The above selection strat-
mate the¡ expected ¢errorRover the¡full distribution
¢ P (x), i.e., egy Eqn. 9 actually well-reflects these two targets. The
EP (x) E y|yL(x) , x = P (x)E y|yL(x) , x dx, because the last term in Eqn. 9 can be rewritten as
pool not only provides a finite set of samples but also an Pm ¡ ¢
estimation of P (x). After selecting the pair (xs , ys ), the i=1 M I yi ; ys |yL(xs ) , xs
expected Bayesian classification error over the pool P is ¡ ¢ P m ¡ ¢
= M I ys ; ys |yL(xs ) , xs + M I yi ; ys |yL(xs ) , xs
i=1,i6=s
E a (P)n ¡ ¢ ¡ ¢
1
¡ ¢ P ¡ ¢o = H ys |yL(xs ) , xs +
P
m
M I yi ; ys |yL(xs ) , xs
= |P| E y|ys ; yL(xs ) , xs + x∈P\xs E y|yL(x) , x
1
¡ ¢ ¡ ¢ i=1,i6=s
= |P| {E y|ys ; yL(xs ) , xs − E y|yL(xs ) , xs (10)
P ¡ ¢
+ x∈P E y|yL(x) , x } As we can see, the objective selection func-
(5) tion for 2DAL has been divided into two parts:
Therefore, the reduction of the expected Bayesian classifi- ¡ ¢ P
m ¡ ¢
H ys |yL(xs ) , xs and M I yi ; ys |yL(xs ) , xs .
cation after selecting (xs , ys ) over the whole pool P is i=1,i6=s
b a The former entropy measures the uncertainty of the se-
∆E (P) = E (P) − E (P) (6)
lected pair (x∗s , ys∗ ) itself, and this is consistent with
Thus our goal is to select a best (x∗s , ys∗ ) to maximize the the typical one dimensional active learning algorithm,
above expected error reduction. That is, i.e., to select the most uncertain samples near the clas-
sification boundary [10][9]. On the other hand, the
(x∗s , ys∗ ) = arg maxxs ∈P,ys ∈U (xs ) ∆E (P)
(7) latter mutual information term measures the statisti-
= arg minxs ∈P,ys ∈U (xs ) −∆E (P)
cal redundancy among the selected label and the rest.
Applying Lemma 1 and Theorem 1, we have By maximizing these mutual information terms, 2DAL
can provide information for the inference of other la-
−∆E (P) = E a (P) − E b (P)
(1) 1 ¡ ¢ ¡ ¢ bels to help reduce their label uncertainty. Therefore,
= |P| {E y|ys ; yL(xs ) , xs − E y|yL(xs ) , xs the obtained strategy confirms our motivation of select-
P ¡ ¢ 1
P ¡ ¢
+ x∈P E y|yL(x) , x } − |P| x∈P E y|yL(x) , x ing the most informative sample-label pairs to reduce
1
© ¡ ¢ ¡ ¢ª
= |P| E y|ys ; yL(xs ) , xs − E y|yL(xs ) , xs the uncertainties along both sample and label dimen-
(2) Pm ¡ ¢ sion. Note that when there is only one label for each
1 1
≤ |P| { 2m i=1¡ H yi |yL(xs ) , xs¢ sample, Eqn. 10 reduces to H(ys |xs ). The selection
1
P m
− 2m Pmi=1 M ¡ I yi ; ys |yL(x
¢ s ) , xs criterion becomes the same as the traditional binary-
1
− m i=1 E yi |yL(xs ) , xs } based criterion, i.e., to select the most uncertain sam-
(3)
1 1
Pm ¡ ¢ ple for annotation [9] [14].
≤ |P| { 2m i=1¡ H yi |yL(xs ) , xs¢
1
P m 2 When computing the mutual information terms in Eqn.
− 2m Pmi=1¡M I ¡yi ; ys |yL(xs ) ,¢xs ¢
1
− m ©i=1 12 HPyi |yL(xs )¡, xs − ² } 9, we need the distribution P (y|x). However, the true
1 1 m ¢ª distribution is unknown, but we can estimate it using
= |P| ² − 2m i=1 M I yi ; ys |yL(xs ) , xs
the current learner. As stated in [13], such an approxi-
(8)
mation is reasonable because the most useful labeling
The equality (1) comes from Eqn. 4 5 . The first inequality
is usually consistent with the learner’s prior belief over
(2) follows the Theorem 1 and the second inequality (3)
the majority (but not all) of the unlabeled pairs.
comes from the lower bound of Lemma 1.
Consequently, by minimizing the obtained Bayesian er- 3 It is worth indicating that the posterior P (y|x) is
ror bound 8, we can select the sample-label pair for annota- significant in modeling the label correlations. If
tion according to we assume the independence Qm among the different
labels, i.e., P (y|x) = i=1 P (yi |x) and corre-
(x∗s , ys∗ ) ½ ¾ spondingly¡ the mutual information term will be-
1 1
P
m ¡ ¢ ¢
= arg min |P| ²− 2m M I yi ; ys |yL(xs ) , xs come M I yi ; ys |yL(xs ) , xs = 0, i 6= s. In this
xs ∈P,ys ∈U (xs )
Pm i=1
¡ ¢ case, the selection criterion
¡ reduces ¢to (x∗s , ys∗ ) =
= arg max i=1 M I yi ; ys |yL(xs ) , xs arg maxxs ∈P,ys ∈U (xs ) H ys |yL(xs ) , xs , that is, to se-
xs ∈P,ys ∈U (xs )
(9) lect the most uncertain sample-label pair. Such a cri-
terion neglects the label correlations and will be less
2.3. Further Discussions efficient in reducing label uncertainty. Therefore, a
1 As we discussed in section 2.1, the proposed 2DAL ap- statistical method that can model the label correlations
proach is an active learning algorithm along two di- is required to adopt. We introduce such a Bayesian
mensions, which reduces not only sample uncertainty model in the following section.
3. Maximum Entropy Model and EM Variant to be determined. The optimal parameters can be found by
minimizing the Lagrangian:
In the above 2DAL strategy, we have indicated that a sta-
D E
tistical model is needed to measure label correlations. How-
L(b, R, W ) = − log P̂ (y|x)
ever, common multi-label classifiers, such as one-against- Q̃
λb 2 λR 2 λW 2
rest encoded binary SVM and others, tackle the classifi- + 2n
||b|| 2 + 2n ||R||F + 2n ||W ||F ® (12)
T
cation of multi-labeled samples in an independent manner. = −y (b + Ry + W x) + log Z(x) Q̃
λb
These models neglect the label correlations and, hence, do + 2n ||b||22 + λ2nR ||R||2F + λ2n
W
||W ||2F
not fit our target. In this section, we will introduce a multi-
labeled Bayesian classifier in which the relations among dif- where ||.||F denotes Frobenius norm and n is the number of
ferent labels are well modeled. samples in training set.
Now, we extend the above multi-labeled MEM to a
3.1. Kernelized maximum entropy model nonlinear one so that the powerful kernel method can be
The principle of Maximum Entropy Model (MEM) is to adopted. A transformation ψ maps samples into a tar-
model all known, and assume nothing about the unknown. get space in which kernel function k(x0 , x) gives the in-
Traditional single-label data classification suffers from the ner product. We can ¡ rewrite the multi-labeled¢ MEM as
1
same problem as binary SVM. [16] extends the single la- P̂ (y|x) = Z(x) exp yT (b + Ry) + yT K(W, x) . Accord-
beled MEM to multi-labeled case. This model is linear ing to the Representer Theorem, the optimal weighting vec-
and can be effective on a set of samples that vary linearly. tor of the single-labeled problem is a linear combination of
However, it will fail to capture the structure of the feature samples. In the proposed multi-labeled setting, the mapped
space if the variations among the samples are nonlinear. weighting matrix ψ(W ) can still be written as a linear com-
But image classification is actually in this case when one bination of ψ(xi ) except that the combination coefficients
is trying to extract features from image categories that vary are vectors instead of scalars, i.e.
in their appearance, illumination conditions and complex Pn
ψ(W ) = i=1 θ(xi )ψ T (xi )
background clutters. Therefore, a nonlinear version of such = [ θ(x1 ) θ(x2 ) · · · θ(xn ) ]
a MEM is required to classify the images based on their T (13)
· [ ψ(x1 ) ψ(x2 ) · · · ψ(xn ) ]
nonlinear feature structure. Moreover, they do not address = Θ · [ ψ(x1 ) ψ(x2 ) · · · ψ(xn ) ]
T
the problem brought by incomplete labels. We first intro-
duce the model in [16] and further extend it to a nonlinear where the summation is taken over the samples in the train-
n
case by incorporating a kernel function into the model. Such ing set {xi }i=1 . θ(xi ) is a m × 1 coefficient vector and
an extension is used as the underlying classifier in 2DAL. Θ is a m × n matrix in which each row is the weighting
e y), Q(x, y) denote the empirical and the model
Let Q(x, coefficients for each label. Accordingly, we have
distribution, respectively. The optimal multi-label model
can be obtained by solving the following formulation [16]: K(W, x) = ψ(W ) · ψ(x)
T (14)
= Θ · [ k(x1 , x) · · · k(xn , x) ] = Θ · k(x)
P̂ = arg maxP H(x, y|Q) = arg minP hlog P (y|x)iQ
s.t. hyi iQ = hyi iQ̃ + ηi , and so
hyi yj iQ = hyi yj iQ̃ + θil , 1 ≤ i < j ≤ m 1
¡ ¢
hy P̂ (y|x) = Z(x) exp yT (b + Ry) + yT k(W, x)
Pi xl iQ = hyi xl iQ̃ + φil , 1 ≤ i ≤ m, 1 ≤ l ≤ d; = Z(x)1
¡
exp yT (b + Ry + Θk(x))
¢ (15)
y P (y|x) = 1
(11)
T
where H(x, y|Q) is the entropy of x and y given distribution where k(x) = [ k(x1 , x) · · · k(xn , x) ] is a n × 1
Q, and h·iQ denotes the expectation with respect to distri- vector and it can be seen as a new representation of sam-
bution Q. d is the dimension of the feature vector x and ple x. Correspondingly, with the identity ||ψ(W )||2F =
xl represents its l-th element. ηi , θil and φil are the es- tr(ψ(W )ψ(W )T ) = tr(ΘKΘT ) the Lagrangian function
timation errors following the Gaussian distribution which Eqn. 12 can be rewritten as
serve to smooth the MEM to improve the model’s gener- D E
alization ability. By modeling the pair-wise label correla- L(b, R, Θ) = − log P̂ (y|x)
Q̃ (16)
tions, the obtained model reveals the underlying label cor- λb
+ 2n ||b||22 + λ2nR ||R||2F + λ2n
W
tr(ΘKΘT )
relations. Formulation 11 can be solved by Lagrange Mul-
tiplier algorithms and¡ the obtained posterior¢ probability is where K = [k(xi , xj )]n×n is the kernel matrix. We
1
P̂ (y|x) = Z(x) exp yT (b + Ry + W x) , where Z(x) = call the above model Kernelized Maximum Entropy Model
P T
y y (b + Ry + W x) is the partition function, and the pa- (KMEM) in this paper. By minimizing Eqn.16, we can es-
rameters b, W , and R are Lagrangian multipliers that need timate the optimal parameters in KMEM.
3.2. EM based approach for incomplete labels
Beach
Given the partially labeled training set constructed by
2DAL (see Figure 2), we can handle the incomplete la- Sunset
bels by integrating out the unlabeled part to yield the Fall foliage
marginal distribution of the labeled part P̂ (yL(x) |x) =
P Field
yU (x) P̂ (yU (x) , yL(x) |x). Then substitute it for P̂ (y|x) in
Mountain
Eqn. 16, we can obtain:
D P E Urban
Urban
Beach
Fall foliage
Field
Mountain
Sunset
Q̃ (17)
λb
+ 2n ||b||22 + λ2nR ||R||2F + λ2n
W
tr(ΘKΘT )
0.9 0.58
0.8 0.56
Average F1 Score
Aveage F1 Score
0.7 0.54
0.6 0.52
0.5 0.5
2DAL
2DAL
0.4 1DAL 0.48
1DAL
RND RND
0.3 0.46
0.2 0.44
0 1000 2000 3000 4000 5000 6000 0 2000 4000 6000 8000 10000 12000 14000
The Total Number of Selected Sample−Label Pairs The Total Number of Selected Sample−Label Pairs
1 The proposed 2DAL strategy: using the proposed 4.2. Gene data set
sample-label pair selection criterion in Section 2.2, The second data set is the Yeast data set [11] which con-
with KMEM as the underlying classifier. sists of micro-array expression data and phylogenetic pro-
2 1D active learning strategy (1DAL): using the mean-max files with 2,417 genes and each gene in the set belongs to
loss active learning strategy that has been proposed in one or more of 14 different functional classes. As for multi-
the previous work [11] on multi-label active learning. labeled gene data set, there are 33, 838 sample-label pairs
As stated in Section 1, this strategy selects only along in the sets. Each sample in this data set is annotated by 11
the sample dimension. It does not take advantages of positive labels at most. The detailed description about this
the label correlations to reduce human labeling cost. biological data set can be found in [6].
Therefore when one sample is selected, all its labels In the experiment, 242 (10%) genes with their labels are
have to be labeled. used as the initial training set. In each iteration, 140 sample-
3 Random strategy (RND): selecting the sample-label pairs label pairs are selected. Similar to section 4.1, the 1DAL
at random. For the sake of fair comparison with the selects 14 samples for annotating all their labels. That’s
proposed 2DAL, we also use KMEM as the classifier. equivalent to 140 sample-label pairs. Figure 4(b) illustrates
the performance of the three strategies on this data set.
We use the average F 1 score over all different labels for
2rp From the above two experiments, we have observed:
performance evaluation, i.e., F 1 = r+p where p and r are
precision and recall respectively. For the Scene data set, we 1 When given a fixed number of annotations, 2DAL out-
use 241 (10%) images as the initial training set. In each performs 1DAL over all the active learning iterations.
iteration, 60 sample-label pairs are selected by the 2DAL. This is because the former considers both sample and
Note that, for 1DAL, it requests for annotation on the basis label uncertainty for selecting sample-label pair, while
of samples rather than sample-label pairs, so in each iter- 1DAL only considers the sample uncertainty. There-
ation, it selects 10 images for annotating all the six labels fore, the informative label correlations associated with
or equivalently 60 image-label pairs. The average F1 score each sample can help to reduce the expensive human
is then computed over all the remaining unlabeled data. In labors needed to construct the labeled pool.
Figure 4(a), we show the performance of the three strate- 2 The proposed 2DAL gives good performance on diverse
gies over the total number of the selected sample-label pairs. data sets, ranging from natural scenes to gene images.
The proposed 2DAL has the best performance over all iter- This is an important character of a good algorithm to
ations. With the number of selected pairs increasing, the be used in real-world applications.
improvement becomes more and more significant. Table 2
5. Conclusion
compares the F 1 scores after 100 active learning iterations
over all the six scene categories. The proposed 2DAL out- In this paper, we proposed an efficient two dimensional
performs the other strategies on all the categories. In par- active learning (2DAL) strategy for multi-labeled image
ticular, the improvement is obvious on “Urban”. Such an classification. This 2DAL strategy selects the sample-label
pairs to annotate, along both the sample and label dimen- 0.5
0.2 min{p, 1 − p)
when there is only one lable. Extensive experiments on two
0.15
widely used data sets have shown that for a given number 0.1
of required annotations, the proposed 2DAL strategy out- 0.05