0% found this document useful (0 votes)
18 views8 pages

2008 - CVPR-gjqi - Two-Dimensional Active Learning For Image Classification

Two-dimensional active learning is proposed for multi-label image classification. Traditional active learning selects samples only, but for multi-label problems this is sub-optimal as it does not consider label correlations. The proposed method selects sample-label pairs to minimize classification error, considering both the sample and label dimensions. It argues that only some effective labels need annotation while others can be inferred from label correlations. Experiments on real-world applications show it outperforms methods that do not use label correlations.

Uploaded by

thuyntt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views8 pages

2008 - CVPR-gjqi - Two-Dimensional Active Learning For Image Classification

Two-dimensional active learning is proposed for multi-label image classification. Traditional active learning selects samples only, but for multi-label problems this is sub-optimal as it does not consider label correlations. The proposed method selects sample-label pairs to minimize classification error, considering both the sample and label dimensions. It argues that only some effective labels need annotation while others can be inferred from label correlations. Experiments on real-world applications show it outperforms methods that do not use label correlations.

Uploaded by

thuyntt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Two-Dimensional Active Learning for Image Classification


Guo-Jun Qi, ‡ Xian-Sheng Hua, ‡ Yong Rui, † Jinhui Tang, ‡ Hong-Jiang Zhang

MOE-Microsoft Key Laboratory of Multimedia Computing and Communication
& Department of Automation, University of Science and Technology of China
{qgj, jhtang}@mail.ustc.edu.cn

Microsoft Research Asia
{xshua, yongrui, hjzhang}@microsoft.com

Abstract age classification, as it can significantly reduce the human


cost in labeling training samples. Specifically, active learn-
In this paper, we propose a two-dimensional active ing methods iteratively annotate a set of elaborately se-
learning scheme and show its application in image classifi- lected samples so that the classification error is minimized
cation. Traditional active learning methods select samples in each iteration. As a result, the total number of training
only along the sample dimension. While this is the right samples that need to be labeled is smaller than non active
strategy in binary classification, it is sub-optimal for multi- learning approaches. It is clear that the core of any ac-
label classification. In multi-label classification, we argue tive learning approach is the sample selection strategy. In
that, for each selected sample, only a part of more effec- the past decade, a number of active learning approaches
tive labels are necessary to be annotated while others can were developed by using different sample selection strate-
be inferred by exploring the correlations among the labels. gies [14][4][10][8]. Most of these approaches focus on
The reason is that the contributions of different labels to the binary or multi-class classification scenario [10][4][15].
minimizing the classification error are different due to the However, in many real-world applications such as image
inherent label correlations. To this end, we propose to se- and video retrieval [2][12], text search [16] and bioinfor-
lect sample-label pairs, rather than only samples, to mini- matics [6], a sample is usually associated with multiple la-
mize a multi-label Bayesian classification error bound. This bels rather than a single one. Under such a multi-label set-
new active learning strategy not only considers the sam- ting, each sample will be annotated as “positive” or “nega-
ple dimension but also the label dimension, and we call it tive” for each and every label (See figure 1 for some exam-
Two-Dimensional Active Learning (2DAL). We also show ples). As a result, active learning with multi-labeled sam-
that the traditional active learning formulation is a special ples is much more challenging than that with binary-labeled
case of 2DAL when there is only one label. Extensive ex- ones, especially when the number of labels is large.
periments conducted on two real-world applications show A direct way to tackle active learning under multi-label
that the 2DAL significantly outperforms the best existing ap- setting is to translate it into a set of binary problems, i.e.,
proaches which did not take label correlation into account. each category/label is independently handled by a binary
active learning algorithm. For example, in [11][3] two re-
1. Introduction search groups have proposed such a binary-based active
learning algorithm for multi-labeled classification problem,
Image semantic understanding is typically formulated as
respectively. However, this type of approaches does not take
either a multi-class or a multi-label classification problem
into account the inherent relationships among multiple la-
[15][2]. In the multi-class setting, each image is classi-
bels. In this paper, we propose a novel active learning strat-
fied into one and only one predefined category. In other
egy which iteratively selects sample-label pairs to minimize
words, only one label is assigned to an image in this setting.
Real-world applications [2], however, require that one or
multiple labels can be assigned to an image. This require-
ment results in multi-label classification, which is signifi-
Field P N N
cantly more challenging, and will be the focus of this paper. Mountain P N P
Specifically, we will use active learning as the tool, and ex- Urban N P N

tend it from a one-dimensional sample-centric approach to Beach N P P

a two-dimensional joint sample-label-centric approach for People N P N

multi-label image classification. Figure 1. Some examples of multi-labeled images. “P” means pos-
Active learning is one of the most used methods in im- itive label and “N” means negative label.

1
Label Dimension Label Dimension

the expected classification error. Specifically, in each iter- Before 2DAL After 2DAL
... ... ... ...
ation, the annotators are only required to annotate/confirm
X1 ? ? P X1 S ? P

... ... ... ... ... ... ... ...

a selected part of labels of selected samples while the re- Xi ? ... P ... N Xi ? ... P ... N

maining unlabeled part will be inferred according to the la- ... ... ... ... ... ... ... ...
... ... ... ...
bel correlations. We call this strategy 2 Dimensional Active Xj ? ? ? 2DAL Xj S ? S

Sample ... ... ... ... Sample ... ... ... ...

Learning (2DAL) since it considers not only the samples to


Dimension Dimension

Xn P ... ? ... P Xn P ... ? ... P

be labeled along the sample dimension but also the labels Samples Labels Samples Labels

associated with these samples along the label dimension.


An intuitive explanation of this strategy is that there exist P Labeled Positive Concept N Labeled Negative Concept

both sample and label redundancies for multi-labeled sam- Unlabeled Concept Selected to be labeled Concept
? S
ples. Therefore, annotating a set of selected sample-label
pairs provides enough information for training the classi- Figure 2. Proposed two-dimension 2DAL strategy
fiers since the information in the selected pairs can be prop-
as follows. In section 2, we present the 2DAL selection
agated to the rest along both the sample and the label “di-
strategy used in the proposed active learning algorithm. We
mensions”. Unlike existing binary-based active learning
also show that the traditional active learning formulation is
strategies [11][3] which only take the sample redundancy
a special case of 2DAL when there is only one lable. After
into account, the 2DAL strategy additionally considers the
that, a Kernelized Maximum Entropy Model is proposed in
label dimension to leverage the rich relationships embedded
section 3 to model the label correlations. In addition, an
in the multiple labels. 2DAL efficiently selects an informa-
Expectation-Maximum (EM) algorithm is also given in this
tive part of the labels rather than all the labels for a partic-
section to solve the incomplete labeling problem. In section
ular selected sample. Such a strategy significantly reduces
4 we evaluate the proposed 2DAL with comparison with the
the required human labors during active learning. For ex-
state-of-the-art one dimensional active learning approach on
ample, Field and Mountain tend to occur simultaneously in
two real-world data sets. Finally, we conclude in section 5.
an image. Therefore, it is reasonable to select only one la-
bel (e.g., Mountain) for annotation since the uncertainty of
the other label can be remarkably decreased after annotating 2. Two-Dimensional Active Learning
this one. Another example is Mountain and Urban, in con- In this section, we start by detailing the underlying idea
trast to Field and Mountain, these two labels often do not of the proposed 2DAL strategy in multi-label setting from
occur simultaneously. Thus, annotating one of them most the perspective of information theory. Then, a Bayesian
likely will probably eliminate the existence of the other. error bound is derived that states the expected classifica-
To realize 2DAL, we will answer the following questions tion error given a selected sample-label pair. The pro-
in this paper: posed 2DAL strategy will then be deduced by selecting the
1 What is the proper selection strategy for finding the sample-label pairs which optimize this bound.
sample-label pairs? To address this issue, we formu-
late the selection of sample-label pairs as minimizing 2.1. The proposed 2DAL strategy
a derived Multi-Label Bayesian Classification Error Figure 2 illustrates the proposed 2DAL strategy. Dif-
Bound. We will demonstrate that selecting sample- ferent from the typical binary active learning formulation
label pairs in this way will significantly reduce the un- that selects the most informative samples for annotation, we
certainty of both the samples and the labels. jointly select both the samples and labels simultaneously.
2 How can we model the label relationships/correlations? The underlying assumption is that different labels of a cer-
Since the proposed 2DAL strategy utilizes the label tain sample have different contributions to minimizing the
dependencies to reduce labeling cost, the underlying expected classification error of the to-be-trained classifier.
classifier should be able to model the corresponding And annotating a portion of well-selected labels may pro-
label correlations. Accordingly, we propose a Ker- vide sufficient information for learning the classifier. As
nelized Maximum Entropy Model (KMEM) to model shown in Figure 1, this strategy trades off between the an-
such correlations. Furthermore, since the 2DAL strat- notation labors and the learning performance along two di-
egy only annotates a sub set of labels, we formulate an mensions, i.e., the sample and label dimensions. In essence,
Expectation-Maximum(EM) [5] algorithm to solve the the multi-label classifiers do have uncertainty along differ-
incomplete labeling problem. ent labels as well as different samples. Traditional active
To the best of our knowledge, we are the first to present the learning algorithms can be seen as a one-dimension active
study of active learning on the granularity of sample-label selection approach, which only reduces the sample uncer-
pairs, with both theoretical analysis and empirical results tainty. In contrast, 2DAL is a two-dimensional active learn-
on real-world data sets. The rest of the paper is organized ing strategy, which selects the most “informative” sample-
label pairs to reduce the uncertainty along the dimensionali- parts U (x) and L(x). Once ys is selected to ask for label-
ties of both sample and label. More specifically, along label ing¡ (but not yet annotated),
¢ the Bayesian classification error
dimension all of the labels correlatively interact. Therefore, E yi |ys , yL(x) , x for an unlabeled yi , i ∈ U (x) is bounded
once partial labels are annotated, the left unlabeled concepts as
can then be inferred based on label correlations. Theoreti- 1
¡ ¢ ¡ ¢
2 H yi¡|ys ; yL(x) , x − ¢ ² ≤ E yi |ys ; yL(x) , x
(1)
cally, the label correlations have a connection with the ex- ≤ 12 H yi |ys ; yL(x) , x
pected Bayesian Error Bound (see the following lemma and
theorem in section 2.2), and thus these label correlations can where
help to reduce the prediction errors in the testing set during ¡ ¢ P ¡ ¢
H yi |ys ; yL(x) , x = {−P yi = t, ys = r|yL(x) , x
the active learning procedure. This approach saves much ¡ t,r∈{0,1} ¢
labor from fully annotating multiple labels. Thus, it is far × log P yi = t|ys = r; yL(x) , x }
more efficient when the number of labels is huge. For in-
is the conditional entropy of yi given the selected part ys
stance, an image may be associated with thousands and hun-
( both yi and ys are random variables since they have
dreds of concepts. That means a full annotation strategy will
not been labeled) and yL(x) is the known labeled part;
pay large labor costs for only one image. On the contrary,
2DAL only selects the most informative labels for annota- ² = 21 log 45 is a constant.
This lemma will be proven in the appendix.
tion. In the following section, we will derive such a two- Remark 1. It is worth noting that this bound is irrelevant
dimension selection criterion based on a derived Bayesian to the true label of the selected ys . In fact, before the anno-
classification error bound in multi-label setting. tator gives the label of ys , the true value of ys is unknown.
On the other hand, it is worth noting that as illustrated However, no matter what ys holds, 1 or 0, this bound always
in Figure 2, during the learning process, some samples may holds.
be lack of some labels since only a partial of labels are an- Based on this lemma, we can obtain the following theo-
notated. This is different from traditional active learning rem which bounds the multi-label error:
algorithm. In the section 3.2, we will address how to learn Theorem 1. (Multi-labeled Bayesian classification error
the classification model from incomplete labels. bound) Under the condition of lemma 1, the Bayesian clas-
sification error bound E(y|ys ; yL(x) , x) for sample x over all
2.2. Multi-labeled Bayesian error bound the labels y is
¡ ¢
The 2DAL learner requests annotations on the basis of E y|ysP ; yL(x)©, x ¡ ¢ ¡ ¢ª
1 m
sample-label pairs which, once incorporated into the train- ≤ 2m i=1 H yi |yL(x) , x − M I yi ; ys |yL(x) , x
ing set, are expected to result in the lowest generalization (2)
error. Here we will first derive a Multi-Labeled Bayesian where M I(yi ; ys |yL(x) , x) is the mutual information be-
Error Bound when a selected sample-label pair is labeled tween the random variables yi and ys given the known la-
under multi-label setting, and 2DAL accordingly will itera- beled part yL(x) .
tively select the ones to minimize this bound.
Proof.
Before we move further, we first define some notations. ¡ ¢
For each sample x, it has m labels yi (1 ≤ i ≤ m) and each E y|ys ; yL(x) , x
of them indicates whether its corresponding semantic con- (1) 1 Pm ¡ ¢
= m i=1 E yi |ys ; yL(x) , x
cept occurs. As stated before, in each 2DAL iteration, some (2) Pm ¡ ¢
1
of these labels have already been annotated while others ≤ 2m i=1 H yi |ys ; yL(x) , x
not. Let U (x) , {i|(x, yi ) is unlabeled sample-label pair.} (3) 1 Pm © ¡ ¢ ¡ ¢ª
= 2m i=1 H yi |yL(x) , x − M I yi ; ys |yL(x) , x
denote the set of indices of unlabeled part and L(x) , (3)
{i|(x, yi ) is labeled sample-label pair.} denote the labeled where (2) directly comes from Lemma 1, (3) makes use of
part. Note that L(x) can be an empty set ∅, which indicates the relationship between mutual information and entropy:
that no label has been annotated for x. Let P (y|x) be the M I(X; Y ) = H(X) − H(X|Y ).
m
conditional distribution over samples, where y = {0, 1}
is the complete label vector and P (x) be the marginal sam- We are concerned with pool-based active learning, i.e., a
ple distribution. large pool P is available to the learner sampled from P (x)
First, we establish a Bayesian error bound for classifying and the proposed 2DAL then selects the most informative
one unlabeled yi once ys is actively selected for annotating. sample-label pairs from the pool. We first write the ex-
This error bound originates from the equivocation bound pected Bayesian classification error over all samples in P
given in [7], and we extend it to multi-label setting so it can before selecting a sample-label pair (xs , ys )
handle sample-label pairs. 1 X ¡ ¢
Lemma 1. Given a sample x and its labeled and unlabeled E b (P) = E y|yL(x) , x (4)
|P| x∈P
We can use the above classification error on the pool to esti- but also label uncertainty. The above selection strat-
mate the¡ expected ¢errorRover the¡full distribution
¢ P (x), i.e., egy Eqn. 9 actually well-reflects these two targets. The
EP (x) E y|yL(x) , x = P (x)E y|yL(x) , x dx, because the last term in Eqn. 9 can be rewritten as
pool not only provides a finite set of samples but also an Pm ¡ ¢
estimation of P (x). After selecting the pair (xs , ys ), the i=1 M I yi ; ys |yL(xs ) , xs
expected Bayesian classification error over the pool P is ¡ ¢ P m ¡ ¢
= M I ys ; ys |yL(xs ) , xs + M I yi ; ys |yL(xs ) , xs
i=1,i6=s
E a (P)n ¡ ¢ ¡ ¢
1
¡ ¢ P ¡ ¢o = H ys |yL(xs ) , xs +
P
m
M I yi ; ys |yL(xs ) , xs
= |P| E y|ys ; yL(xs ) , xs + x∈P\xs E y|yL(x) , x
1
¡ ¢ ¡ ¢ i=1,i6=s
= |P| {E y|ys ; yL(xs ) , xs − E y|yL(xs ) , xs (10)
P ¡ ¢
+ x∈P E y|yL(x) , x } As we can see, the objective selection func-
(5) tion for 2DAL has been divided into two parts:
Therefore, the reduction of the expected Bayesian classifi- ¡ ¢ P
m ¡ ¢
H ys |yL(xs ) , xs and M I yi ; ys |yL(xs ) , xs .
cation after selecting (xs , ys ) over the whole pool P is i=1,i6=s
b a The former entropy measures the uncertainty of the se-
∆E (P) = E (P) − E (P) (6)
lected pair (x∗s , ys∗ ) itself, and this is consistent with
Thus our goal is to select a best (x∗s , ys∗ ) to maximize the the typical one dimensional active learning algorithm,
above expected error reduction. That is, i.e., to select the most uncertain samples near the clas-
sification boundary [10][9]. On the other hand, the
(x∗s , ys∗ ) = arg maxxs ∈P,ys ∈U (xs ) ∆E (P)
(7) latter mutual information term measures the statisti-
= arg minxs ∈P,ys ∈U (xs ) −∆E (P)
cal redundancy among the selected label and the rest.
Applying Lemma 1 and Theorem 1, we have By maximizing these mutual information terms, 2DAL
can provide information for the inference of other la-
−∆E (P) = E a (P) − E b (P)
(1) 1 ¡ ¢ ¡ ¢ bels to help reduce their label uncertainty. Therefore,
= |P| {E y|ys ; yL(xs ) , xs − E y|yL(xs ) , xs the obtained strategy confirms our motivation of select-
P ¡ ¢ 1
P ¡ ¢
+ x∈P E y|yL(x) , x } − |P| x∈P E y|yL(x) , x ing the most informative sample-label pairs to reduce
1
© ¡ ¢ ¡ ¢ª
= |P| E y|ys ; yL(xs ) , xs − E y|yL(xs ) , xs the uncertainties along both sample and label dimen-
(2) Pm ¡ ¢ sion. Note that when there is only one label for each
1 1
≤ |P| { 2m i=1¡ H yi |yL(xs ) , xs¢ sample, Eqn. 10 reduces to H(ys |xs ). The selection
1
P m
− 2m Pmi=1 M ¡ I yi ; ys |yL(x
¢ s ) , xs criterion becomes the same as the traditional binary-
1
− m i=1 E yi |yL(xs ) , xs } based criterion, i.e., to select the most uncertain sam-
(3)
1 1
Pm ¡ ¢ ple for annotation [9] [14].
≤ |P| { 2m i=1¡ H yi |yL(xs ) , xs¢
1
P m 2 When computing the mutual information terms in Eqn.
− 2m Pmi=1¡M I ¡yi ; ys |yL(xs ) ,¢xs ¢
1
− m ©i=1 12 HPyi |yL(xs )¡, xs − ² } 9, we need the distribution P (y|x). However, the true
1 1 m ¢ª distribution is unknown, but we can estimate it using
= |P| ² − 2m i=1 M I yi ; ys |yL(xs ) , xs
the current learner. As stated in [13], such an approxi-
(8)
mation is reasonable because the most useful labeling
The equality (1) comes from Eqn. 4 5 . The first inequality
is usually consistent with the learner’s prior belief over
(2) follows the Theorem 1 and the second inequality (3)
the majority (but not all) of the unlabeled pairs.
comes from the lower bound of Lemma 1.
Consequently, by minimizing the obtained Bayesian er- 3 It is worth indicating that the posterior P (y|x) is
ror bound 8, we can select the sample-label pair for annota- significant in modeling the label correlations. If
tion according to we assume the independence Qm among the different
labels, i.e., P (y|x) = i=1 P (yi |x) and corre-
(x∗s , ys∗ ) ½ ¾ spondingly¡ the mutual information term will be-
1 1
P
m ¡ ¢ ¢
= arg min |P| ²− 2m M I yi ; ys |yL(xs ) , xs come M I yi ; ys |yL(xs ) , xs = 0, i 6= s. In this
xs ∈P,ys ∈U (xs )
Pm i=1
¡ ¢ case, the selection criterion
¡ reduces ¢to (x∗s , ys∗ ) =
= arg max i=1 M I yi ; ys |yL(xs ) , xs arg maxxs ∈P,ys ∈U (xs ) H ys |yL(xs ) , xs , that is, to se-
xs ∈P,ys ∈U (xs )
(9) lect the most uncertain sample-label pair. Such a cri-
terion neglects the label correlations and will be less
2.3. Further Discussions efficient in reducing label uncertainty. Therefore, a
1 As we discussed in section 2.1, the proposed 2DAL ap- statistical method that can model the label correlations
proach is an active learning algorithm along two di- is required to adopt. We introduce such a Bayesian
mensions, which reduces not only sample uncertainty model in the following section.
3. Maximum Entropy Model and EM Variant to be determined. The optimal parameters can be found by
minimizing the Lagrangian:
In the above 2DAL strategy, we have indicated that a sta-
D E
tistical model is needed to measure label correlations. How-
L(b, R, W ) = − log P̂ (y|x)
ever, common multi-label classifiers, such as one-against- Q̃
λb 2 λR 2 λW 2
rest encoded binary SVM and others, tackle the classifi- + 2n
­ ||b|| 2 + 2n ||R||F + 2n ||W ||F ® (12)
T
cation of multi-labeled samples in an independent manner. = −y (b + Ry + W x) + log Z(x) Q̃
λb
These models neglect the label correlations and, hence, do + 2n ||b||22 + λ2nR ||R||2F + λ2n
W
||W ||2F
not fit our target. In this section, we will introduce a multi-
labeled Bayesian classifier in which the relations among dif- where ||.||F denotes Frobenius norm and n is the number of
ferent labels are well modeled. samples in training set.
Now, we extend the above multi-labeled MEM to a
3.1. Kernelized maximum entropy model nonlinear one so that the powerful kernel method can be
The principle of Maximum Entropy Model (MEM) is to adopted. A transformation ψ maps samples into a tar-
model all known, and assume nothing about the unknown. get space in which kernel function k(x0 , x) gives the in-
Traditional single-label data classification suffers from the ner product. We can ¡ rewrite the multi-labeled¢ MEM as
1
same problem as binary SVM. [16] extends the single la- P̂ (y|x) = Z(x) exp yT (b + Ry) + yT K(W, x) . Accord-
beled MEM to multi-labeled case. This model is linear ing to the Representer Theorem, the optimal weighting vec-
and can be effective on a set of samples that vary linearly. tor of the single-labeled problem is a linear combination of
However, it will fail to capture the structure of the feature samples. In the proposed multi-labeled setting, the mapped
space if the variations among the samples are nonlinear. weighting matrix ψ(W ) can still be written as a linear com-
But image classification is actually in this case when one bination of ψ(xi ) except that the combination coefficients
is trying to extract features from image categories that vary are vectors instead of scalars, i.e.
in their appearance, illumination conditions and complex Pn
ψ(W ) = i=1 θ(xi )ψ T (xi )
background clutters. Therefore, a nonlinear version of such = [ θ(x1 ) θ(x2 ) · · · θ(xn ) ]
a MEM is required to classify the images based on their T (13)
· [ ψ(x1 ) ψ(x2 ) · · · ψ(xn ) ]
nonlinear feature structure. Moreover, they do not address = Θ · [ ψ(x1 ) ψ(x2 ) · · · ψ(xn ) ]
T
the problem brought by incomplete labels. We first intro-
duce the model in [16] and further extend it to a nonlinear where the summation is taken over the samples in the train-
n
case by incorporating a kernel function into the model. Such ing set {xi }i=1 . θ(xi ) is a m × 1 coefficient vector and
an extension is used as the underlying classifier in 2DAL. Θ is a m × n matrix in which each row is the weighting
e y), Q(x, y) denote the empirical and the model
Let Q(x, coefficients for each label. Accordingly, we have
distribution, respectively. The optimal multi-label model
can be obtained by solving the following formulation [16]: K(W, x) = ψ(W ) · ψ(x)
T (14)
= Θ · [ k(x1 , x) · · · k(xn , x) ] = Θ · k(x)
P̂ = arg maxP H(x, y|Q) = arg minP hlog P (y|x)iQ
s.t. hyi iQ = hyi iQ̃ + ηi , and so
hyi yj iQ = hyi yj iQ̃ + θil , 1 ≤ i < j ≤ m 1
¡ ¢
hy P̂ (y|x) = Z(x) exp yT (b + Ry) + yT k(W, x)
Pi xl iQ = hyi xl iQ̃ + φil , 1 ≤ i ≤ m, 1 ≤ l ≤ d; = Z(x)1
¡
exp yT (b + Ry + Θk(x))
¢ (15)
y P (y|x) = 1
(11)
T
where H(x, y|Q) is the entropy of x and y given distribution where k(x) = [ k(x1 , x) · · · k(xn , x) ] is a n × 1
Q, and h·iQ denotes the expectation with respect to distri- vector and it can be seen as a new representation of sam-
bution Q. d is the dimension of the feature vector x and ple x. Correspondingly, with the identity ||ψ(W )||2F =
xl represents its l-th element. ηi , θil and φil are the es- tr(ψ(W )ψ(W )T ) = tr(ΘKΘT ) the Lagrangian function
timation errors following the Gaussian distribution which Eqn. 12 can be rewritten as
serve to smooth the MEM to improve the model’s gener- D E
alization ability. By modeling the pair-wise label correla- L(b, R, Θ) = − log P̂ (y|x)
Q̃ (16)
tions, the obtained model reveals the underlying label cor- λb
+ 2n ||b||22 + λ2nR ||R||2F + λ2n
W
tr(ΘKΘT )
relations. Formulation 11 can be solved by Lagrange Mul-
tiplier algorithms and¡ the obtained posterior¢ probability is where K = [k(xi , xj )]n×n is the kernel matrix. We
1
P̂ (y|x) = Z(x) exp yT (b + Ry + W x) , where Z(x) = call the above model Kernelized Maximum Entropy Model
P T
y y (b + Ry + W x) is the partition function, and the pa- (KMEM) in this paper. By minimizing Eqn.16, we can es-
rameters b, W , and R are Lagrangian multipliers that need timate the optimal parameters in KMEM.
3.2. EM based approach for incomplete labels
Beach
Given the partially labeled training set constructed by
2DAL (see Figure 2), we can handle the incomplete la- Sunset

bels by integrating out the unlabeled part to yield the Fall foliage
marginal distribution of the labeled part P̂ (yL(x) |x) =
P Field
yU (x) P̂ (yU (x) , yL(x) |x). Then substitute it for P̂ (y|x) in
Mountain
Eqn. 16, we can obtain:
D P E Urban

L(b, R, Θ) = − log yU (x) P̂ (yU (x) , yL(x) |x)

Urban
Beach

Fall foliage

Field

Mountain
Sunset
Q̃ (17)
λb
+ 2n ||b||22 + λ2nR ||R||2F + λ2n
W
tr(ΘKΘT )

By minimizing Eqn. 17, we can find the optimal parameters


Figure 3. The mutual information between different concepts in
for KMEM. However, it is difficult to minimize it directly.
Scene data set
Instead, we use the Expectation Maximization (EM) algo-
Class Total Class Total
rithm [5] to solve this optimization problem: Beach 369 Beach+Mountain 38
E-Step: Given the current t-th step parameter estimation Sunset 364 Foliage+Mountain 13
Foliage 360 Field+Mountain 75
bt , Rt , Θt , the T -function (i.e., the expectation of the La- Field 327 Field+Foliage+Mountain 1
grangain Eq. 16 under the current parameters given the la- Beach+Field 1 Urban 405
Foliage+Field 23 Beach+Urban 19
beled part) can be written as Mountain 405
Table 1. The description about the Scene data set
T (b,
D R, Θ|bt , Rt , Θt ) E
= −EU (x)|L(x);bt ,Rt ,Θt log P̂ (yU (x) , yL(x) |x; b, R, Θ) 4.1. Natural scene data set

λb λR λW This natural scene data set is first used in a previous re-
+ 2n ||b||22 + 2
2n ||R||F + 2n tr(ΘKΘT )
(18) search on the multi-labeled image scene classification prob-
where EU (x)|L(x);bt ,Rt ,Θt is the expectation operator lem [2]. It contains 2, 407 natural images belonging to one
given the current estimated conditional probability or more of six natural scene categories including beach,
P̂ (yU (x) |yL(x) , x; bt , Rt , Θt ). sunset, fall foliage, field, mountain, and urban. Since the
M-Step: Update the parameters by minimizing T -function: data sets are multi-labeled, there are 14, 442 sample-label
pairs in this set. Each sample in this set has been assigned
bt+1 , Rt+1 , Θt+1 = arg min T (b, R, Θ|bt , Rt , Θt ) (19) by three positive labels at most. Table 1 describes the multi-
b,R,Θ
label distribution in this set. We can see that 177 samples
The derivatives of T -function with respect to its parameters have more than one positive labels. Although this number is
b, R, Θ is not large, it does not say the label correlation is low. In fact,
­ ® the statistical correlations between the labels are determined
∂T λb
∂bi = hy i iQ − E y |L(x);b ,R ,Θ y i + n® b i by not only the correlations between positive labels but also
­i t t t Q̃
∂T λR those between the negative labels, as well as between posi-
∂Rij = hyi yj iQ − Eyi ,y yy
­j |L(x);bt ,Rt ,Θt i j Q̃
+ n Rij
® (20) tive and negative ones. In figure 3, we illustrate the mutual
∂T
∂Θil = hyi k(xl , x)iQ − Eyi |L(x);bt ,Rt ,Θt yi k(xl , x) Q̃
λW Pn information calculated over the whole data set. According
+ n k=1 Θik k(xk , xl ) to the information theory, the mutual information considers
Given these derivatives, we can use the efficient gradient all kinds of the correlations among the positive/negative la-
descent methods (e.g., LMVM [1]) to minimize Eqn. 18. bels as stated above. From this illustration, the correlations
between the labels are obvious. Note that, the mutual infor-
mation computed here is not the one used in 2DAL as Eqn.
4. Experiments 9. In Eqn. 9, the mutual information is calculated from the
In this section, we will evaluate the proposed 2DAL statistical model KMEM.
strategy on two real-world used data sets. The first data For the features used in this experiment, an image is first
set is a natural scene set with six image categories. The converted into CIE Luv color space and then the first and
second is a biological data set with 14 different types second color moments (mean and variance) are extracted
of genes. These two data sets are publicly available at over a 7 × 7 grid on the image. The end result is a 49 × 2 ×
http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/. We 3 = 294 dimension feature vector [2].
compare the proposed 2DAL with the state-of-the-art active In this experiment, we compare the following three ac-
learning approaches. tive learning strategies:
1 0.6

0.9 0.58

0.8 0.56

Average F1 Score
Aveage F1 Score

0.7 0.54

0.6 0.52

0.5 0.5
2DAL
2DAL
0.4 1DAL 0.48
1DAL
RND RND
0.3 0.46

0.2 0.44
0 1000 2000 3000 4000 5000 6000 0 2000 4000 6000 8000 10000 12000 14000
The Total Number of Selected Sample−Label Pairs The Total Number of Selected Sample−Label Pairs

(a) Scene (b) Yeast


Figure 4. The performance of five active learning strategies over two real-world data sets (a)Scene (b)Yeast
Class 2DAL 1DAL RND improvement is obtained by considering its significant cor-
Beach 0.9523 0.8652 0.6744
Sunset 0.9916 0.9421 0.9002 relations with other categories (see Figure 3 for an illustra-
Fall Foliage 0.9887 0.9338 0.8927 tion of these label correlations) during the active learning
Field 0.9588 0.8813 0.8071
Mountain 0.7806 0.6457 0.6122 procedure. It confirms 2DAL can obviously improve the
Urban 0.8534 0.6162 0.6856
Table 2. F 1 scores after 100 iterations on six scene categories. classification performance.

1 The proposed 2DAL strategy: using the proposed 4.2. Gene data set
sample-label pair selection criterion in Section 2.2, The second data set is the Yeast data set [11] which con-
with KMEM as the underlying classifier. sists of micro-array expression data and phylogenetic pro-
2 1D active learning strategy (1DAL): using the mean-max files with 2,417 genes and each gene in the set belongs to
loss active learning strategy that has been proposed in one or more of 14 different functional classes. As for multi-
the previous work [11] on multi-label active learning. labeled gene data set, there are 33, 838 sample-label pairs
As stated in Section 1, this strategy selects only along in the sets. Each sample in this data set is annotated by 11
the sample dimension. It does not take advantages of positive labels at most. The detailed description about this
the label correlations to reduce human labeling cost. biological data set can be found in [6].
Therefore when one sample is selected, all its labels In the experiment, 242 (10%) genes with their labels are
have to be labeled. used as the initial training set. In each iteration, 140 sample-
3 Random strategy (RND): selecting the sample-label pairs label pairs are selected. Similar to section 4.1, the 1DAL
at random. For the sake of fair comparison with the selects 14 samples for annotating all their labels. That’s
proposed 2DAL, we also use KMEM as the classifier. equivalent to 140 sample-label pairs. Figure 4(b) illustrates
the performance of the three strategies on this data set.
We use the average F 1 score over all different labels for
2rp From the above two experiments, we have observed:
performance evaluation, i.e., F 1 = r+p where p and r are
precision and recall respectively. For the Scene data set, we 1 When given a fixed number of annotations, 2DAL out-
use 241 (10%) images as the initial training set. In each performs 1DAL over all the active learning iterations.
iteration, 60 sample-label pairs are selected by the 2DAL. This is because the former considers both sample and
Note that, for 1DAL, it requests for annotation on the basis label uncertainty for selecting sample-label pair, while
of samples rather than sample-label pairs, so in each iter- 1DAL only considers the sample uncertainty. There-
ation, it selects 10 images for annotating all the six labels fore, the informative label correlations associated with
or equivalently 60 image-label pairs. The average F1 score each sample can help to reduce the expensive human
is then computed over all the remaining unlabeled data. In labors needed to construct the labeled pool.
Figure 4(a), we show the performance of the three strate- 2 The proposed 2DAL gives good performance on diverse
gies over the total number of the selected sample-label pairs. data sets, ranging from natural scenes to gene images.
The proposed 2DAL has the best performance over all iter- This is an important character of a good algorithm to
ations. With the number of selected pairs increasing, the be used in real-world applications.
improvement becomes more and more significant. Table 2
5. Conclusion
compares the F 1 scores after 100 active learning iterations
over all the six scene categories. The proposed 2DAL out- In this paper, we proposed an efficient two dimensional
performs the other strategies on all the categories. In par- active learning (2DAL) strategy for multi-labeled image
ticular, the improvement is obvious on “Urban”. Such an classification. This 2DAL strategy selects the sample-label
pairs to annotate, along both the sample and label dimen- 0.5

sions. In contrast to traditional one-dimensional binary ac- 0.45

tive learning algorithms, 2DAL only needs to annotate a sub 0.4 1


H(p)
2
set of labels associated with a certain sample, thus much 0.35

more efficient. Furthermore, we showed that the tradi- 0.3 ²

tional active learning formulation is a special case of 2DAL 0.25

0.2 min{p, 1 − p)
when there is only one lable. Extensive experiments on two
0.15
widely used data sets have shown that for a given number 0.1
of required annotations, the proposed 2DAL strategy out- 0.05

performs other state-of-the-art sample selection strategies. 0


0 0.2 0.4 0.6 0.8 1

Appendix Figure 5. Illustration of the inequality 12 H(p) − ² ≤ min{p, 1 −


p} ≤ 12 H(p), ² = 12 log 54
Here we give the proof of Lemma 1.
Proof. Since the selected ys can take on two values References
{0, 1}, there are two possible¡ posterior distributions ¢ [1] S. Benson, L. C. McInnes, J. Moré, T. Munson, and J. Sarich. TAO
for¡ the unlabeled y¢i , i.e., P yi |ys = 0; yL(x) , x and user manual (revision 1.9). Technical Report ANL/MCS-TM-242,
Mathematics and Computer Science Division, Argonne National
P yi |ys = 1; yL(x) , x . If ys = 1 holds, the Bayesian clas- Laboratory, 2007. http://www.mcs.anl.gov/tao. 6
sification error is [7]: [2] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-
¡ ¢ ¡ ¢ label scene classification. Pattern Recognition, 37(9), 2004. 1, 6
E y¡i |ys = 1; yL(x) , x = min{P¢ yi = 1|ys = 1; yL(x) , x [3] K. Brinker. On active learning in multi-label classification. ”From
, P yi = 0|ys = 1; yL(x) , x } Data and Information Analysis to Knowledge Engineering” of Book
(21) Series ”Studies in Classification, Data Analysis, and Knowledge Or-
ganization”, Springer, 2006. 1, 2
Given the inequality 21 H(p) − ² ≤ min{p, 1 − p} ≤
1 1 5 [4] E. Y. Chang, S. Tong, K. Goh, and C. Chang. Support vector ma-
2 H(p), ² = 2 log 4 (see figure 5), we have chine concept-dependent active learning for image retrieval. IEEE
1
¡ ¢ ¡ ¢ Transaction on Multimedia, 2005. 1
2 H yi¡|ys = 1; yL(x) , x − ¢ ² ≤ E yi |ys = 1; yL(x) , x [5] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood
≤ 21 H yi |ys = 1; yL(x) , x from incomplete data via em algorithm. Journal of the Royal Statis-
(22) tical Society (Series B), 39(1), 1977. 2, 6
Similarly, if ys = 0 holds, [6] A. Elisseeff and J. Weston. A kernel method for multi-labelled clas-
sification. In Proc. of NIPS, 2002. 1, 7
1
¡ ¢ ¡ ¢ [7] M. E. Hellman and J. Raviv. Probability of error , equivocation, and
2 H yi¡|ys = 0; yL(x) , x − ¢ ² ≤ E yi |ys = 0; yL(x) , x
the chernoff bound. IEEE Transaction on Information Theory, 1970.
≤ 21 H yi |ys = 0; yL(x) , x . 3, 8
(23) [8] S. C. H. Hoi and M. R. Lyu. A semi-supervised active learning frame-
Therefore, the Bayesian classification error bound given the work for image retrieval. In Proc. of IEEE CVPR, 2005. 1
selected ys can be computed as: [9] F. Jing, M. Li, and H.-J. Zhang. Entropy-based active learning with
¡ ¢ support vector machine for content-based image retrieval. In Proc.
E yi¡|ys ; yL(x) , x of IEEE Conference on Multimedia and Expo(ICME), 2004. 4
¢ ¡ ¢ [10] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrel. Active learning
= P¡ ys = 1|yL(x) , x¢ E¡ yi |ys = 1; yL(x) , x¢
with gaussian processes for object categorization. In Proc. of IEEE
+P y¡s = 0|yL(x) , x E¢ yi¡|ys = 0; yL(x) , x ¢ ICCV, 2007. 1, 4
(24)
≤ 21 P¡ ys = 1|yL(x) , x¢ H¡ yi |ys = 1; yL(x) , x¢ [11] X. Li, L. Wang, and E. Sung. Multi-label svm active learning for
1
+ 2 P ¡ys = 0|yL(x) , x¢ H yi |ys = 0; yL(x) , x image classification. In Proc. of ICIP, 2004. 1, 2, 7
= 21 H yi |ys ; yL(x) , x [12] G.-J. Qi, X.-S. Hua, Y. Rui, J.-H. Tang, T. Mei, and H.-J. Zhang. Cor-
relative multi-label video annotation. In Proc. of ACM Conference on
The last equality follows the definition of conditional en- Multimedia (ACM Multimedia), 2007. 1
[13] N. Roy and A. McCallum. Toward optimal active learning through
tropy. And similarly sampling esitmation of error reduction. In Proc. of ICML, 2001. 4
¡ ¢ [14] S. Tong and E. Y. Chang. Support vector machine active learning for
E yi¡|ys ; yL(x) , x ¢ ¡ ¢ image retrieval. In Proc. of ACM Conference on Multimedia (ACM
= P¡ ys = 1|yL(x) , x¢ E¡ yi |ys = 1; yL(x) , x¢ Multimedia), 2001. 1, 4
+P y¡s = 0|yL(x) , x E¢ ©yi |y¡s = 0; yL(x) , x ¢ ª [15] R. Yan, J. Yang, and A. Hauptmann. Automatically labeling data
≥ 12 P¡ ys = 1|yL(x) , x¢ © H¡ yi |ys = 1; yL(x) , x¢ − 2²ª using multi-class active learning. In Proc. of IEEE ICCV, 2003. 1
+ 12 P ¡ys = 0|yL(x) , x¢ H yi |ys = 0; yL(x) , x − 2² [16] S. Zhu, X. Ji, W. Xu, and Y. Gong. Multi-labelled classification using
maximum entropy method. In Proc. of ACM SIGIR, 2005. 1, 5
= 12 H yi |ys ; yL(x) , x − ²
(25)

You might also like