
to determine the probability (or confidence) of a label. In
most cases, it relies on the prior knowledge of the human
experts, which is a highly subjective and variable process.
As a result, the problem of learning from probabilistic
labels has not been extensively studied to date. Fortunately,
although not a probability by definition, d
y
xxxx
still shares the
same constraints with p robability, i.e., d
y
xxxx
2½0; 1 and
P
y
d
y
xxxx
¼ 1. Thus, many theories and methods in statistics
can be applied to label distributions.
It is also worthwhile to distinguish description degree
from the concept membership used in fuzzy classification [42].
Membership is a truth value that may range between
completely true and completely false. It is designed to
handle the status of partial truth that often appears in the
nonnumeric linguistic variables. For example, the age 25
might have a membership of 0.7 to the linguistic category
“young” and 0.3 to “middle age.” But for a particular face,
its association with the chronological age 25 will be either
completely true or completely false. On the other hand,
description degree reflects the ambiguity of the class
description of the instance, i.e., one class label may only
partially describe the instance. For example, due to the
appearance similarity of the neighboring ages, both the
chronological age 25 and the neighboring ages 24 and
26 can be used to describe the appearance of a 25-year-old
face. For each of 24, 25, and 26, it is completely true that it
can be used to describe the face (in the sense of appearance).
Each age’s description degree indicates how much the age
contributes to the full class description of the face.
The prior label distribution assigned to a face image at
the chronological age should satisfy the following two
properties: 1) The description degree of is the highest in
the label distribution, which ensures the leading position
of the chronological age in the class description;
2) the description degree of other ages decreases with the
increase of the distance away from , which makes the age
closer to the chronological age contribute more to the class
description. While there are many possibilities, Fig. 3
shows two kinds of prior label distributions for the images
at the chronological age , i.e., the Gaussian distribution
and the tr iangle distribution. Note that the age y is
regarded as a discrete class label in this paper, while both
the Gaussian and triangle distributions are defined by
continuous density functions pðyÞ. Directly letting d
y
xxxx
¼ pðyÞ
might induce
P
y
d
y
xxxx
6¼ 1. Thus, a normalization process
d
y
xxxx
¼ pðyÞ=
P
y
pðyÞ is required to ensure
P
y
d
y
xxxx
¼ 1.
3LEARNING FROM LABEL DISTRIBUTIONS
3.1 Problem Formulation
As mentioned before, many theories and methods from
statistics can be borrowed to deal with label distributions.
First of all, the description degree d
y
xxxx
could be represented
by the form of conditional probability, i.e., d
y
xxxx
¼ P ðy j xxxxÞ.
This might be explained as that given an instance xxxx, the
probability of the presence of y is equal to its description
degree. Then, the problem of label distribution learning can
be formulated as follows:
Let X¼IR
q
denote the input space and Y¼fy
1
;y
2
; ...;y
c
g
denote the finite set of possible class labels. Given a training
set S ¼fðxxxx
1
;D
1
Þ; ðxxxx
2
;D
2
Þ; ...; ðxxxx
n
;D
n
Þg, where xxxx
i
2X is an
instance, D
i
¼fd
y
1
xxxx
i
;d
y
2
xxxx
i
; ...;d
y
c
xxxx
i
g is the label distribution
associated with xxxx
i
, the goal of label distribution learning is
to learn a conditional probability mass function pðy j xxxxÞ from
S, where xxxx 2X and y 2Y.
For the problem of age estimation, suppose the same
shape of prior label distribution (e.g., Fig. 3) is assigned to
each face image; then the highest description degree for
each image will be the same, say, p
max
.Sincethe
description degree of the chronological age should always
be the highest in the label distribution, for a face image xxxx
at the chronological age , the label distribution learner
should output
pð j xxxx
Þ¼p
max
; ð1Þ
pð þ j xxxx
Þ¼p
max
p
; ð2Þ
where p
2½0; 1 is the description degree difference from
p
max
when the age changes to a neighboring age þ .
Similarly, for a face image xxxx
þ
at the chronological age
þ :
pð þ j xxxx
þ
Þ¼p
max
: ð3Þ
As mentioned before, the faces at the close ages are quite
similar, i.e., xxxx
þ
xxxx
; thus,
pð þ j xxxx
þ
Þpð þ j xxxx
Þ: ð4Þ
So, p
is a small positive number, which indicates that
pð þ j xxxx
Þ is just a little bit smaller than pð j xxxx
Þ. Note
that the above analysis does not depend on any particular
form of the prior label distribution except that it must
satisfy the two properties mentioned in Section 2. This
proves that when applied to age estimation, label distribu-
tion learning tends to learn the similarity among the
neighboring ages, no matter what the (reasonable) prior
label distribution might be.
Suppose pðy j xxxxÞ is a parametric model pðy j xxxx; Þ, where
is the vector of the model parameters. Given the training
set S, the goal of label distribution learning is to find the
that can generate a distribution similar to D
i
given the
instance xxxx
i
. If the Kullback-Leibler divergence is used as
the measurement of the similarity between two distribu-
tions, then the best model parameter vector
is
determined by
¼ argmin
X
i
X
j
d
y
j
xxxx
i
ln
d
y
j
xxxx
i
pðy
j
j xxxx
i
; Þ
¼ argmax
X
i
X
j
d
y
j
xxxx
i
ln pðy
j
j xxxx
i
; Þ:
ð5Þ
It is interesting to examine the traditional learning
paradigms under the optimization criterion shown in (5).
For single-label learning (see Fig. 2a), d
y
j
xxxx
i
¼ Krðy
j
;yðxxxx
i
ÞÞ,
GENG ET AL.: FACIAL AGE ESTIMATION BY LEARNING FROM LABEL DISTRIBUTIONS 2403
Fig. 3. Typical label distributions for the image at the chronological
age : (a) Gaussian distribution and (b) triangle distribution.