[3] Heatmap Regression via Randomized Rounding
[3] Heatmap Regression via Randomized Rounding
Abstract—Heatmap regression has become the mainstream methodology for deep learning-based semantic landmark localization,
including in facial landmark localization and human pose estimation. Though heatmap regression is robust to large variations in pose,
illumination, and occlusion in unconstrained settings, it usually suffers from a sub-pixel localization problem. Specifically, considering that
the activation point indices in heatmaps are always integers, quantization error thus appears when using heatmaps as the representation
of numerical coordinates. Previous methods to overcome the sub-pixel localization problem usually rely on high-resolution heatmaps. As
a result, there is always a trade-off between achieving localization accuracy and computational cost, where the computational complexity
of heatmap regression depends on the heatmap resolution in a quadratic manner. In this paper, we formally analyze the quantization
error of vanilla heatmap regression and propose a simple yet effective quantization system to address the sub-pixel localization problem.
arXiv:2009.00225v2 [cs.CV] 26 Aug 2021
The proposed quantization system induced by the randomized rounding operation 1) encodes the fractional part of numerical coordinates
into the ground truth heatmap using a probabilistic approach during training; and 2) decodes the predicted numerical coordinates
from a set of activation points during testing. We prove that the proposed quantization system for heatmap regression is unbiased
and lossless. Experimental results on popular facial landmark localization datasets (WFLW, 300W, COFW, and AFLW) and human
pose estimation datasets (MPII and COCO) demonstrate the effectiveness of the proposed method for efficient and accurate semantic
landmark localization. Code is available at http://github.com/baoshengyu/H3R.
Index Terms—Semantic landmark localization, heatmap regression, quantization error, randomized rounding.
F
1 I NTRODUCTION
2 R ELATED W ORK
Semantic landmark localization, which aims to predict the
numerical coordinates for a set of pre-defined semantic
Fig. 1. An intuitive example of the heatmap representation for the nu- landmarks in a given image or video, has a variety of appli-
merical coordinate. Given the numerical coordinate xi “ pxi , yi q, we cations in computer and robot vision including facial land-
then have the corresponding heatmap hi with the maximum activation
point located at the position pxi , yi q.
mark detection [1], [29], [2], hand landmark detection [3],
[4], human pose estimation [5], [6], [22], and household
object pose estimation [7], [10]. In this section, we briefly
review recent works on coordinate regression and heatmap
In vanilla heatmap regression, 1) during training, the regression for semantic landmark localization, especially in
ground truth numerical coordinates are first quantized to facial landmark localization applications.
generate the ground truth heatmap; and 2) during testing,
the predicted numerical coordinates can be decoded from
2.1 Coordinate Regression
the maximum activation point in the predicted heatmap.
However, typical quantization operations such as floor, Coordinate regression has been widely and successfully
round, and ceil discard the fractional part of the ground used in semantic landmark localization under constrained
truth numerical coordinates, making it difficult to recon- settings, where it usually relies on simple yet effective
struct the fractional part even from the optimal predicted features [30], [31], [32], [33]. To improve the performance
heatmap. This error induced by the transformation between of coordinate regression for semantic landmark localiza-
numerical coordinates and heatmap is known as the quan- tion in the wild, several methods have been proposed
tization error. To address the problem of quantization error, by using cascade refinement [1], [34], [35], [36], [37], [38],
here we introduce a new quantization system to form a loss- parametric/non-parametric shape models [39], [29], [36],
less transformation between the numerical coordinates and [40], multi-task learning [41], [42], [43], and novel loss
the heatmap. In our approach, during training, the proposed functions [44], [45].
quantization system uses a set of activation points, and the
fractional part of the numerical coordinate is encoded as the
activation probabilities of different activation points. During 2.2 Heatmap Regression
testing, the fractional part can then be reconstructed accord- The success of deep learning has prompted the use of
ing to the activation probabilities of the top k maximum heatmap regression for semantic landmark localization, es-
activation points in the predicted heatmap. To achieve this, pecially for robust and accurate facial landmark localiza-
we introduce a new quantization operation called random- tion [2], [46], [47], [23] and human pose estimation [19],
ized rounding, or random-round, which is widely used in [48], [6], [49], [50], [51], [52]. Existing heatmap regression
combinatorial optimization to convert fractional solutions methods either rely on large input images or empirical
into integer solutions with provable approximation guaran- compensations during inference to mitigate the problem
tees [27], [28]. Furthermore, the proposed method can easily of quantization error [6], [53], [54], [22]. For example, a
be implemented using a few lines of source code, making it simple yet effective compensation method known as “shift a
a plug-and-play replacement for the quantization system of quarter to the second maximum activation point” has been
existing heatmap regression methods. widely used in many state-of-the-art heatmap regression
In this paper, we address the problem of quantization methods [6], [55], [22].
error in heatmap regression. The remainder of the paper Several methods have been developed to address the
is structured as follows. In the preliminaries, we briefly problem of quantization error in three aspects: 1) jointly
review two typical semantic landmark localization methods, predicting the heatmap and the offset in a multi-task man-
coordinate regression and heatmap regression. In the meth- ner [56]; 2) encoding and decoding the fractional part of
ods, we first formally introduce the problem of quantization numerical coordinates via a modulated 2D Gaussian dis-
error by decomposing the prediction error into the heatmap tribution [24], [26]; and 3) exploring differentiable transfor-
error and the quantization error. We then discuss quantiza- mations between the heatmap and the numerical coordi-
tion bias in vanilla heatmap regression and prove a tight nates [57], [20], [58], [59]. Specifically, [24] generates the frac-
upper bound on the quantization error in vanilla heatmap tional part sensitive ground truth heatmap for video-based
regression. To address quantization error, we devise a new face alignment, which is known as fractional heatmap re-
quantization system and theoretically prove that the pro- gression. Under the assumption that the predicted heatmap
posed quantization system is unbiased and lossless. We also follows a 2D Gaussian distribution, [26] decodes the frac-
discuss uncertainty in heatmap prediction as well as the tional part of numerical coordinates from the modulated
unbiased annotation when forming a robust semantic land- predicted heatmap. The soft-argmax operation is differen-
mark localization system. In the experimental section, we tiable [60], [61], [62], [59], and has been intensively explored
demonstrate the effectiveness of our proposed method on in human pose estimation [57], [20].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
each landmark, since different numerical coordinates usu- To obtain integer coordinates for the generation of the
ally have different fractional parts. As a result, it will signif- ground truth heatmap, typical integer quantization oper-
icantly increase the training loads of the heatmap regression ations such as floor, round, and ceil have been widely
model. For example, given 98 landmarks per face image, the used in previous heatmap regression methods. To unify the
kernel size 11 ˆ 11, and a mini-batch of 16 training samples, quantization error induced by different integer operations,
we then have to run (3) for 98 ˆ 16 ˆ 11 ˆ 11 “ 189, 728 we first introduce a unified integer quantization operation
times in each training iteration. To address this issue, ex- as follows. Given a downsampling stride s ą 1 and a
isting heatmap regression methods usually first quantize threshold t P r0, 1s, the coordinate x P N then can be
numerical coordinates into integers, where a standard ker- quantized according to its fractional part “ x{s ´ tx{su,
nel matrix can then be shared for efficient ground truth i.e., #
heatmap generation [6], [2], [55], [23]. However, the above- tx{su if ă t,
mentioned existing heatmap regression methods usually qpx, s, tq “ (10)
tx{su ` 1 otherwise.
suffer from the inherent drawback of failing to encode the
fractional part of numerical coordinates. Therefore, how to That is, for integer quantization operations floor, round,
efficiently encode the fractional information in numerical and ceil, we have t “ 1.0, t “ 0.5, and t “ 0, respectively.
coordinates still remains challenging. Furthermore, during Furthermore, when the downsampling stride s ą 1, the
p decode operation in (2) then becomes
the inference stage, the predicted numerical coordinates xi
obtained by a decode operation in (2) are also integers. As
ˆ ˙
a result, typical heatmap regression methods usually fail xpi P s ˚ arg max thpi pxqu . (11)
x
to efficiently address the fractional part of the numerical
coordinate during both training and inference, resulting in A vanilla quantization system for heatmap regression can
localization error. then be formed by the encode operation in (10) and the
To analyze the localization error caused by the quantiza- decode operation in (11). When applied to a vector or a
tion system in heatmap regression, we formulate the local- matrix, the integer quantization operation defined in (10)
ization error as the sum of heatmap error and quantization is an element-wise operation.
error as follows:
4.2 Quantization Error
Eloc “ }xpi ´ xgi }2 “ }xpi ´ xopt
i ` xopt
i ´ xgi }2
In this subsection, we first correct the bias in a vanilla quan-
}xpi ´ xopt
ď loooooomoooooon
i }2 ` }x
opt
i ´ xgi }2 ,
loooooomoooooon (8) tization system to form an unbiased vanilla quantization
heatmap error quantization error system. With the unbiased quantization system, we then
provide a tight upper bound on the quantization error for
where xopt
i indicates the numerical coordinate decoded
vanilla heatmap regression.
from the optimal predicted heatmap. Generally, the heatmap g
Let x denote the fractional part of xi {s, and y denote
error corresponds to the error in heatmap prediction, i.e., g
the fractional part of yi {s. Given the downsampling stride
}hpi ´ hgi }2 , and the quantization error indicates the error of the heatmap s ą 1, we then have
caused by both the encode and decode operations. If there
is no heatmap error, the localization error then all originates x “ xgi {s ´ txgi {su,
(12)
from the error of the quantization system, i.e., y “ yig {s ´ tyig {su.
Eloc “ }xpi ´ xgi }2 “ }xopt
i ´ xgi }2 . (9) Given the assumption of a “perfect” heatmap prediction
p g
model or no heatmap error, i.e., hi pxq “ hi pxq, we then
The generalizability of deep neural networks for heatmap have the predicted numerical coordinates
prediction, i.e., the heatmap error, is beyond the scope of #
this paper. We do not consider the heatmap error during the p txgi {su if x ă t,
xi {s “ g
analysis of quantization error in this paper. txi {su ` 1 otherwise,
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
Fig. 3. An intuitive example of the encode operation via randomized rounding. When the downsampling stride s ą 1, the ground truth activation
point pxgi {s, yig {sq usually does not correspond to a single pixel in the heatmap. Therefore, we introduce the randomized rounding operation to
assign the ground truth activation point to a set of alternative activation points and the activation probability depends on the fractional part of the
ground truth numerical coordinates.
#
tyig {su if y ă t, quantization error and that the upper bound of the quanti-
yip {s “
tyig {su ` 1 otherwise. zation error linearly depends on the downsampling stride
of the heatmap. As a result, given the heatmap regression
If data samples satisfy the i.i.d. assumption and the frac- model, it will cause extremely large localization error for
p
tional parts x , y P Up0, 1q, the bias of xi as an estimator of large faces in the original input image, making it a signif-
g
xi can then be evaluated as icant problem in many important face-related applications
such as face makeup, face swapping, and face reenactment.
p g
E pxi {s ´ xi {sq “ E p1tx ă tup´x q ` 1tx ě tup1 ´ x qq
“ 0.5 ´ t.
g g
Considering that xi , yi are independent variables, we thus 4.3 Randomized Rounding
have the quantization bias in the vanilla quantization sys- In vanilla heatmap regression, each numerical coordinate
tem as follows: corresponds to a single activation point in the heatmap,
while the indices of the activation point are all integers.
p g p g p g
E pxi {s ´ xi {sq “ pE pxi {s ´ xi {sq , E pyi {s ´ yi {sqq As a result, the fractional part of the numerical coordinate
“ p0.5 ´ t, 0.5 ´ tq . is usually ignored during the encode process, making it
an inherent drawback of heatmap regression for sub-pixel
Therefore, only the encode operation in (10), i.e., the round
localization. To retain the fractional information when us-
operation, is unbiased. Furthermore, given @t P r0, 1s for
ing heatmap representations, we utilize multiple activation
the encode operation in (10), we can correct the bias of the
points around the ground truth activation point. Inspired
encode operation with a shift on the decode operation, i.e.,
ˆ ˙ by the randomized rounding method [27], we address the
quantization error in vanilla heatmap regression by using a
xpi P s ˚ arg max thpi pxqu ` t ´ 0.5 . (13)
x probabilistic approach. Specifically, we encode the fractional
part of the numerical coordinate to different activation
For simplicity, we use the round operation, i.e., t “ 0.5,
points with different activation probabilities. An intuitive
to form an unbiased quantization system as our baseline.
example is shown in Fig. 3.
Though the vanilla quantization system defined by (10)
and (13) is unbiased, it causes non-invertible localization We describe the proposed quantization system as fol-
g
error. An intuitive explanation for this is that the encode lows. Given the ground truth numerical coordinate xi “
g g
operation in (10) directly discards the fractional part of the pxi , yi q and a downsampling stride of the heatmap s ą
ground truth numerical coordinates, thus making it impos- 1, the ground truth activation point in the heatmap is
sible for the decode operation to accurately reconstruct the pxgi {s, yig {sq, which are usually floating-point numbers,
numerical coordinates. and we are unable to find the corresponding pixel in
the heatmap. If we ignore the fractional part px , y q us-
Theorem 1. Given an unbiased quantization system defined by ing a typical integer quantization operation, e.g., round,
the encode operation in (10) and the decode operation in (13), we the ground truth activation point will be approximated
then have that the quantization error tightly upper bounded, i.e., by one of the activation points around the ground truth
? g g g g
activation point, i.e., ptxi {su, tyi {suq, ptxi {su ` 1, tyi {suq,
}xpi ´ xgi }2 ď 2s{2, g g g g
ptxi {su, tyi {su ` 1q, and ptxi {su ` 1, tyi {su ` 1q. However,
where s ą 1 indicates the downsampling stride of the heatmap. the above process is not invertible. To address this, we
randomly assign the ground truth activation point to one of
Proof. In Appendix.
the alternative activation points around the ground truth
From Theorem 1, we know that the vanilla quantization activation point, and the activation probability is deter-
system defined by (10) and (13) will cause non-invertible mined by the fractional part of the ground truth activation
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
P thgi ptxgi {su, tyig {suq “ 1u “ p1 ´ x qp1 ´ y q, The fractional information of the numerical coordinate
px , y q is well-captured by the randomized rounding oper-
P thgi ptxgi {su ` 1, tyig {suq “ 1u “ x p1 ´ y q, ation, allowing us to reconstruct the ground truth numerical
P thgi ptxgi {su, tyig {su ` 1q “ 1u “ p1 ´ x qy , g
coordinate xi without the quantization error. However,
P thgi ptxgi {su ` 1, tyig {su ` 1q “ 1u “ x y . during the inference phase, the ground truth numerical
g
(14) coordinate xi is unavailable and heatmap error always
exists in practice, making it difficult to identify the proper
To achieve the encode scheme in (14) in conjunction with g
set of ground truth activation points Xi . In this section, we
current minibatch stochastic gradient descent training al- describe a way to form a set of alternative activation points
gorithms for deep learning models, we introduce a new in practice.
integer quantization operation via randomized rounding, We introduce two activation point selection methods
i.e., random-round: as follows. The first solution is to estimate all activation
# points via the points around the maximum activation point.
tx{su if ă t, t „ Up0, 1q, As shown in Fig. 4, given the maximum activation point,
qpx, sq “ (15)
tx{su ` 1 otherwise. we then have four different sets of alternative activation
g g g g
points, Xi 1 , Xi 2 , Xi 3 , and Xi 4 . Therefore, given the pre-
Given the encode operation in (15), if we do not consider dicted heatmap in practice, we then take a risk of choosing
the heatmap error, we then have the activation probability an incorrect set of alternative activation points. To find
at x: a robust set of alternative activation points, we may use
hpi pxq “ P thgi pxq “ 1u . (16) all nine activation points around the maximum activation
point, i.e.,
As a result, the fractional part of the ground truth numerical
coordinate px , y q can be reconstructed from the predicted Xig “ Xig1 Y Xig2 Y Xig3 Y Xig4 . (19)
heatmap via the activation probabilities of all activation g
points, i.e., Another solution of alternative activation points Xi is to
generalize the argmax operation with the argtopk operation,
¨ ˛ p
i.e., we decode the predicted heatmap hi to obtain the
ÿ p
xpi “ s ˚ ˝ hpi pxq ˚ x‚, (17) numerical coordinate xi according to the top k largest
xPXig activation points,
g
where Xi indicates the set of activation points around the Xig “ arg topkphpi pxqq. (20)
x
ground truth activation point, i.e.,
If there is no heatmap error, the two alternative activa-
Xig “ t ptxgi {su, tyig {suq , ptxgi {su ` 1, tyig {suq , tion points solutions presented above, i.e., the alternative
(18)
ptxgi {su, tyig {su ` 1q , ptxgi {su ` 1, tyig {su ` 1qu. activation points in (18) and (19), are equal to each other
when using the decode operation in (17). Specifically, we
Theorem 2. Given the encode operation in (15) and the decode find that the activation points in (19) achieve comparable
operation in (17), we then have that the 1) encode operation is performance to the activation points in (20) when k “ 9.
unbiased; and 2) quantization system is lossless, i.e., there is no For simplicity, unless otherwise mentioned, we use the
quantization error. set of alternative activation points defined by (20) in this
paper. Furthermore, when we take the heatmap error into
Proof. In Appendix. consideration, the values of different k then forms a trade-
off on the selection of activation points, i.e., a larger k will be
From Theorem 2, we know that the quantization system robust to activation point selection whilst also increasing the
defined by the encode operation in (15) and the decode risk of noise from the heatmap error. See more discussion in
operation in (17) is unbiased and lossless. Section 4.5 and the experimental results in Section 5.5.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
TABLE 2
Comparison with State-of-the-Arts on WFLW dataset.
TABLE 3
Comparison with State-of-the-Arts on 300W dataset.
All face images are selected from the WIDER Face 5.2 Evaluation Metrics
dataset [74], which contains face images with large We use the normalized mean error (NME) as the evaluation
variations in scale, expression, pose, and occlusion. metric in this paper, i.e.,
‚ 300W [75]. 300W contains 3, 148 training images, ˆ p
}xi ´ xgi }2
˙
including 337 images from AFW [30], 2, 000 images NME “ E , (23)
from the training set of HELEN [76], and 811 images d
from the training set of LFPW [39]. For testing, there where d indicates the normalization distance. For fair com-
are four different settings: 1) common: 554 images, parison, we report the performances on WFLW, 300W, and
including 330 and 224 images from the testsets of COFW using two the normalization methods, inter-pupil
HELEN and LFPW, respectively; 2) challenge: 135 distance (the distance between the eye centers) and inter-
images from IBUG; 3) full: 689 images as a combina- ocular distance (the distance between the outer eye corners).
tion of common and challenge; and 4) private: 600 We report the performance on AFLW using the size of the
indoor/outdoor images. All images are manually face bounding box as the normalization distance, ? i.e., the
annotated with 68 facial landmarks. normalization distance can be evaluated by d “ w ˚ h,
‚ COFW [34]. COFW contains 1, 852 images, including where w and h indicate the width and height of the face
1, 345 training and 507 testing images. All images are bounding box, respectively.
manually annotated with 29 facial landmarks.
‚ AFLW [77]. AFLW contains 24, 386 face images, in- 5.3 Implementation Details
cluding 20, 000 images for training and 4, 836 images We implement the proposed heatmap regression method
for testing. For testing, there are two settings: 1) for facial landmark detection using PyTorch [78]. Following
full: all 4, 836 images for testing; and 2) front: 1, 314 the practice in [23], we use HRNet [22] as our backbone
frontal images selected from the full set. All images network, which is an efficient counterpart of ResNet [79],
are manually annotated with 21 facial landmarks. U-Net [80], and Hourglass [6] for semantic landmark lo-
For fair comparison, we use 19 facial landmarks, i.e., calization. Unless otherwise mentioned, we use HRNet-
the landmarks on two ears are ignored. W18 as the backbone network in our experiments. All face
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
images are cropped and resized to 256 ˆ 256 pixels and that the proposed method outperforms recent state-of-the-
the downsampling stride of the feature map is 4 pixels. For art methods with a clear margin on COFW dataset. AFLW
training, we perform widely-used data augmentation for captures a wide range of different face poses, including both
facial landmark detection as follows We horizontally flip all frontal faces and non-frontal faces. As shown in Table 5,
training images with probability 0.5 and randomly change the proposed method achieves consistent improvements for
the brightness (˘0.125), contrast (˘0.5), and saturation both frontal faces and non-frontal faces, suggesting robust-
(˘0.5) of each image. We then randomly rotate the image ness across different face poses.
(˘30˝ ), rescale the image (˘0.25), and translate the image
(˘16 pixels). We also randomly erase a rectangular region in 5.5 Ablation Studies
the training image [81]. All our models are initialized from
To better understand the proposed quantization system in
the weights pretrained on ImageNet [82]. We use the Adam
different settings, we perform ablation studies the most
optimizer [83] with batch size 16. The learning rate starts
challenging dataset, WFLW [64].
from 0.001 and is divided by 10 for every 60 epochs, with
150 training epochs in total. During the testing phase, we
horizontally flip testing images as the data augmentation
and average the predictions.
TABLE 4
Comparison with State-of-the-Arts on COFW dataset.
NME (%)
Method
Inter-ocular Inter-pupil
SHN [71] - 5.60
LAB [64] - 5.58
DFCE [72] - 5.27
3DDE [65] - 5.11
Wing [45] - 5.07
HRNet [23] 3.45 -
AWing [21] - 4.94
H3R (ours) 3.15 4.55
TABLE 5
Comparison with State-of-the-Arts on AFLW dataset. Fig. 6. The influence of different input resolutions.
TABLE 6
The influence of different numbers of alternative activation points when using binary heatmap.
NME (%)
Resolution
k“1 k“2 k“3 k“4 k“5 k“6 k“7 k“8 k“9 k “ 10 k “ 11 k “ 12 best
512 ˆ 512 3.980 3.946 3.930 3.923 3.916 3.912 3.909 3.906 3.903 3.901 3.898 3.897 3.890/k “ 25
384 ˆ 384 3.932 3.875 3.855 3.850 3.842 3.837 3.833 3.830 3.828 3.827 3.826 3.826 3.825/k “ 14
256 ˆ 256 4.005 3.881 3.836 3.832 3.819 3.815 3.810 3.808 3.807 3.807 3.807 3.808 3.807/k “ 9
128 ˆ 128 4.637 4.164 4.029 4.023 3.997 3.991 3.988 3.989 3.991 3.994 3.998 4.002 3.988/k “ 7
TABLE 7
The influence of different numbers of alternative activation points when using Gaussian heatmap.
NME (%)
k“1 k“2 k“3 k“4 k“5 k“6 k“7 k“8 k“9 k “ 10 k “ 11 k “ 12 best
σ “ 0.0 4.235 4.102 4.069 4.065 4.090 4.202 4.427 4.775 5.219 5.719 6.256 6.802 4.065/k “ 4
σ “ 0.5 4.162 4.045 4.010 4.008 3.991 3.990 3.988 3.992 3.998 4.014 4.047 4.108 3.988/k “ 7
σ “ 1.0 4.037 3.928 3.898 3.908 3.871 3.877 3.876 3.876 3.871 3.861 3.855 3.857 3.855/k “ 11
σ “ 1.5 4.032 3.927 3.901 3.918 3.873 3.879 3.885 3.882 3.878 3.864 3.857 3.860 3.855/k “ 18
σ “ 2.0 4.086 3.983 3.953 3.969 3.923 3.931 3.937 3.931 3.925 3.909 3.902 3.907 3.894/k “ 25
TABLE 8
The influence of different face ”bounding box” annotation policies.
The influence of different numbers of alternative ac- The influence of different “bounding box” annotation
tivation points. In the proposed quantization system, the policies. For facial landmark detection, a reference bound-
activation probability indicates the distance between activa- ing box is required to indicate the position of the facial area.
g g
tion point to the ground truth activation point pxi {s, yi {sq. However, there is a performance gap when using different
If there is no heatmap error, the alternative activation points reference bounding boxes [21]. A comparison between two
in (20) then give the same result as in (18). If the heatmap widely used “bounding box” annotation policies is shown
error cannot be ignored, there will be a trade-off on the num- in Fig. 7, and we introduce two different bounding box
ber of alternative activation points: 1) a small k increases annotation policies as follows:
the risk of missing the ground truth alternative activation
points; 2) a large k introduces the noise from irrelevant ‚ P1: This annotation policy is usually used in semantic
activation points, especially for large heatmap error. We landmark localization tasks, especially in facial land-
demonstrate the performance of the proposed method by mark localization. Specifically, the rectangular area of
using different numbers of alternative activation points in the bounding box tightly encloses a set of pre-defined
Table 6. Specifically, we see that 1) when using a high facial landmarks.
input resolution, the best performance is achieved with a ‚ P2: This annotation policy has been widely used in
relatively large k ; and 2) the performance is smooth near face detection datasets [74]. The bounding box con-
the optimal number of alternative activation points, making tains the areas of the forehead, chin, and cheek. For
it easy to find a proper k for validation data. As introduced the occluded face, the bounding box is estimated by
in Section 3, binary heatmaps can be seen as a special case the human annotator based on the scale of occlusion.
of Gaussian heatmaps with standard deviation σ “ 0. Con-
sidering that Gaussian heatmaps have been widely used in
semantic landmark localization applications, we generalize
the proposed quantization system to Gaussian heatmaps
and demonstrate the influence of different numbers of al-
ternative activation points in Table 7. Specifically, we see
that 1) when applying the proposed quantization system to
the model using Gaussian heatmap, it achieves comparable
performance to the model using binary heatmap; and 2) the Fig. 7. A comparison between different ”bounding box” annotation poli-
optimal number of alternative activation points increases cies. P1: the yellow bounding boxes. P2: the blue bounding boxes.
with the standard deviation σ .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
Fig. 8. Qualitative results from the test split of the WFLW dataset (best view in color). Blue: the ground truth facial landmarks. Yellow: the predicted
facial landmarks. The first row shows some “good” cases and the second row shows some “bad” cases.
TABLE 9
Results on MPII Human Pose dataset. In each block, the first row with “´” indicates the baseline method, i.e., k “ 1; The second row with “˚”
indicates the compensation method, i.e., “shift a quarter to the second maximum activation point”.
Backbone Input H3R Head Shoulder Elbow Wrist Hip Knee Ankle Mean [email protected]
ResNet-50 256 ˆ 256 - 96.3 95.2 89.0 82.9 88.2 83.7 79.4 88.4 29.8
ResNet-50 256 ˆ 256 * 96.4 95.3 89.0 83.2 88.4 84.0 79.6 88.5 34.0
ResNet-50 256 ˆ 256 X 96.3 95.2 88.8 83.4 88.5 84.3 79.8 88.6 34.9
ResNet-101 256 ˆ 256 - 96.7 95.8 89.3 84.2 87.9 84.2 80.7 88.9 30.0
ResNet-101 256 ˆ 256 * 96.9 95.9 89.5 84.4 88.4 84.5 80.7 89.1 34.0
ResNet-101 256 ˆ 256 X 96.7 96.0 89.3 84.4 88.5 84.3 80.6 89.1 35.0
ResNet-152 256 ˆ 256 - 97.0 95.8 89.9 84.7 89.1 85.4 81.3 89.5 31.0
ResNet-152 256 ˆ 256 * 97.0 95.9 90.0 85.0 89.2 85.3 81.3 89.6 35.0
ResNet-152 256 ˆ 256 X 96.8 95.9 90.1 84.9 89.3 85.3 81.3 89.6 36.2
HRNet-W32 256 ˆ 256 - 97.0 95.7 90.1 86.3 88.4 86.8 82.9 90.1 32.8
HRNet-W32 256 ˆ 256 * 97.1 95.9 90.3 86.4 89.1 87.1 83.3 90.3 37.7
HRNet-W32 256 ˆ 256 X 97.1 96.1 90.8 86.1 89.2 86.4 82.6 90.3 39.3
TABLE 10
Results on COCO validation set. In each block, the first row with “´” indicates the baseline, i.e., k “ 1; The second row with “˚” indicates the
compensation method, i.e., “shift a quarter to the second maximum activation point”, which can be seen as a special case of H3R with k “ 2.
7 C ONCLUSION [4] U. Iqbal, P. Molchanov, T. Breuel Juergen Gall, and J. Kautz, “Hand
pose estimation via latent 2.5 d heatmap regression,” in European
In this paper, we address the problem of sub-pixel local- Conference on Computer Vision (ECCV), 2018, pp. 118–134.
ization for heatmap-based semantic landmark localization. [5] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation
We formally analyze quantization error in vanilla heatmap via deep neural networks,” in IEEE Conference on Computer Vision
regression and propose a new quantization system via ran- and Pattern Recognition (CVPR), 2014, pp. 1653–1660.
[6] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for
domized rounding operation, which we prove is unbiased human pose estimation,” in European Conference on Computer Vision
and lossless. Experiments on facial landmark localization (ECCV), 2016, pp. 483–499.
and human pose estimation datasets demonstrate the effec- [7] A. Saxena, J. Driemeyer, and A. Y. Ng, “Robotic grasping
tiveness of the proposed quantization system for efficient of novel objects using vision,” International Journal of Robotics
Research (IJRR), vol. 27, no. 2, pp. 157–173, 2008. [Online].
and accurate sub-pixel localization. Available: https://doi.org/10.1177/0278364907087172
[8] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face repre-
sentation by joint identification-verification,” in Neural Information
ACKNOWLEDGEMENT Processing Systems (NIPS), 2014, pp. 1988–1996.
Dr. Baosheng Yu is supported by ARC project FL-170100117. [9] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and
M. Nießner, “Face2face: Real-time face capture and reenactment
of rgb videos,” in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016, pp. 2387–2395.
R EFERENCES [10] C. Wang, D. Xu, Y. Zhu, R. Martı́n-Martı́n, C. Lu, L. Fei-Fei, and
[1] Y. Sun, X. Wang, and X. Tang, “Deep convolutional network S. Savarese, “Densefusion: 6d object pose estimation by iterative
cascade for facial point detection,” in IEEE Conference on Computer dense fusion,” in IEEE Conference on Computer Vision and Pattern
Vision and Pattern Recognition (CVPR), 2013, pp. 3476–3483. Recognition (CVPR), 2019.
[2] A. Bulat and G. Tzimiropoulos, “How far are we from solving [11] J. Deng, J. Guo, X. Niannan, and S. Zafeiriou, “Arcface: Additive
the 2d & 3d face alignment problem?(and a dataset of 230,000 3d angular margin loss for deep face recognition,” in IEEE Conference
facial landmarks),” in IEEE International Conference on Computer on Computer Vision and Pattern Recognition (CVPR), 2019.
Vision (ICCV), 2017, pp. 1021–1030. [12] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang,
[3] A. Sinha, C. Choi, and K. Ramani, “Deephand: Robust hand pose “Spindle net: Person re-identification with human body region
estimation by completing a matrix imputed with deep features,” in guided feature decomposition and fusion,” in IEEE Conference on
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1077–
June 2016. 1085.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
[13] L. Zheng, Y. Huang, H. Lu, and Y. Yang, “Pose-invariant embed- [36] S. Zhu, C. Li, C. Change Loy, and X. Tang, “Face alignment by
ding for deep person re-identification,” IEEE Transactions on Image coarse-to-fine shape searching,” in IEEE Conference on Computer
Processing (TIP), vol. 28, no. 9, pp. 4500–4509, 2019. Vision and Pattern Recognition (CVPR), 2015, pp. 4998–5006.
[14] Y.-C. Chen, X. Shen, and J. Jia, “Makeup-go: Blind reversion of [37] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human pose
portrait edit,” in IEEE International Conference on Computer Vision estimation with iterative error feedback,” in IEEE Conference on
(ICCV), Oct 2017. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4733–
[15] H. Chang, J. Lu, F. Yu, and A. Finkelstein, “Pairedcyclegan: Asym- 4742.
metric style transfer for applying and removing makeup,” in IEEE [38] V. Belagiannis and A. Zisserman, “Recurrent human pose estima-
Conference on Computer Vision and Pattern Recognition (CVPR), 2018, tion,” in IEEE International Conference on Automatic Face & Gesture
pp. 40–48. Recognition (FG), 2017, pp. 468–475.
[16] C. Cao, Y. Weng, S. Lin, and K. Zhou, “3d shape regression for [39] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar,
real-time facial animation,” ACM Transactions on Graphics (TOG), “Localizing parts of faces using a consensus of exemplars,” IEEE
vol. 32, no. 4, pp. 1–10, 2013. Transactions on Pattern Analysis and Machine Intelligence (TPAMI),
[17] C. Cao, Q. Hou, and K. Zhou, “Displaced dynamic expression vol. 35, no. 12, pp. 2930–2940, 2013.
regression for real-time facial tracking and animation,” ACM [40] X. Miao, X. Zhen, X. Liu, C. Deng, V. Athitsos, and H. Huang,
Transactions on Graphics (TOG), vol. 33, no. 4, pp. 1–10, 2014. “Direct shape regression networks for end-to-end face alignment,”
[18] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, in IEEE Conference on Computer Vision and Pattern Recognition
and C. Theobalt, “Real-time expression transfer for facial reen- (CVPR), 2018, pp. 5040–5049.
actment.” ACM Transactions on Graphics (TOG), vol. 34, no. 6, pp. [41] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark
183–1, 2015. detection by deep multi-task learning,” in European Conference on
[19] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of Computer Vision (ECCV), 2014, pp. 94–108.
a convolutional network and a graphical model for human pose [42] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection
estimation,” in Neural Information Processing Systems (NIPS), 2014, and alignment using multitask cascaded convolutional networks,”
pp. 1799–1807. IEEE Signal Processing Letters (SPL), vol. 23, no. 10, pp. 1499–1503,
[20] A. Nibali, Z. He, S. Morgan, and L. Prendergast, “Numerical 2016.
coordinate regression with convolutional neural networks,” arXiv [43] R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: A deep
preprint arXiv:1801.07372, 2018. multi-task learning framework for face detection, landmark lo-
[21] X. Wang, L. Bo, and L. Fuxin, “Adaptive wing loss for robust calization, pose estimation, and gender recognition,” IEEE Trans-
face alignment via heatmap regression,” in IEEE International actions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 41,
Conference on Computer Vision (ICCV), 2019, pp. 6971–6981. no. 1, pp. 121–135, 2017.
[22] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution repre- [44] R. Girshick, “Fast r-cnn,” in IEEE International Conference on Com-
sentation learning for human pose estimation,” in IEEE Conference puter Vision (ICCV), 2015, pp. 1440–1448.
on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5693– [45] Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu, “Wing loss
5703. for robust facial landmark localisation with convolutional neural
networks,” in IEEE Conference on Computer Vision and Pattern
[23] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu,
Recognition (CVPR), 2018, pp. 2235–2245.
M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution
representation learning for visual recognition,” IEEE Transactions [46] A. Bulat and G. Tzimiropoulos, “Binarized convolutional land-
on Pattern Analysis and Machine Intelligence (TPAMI), 2020. mark localizers for human pose estimation and face alignment
with limited resources,” in IEEE International Conference on Com-
[24] Y. Tai, Y. Liang, X. Liu, L. Duan, J. Li, C. Wang, F. Huang, and
puter Vision (ICCV), 2017, pp. 3706–3714.
Y. Chen, “Towards highly accurate and stable face alignment for
high-resolution videos,” in AAAI Conference on Artificial Intelligence [47] D. Merget, M. Rock, and G. Rigoll, “Robust facial landmark de-
(AAAI), vol. 33, 2019, pp. 8893–8900. tection via a fully-convolutional local-global context network,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
[25] W. Li, Z. Wang, B. Yin, Q. Peng, Y. Du, T. Xiao, G. Yu, H. Lu, Y. Wei,
2018, pp. 781–790.
and J. Sun, “Rethinking on multi-stage networks for human pose
estimation,” arXiv preprint arXiv:1901.00148, 2019. [48] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for
human pose estimation in videos,” in IEEE International Conference
[26] F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, “Distribution-aware on Computer Vision (ICCV), 2015, pp. 1913–1921.
coordinate representation for human pose estimation,” in IEEE
[49] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka,
Conference on Computer Vision and Pattern Recognition (CVPR), 2020,
P. V. Gehler, and B. Schiele, “Deepcut: Joint subset partition and
pp. 7093–7102.
labeling for multi person pose estimation,” in IEEE Conference on
[27] P. Raghavan and C. D. Tompson, “Randomized rounding: a Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4929–
technique for provably good algorithms and algorithmic proofs,” 4937.
Combinatorica, vol. 7, no. 4, pp. 365–374, 1987.
[50] A. Newell, Z. Huang, and J. Deng, “Associative embedding:
[28] B. Korte, J. Vygen, B. Korte, and J. Vygen, Combinatorial optimiza- End-to-end learning for joint detection and grouping,” in Neural
tion. Springer, 2012, vol. 2. Information Processing Systems (NIPS), 2017, pp. 2277–2287.
[29] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit [51] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang, “Learning fea-
shape regression,” International Journal of Computer Vision (IJCV), ture pyramids for human pose estimation,” in IEEE International
vol. 107, no. 2, pp. 177–190, 2014. Conference on Computer Vision (ICCV), 2017, pp. 1281–1290.
[30] X. Zhu and D. Ramanan, “Face detection, pose estimation, and [52] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and
landmark localization in the wild,” in IEEE Conference on Computer K. Murphy, “Personlab: Person pose estimation and instance seg-
Vision and Pattern Recognition (CVPR), 2012, pp. 2879–2886. mentation with a bottom-up, part-based, geometric embedding
[31] X. Xiong and F. De la Torre, “Supervised descent method and its model,” in European Conference on Computer Vision (ECCV), 2018,
applications to face alignment,” in IEEE Conference on Computer pp. 269–286.
Vision and Pattern Recognition (CVPR), 2013, pp. 532–539. [53] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolu-
[32] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps via tional pose machines,” in IEEE Conference on Computer Vision and
regressing local binary features,” in IEEE Conference on Computer Pattern Recognition (CVPR), 2016, pp. 4724–4732.
Vision and Pattern Recognition (CVPR), 2014, pp. 1685–1692. [54] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person
[33] V. Kazemi and J. Sullivan, “One millisecond face alignment with 2d pose estimation using part affinity fields,” in IEEE Conference
an ensemble of regression trees,” in IEEE Conference on Computer on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7291–
Vision and Pattern Recognition (CVPR), 2014, pp. 1867–1874. 7299.
[34] X. P. Burgos-Artizzu, P. Perona, and P. Dollár, “Robust face land- [55] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded
mark estimation under occlusion,” in IEEE International Conference pyramid network for multi-person pose estimation,” in IEEE
on Computer Vision (ICCV), 2013, pp. 1513–1520. Conference on Computer Vision and Pattern Recognition (CVPR), 2018,
[35] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-fine auto- pp. 7103–7112.
encoder networks (cfan) for real-time face alignment,” in European [56] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson,
Conference on Computer Vision (ECCV), 2014, pp. 1–16. C. Bregler, and K. Murphy, “Towards accurate multi-person pose
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
estimation in the wild,” in IEEE Conference on Computer Vision and [78] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
Pattern Recognition (CVPR), 2017, pp. 4903–4911. T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison,
[57] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
regression,” in European Conference on Computer Vision (ECCV), B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative
2018, pp. 529–545. style, high-performance deep learning library,” in Neural Infor-
[58] D. C. Luvizon, D. Picard, and H. Tabia, “2d/3d pose estimation mation Processing Systems (NeurIPS), H. Wallach, H. Larochelle,
and action recognition using multitask deep learning,” in IEEE A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, Eds.
Conference on Computer Vision and Pattern Recognition (CVPR), 2018, Curran Associates, Inc., 2019, pp. 8024–8035.
pp. 5137–5146. [79] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
[59] D. C. Luvizon, H. Tabia, and D. Picard, “Human pose regression for image recognition,” in IEEE Conference on Computer Vision and
by combining indirect part detection and contextual information,” Pattern Recognition (CVPR), 2016, pp. 770–778.
Computers & Graphics, vol. 85, pp. 15–22, 2019. [80] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
[60] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant networks for biomedical image segmentation,” in International
feature transform,” in European Conference on Computer Vision Conference on Medical Image Computing and Computer-Assisted In-
(ECCV), 2016, pp. 467–483. tervention (MICCAI). Springer, 2015, pp. 234–241.
[61] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training [81] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random eras-
of deep visuomotor policies,” Journal of Machine Learning Research ing data augmentation,” AAAI Conference on Artificial Intelligence
(JMLR), vol. 17, no. 1, pp. 1334–1373, 2016. (AAAI), 2020.
[62] J. Thewlis, H. Bilen, and A. Vedaldi, “Unsupervised learning [82] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
of object landmarks by factorized spatial embeddings,” in IEEE “Imagenet: A large-scale hierarchical image database,” in IEEE
International Conference on Computer Vision (ICCV), 2017, pp. 5916– Conference on Computer Vision and Pattern Recognition (CVPR), 2009,
5925. pp. 248–255.
[63] W. Wu and S. Yang, “Leveraging intra and inter-dataset variations [83] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
for robust face alignment,” in IEEE Conference on Computer Vision tion,” in International Conference on Learning Representations (ICLR),
and Pattern Recognition Workshops (CVPRW), 2017, pp. 150–159. 2015.
[64] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou, “Look at [84] J. Naruniec, L. Helminger, C. Schroers, and R. Weber, “High-
boundary: A boundary-aware face alignment algorithm,” in IEEE resolution neural face swapping for visual effects,” Eurographics
Conference on Computer Vision and Pattern Recognition (CVPR), 2018, Symposium on Rendering, vol. 39, no. 4, 2020.
pp. 2129–2138. [85] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human
[65] R. Valle, J. M. Buenaposada, A. Valdés, and L. Baumela, “Face pose estimation: New benchmark and state of the art analysis,” in
alignment using a 3d deeply-initialized ensemble of regression IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
trees,” Computer Vision and Image Understanding (CVIU), vol. 189, 2014, pp. 3686–3693.
p. 102846, 2019. [86] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
[66] A. Dapogny, K. Bailly, and M. Cord, “Decafa: Deep convolutional
context,” in European Conference on Computer Vision (ECCV), 2014,
cascade for face alignment in the wild,” in IEEE International
pp. 740–755.
Conference on Computer Vision (ICCV), October 2019.
[67] S. Qian, K. Sun, W. Wu, C. Qian, and J. Jia, “Aggregation via sep-
aration: Boosting facial landmark detector with semi-supervised
style translation,” in IEEE International Conference on Computer
Vision (ICCV), 2019, pp. 10 153–10 163.
[68] A. Kumar, T. K. Marks, W. Mou, Y. Wang, M. Jones, A. Cherian,
T. Koike-Akino, X. Liu, and C. Feng, “Luvli face alignment: Esti-
mating landmarks’ location, uncertainty, and visibility likelihood,”
in IEEE Conference on Computer Vision and Pattern Recognition Baosheng Yu received a B.E. from the Uni-
(CVPR), 2020, pp. 8236–8246. versity of Science and Technology of China in
[69] X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Style aggregated 2014, and a Ph.D. from The University of Sydney
network for facial landmark detection,” in IEEE Conference on in 2019. He is currently a Research Fellow in
Computer Vision and Pattern Recognition (CVPR), 2018, pp. 379–388. the School of Computer Science and the Fac-
[70] M. Kowalski, J. Naruniec, and T. Trzcinski, “Deep alignment net- ulty of Engineering at The University of Sydney,
work: A convolutional neural network for robust face alignment,” NSW, Australia. His research interests include
in IEEE Conference on Computer Vision and Pattern Recognition machine learning, computer vision, and deep
Workshops (CVPRW), 2017, pp. 88–97. learning.
[71] J. Yang, Q. Liu, and K. Zhang, “Stacked hourglass network for ro-
bust facial landmark localisation,” in IEEE Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 79–87.
[72] R. Valle, J. M. Buenaposada, A. Valdés, and L. Baumela, “A
deeply-initialized coarse-to-fine ensemble of regression trees for
face alignment,” in European Conference on Computer Vision (ECCV),
2018, pp. 585–601.
[73] X. Zou, S. Zhong, L. Yan, X. Zhao, J. Zhou, and Y. Wu, “Learn-
ing robust facial landmark detection via hierarchical structured Dacheng Tao (F’15) is the President of the JD
ensemble,” in IEEE International Conference on Computer Vision Explore Academy and a Senior Vice President of
(ICCV), October 2019. JD.com. He is also an advisor and chief scientist
[74] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A face of the digital science institute in the University of
detection benchmark,” in IEEE Conference on Computer Vision and Sydney. He mainly applies statistics and mathe-
Pattern Recognition (CVPR), 2016, pp. 5525–5533. matics to artificial intelligence and data science,
[75] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 and his research is detailed in one monograph
faces in-the-wild challenge: The first facial landmark localization and over 200 publications in prestigious jour-
challenge,” in IEEE International Conference on Computer Vision nals and proceedings at leading conferences.
Workshops (ICCVW), 2013, pp. 397–403. He received the 2015 Australian Scopus-Eureka
[76] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, “Interactive Prize, the 2018 IEEE ICDM Research Contribu-
facial feature localization,” in European Conference on Computer tions Award, and the 2021 IEEE Computer Society McCluskey Tech-
Vision (ECCV), 2012, pp. 679–692. nical Achievement Award. He is a fellow of the Australian Academy of
[77] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated Science, AAAS, and ACM.
facial landmarks in the wild: A large-scale, real-world database
for facial landmark localization,” in IEEE International Conference
on Computer Vision Workshops (ICCVW), 2011, pp. 2144–2151.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
g
A PPENDIX A Similarly, we have E pqpyi , sqq
“ yig {s.
Considering that xi
p
p
P ROOFS OF T HEOREM 1 AND T HEOREM 2 and yi are linearly independent variables, we thus have
g g g
Theorem 1. Given an unbiased quantization system defined by E pqpxi , sqq “ pE pqpyi , sqq , E pqpxi , sqqq
the encode operation in (10) and the decode operation in (13), we “ pxgi {s, yig {sq.
then have that the quantization error tightly upper bounded, i.e.,
? Therefore, the encode operation in (15), i.e., random-round,
}xpi ´ xgi }2 ď 2s{2, is an unbiased encode operation for heatmap regression.
We then prove that the quantization system is losses
where s ą 1 indicates the downsampling stride of the heatmap. as follows. For the decode operation in (17), if there is no
Proof. Given the ground truth numerical coordinate xi “
g heatmap error, we then have
pxgi , yig q, the predicted numerical coordinate xpi “ pxpi , yip q, P txpi {s “ ptxgi {su, tyig {suqu “ p1 ´ x qp1 ´ y q,
and the downsampling stride of the heatmap s ą 1, if there
P txpi {s “ ptxgi {su ` 1, tyig {suqu “ x p1 ´ y q,
is no heatmap error, we then have
P txpi {s “ ptxgi {su, tyig {su ` 1qu “ p1 ´ x qy ,
hpi pxq “ hgi pxq, P txpi {s “ ptxgi {su ` 1, tyig {su ` 1qu “ x y .
p g
where hi pxq and hi pxq indicate the ground truth heatmap We can reconstruct the fractional part of xi , i.e.,
g
and the predicted heatmap, respectively. Therefore, accord- ÿ
ing to the decode operation in (13), we have the predicted pxpi {s, yip {sq “ P txpi uxpi {s
numerical coordinate as xp
i
pxi , yi q, the predicted numerical coordinate xi “ pxi , yi q, is lossless for heatmap regression if there is no heatmap
g g p p p
and the downsampling stride of the heatmap s ą 1, we then error. However, heatmap prediction performance will be
have influenced by the number of training samples: increasing
g g g the number of training samples improves the model gener-
E pqpxi , sqq “ E pP tx ă tu txi {su ` P tx ě tu ptxi {su ` 1qq alizability from the learning theory perspective. Therefore,
“ txgi {sup1 ´ x q ` ptxgi {su ` 1qx we perform experiments to evaluate the influence of the
“ xg {s
i
proposed method when using different numbers of training
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16
TABLE 12
Comparison of different backbone networks and feature maps on WFLW dataset.
samples in practice. As shown in Table 11, we find that resolution, the heatmap regression model achieves better
1) the proposed method delivers consistent improvements performance with the binary heatmap. We demonstrate the
when using different numbers of training samples; and differences between binary heatmap and Gaussian heatmap
2) increasing the number of training samples significantly in Figure 9. Specifically, the Gaussian heatmap improves
improves the performance of heatmap regression models the robustness of heatmap prediction, while at the risk
with low-resolution input images. of increasing the uncertainty on the maximum activation
The influence of different backbone networks. If we point in the predicted heatmap. Therefore, when training
do not take the heatmap prediction error into consideration, very efficient heatmap regression models using a low input
the quantization error in heatmap regression is then caused resolution, we recommend the binary heatmap.
by the downsampling of heamaps: 1) the downsampling
of input images and 2) the downsampling of CNN feature
maps. Though the analysis of heamap prediction error is
out the scope of this paper, we perform some experiments to
demonstrate the influence of different feature maps from the
backbone networks in practice. Specifically, we perform ex-
periments using the following two settings: 1) upsampling
the feature maps from HRNet [22]; or 2) using the feature
maps from U-shape backbone networks, i.e., U-Net [80]. As
shown in Table 12, we see that 1) directly upsampling the
feature maps achieves comparable performance with the Fig. 9. An intuitive example of the ground truth heatmap on WFLW
baseline method; 2) U-Net performs better than HRNet- dataset using different σ . We plot all heatmaps hg1 , . . . , hg98 into a single
figure for better visualization.
W18 when using a small input resolution (e.g., 64 ˆ 64
pixels), while is significantly worse than HRNet when using The qualitative comparison between the vanilla quan-
a large input resolution (e.g., 256 ˆ 256 pixels); and 3) U-Net tization system and the proposed quantization system. We
contains more parameters and requires much more compu- provide some demo images for facial landmark detection
tations than HRNet when using the same input resolution. using both the baseline method (i.e., k “ 1) and the pro-
It would be interesting to further explore more efficient U- posed quantization method (i.e., k “ 9) to demonstrate the
shape networks for low-resolution heatmap-based semantic effectiveness of the proposed method for accurate semantic
landmark localization. landmark localization. As shown in Fig. 10, we see that
the black landmarks are closer to the blue landmarks than
TABLE 13
The influence of different types of heatmap when using the heatmap
the yellow landmarks, especially when using low resolution
regression model with different input resolutions. models (e.g., 64 ˆ 64 pixels).
Fig. 10. Qualitative results from the test split of the WFLW dataset (best view in color). Blue: the ground truth facial landmarks. Yellow: the predicted
facial landmarks by the vanilla quantization system (i.e., k “ 1). Black: the predicted facial landmarks by the proposed quantization system (i.e.,
k “ 9). For above three rows, we use the heatmap regression models with the input resolutions, 256 ˆ 256, 128 ˆ 128, and 64 ˆ 64 pixels,
respectively.
TABLE 14
Results on COCO validation set. The input resolution is 256 ˆ 192 pixels. The first row indicates the performance when using the compensation
method “shift a quarter to the second maximum activation point”.
TABLE 15
Results on MPII validation set. The input resolution is 256 ˆ 256 pixels. The first row indicates the performance when using the compensation
method “shift a quarter to the second maximum activation point”.
Backbone k Head Shoulder Elbow Wrist Hip Knee Ankle Mean [email protected]
HRNet-W32 - 97.1 95.9 90.3 86.4 89.1 87.1 83.3 90.3 37.7
HRNet-W32 1 97.033 95.703 90.131 86.312 88.402 86.802 82.924 90.055 32.808
HRNet-W32 2 97.033 95.856 90.302 86.416 88.818 87.004 83.325 90.247 35.912
HRNet-W32 3 97.135 95.975 90.302 86.347 89.250 86.741 83.018 90.271 37.668
HRNet-W32 4 97.237 95.907 90.643 86.141 89.164 86.862 83.089 90.294 36.997
HRNet-W32 5 97.203 95.839 90.677 86.141 89.147 87.124 82.994 90.315 38.774
HRNet-W32 6 97.101 95.839 90.677 86.262 89.181 86.923 83.089 90.310 38.548
HRNet-W32 7 97.135 95.822 90.609 86.313 89.199 86.943 83.113 90.312 37.913
HRNet-W32 8 97.033 95.822 90.557 86.175 89.147 86.922 82.876 90.245 38.129
HRNet-W32 9 97.067 95.822 90.694 86.329 89.112 86.942 82.876 90.284 38.457
HRNet-W32 10 97.033 95.873 90.626 86.141 89.372 86.862 82.995 90.297 39.024
HRNet-W32 11 97.101 95.907 90.592 86.244 89.406 86.842 82.853 90.302 39.139
HRNet-W32 12 97.101 95.822 90.660 86.175 89.475 86.761 82.900 90.297 38.996
HRNet-W32 13 97.101 95.822 90.694 86.141 89.475 86.741 82.829 90.289 38.855
HRNet-W32 14 97.033 95.822 90.728 86.141 89.337 86.640 82.806 90.250 38.616
HRNet-W32 15 97.169 95.754 90.609 86.124 89.337 86.801 82.475 90.211 38.798
HRNet-W32 16 97.101 95.771 90.694 86.141 89.199 86.882 82.569 90.226 38.691
HRNet-W32 17 97.067 95.822 90.609 86.142 89.285 86.821 82.617 90.224 39.053
HRNet-W32 18 97.033 95.822 90.626 86.004 89.250 86.801 82.522 90.193 39.048
HRNet-W32 19 97.033 95.839 90.626 85.953 89.250 86.822 82.593 90.198 39.058
HRNet-W32 20 96.999 95.788 90.626 86.039 89.389 86.801 82.664 90.226 39.014
HRNet-W32 21 96.999 95.771 90.677 85.935 89.337 86.862 82.451 90.187 38.855
HRNet-W32 22 96.999 95.856 90.626 86.090 89.268 86.741 82.475 90.198 38.842
HRNet-W32 23 96.999 95.890 90.609 86.038 89.164 86.801 82.499 90.187 39.037
HRNet-W32 24 96.862 95.754 90.592 86.021 89.233 86.842 82.333 90.146 39.089
HRNet-W32 25 96.930 95.754 90.506 86.055 89.302 86.721 82.309 90.135 39.126
HRNet-W32 30 96.862 95.788 90.438 86.004 89.372 86.660 82.144 90.101 38.759
HRNet-W32 35 96.623 95.873 90.523 85.936 89.354 86.701 81.884 90.075 38.964
HRNet-W32 40 96.385 95.856 90.523 85.816 89.406 86.721 81.908 90.049 38.618
HRNet-W32 45 96.317 95.669 90.506 85.746 89.285 86.660 82.026 89.990 38.579
HRNet-W32 50 96.214 95.686 90.506 85.678 89.268 86.439 81.648 89.899 38.584
HRNet-W32 55 96.044 95.669 90.404 85.335 89.337 86.439 81.506 89.815 38.496
HRNet-W32 60 95.805 95.533 90.302 85.147 89.147 86.419 81.270 89.672 38.332
HRNet-W32 65 95.703 95.618 90.251 85.079 89.164 86.419 81.081 89.646 38.218
HRNet-W32 70 95.498 95.584 90.165 85.215 89.302 86.378 80.963 89.633 38.197
HRNet-W32 75 95.259 95.533 90.029 85.147 89.147 86.318 80.467 89.487 38.054
HRNet-W32 80 95.020 95.516 90.046 84.907 89.216 86.197 80.255 89.404 37.684
HRNet-W32 85 94.679 95.465 89.927 84.925 89.337 86.258 80.042 89.357 37.463
HRNet-W32 90 94.407 95.431 89.790 84.907 89.233 86.197 79.735 89.251 37.208
HRNet-W32 95 94.065 95.414 89.773 84.890 89.302 86.097 79.263 89.165 36.719
HRNet-W32 100 93.827 95.482 89.603 84.737 89.320 86.076 78.578 89.032 36.180
HRNet-W32 105 93.520 95.448 89.603 84.599 89.199 85.996 77.964 88.884 35.532
HRNet-W32 110 93.008 95.448 89.518 84.273 89.147 85.835 77.114 88.657 34.770
HRNet-W32 115 92.565 95.296 89.501 84.085 89.112 85.794 76.405 88.483 33.352
HRNet-W32 120 92.121 95.262 89.398 83.930 89.060 85.533 75.272 88.238 31.814
HRNet-W32 125 91.473 95.160 89.160 83.554 88.991 85.472 74.162 87.939 30.065
HRNet-W32 130 90.825 94.939 89.211 83.023 89.095 85.089 72.910 87.609 27.528
HRNet-W32 135 90.246 94.667 89.006 82.577 88.852 84.807 71.209 87.164 24.757
HRNet-W32 140 89.529 94.463 88.853 82.046 88.679 83.800 69.792 86.659 21.788
HRNet-W32 145 88.404 94.124 88.717 81.310 88.645 82.773 67.997 86.053 19.136
HRNet-W32 150 87.756 93.886 88.614 80.146 88.575 81.383 66.084 85.373 16.586
HRNet-W32 175 82.401 90.829 86.927 71.515 86.014 70.504 56.826 80.109 9.204
HRNet-W32 200 68.008 87.075 82.325 62.146 79.176 58.737 41.120 72.009 5.988