0% found this document useful (0 votes)
26 views

[3] Heatmap Regression via Randomized Rounding

This paper addresses the sub-pixel localization problem in heatmap regression, a method widely used for semantic landmark localization in deep learning. The authors propose a new quantization system using randomized rounding to effectively encode and decode numerical coordinates, thereby reducing quantization error without increasing computational cost. Experimental results on various datasets demonstrate the method's efficacy in improving accuracy for facial landmark detection and human pose estimation tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

[3] Heatmap Regression via Randomized Rounding

This paper addresses the sub-pixel localization problem in heatmap regression, a method widely used for semantic landmark localization in deep learning. The authors propose a new quantization system using randomized rounding to effectively encode and decode numerical coordinates, thereby reducing quantization error without increasing computational cost. Experimental results on various datasets demonstrate the method's efficacy in improving accuracy for facial landmark detection and human pose estimation tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Heatmap Regression via Randomized Rounding


Baosheng Yu and Dacheng Tao, Fellow, IEEE

Abstract—Heatmap regression has become the mainstream methodology for deep learning-based semantic landmark localization,
including in facial landmark localization and human pose estimation. Though heatmap regression is robust to large variations in pose,
illumination, and occlusion in unconstrained settings, it usually suffers from a sub-pixel localization problem. Specifically, considering that
the activation point indices in heatmaps are always integers, quantization error thus appears when using heatmaps as the representation
of numerical coordinates. Previous methods to overcome the sub-pixel localization problem usually rely on high-resolution heatmaps. As
a result, there is always a trade-off between achieving localization accuracy and computational cost, where the computational complexity
of heatmap regression depends on the heatmap resolution in a quadratic manner. In this paper, we formally analyze the quantization
error of vanilla heatmap regression and propose a simple yet effective quantization system to address the sub-pixel localization problem.
arXiv:2009.00225v2 [cs.CV] 26 Aug 2021

The proposed quantization system induced by the randomized rounding operation 1) encodes the fractional part of numerical coordinates
into the ground truth heatmap using a probabilistic approach during training; and 2) decodes the predicted numerical coordinates
from a set of activation points during testing. We prove that the proposed quantization system for heatmap regression is unbiased
and lossless. Experimental results on popular facial landmark localization datasets (WFLW, 300W, COFW, and AFLW) and human
pose estimation datasets (MPII and COCO) demonstrate the effectiveness of the proposed method for efficient and accurate semantic
landmark localization. Code is available at http://github.com/baoshengyu/H3R.

Index Terms—Semantic landmark localization, heatmap regression, quantization error, randomized rounding.

F
1 I NTRODUCTION

S EMANTIC landmarks are sets of points or pixels in im-


ages containing rich semantic information. They reflect
the intrinsic structure or shape of objects such as human
effective spatial generalization of heatmap representation,
heatmap regression method is robust to large variations
in pose, illumination, and occlusion in unconstrained set-
faces [1], [2], hands [3], [4], bodies [5], [6], and household tings [19], [20]. Heatmap regression has performed particu-
objects [7]. Semantic landmark localization is fundamental larly well in semantic landmark localization tasks including
in computer and robot vision [8], [9], [7], [10]. For exam- facial landmark detection [2], [21] and human pose esti-
ple, semantic landmark localization can be used to register mation [6], [22]. Despite this promise, heatmap regression
correspondences between spatial positions and semantics method suffers from an inherent drawback, namely that the
(semantic alignment), which is extremely useful in visual indices of the activation points in heatmaps are always in-
recognition tasks such as face recognition [8], [11] and per- tegers. Vanilla heatmap-based methods therefore fail to pre-
son re-identification [12], [13]. Therefore, robust and efficient dict the numerical coordinates in sub-pixel precision. Sub-
semantic landmark localization is extremely important in pixel localization is nevertheless important in real-world
applications requiring accurate semantic landmarks includ- scenarios with the fractional part of numerical coordinates
ing robotic grasping [7], [10] and facial analysis applications originating from: 1) the input image being captured either
such as face makeup [14], [15], animation [16], [17], and by a low-resolution camera and/or at a relatively large
reenactment [18], [9]. distance; and 2) the heatmap is usually at a much lower
Coordinate regression and heatmap regression are two resolution than the input image due to the downsampling
widely-used methods for deep learning-based semantic stride of convolutional neural networks. As a result, low-
landmark localization [1], [19]. Rather than directly re- resolution heatmaps significantly degrade heatmap regres-
gressing the numerical coordinate with a fully-connected sion performance. Considering that the computational cost
layer, heatmap-based methods aim to predict the heatmap of convolutional neural networks usually depends quadrat-
where the maximum activation point corresponds to the ically on the resolution of the input image or the feature
semantic landmark in the input image. An intuitive example map, there is a trade-off between the localization accuracy
of heatmap representation is shown in Fig. 1. Due to the and the computational cost for heatmap regression [6], [23],
[24], [25], [26]. Furthermore, the downsampling stride of
heatmap is not always equal to the downsampling stride
‚ Baosheng Yu is with The University of Sydney, Australia. E-mail: of feature map: given an original image of 512 ˆ 512 pixels,
[email protected].
‚ Dacheng Tao is with JD Explore Academy, China and The University of
a heatmap regression model with the input size 128 ˆ 128
Sydney, Australia. Email: [email protected]. pixels, and the feature map with a downsampling stride 4
‚ Corresponding author: Dacheng Tao. pixels, we then have the size of heatmap 32 ˆ 32 pixels,
1057-7149 © 2018 IEEE. Personal use of this material is permitted. Permis- i.e., the downsampling stride of heatmap s “ 16 pixels.
sion from IEEE must be obtained for all other uses, in any current or future For simplicity, we do not distinguish between the above
media, including reprinting/republishing this material for advertising or pro- two settings to address the quantization error in a unified
motional purposes, creating new collective works, for resale or redistribution
to servers or lists, or reuse of any copyrighted component of this work in other manner. Unless otherwise mentioned, we refer to s ą 1 as
works. the downsampling stride of the heatmap.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

popular facial landmark detection datasets (WFLW, 300W,


COFW, and AFLW) and human pose estimation datasets
(MPII and COCO).

2 R ELATED W ORK
Semantic landmark localization, which aims to predict the
numerical coordinates for a set of pre-defined semantic
Fig. 1. An intuitive example of the heatmap representation for the nu- landmarks in a given image or video, has a variety of appli-
merical coordinate. Given the numerical coordinate xi “ pxi , yi q, we cations in computer and robot vision including facial land-
then have the corresponding heatmap hi with the maximum activation
point located at the position pxi , yi q.
mark detection [1], [29], [2], hand landmark detection [3],
[4], human pose estimation [5], [6], [22], and household
object pose estimation [7], [10]. In this section, we briefly
review recent works on coordinate regression and heatmap
In vanilla heatmap regression, 1) during training, the regression for semantic landmark localization, especially in
ground truth numerical coordinates are first quantized to facial landmark localization applications.
generate the ground truth heatmap; and 2) during testing,
the predicted numerical coordinates can be decoded from
2.1 Coordinate Regression
the maximum activation point in the predicted heatmap.
However, typical quantization operations such as floor, Coordinate regression has been widely and successfully
round, and ceil discard the fractional part of the ground used in semantic landmark localization under constrained
truth numerical coordinates, making it difficult to recon- settings, where it usually relies on simple yet effective
struct the fractional part even from the optimal predicted features [30], [31], [32], [33]. To improve the performance
heatmap. This error induced by the transformation between of coordinate regression for semantic landmark localiza-
numerical coordinates and heatmap is known as the quan- tion in the wild, several methods have been proposed
tization error. To address the problem of quantization error, by using cascade refinement [1], [34], [35], [36], [37], [38],
here we introduce a new quantization system to form a loss- parametric/non-parametric shape models [39], [29], [36],
less transformation between the numerical coordinates and [40], multi-task learning [41], [42], [43], and novel loss
the heatmap. In our approach, during training, the proposed functions [44], [45].
quantization system uses a set of activation points, and the
fractional part of the numerical coordinate is encoded as the
activation probabilities of different activation points. During 2.2 Heatmap Regression
testing, the fractional part can then be reconstructed accord- The success of deep learning has prompted the use of
ing to the activation probabilities of the top k maximum heatmap regression for semantic landmark localization, es-
activation points in the predicted heatmap. To achieve this, pecially for robust and accurate facial landmark localiza-
we introduce a new quantization operation called random- tion [2], [46], [47], [23] and human pose estimation [19],
ized rounding, or random-round, which is widely used in [48], [6], [49], [50], [51], [52]. Existing heatmap regression
combinatorial optimization to convert fractional solutions methods either rely on large input images or empirical
into integer solutions with provable approximation guaran- compensations during inference to mitigate the problem
tees [27], [28]. Furthermore, the proposed method can easily of quantization error [6], [53], [54], [22]. For example, a
be implemented using a few lines of source code, making it simple yet effective compensation method known as “shift a
a plug-and-play replacement for the quantization system of quarter to the second maximum activation point” has been
existing heatmap regression methods. widely used in many state-of-the-art heatmap regression
In this paper, we address the problem of quantization methods [6], [55], [22].
error in heatmap regression. The remainder of the paper Several methods have been developed to address the
is structured as follows. In the preliminaries, we briefly problem of quantization error in three aspects: 1) jointly
review two typical semantic landmark localization methods, predicting the heatmap and the offset in a multi-task man-
coordinate regression and heatmap regression. In the meth- ner [56]; 2) encoding and decoding the fractional part of
ods, we first formally introduce the problem of quantization numerical coordinates via a modulated 2D Gaussian dis-
error by decomposing the prediction error into the heatmap tribution [24], [26]; and 3) exploring differentiable transfor-
error and the quantization error. We then discuss quantiza- mations between the heatmap and the numerical coordi-
tion bias in vanilla heatmap regression and prove a tight nates [57], [20], [58], [59]. Specifically, [24] generates the frac-
upper bound on the quantization error in vanilla heatmap tional part sensitive ground truth heatmap for video-based
regression. To address quantization error, we devise a new face alignment, which is known as fractional heatmap re-
quantization system and theoretically prove that the pro- gression. Under the assumption that the predicted heatmap
posed quantization system is unbiased and lossless. We also follows a 2D Gaussian distribution, [26] decodes the frac-
discuss uncertainty in heatmap prediction as well as the tional part of numerical coordinates from the modulated
unbiased annotation when forming a robust semantic land- predicted heatmap. The soft-argmax operation is differen-
mark localization system. In the experimental section, we tiable [60], [61], [62], [59], and has been intensively explored
demonstrate the effectiveness of our proposed method on in human pose estimation [57], [20].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

3 P RELIMINARIES tion N pxgi , Σq,


i.e., the ground truth heatmap hgi
at location
2
x P N can be evaluated as
In this section, we introduce two widely-used semantic ˆ
1
˙
g g J ´1 g
landmark localization methods, coordinate regression and hi pxq “ exp ´ px ´ xi q Σ px ´ xi q , (3)
2
heatmap regression. For simplicity, we use facial landmark
detection as an intuitive example. where Σ is the covariance matrix (a positive semi-definite
Coordinate Regression. Given a face image, semantic matrix) and σ ą 0 is the standard deviation in both direc-
landmark detection aims to find the numerical coordinates tions, i.e., „ 2 
of a set of pre-defined facial landmarks xi “ pxi , yi q, where σ 0
Σ“ . (4)
i “ 1, 2, . . . , K , indicate the indices of different facial land- 0 σ2
marks (e.g., a set of five pre-defined facial landmarks can be When σ Ñ 0, the ground truth heatmap can be generated
left eye, right eye, nose, left mouth corner, and right mouth by assigning a positive value at the ground truth numerical
corner). It is natural to train a model (e.g., deep neural g
coordinate xi , i.e.,
networks) to directly regress the numerical coordinates of #
g
all facial landmarks. The coordinate regression model then g 1 if x “ xi ,
hi pxq “ (5)
can be optimized via a typical regression criterion such as 0 otherwise.
mean squared error (MSE) and mean absolute error (MAE).
For the MSE criterion (also the L2 loss), we have Specifically, when σ Ñ 0, the ground truth heatmap defined
in (5) is also known as the binary heatmap.
Lpxpi , xgi q “ }xpi ´ xgi }22 , (1) Given the ground truth heatmap, the heatmap regres-
sion model then can be optimized using typical pixel-wise
p g regression criteria such as MSE, MAE, or Smooth-L1 [44].
where xi and xi indicate the predicted and the ground
Specifically, for Gaussian heatmaps, the heatmap regression
truth numerical coordinates, respectively. When using the
model is usually optimized with the pixel-wise MSE crite-
MAE criterion (also the L1 loss), the loss function L can be
rion, i.e.,
defined in a similar way to (1).
Lphpi , hgi q “ E}hpi pxq ´ hgi pxq}22 . (6)
Heatmap Regression. Heatmaps (also known as confi-
dence maps) are simple yet effective representations of se- When using the MAE/Smooth-L1 criteria, the loss function
mantic landmark locations. Given the numerical coordinate can be defined in a similar way to (6). For binary heatmap,
xi for the i-th semantic landmark, it then corresponds to the heatmap regression model can also be optimized with
a specific heatmap hi as shown in Fig. 1. For simplicity, the pixel-wise cross-entropy criterion, i.e.,
we assume hi is the same size as the input image in this
section and leave the problem of quantization error to the Lphpi , hgi q “ E pLCE phpi pxq, hgi pxqqq , (7)
next section. With the heatmap representation, the problem where LCE indicates the cross-entropy criterion with a soft-
of semantic landmark localization can be translated into max function as the activation/normalization function. A
heatmap regression via two heatmap subroutines: 1) en- comprehensive review of different loss functions for se-
g
code (from the ground truth numerical coordinate xi to mantic landmark localization is beyond the scope of this
g
the ground truth heatmap hi ); and 2) decode (from the paper, but we refer interested readers to [45] for descriptions
p
predicted heatmap hi to the predicted numerical coordi- of coordinate regression and [21] for heatmap regression.
p
nate xi ). The main deep learning-based heatmap regression Unless otherwise mentioned, we use the MSE criterion for
for semantic landmark localization framework is shown Gaussian heatmap and the cross-entropy criterion for binary
in Fig. 2. Specifically, during the inference stage, given a heatmap in this paper.
p p
predicted heatmap hi , the value hi pxq P r0, 1s indicates
the confidence score that the i-th landmark is located at
coordinate x P N2 . Then, we can decode the predicted 4 M ETHOD
p p
numerical coordinate xi from the predicted heatmap hi In this section, we first introduce the quantization system
using the argmax operation, i.e., in heatmap regression and then formulate the quantization
error in a unified way by correcting the quantization bias
xpi “ pxpi , yip q P arg max thpi pxqu . (2) in a vanilla quantization system. Lastly, we devise a new
x quantization system via randomized rounding to address
the problem of quantization error.
Therefore, with the decode operation in (2), the problem of
semantic landmark localization can be solved by training a
p 4.1 Quantization System
deep model to predict heatmap hi .
To train a heatmap regression model, the ground truth Heatmap regression for semantic landmark localization usu-
g
heatmap hi is indispensable, i.e., we need to encode the ally contains two key components: 1) heatmap prediction;
g
ground truth coordinate xi into the ground truth heatmap and 2) transformation between the heatmap and the nu-
g
hi . We introduce two widely-used methods to generate merical coordinates. The quantization system in heatmap
the ground truth heatmap, Gaussian heatmap and binary regression is a combination of the encode and decode op-
g
heatmap, as follows. Given the ground truth coordinate xi , erations. During training, when the ground truth numerical
g
the ground truth Gaussian heatmap can be generated by coordinates xi are floating-point numbers, we then need
sampling and normalizing from a bivariate normal distribu- to calculate a specific Gaussian kernel matrix using (3) for
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

Fig. 2. The main deep learning-based heatmap regression


˘ for semantic landmark localization framework. Specifically, during˘ the inference stage,
we decode the predicted heatmaps hp “ hp1 , . . . , hpK to obtain numerical coordinates xp “ xp1 , . . . , xpK` ; During the
` `
the predicted ˘ training
stage, we encode the ground truth numerical coordinates xg “ xg1 , . . . , xgK to generate the ground truth heatmaps hg “ hg1 , . . . , hgK .
` ˘

each landmark, since different numerical coordinates usu- To obtain integer coordinates for the generation of the
ally have different fractional parts. As a result, it will signif- ground truth heatmap, typical integer quantization oper-
icantly increase the training loads of the heatmap regression ations such as floor, round, and ceil have been widely
model. For example, given 98 landmarks per face image, the used in previous heatmap regression methods. To unify the
kernel size 11 ˆ 11, and a mini-batch of 16 training samples, quantization error induced by different integer operations,
we then have to run (3) for 98 ˆ 16 ˆ 11 ˆ 11 “ 189, 728 we first introduce a unified integer quantization operation
times in each training iteration. To address this issue, ex- as follows. Given a downsampling stride s ą 1 and a
isting heatmap regression methods usually first quantize threshold t P r0, 1s, the coordinate x P N then can be
numerical coordinates into integers, where a standard ker- quantized according to its fractional part  “ x{s ´ tx{su,
nel matrix can then be shared for efficient ground truth i.e., #
heatmap generation [6], [2], [55], [23]. However, the above- tx{su if  ă t,
mentioned existing heatmap regression methods usually qpx, s, tq “ (10)
tx{su ` 1 otherwise.
suffer from the inherent drawback of failing to encode the
fractional part of numerical coordinates. Therefore, how to That is, for integer quantization operations floor, round,
efficiently encode the fractional information in numerical and ceil, we have t “ 1.0, t “ 0.5, and t “ 0, respectively.
coordinates still remains challenging. Furthermore, during Furthermore, when the downsampling stride s ą 1, the
p decode operation in (2) then becomes
the inference stage, the predicted numerical coordinates xi
obtained by a decode operation in (2) are also integers. As
ˆ ˙
a result, typical heatmap regression methods usually fail xpi P s ˚ arg max thpi pxqu . (11)
x
to efficiently address the fractional part of the numerical
coordinate during both training and inference, resulting in A vanilla quantization system for heatmap regression can
localization error. then be formed by the encode operation in (10) and the
To analyze the localization error caused by the quantiza- decode operation in (11). When applied to a vector or a
tion system in heatmap regression, we formulate the local- matrix, the integer quantization operation defined in (10)
ization error as the sum of heatmap error and quantization is an element-wise operation.
error as follows:
4.2 Quantization Error
Eloc “ }xpi ´ xgi }2 “ }xpi ´ xopt
i ` xopt
i ´ xgi }2
In this subsection, we first correct the bias in a vanilla quan-
}xpi ´ xopt
ď loooooomoooooon
i }2 ` }x
opt
i ´ xgi }2 ,
loooooomoooooon (8) tization system to form an unbiased vanilla quantization
heatmap error quantization error system. With the unbiased quantization system, we then
provide a tight upper bound on the quantization error for
where xopt
i indicates the numerical coordinate decoded
vanilla heatmap regression.
from the optimal predicted heatmap. Generally, the heatmap g
Let x denote the fractional part of xi {s, and y denote
error corresponds to the error in heatmap prediction, i.e., g
the fractional part of yi {s. Given the downsampling stride
}hpi ´ hgi }2 , and the quantization error indicates the error of the heatmap s ą 1, we then have
caused by both the encode and decode operations. If there
is no heatmap error, the localization error then all originates x “ xgi {s ´ txgi {su,
(12)
from the error of the quantization system, i.e., y “ yig {s ´ tyig {su.
Eloc “ }xpi ´ xgi }2 “ }xopt
i ´ xgi }2 . (9) Given the assumption of a “perfect” heatmap prediction
p g
model or no heatmap error, i.e., hi pxq “ hi pxq, we then
The generalizability of deep neural networks for heatmap have the predicted numerical coordinates
prediction, i.e., the heatmap error, is beyond the scope of #
this paper. We do not consider the heatmap error during the p txgi {su if x ă t,
xi {s “ g
analysis of quantization error in this paper. txi {su ` 1 otherwise,
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

Fig. 3. An intuitive example of the encode operation via randomized rounding. When the downsampling stride s ą 1, the ground truth activation
point pxgi {s, yig {sq usually does not correspond to a single pixel in the heatmap. Therefore, we introduce the randomized rounding operation to
assign the ground truth activation point to a set of alternative activation points and the activation probability depends on the fractional part of the
ground truth numerical coordinates.

#
tyig {su if y ă t, quantization error and that the upper bound of the quanti-
yip {s “
tyig {su ` 1 otherwise. zation error linearly depends on the downsampling stride
of the heatmap. As a result, given the heatmap regression
If data samples satisfy the i.i.d. assumption and the frac- model, it will cause extremely large localization error for
p
tional parts x , y P Up0, 1q, the bias of xi as an estimator of large faces in the original input image, making it a signif-
g
xi can then be evaluated as icant problem in many important face-related applications
such as face makeup, face swapping, and face reenactment.
p g
E pxi {s ´ xi {sq “ E p1tx ă tup´x q ` 1tx ě tup1 ´ x qq
“ 0.5 ´ t.
g g
Considering that xi , yi are independent variables, we thus 4.3 Randomized Rounding
have the quantization bias in the vanilla quantization sys- In vanilla heatmap regression, each numerical coordinate
tem as follows: corresponds to a single activation point in the heatmap,
while the indices of the activation point are all integers.
p g p g p g
E pxi {s ´ xi {sq “ pE pxi {s ´ xi {sq , E pyi {s ´ yi {sqq As a result, the fractional part of the numerical coordinate
“ p0.5 ´ t, 0.5 ´ tq . is usually ignored during the encode process, making it
an inherent drawback of heatmap regression for sub-pixel
Therefore, only the encode operation in (10), i.e., the round
localization. To retain the fractional information when us-
operation, is unbiased. Furthermore, given @t P r0, 1s for
ing heatmap representations, we utilize multiple activation
the encode operation in (10), we can correct the bias of the
points around the ground truth activation point. Inspired
encode operation with a shift on the decode operation, i.e.,
ˆ ˙ by the randomized rounding method [27], we address the
quantization error in vanilla heatmap regression by using a
xpi P s ˚ arg max thpi pxqu ` t ´ 0.5 . (13)
x probabilistic approach. Specifically, we encode the fractional
part of the numerical coordinate to different activation
For simplicity, we use the round operation, i.e., t “ 0.5,
points with different activation probabilities. An intuitive
to form an unbiased quantization system as our baseline.
example is shown in Fig. 3.
Though the vanilla quantization system defined by (10)
and (13) is unbiased, it causes non-invertible localization We describe the proposed quantization system as fol-
g
error. An intuitive explanation for this is that the encode lows. Given the ground truth numerical coordinate xi “
g g
operation in (10) directly discards the fractional part of the pxi , yi q and a downsampling stride of the heatmap s ą
ground truth numerical coordinates, thus making it impos- 1, the ground truth activation point in the heatmap is
sible for the decode operation to accurately reconstruct the pxgi {s, yig {sq, which are usually floating-point numbers,
numerical coordinates. and we are unable to find the corresponding pixel in
the heatmap. If we ignore the fractional part px , y q us-
Theorem 1. Given an unbiased quantization system defined by ing a typical integer quantization operation, e.g., round,
the encode operation in (10) and the decode operation in (13), we the ground truth activation point will be approximated
then have that the quantization error tightly upper bounded, i.e., by one of the activation points around the ground truth
? g g g g
activation point, i.e., ptxi {su, tyi {suq, ptxi {su ` 1, tyi {suq,
}xpi ´ xgi }2 ď 2s{2, g g g g
ptxi {su, tyi {su ` 1q, and ptxi {su ` 1, tyi {su ` 1q. However,
where s ą 1 indicates the downsampling stride of the heatmap. the above process is not invertible. To address this, we
randomly assign the ground truth activation point to one of
Proof. In Appendix.
the alternative activation points around the ground truth
From Theorem 1, we know that the vanilla quantization activation point, and the activation probability is deter-
system defined by (10) and (13) will cause non-invertible mined by the fractional part of the ground truth activation
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

Fig. 4. An example of different possible sets of alternative activation points.

point as follows: 4.4 Activation Points Selection

P thgi ptxgi {su, tyig {suq “ 1u “ p1 ´ x qp1 ´ y q, The fractional information of the numerical coordinate
px , y q is well-captured by the randomized rounding oper-
P thgi ptxgi {su ` 1, tyig {suq “ 1u “ x p1 ´ y q, ation, allowing us to reconstruct the ground truth numerical
P thgi ptxgi {su, tyig {su ` 1q “ 1u “ p1 ´ x qy , g
coordinate xi without the quantization error. However,
P thgi ptxgi {su ` 1, tyig {su ` 1q “ 1u “ x y . during the inference phase, the ground truth numerical
g
(14) coordinate xi is unavailable and heatmap error always
exists in practice, making it difficult to identify the proper
To achieve the encode scheme in (14) in conjunction with g
set of ground truth activation points Xi . In this section, we
current minibatch stochastic gradient descent training al- describe a way to form a set of alternative activation points
gorithms for deep learning models, we introduce a new in practice.
integer quantization operation via randomized rounding, We introduce two activation point selection methods
i.e., random-round: as follows. The first solution is to estimate all activation
# points via the points around the maximum activation point.
tx{su if  ă t, t „ Up0, 1q, As shown in Fig. 4, given the maximum activation point,
qpx, sq “ (15)
tx{su ` 1 otherwise. we then have four different sets of alternative activation
g g g g
points, Xi 1 , Xi 2 , Xi 3 , and Xi 4 . Therefore, given the pre-
Given the encode operation in (15), if we do not consider dicted heatmap in practice, we then take a risk of choosing
the heatmap error, we then have the activation probability an incorrect set of alternative activation points. To find
at x: a robust set of alternative activation points, we may use
hpi pxq “ P thgi pxq “ 1u . (16) all nine activation points around the maximum activation
point, i.e.,
As a result, the fractional part of the ground truth numerical
coordinate px , y q can be reconstructed from the predicted Xig “ Xig1 Y Xig2 Y Xig3 Y Xig4 . (19)
heatmap via the activation probabilities of all activation g
points, i.e., Another solution of alternative activation points Xi is to
generalize the argmax operation with the argtopk operation,
¨ ˛ p
i.e., we decode the predicted heatmap hi to obtain the
ÿ p
xpi “ s ˚ ˝ hpi pxq ˚ x‚, (17) numerical coordinate xi according to the top k largest
xPXig activation points,

g
where Xi indicates the set of activation points around the Xig “ arg topkphpi pxqq. (20)
x
ground truth activation point, i.e.,
If there is no heatmap error, the two alternative activa-
Xig “ t ptxgi {su, tyig {suq , ptxgi {su ` 1, tyig {suq , tion points solutions presented above, i.e., the alternative
(18)
ptxgi {su, tyig {su ` 1q , ptxgi {su ` 1, tyig {su ` 1qu. activation points in (18) and (19), are equal to each other
when using the decode operation in (17). Specifically, we
Theorem 2. Given the encode operation in (15) and the decode find that the activation points in (19) achieve comparable
operation in (17), we then have that the 1) encode operation is performance to the activation points in (20) when k “ 9.
unbiased; and 2) quantization system is lossless, i.e., there is no For simplicity, unless otherwise mentioned, we use the
quantization error. set of alternative activation points defined by (20) in this
paper. Furthermore, when we take the heatmap error into
Proof. In Appendix. consideration, the values of different k then forms a trade-
off on the selection of activation points, i.e., a larger k will be
From Theorem 2, we know that the quantization system robust to activation point selection whilst also increasing the
defined by the encode operation in (15) and the decode risk of noise from the heatmap error. See more discussion in
operation in (17) is unbiased and lossless. Section 4.5 and the experimental results in Section 5.5.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

4.5 Discussion Empirical Compensation. “Shift a quarter to the second


In this subsection, we provide some insights into the pro- maximum activation point” has become an effective and
posed quantization system with respect to: 1) the influence widely used empirical compensation method for heatmap
of human annotators on the proposed quantization system regression [6], [55], [22], but it still lacks a proper explana-
in practice; and 2) the underlying explanation behind the tion. We thus provide an intuitive explanation according to
widely used empirical compensation method “shift a quar- the proposed quantization system. The proposed quantiza-
ter to the second maximum activation point”. tion system encodes the ground truth numerical coordinates
Unbiased Annotation. We assume the “ground truth nu- into multiple activation points, and the activation probabil-
merical coordinates” are always accurate, while the ground ity of each activation point is decided by the fractional part,
truth numerical coordinates are usually labelled by human i.e., the activation probability indicates the distance between
annotators at the risk of the annotation bias. Given an the activation point and the ground truth activation point.
input image, the ground truth numerical coordinates xi
g Therefore, the ground truth activation point is closer to the
can be obtained by clicking a specific pixel in the image, i-th maximum activation point than the pi ` 1q-th maximum
which is a simple but effective annotation pipeline provided activation point. We demonstrate the averaged activation
by most image annotation tools. For sub-pixel numerical probabilities for the top k activation points on the WFLW
coordinates, especially in low-resolution input images, the dataset in Table 1. We find that the marginal improvement
annotators may click either one of all possible pixels around
the ground truth numerical coordinates due to human vi- TABLE 1
The activation probabilities of the top k activation points.
sual uncertainty. As shown in Fig. 5, clicking any one of the
four possible pixels causes annotation error, which corre-
k=1 k=2 k=3 k=4
sponds to the fractional part of the ground truth numerical hpi pxq 0.44 0.26 0.17 0.13
g g g g
coordinate p1x , 1y q “ pxi ´ txi u, yi ´ tyi uq. Given enough NME(%) 6.45 5.07 4.71 4.68
data samples, if the annotators click the pixel according to
the following distribution, i.e.,
decreases as the number of activation points increases, i.e.,
P tptxgi u, tyig uqu “ p1 ´ 1x qp1 ´ 1y q, the second maximum activation points provides the max-
P tptxgi u ` 1, tyig uqu “ 1x p1 ´ 1y q, imum improvement to the reconstruction of the fractional
(21) part. This observation partially explains the effectiveness
P tptxgi u, tyig u ` 1qu “ p1 ´ 1x q1y , of the compensation method “shift a quarter to the second
P tptxgi u ` 1, tyig u ` 1qu “ 1x 1y , maximum activation point”, which can be seen as a special
case of the proposed method (20) with k “ 2.
the fractional part then can be well captured by the heatmap
Furthermore, the proposed quantization system shares
regression model and we refer to it as an unbiased annota-
the same motivation with the bilinear interpolation. Specif-
tion. If we take the downsampling stride into consideration,
ically, the bilinear interpolation usually aims to find the
value of the unknown function f px, yq given its neighbors
f px1 , y1 q, f px1 , y2 q, f px2 , y1 q, and f px2 , y2 q. For the pro-
posed quantization system, we have f px, yq “ px, yq, which
indicates the location of landmarks. Specifically, if there is no
g g
heatmap error, we then have x1 “ txi {su, x2 “ txi {su ` 1,
g g
y1 “ tyi {su, and y2 “ tyi {su ` 1. If we take heatmap
error into consideration, the ground truth activation points
are usually unknown. Therefore, the number of alternative
activation points also controls the trade-off between the
robustness of the quantization system and the risk of noise
from the heatmap error (Also see details in Section 4.4).
Fig. 5. An example of unbiased human annotation for semantic landmark
localization.
5 FACIAL L ANDMARK D ETECTION
px , y q is then a joint result of both the downsampling of In this section, we perform facial landmark detection ex-
the heatmap and the annotation process, i.e., periments. We first introduce widely used facial landmark
` ˘ detection datasets. We then describe the implementation
px , y q 9 1x {s, 1y {s ` ps ´ 1q. (22) details of our proposed method. Finally, we present our
On the one hand, if the heatmap regression model uses experimental results on different datasets and perform com-
a low input resolution (or a large downsampling stride prehensive ablation studies on the most challenging dataset.
s " 1), the fractional part px , y q then mainly comes from
the downsampling of the heatmap; on the other hand, if 5.1 Datasets
the heatmap regression model uses a high input resolution,
We use four widely used facial landmark detection datasets:
the annotation process will also have a significant influence
on the heatmap regression. Therefore, when using a high ‚ WFLW [64]. WFLW contains 10, 000 face images, in-
input resolution model in practice, a diverse set of human cluding 7, 500 training images and 2, 500 testing im-
annotators help reduce the bias in annotation process. ages, with 98 manually annotated facial landmarks.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

TABLE 2
Comparison with State-of-the-Arts on WFLW dataset.

NME (%), Inter-ocular


Method
test pose expression illumination make-up occlusion blur
ESR [29] 11.13 25.88 11.47 10.49 11.05 13.75 12.20
SDM [31] 10.29 24.10 11.45 9.32 9.38 13.03 11.28
CFSS [36] 9.07 21.36 10.09 8.30 8.74 11.76 9.96
DVLN [63] 6.08 11.54 6.78 5.73 5.98 7.33 6.88
LAB [64] 5.27 10.24 5.51 5.23 5.15 6.79 6.32
Wing [45] 5.11 8.75 5.36 4.93 5.41 6.37 5.81
3DDE [65] 4.68 8.62 5.21 4.65 4.60 5.77 5.41
DeCaFA [66] 4.62 8.11 4.65 4.41 4.63 5.74 5.38
HRNet [23] 4.60 7.86 4.78 4.57 4.26 5.42 5.36
AVS [67] 4.39 8.42 4.68 4.24 4.37 5.60 4.86
LUVLi [68] 4.37 - - - - - -
AWing [21] 4.21 7.21 4.46 4.23 4.02 4.99 4.82
H3R (ours) 3.81 6.45 4.07 3.70 3.66 4.48 4.30

TABLE 3
Comparison with State-of-the-Arts on 300W dataset.

NME (%), Inter-ocular NME (%), Inter-pupil


Method
private full common challenge private full common challenge
SAN [69] - 3.98 3.34 6.60 - - - -
DAN [70] 4.30 3.59 3.19 5.24 - 5.03 4.42 7.57
SHN [71] 4.05 - - - - 4.68 4.12 7.00
LAB [64] - 3.49 2.98 5.19 - 4.12 3.42 6.98
Wing [45] - - - - - 4.04 3.27 7.18
DeCaFA [66] - 3.39 2.93 5.26 - - - -
DFCE [72] 3.88 3.24 2.76 5.22 - 4.55 3.83 7.54
AVS [67] - 3.86 3.21 6.49 - 4.54 3.98 7.21
HRNet [23] 3.85 3.32 2.87 5.15 - - - -
HG-HSLE [73] - 3.28 2.85 5.03 - 4.59 3.94 7.24
LUVLi [68] - 3.23 2.76 5.16 - - - -
3DDE [65] 3.73 3.13 2.69 4.92 - 4.39 3.73 7.10
AWing [21] 3.56 3.07 2.72 4.52 - 4.31 3.77 6.52
H3R (ours) 3.48 3.02 2.65 4.58 5.07 4.24 3.67 6.60

All face images are selected from the WIDER Face 5.2 Evaluation Metrics
dataset [74], which contains face images with large We use the normalized mean error (NME) as the evaluation
variations in scale, expression, pose, and occlusion. metric in this paper, i.e.,
‚ 300W [75]. 300W contains 3, 148 training images, ˆ p
}xi ´ xgi }2
˙
including 337 images from AFW [30], 2, 000 images NME “ E , (23)
from the training set of HELEN [76], and 811 images d
from the training set of LFPW [39]. For testing, there where d indicates the normalization distance. For fair com-
are four different settings: 1) common: 554 images, parison, we report the performances on WFLW, 300W, and
including 330 and 224 images from the testsets of COFW using two the normalization methods, inter-pupil
HELEN and LFPW, respectively; 2) challenge: 135 distance (the distance between the eye centers) and inter-
images from IBUG; 3) full: 689 images as a combina- ocular distance (the distance between the outer eye corners).
tion of common and challenge; and 4) private: 600 We report the performance on AFLW using the size of the
indoor/outdoor images. All images are manually face bounding box as the normalization distance, ? i.e., the
annotated with 68 facial landmarks. normalization distance can be evaluated by d “ w ˚ h,
‚ COFW [34]. COFW contains 1, 852 images, including where w and h indicate the width and height of the face
1, 345 training and 507 testing images. All images are bounding box, respectively.
manually annotated with 29 facial landmarks.
‚ AFLW [77]. AFLW contains 24, 386 face images, in- 5.3 Implementation Details
cluding 20, 000 images for training and 4, 836 images We implement the proposed heatmap regression method
for testing. For testing, there are two settings: 1) for facial landmark detection using PyTorch [78]. Following
full: all 4, 836 images for testing; and 2) front: 1, 314 the practice in [23], we use HRNet [22] as our backbone
frontal images selected from the full set. All images network, which is an efficient counterpart of ResNet [79],
are manually annotated with 21 facial landmarks. U-Net [80], and Hourglass [6] for semantic landmark lo-
For fair comparison, we use 19 facial landmarks, i.e., calization. Unless otherwise mentioned, we use HRNet-
the landmarks on two ears are ignored. W18 as the backbone network in our experiments. All face
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

images are cropped and resized to 256 ˆ 256 pixels and that the proposed method outperforms recent state-of-the-
the downsampling stride of the feature map is 4 pixels. For art methods with a clear margin on COFW dataset. AFLW
training, we perform widely-used data augmentation for captures a wide range of different face poses, including both
facial landmark detection as follows We horizontally flip all frontal faces and non-frontal faces. As shown in Table 5,
training images with probability 0.5 and randomly change the proposed method achieves consistent improvements for
the brightness (˘0.125), contrast (˘0.5), and saturation both frontal faces and non-frontal faces, suggesting robust-
(˘0.5) of each image. We then randomly rotate the image ness across different face poses.
(˘30˝ ), rescale the image (˘0.25), and translate the image
(˘16 pixels). We also randomly erase a rectangular region in 5.5 Ablation Studies
the training image [81]. All our models are initialized from
To better understand the proposed quantization system in
the weights pretrained on ImageNet [82]. We use the Adam
different settings, we perform ablation studies the most
optimizer [83] with batch size 16. The learning rate starts
challenging dataset, WFLW [64].
from 0.001 and is divided by 10 for every 60 epochs, with
150 training epochs in total. During the testing phase, we
horizontally flip testing images as the data augmentation
and average the predictions.

TABLE 4
Comparison with State-of-the-Arts on COFW dataset.

NME (%)
Method
Inter-ocular Inter-pupil
SHN [71] - 5.60
LAB [64] - 5.58
DFCE [72] - 5.27
3DDE [65] - 5.11
Wing [45] - 5.07
HRNet [23] 3.45 -
AWing [21] - 4.94
H3R (ours) 3.15 4.55

TABLE 5
Comparison with State-of-the-Arts on AFLW dataset. Fig. 6. The influence of different input resolutions.

NME (%) The influence of different input resolutions. Heatmap


Method regression models use a fixed input resolution, e.g., 256 ˆ
full front
DFCE [72] 2.12 - 256 pixels, but training and testing images usually represent
3DDE [65] 2.01 - a wide range of resolutions, e.g., most faces in the WFLW
SAN [69] 1.91 1.85
LAB [64] 1.85 1.62
dataset have an inter-ocular distance of between 30 and 120
Wing [45] 1.65 - pixels. Therefore, we compare the proposed method with
HRNet [23] 1.57 1.46 the baseline using an input resolution from 64 ˆ 64 pixels
LUVLi [68] 1.39 1.19 to 512 ˆ 512 pixels, i.e., a heatmap resolution from 16 ˆ 16
H3R (ours) 1.27 1.11
pixels to 128ˆ128 pixels. In Fig. 6, the proposed method sig-
nificantly improves heatmap regression performance when
using a low input resolution. The increasing number of
high-resolution images/videos in real-world applications
5.4 Comparison with Current State-of-the-Art is a challenge with respect to the computational cost and
To demonstrate the effectiveness of the proposed method, device memory needed to overcome the problem of sub-
we compare it with recent state-of-the-art facial landmark pixel localization by increasing the input resolution of deep
detection methods as follows. As shown in Table 2, the learning-based heatmap regression models. For example,
proposed method outperforms recent state-of-the-art meth- in the film industry, it has sometimes become necessary
ods on the most challenging dataset, WFLW, with a clear to swap the appearance of a target actor and a source
margin for all different settings. For the 300W dataset, we actor to generate higher fidelity video frames in visual
try to report the performances under different settings for effects, especially when an actor is unavailable for some
fair comparison. As shown in Table 3, the proposed method scenes [84]. The manipulation of actor faces in video frames
achieves comparable performances for all different settings. relies on accurate localization of different facial landmarks
Specifically, LAB [64] uses the boundary information as the and is performed at megapixel resolution, inducing a huge
auxiliary supervision; compared to Wing [45], which uses computational cost for extensive frame-by-frame animation.
the coordinate regression for semantic landmark localiza- Therefore, instead of using high-resolution input images,
tion, the heatmap-based methods usually achieve better our proposed method delivers another efficient solution for
performance on the challenge subset. In Table 4, we see dealing with accurate semantic landmark localization.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

TABLE 6
The influence of different numbers of alternative activation points when using binary heatmap.

NME (%)
Resolution
k“1 k“2 k“3 k“4 k“5 k“6 k“7 k“8 k“9 k “ 10 k “ 11 k “ 12 best
512 ˆ 512 3.980 3.946 3.930 3.923 3.916 3.912 3.909 3.906 3.903 3.901 3.898 3.897 3.890/k “ 25
384 ˆ 384 3.932 3.875 3.855 3.850 3.842 3.837 3.833 3.830 3.828 3.827 3.826 3.826 3.825/k “ 14
256 ˆ 256 4.005 3.881 3.836 3.832 3.819 3.815 3.810 3.808 3.807 3.807 3.807 3.808 3.807/k “ 9
128 ˆ 128 4.637 4.164 4.029 4.023 3.997 3.991 3.988 3.989 3.991 3.994 3.998 4.002 3.988/k “ 7

TABLE 7
The influence of different numbers of alternative activation points when using Gaussian heatmap.

NME (%)
k“1 k“2 k“3 k“4 k“5 k“6 k“7 k“8 k“9 k “ 10 k “ 11 k “ 12 best
σ “ 0.0 4.235 4.102 4.069 4.065 4.090 4.202 4.427 4.775 5.219 5.719 6.256 6.802 4.065/k “ 4
σ “ 0.5 4.162 4.045 4.010 4.008 3.991 3.990 3.988 3.992 3.998 4.014 4.047 4.108 3.988/k “ 7
σ “ 1.0 4.037 3.928 3.898 3.908 3.871 3.877 3.876 3.876 3.871 3.861 3.855 3.857 3.855/k “ 11
σ “ 1.5 4.032 3.927 3.901 3.918 3.873 3.879 3.885 3.882 3.878 3.864 3.857 3.860 3.855/k “ 18
σ “ 2.0 4.086 3.983 3.953 3.969 3.923 3.931 3.937 3.931 3.925 3.909 3.902 3.907 3.894/k “ 25

TABLE 8
The influence of different face ”bounding box” annotation policies.

Policy NME (%), Inter-ocular


training testing test pose expression illumination make-up occlusion blur
P1 P1 3.81 6.45 4.07 3.70 3.66 4.48 4.30
P2 P2 3.95 6.74 4.17 3.89 3.81 4.73 4.51
P1 P2 5.34 9.68 5.24 5.25 5.57 6.55 6.14
P2 P1 4.04 6.90 4.24 3.99 3.90 4.87 4.65

The influence of different numbers of alternative ac- The influence of different “bounding box” annotation
tivation points. In the proposed quantization system, the policies. For facial landmark detection, a reference bound-
activation probability indicates the distance between activa- ing box is required to indicate the position of the facial area.
g g
tion point to the ground truth activation point pxi {s, yi {sq. However, there is a performance gap when using different
If there is no heatmap error, the alternative activation points reference bounding boxes [21]. A comparison between two
in (20) then give the same result as in (18). If the heatmap widely used “bounding box” annotation policies is shown
error cannot be ignored, there will be a trade-off on the num- in Fig. 7, and we introduce two different bounding box
ber of alternative activation points: 1) a small k increases annotation policies as follows:
the risk of missing the ground truth alternative activation
points; 2) a large k introduces the noise from irrelevant ‚ P1: This annotation policy is usually used in semantic
activation points, especially for large heatmap error. We landmark localization tasks, especially in facial land-
demonstrate the performance of the proposed method by mark localization. Specifically, the rectangular area of
using different numbers of alternative activation points in the bounding box tightly encloses a set of pre-defined
Table 6. Specifically, we see that 1) when using a high facial landmarks.
input resolution, the best performance is achieved with a ‚ P2: This annotation policy has been widely used in
relatively large k ; and 2) the performance is smooth near face detection datasets [74]. The bounding box con-
the optimal number of alternative activation points, making tains the areas of the forehead, chin, and cheek. For
it easy to find a proper k for validation data. As introduced the occluded face, the bounding box is estimated by
in Section 3, binary heatmaps can be seen as a special case the human annotator based on the scale of occlusion.
of Gaussian heatmaps with standard deviation σ “ 0. Con-
sidering that Gaussian heatmaps have been widely used in
semantic landmark localization applications, we generalize
the proposed quantization system to Gaussian heatmaps
and demonstrate the influence of different numbers of al-
ternative activation points in Table 7. Specifically, we see
that 1) when applying the proposed quantization system to
the model using Gaussian heatmap, it achieves comparable
performance to the model using binary heatmap; and 2) the Fig. 7. A comparison between different ”bounding box” annotation poli-
optimal number of alternative activation points increases cies. P1: the yellow bounding boxes. P2: the blue bounding boxes.
with the standard deviation σ .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

Fig. 8. Qualitative results from the test split of the WFLW dataset (best view in color). Blue: the ground truth facial landmarks. Yellow: the predicted
facial landmarks. The first row shows some “good” cases and the second row shows some “bad” cases.

We demonstrate the experimental results using different 6.2 Implementation Details


annotation policies in Table 8. Specifically, we find that the
We utilize recent state-of-the-art heatmap regression method
policy P1 usually achieves better results, possibly because
for human pose estimation, HRNet [22], as our baseline.
the occluded forehead (e.g., hair) introduces additional vari-
Specifically, the proposed quantization system can be eas-
ations to the face bounding boxes when using the policy
ily integrated into most heatmap regression models and
P2. Furthermore, the model trained using the policy P2 is
we have made the source code of human pose estimation
more robust to different bounding box policy during testing,
based on the HRNet baseline publicly available. For the
suggesting its robustness to inaccurate bounding boxes from
MPII Human Pose dataset, we use the standard evaluation
face detection algorithms.
metric, head-normalized probability of correct keypoint or
Qualitative Analysis. As shown in Fig. 8, we present
PCKh [85]. Specifically, a correct keypoint should fall within
some “good” and “bad” facial landmark detection examples
α ˚ l pixels of the ground truth position, where l indicates
according to NME. Specifically, for the good cases presented
the normalization distance and α P r0, 1s is the matching
in the first row, most images are of high quality; For the bad
threshold. For fair comparison, we apply two different
cases in the second row, most images contain heavy blurring
matching thresholds, [email protected] and [email protected], where a
and/or occlusion, making it difficult to accurately identify
smaller matching threshold, α “ 0.1, indicates a more
the contours of different facial parts.
strict evaluation metric for accurate semantic landmark
localization [22]. For the COCO dataset, we use the standard
evaluation metric, averaged precision (AP) and averaged
6 H UMAN P OSE E STIMATION recall (AR), where the object keypoint similarity or OKS is
In this section, we perform human pose estimation exper- used as the similarity measure between the ground truth
iments to further demonstrate the effectiveness of the pro- objects and the predicted objects [86].
posed quantization system for accurate semantic landmark
localization.
6.3 Results
The experimental results on the MPII dataset are shown
6.1 Datasets in Table 9. Specifically, when using a coarse evaluation
We perform experiments on two popular human pose esti- metric, [email protected], both the proposed method and the com-
mation datasets, pensation method achieve comparable performance to the
baseline method, suggesting that the quantization error
‚ MPII [85]: The MPII Human Pose dataset contains is trivial in coarse semantic landmark localization; When
around 28,821 images with 40,522 person instances, using a more strict evaluation metric, [email protected], the com-
in which 11,701 images for testing and the pensation method, which can be seen as a special case
remaining 17,120 images for training. Following of H3R with k “ 2, significantly improves the baseline,
the experimental setup in [22], we use 22,246 e.g., from 32.8 to 37.7, while the proposed method H3R
person instances for training and evaluate the further improves the performance from 37.7 to 39.3. The
performance on the MPII validation set with 2,958 experimental results on the COCO dataset are shown in
person instances, which is a heldout subset of MPII Table 10. Specifically, the proposed method clearly improves
training set. the averaged precision (AP) in different settings and the
major improvements on AP come from: 1) a strict evaluation
‚ COCO [86]: The COCO dataset contains over 200,000 metric, e.g., AP0.75 ; and 2) large/medium person instances,
images and 250,000 person instances, in which each i.e., AP(M) and AP(L). Furthermore, we also find that the
person instance is labeled with 17 keypoints. Follow- improvement decreases when increasing the input resolu-
ing the experimental setup in [22], we evaluate the tion, e.g., from 0.674 to 0.720 for 192 ˆ 128 p0.046Òq, from
proposed method on the validation set with 5,000 0.723 to 0.750 for 256 ˆ 192 p0.027Òq, and from 0.748 to
images. 0.762 for 384 ˆ 288 p0.014Òq.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

TABLE 9
Results on MPII Human Pose dataset. In each block, the first row with “´” indicates the baseline method, i.e., k “ 1; The second row with “˚”
indicates the compensation method, i.e., “shift a quarter to the second maximum activation point”.

Backbone Input H3R Head Shoulder Elbow Wrist Hip Knee Ankle Mean [email protected]
ResNet-50 256 ˆ 256 - 96.3 95.2 89.0 82.9 88.2 83.7 79.4 88.4 29.8
ResNet-50 256 ˆ 256 * 96.4 95.3 89.0 83.2 88.4 84.0 79.6 88.5 34.0
ResNet-50 256 ˆ 256 X 96.3 95.2 88.8 83.4 88.5 84.3 79.8 88.6 34.9
ResNet-101 256 ˆ 256 - 96.7 95.8 89.3 84.2 87.9 84.2 80.7 88.9 30.0
ResNet-101 256 ˆ 256 * 96.9 95.9 89.5 84.4 88.4 84.5 80.7 89.1 34.0
ResNet-101 256 ˆ 256 X 96.7 96.0 89.3 84.4 88.5 84.3 80.6 89.1 35.0
ResNet-152 256 ˆ 256 - 97.0 95.8 89.9 84.7 89.1 85.4 81.3 89.5 31.0
ResNet-152 256 ˆ 256 * 97.0 95.9 90.0 85.0 89.2 85.3 81.3 89.6 35.0
ResNet-152 256 ˆ 256 X 96.8 95.9 90.1 84.9 89.3 85.3 81.3 89.6 36.2
HRNet-W32 256 ˆ 256 - 97.0 95.7 90.1 86.3 88.4 86.8 82.9 90.1 32.8
HRNet-W32 256 ˆ 256 * 97.1 95.9 90.3 86.4 89.1 87.1 83.3 90.3 37.7
HRNet-W32 256 ˆ 256 X 97.1 96.1 90.8 86.1 89.2 86.4 82.6 90.3 39.3

TABLE 10
Results on COCO validation set. In each block, the first row with “´” indicates the baseline, i.e., k “ 1; The second row with “˚” indicates the
compensation method, i.e., “shift a quarter to the second maximum activation point”, which can be seen as a special case of H3R with k “ 2.

Backbone Input H3R AP Ap .5 AP .75 AP (M) AP (L) AR AR .5 AR .75 AR (M) AR (L)


HRNet-W32 192 ˆ 128 - 0.674 0.890 0.771 0.648 0.732 0.739 0.932 0.828 0.700 0.795
HRNet-W32 192 ˆ 128 * 0.710 0.892 0.792 0.682 0.771 0.770 0.933 0.844 0.732 0.827
HRNet-W32 192 ˆ 128 X 0.720 0.892 0.797 0.691 0.784 0.777 0.933 0.846 0.739 0.834
HRNet-W32 256 ˆ 192 - 0.723 0.904 0.811 0.690 0.788 0.782 0.941 0.859 0.741 0.841
HRNet-W32 256 ˆ 192 * 0.744 0.905 0.819 0.708 0.810 0.798 0.942 0.865 0.757 0.858
HRNet-W32 256 ˆ 192 X 0.750 0.906 0.820 0.715 0.817 0.802 0.942 0.865 0.761 0.861
HRNet-W48 256 ˆ 192 - 0.730 0.904 0.817 0.693 0.798 0.788 0.943 0.864 0.745 0.852
HRNet-W48 256 ˆ 192 * 0.751 0.906 0.822 0.715 0.818 0.804 0.943 0.867 0.762 0.864
HRNet-W48 256 ˆ 192 X 0.756 0.906 0.825 0.718 0.825 0.806 0.941 0.868 0.763 0.869
HRNet-W32 384 ˆ 288 - 0.748 0.904 0.826 0.712 0.816 0.802 0.941 0.871 0.759 0.864
HRNet-W32 384 ˆ 288 * 0.758 0.906 0.825 0.720 0.827 0.809 0.943 0.869 0.767 0.871
HRNet-W32 384 ˆ 288 X 0.762 0.905 0.830 0.725 0.833 0.812 0.942 0.873 0.769 0.874
HRNet-W48 384 ˆ 288 - 0.753 0.907 0.823 0.712 0.823 0.804 0.941 0.867 0.759 0.869
HRNet-W48 384 ˆ 288 * 0.763 0.908 0.829 0.723 0.834 0.812 0.942 0.871 0.767 0.876
HRNet-W48 384 ˆ 288 X 0.765 0.907 0.829 0.724 0.838 0.814 0.941 0.871 0.769 0.878

7 C ONCLUSION [4] U. Iqbal, P. Molchanov, T. Breuel Juergen Gall, and J. Kautz, “Hand
pose estimation via latent 2.5 d heatmap regression,” in European
In this paper, we address the problem of sub-pixel local- Conference on Computer Vision (ECCV), 2018, pp. 118–134.
ization for heatmap-based semantic landmark localization. [5] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation
We formally analyze quantization error in vanilla heatmap via deep neural networks,” in IEEE Conference on Computer Vision
regression and propose a new quantization system via ran- and Pattern Recognition (CVPR), 2014, pp. 1653–1660.
[6] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for
domized rounding operation, which we prove is unbiased human pose estimation,” in European Conference on Computer Vision
and lossless. Experiments on facial landmark localization (ECCV), 2016, pp. 483–499.
and human pose estimation datasets demonstrate the effec- [7] A. Saxena, J. Driemeyer, and A. Y. Ng, “Robotic grasping
tiveness of the proposed quantization system for efficient of novel objects using vision,” International Journal of Robotics
Research (IJRR), vol. 27, no. 2, pp. 157–173, 2008. [Online].
and accurate sub-pixel localization. Available: https://doi.org/10.1177/0278364907087172
[8] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face repre-
sentation by joint identification-verification,” in Neural Information
ACKNOWLEDGEMENT Processing Systems (NIPS), 2014, pp. 1988–1996.
Dr. Baosheng Yu is supported by ARC project FL-170100117. [9] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and
M. Nießner, “Face2face: Real-time face capture and reenactment
of rgb videos,” in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016, pp. 2387–2395.
R EFERENCES [10] C. Wang, D. Xu, Y. Zhu, R. Martı́n-Martı́n, C. Lu, L. Fei-Fei, and
[1] Y. Sun, X. Wang, and X. Tang, “Deep convolutional network S. Savarese, “Densefusion: 6d object pose estimation by iterative
cascade for facial point detection,” in IEEE Conference on Computer dense fusion,” in IEEE Conference on Computer Vision and Pattern
Vision and Pattern Recognition (CVPR), 2013, pp. 3476–3483. Recognition (CVPR), 2019.
[2] A. Bulat and G. Tzimiropoulos, “How far are we from solving [11] J. Deng, J. Guo, X. Niannan, and S. Zafeiriou, “Arcface: Additive
the 2d & 3d face alignment problem?(and a dataset of 230,000 3d angular margin loss for deep face recognition,” in IEEE Conference
facial landmarks),” in IEEE International Conference on Computer on Computer Vision and Pattern Recognition (CVPR), 2019.
Vision (ICCV), 2017, pp. 1021–1030. [12] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang,
[3] A. Sinha, C. Choi, and K. Ramani, “Deephand: Robust hand pose “Spindle net: Person re-identification with human body region
estimation by completing a matrix imputed with deep features,” in guided feature decomposition and fusion,” in IEEE Conference on
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1077–
June 2016. 1085.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

[13] L. Zheng, Y. Huang, H. Lu, and Y. Yang, “Pose-invariant embed- [36] S. Zhu, C. Li, C. Change Loy, and X. Tang, “Face alignment by
ding for deep person re-identification,” IEEE Transactions on Image coarse-to-fine shape searching,” in IEEE Conference on Computer
Processing (TIP), vol. 28, no. 9, pp. 4500–4509, 2019. Vision and Pattern Recognition (CVPR), 2015, pp. 4998–5006.
[14] Y.-C. Chen, X. Shen, and J. Jia, “Makeup-go: Blind reversion of [37] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human pose
portrait edit,” in IEEE International Conference on Computer Vision estimation with iterative error feedback,” in IEEE Conference on
(ICCV), Oct 2017. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4733–
[15] H. Chang, J. Lu, F. Yu, and A. Finkelstein, “Pairedcyclegan: Asym- 4742.
metric style transfer for applying and removing makeup,” in IEEE [38] V. Belagiannis and A. Zisserman, “Recurrent human pose estima-
Conference on Computer Vision and Pattern Recognition (CVPR), 2018, tion,” in IEEE International Conference on Automatic Face & Gesture
pp. 40–48. Recognition (FG), 2017, pp. 468–475.
[16] C. Cao, Y. Weng, S. Lin, and K. Zhou, “3d shape regression for [39] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar,
real-time facial animation,” ACM Transactions on Graphics (TOG), “Localizing parts of faces using a consensus of exemplars,” IEEE
vol. 32, no. 4, pp. 1–10, 2013. Transactions on Pattern Analysis and Machine Intelligence (TPAMI),
[17] C. Cao, Q. Hou, and K. Zhou, “Displaced dynamic expression vol. 35, no. 12, pp. 2930–2940, 2013.
regression for real-time facial tracking and animation,” ACM [40] X. Miao, X. Zhen, X. Liu, C. Deng, V. Athitsos, and H. Huang,
Transactions on Graphics (TOG), vol. 33, no. 4, pp. 1–10, 2014. “Direct shape regression networks for end-to-end face alignment,”
[18] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, in IEEE Conference on Computer Vision and Pattern Recognition
and C. Theobalt, “Real-time expression transfer for facial reen- (CVPR), 2018, pp. 5040–5049.
actment.” ACM Transactions on Graphics (TOG), vol. 34, no. 6, pp. [41] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark
183–1, 2015. detection by deep multi-task learning,” in European Conference on
[19] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of Computer Vision (ECCV), 2014, pp. 94–108.
a convolutional network and a graphical model for human pose [42] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection
estimation,” in Neural Information Processing Systems (NIPS), 2014, and alignment using multitask cascaded convolutional networks,”
pp. 1799–1807. IEEE Signal Processing Letters (SPL), vol. 23, no. 10, pp. 1499–1503,
[20] A. Nibali, Z. He, S. Morgan, and L. Prendergast, “Numerical 2016.
coordinate regression with convolutional neural networks,” arXiv [43] R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: A deep
preprint arXiv:1801.07372, 2018. multi-task learning framework for face detection, landmark lo-
[21] X. Wang, L. Bo, and L. Fuxin, “Adaptive wing loss for robust calization, pose estimation, and gender recognition,” IEEE Trans-
face alignment via heatmap regression,” in IEEE International actions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 41,
Conference on Computer Vision (ICCV), 2019, pp. 6971–6981. no. 1, pp. 121–135, 2017.
[22] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution repre- [44] R. Girshick, “Fast r-cnn,” in IEEE International Conference on Com-
sentation learning for human pose estimation,” in IEEE Conference puter Vision (ICCV), 2015, pp. 1440–1448.
on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5693– [45] Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu, “Wing loss
5703. for robust facial landmark localisation with convolutional neural
networks,” in IEEE Conference on Computer Vision and Pattern
[23] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu,
Recognition (CVPR), 2018, pp. 2235–2245.
M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution
representation learning for visual recognition,” IEEE Transactions [46] A. Bulat and G. Tzimiropoulos, “Binarized convolutional land-
on Pattern Analysis and Machine Intelligence (TPAMI), 2020. mark localizers for human pose estimation and face alignment
with limited resources,” in IEEE International Conference on Com-
[24] Y. Tai, Y. Liang, X. Liu, L. Duan, J. Li, C. Wang, F. Huang, and
puter Vision (ICCV), 2017, pp. 3706–3714.
Y. Chen, “Towards highly accurate and stable face alignment for
high-resolution videos,” in AAAI Conference on Artificial Intelligence [47] D. Merget, M. Rock, and G. Rigoll, “Robust facial landmark de-
(AAAI), vol. 33, 2019, pp. 8893–8900. tection via a fully-convolutional local-global context network,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
[25] W. Li, Z. Wang, B. Yin, Q. Peng, Y. Du, T. Xiao, G. Yu, H. Lu, Y. Wei,
2018, pp. 781–790.
and J. Sun, “Rethinking on multi-stage networks for human pose
estimation,” arXiv preprint arXiv:1901.00148, 2019. [48] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for
human pose estimation in videos,” in IEEE International Conference
[26] F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, “Distribution-aware on Computer Vision (ICCV), 2015, pp. 1913–1921.
coordinate representation for human pose estimation,” in IEEE
[49] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka,
Conference on Computer Vision and Pattern Recognition (CVPR), 2020,
P. V. Gehler, and B. Schiele, “Deepcut: Joint subset partition and
pp. 7093–7102.
labeling for multi person pose estimation,” in IEEE Conference on
[27] P. Raghavan and C. D. Tompson, “Randomized rounding: a Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4929–
technique for provably good algorithms and algorithmic proofs,” 4937.
Combinatorica, vol. 7, no. 4, pp. 365–374, 1987.
[50] A. Newell, Z. Huang, and J. Deng, “Associative embedding:
[28] B. Korte, J. Vygen, B. Korte, and J. Vygen, Combinatorial optimiza- End-to-end learning for joint detection and grouping,” in Neural
tion. Springer, 2012, vol. 2. Information Processing Systems (NIPS), 2017, pp. 2277–2287.
[29] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit [51] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang, “Learning fea-
shape regression,” International Journal of Computer Vision (IJCV), ture pyramids for human pose estimation,” in IEEE International
vol. 107, no. 2, pp. 177–190, 2014. Conference on Computer Vision (ICCV), 2017, pp. 1281–1290.
[30] X. Zhu and D. Ramanan, “Face detection, pose estimation, and [52] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and
landmark localization in the wild,” in IEEE Conference on Computer K. Murphy, “Personlab: Person pose estimation and instance seg-
Vision and Pattern Recognition (CVPR), 2012, pp. 2879–2886. mentation with a bottom-up, part-based, geometric embedding
[31] X. Xiong and F. De la Torre, “Supervised descent method and its model,” in European Conference on Computer Vision (ECCV), 2018,
applications to face alignment,” in IEEE Conference on Computer pp. 269–286.
Vision and Pattern Recognition (CVPR), 2013, pp. 532–539. [53] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolu-
[32] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps via tional pose machines,” in IEEE Conference on Computer Vision and
regressing local binary features,” in IEEE Conference on Computer Pattern Recognition (CVPR), 2016, pp. 4724–4732.
Vision and Pattern Recognition (CVPR), 2014, pp. 1685–1692. [54] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person
[33] V. Kazemi and J. Sullivan, “One millisecond face alignment with 2d pose estimation using part affinity fields,” in IEEE Conference
an ensemble of regression trees,” in IEEE Conference on Computer on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7291–
Vision and Pattern Recognition (CVPR), 2014, pp. 1867–1874. 7299.
[34] X. P. Burgos-Artizzu, P. Perona, and P. Dollár, “Robust face land- [55] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded
mark estimation under occlusion,” in IEEE International Conference pyramid network for multi-person pose estimation,” in IEEE
on Computer Vision (ICCV), 2013, pp. 1513–1520. Conference on Computer Vision and Pattern Recognition (CVPR), 2018,
[35] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-fine auto- pp. 7103–7112.
encoder networks (cfan) for real-time face alignment,” in European [56] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson,
Conference on Computer Vision (ECCV), 2014, pp. 1–16. C. Bregler, and K. Murphy, “Towards accurate multi-person pose
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

estimation in the wild,” in IEEE Conference on Computer Vision and [78] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
Pattern Recognition (CVPR), 2017, pp. 4903–4911. T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison,
[57] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
regression,” in European Conference on Computer Vision (ECCV), B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative
2018, pp. 529–545. style, high-performance deep learning library,” in Neural Infor-
[58] D. C. Luvizon, D. Picard, and H. Tabia, “2d/3d pose estimation mation Processing Systems (NeurIPS), H. Wallach, H. Larochelle,
and action recognition using multitask deep learning,” in IEEE A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, Eds.
Conference on Computer Vision and Pattern Recognition (CVPR), 2018, Curran Associates, Inc., 2019, pp. 8024–8035.
pp. 5137–5146. [79] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
[59] D. C. Luvizon, H. Tabia, and D. Picard, “Human pose regression for image recognition,” in IEEE Conference on Computer Vision and
by combining indirect part detection and contextual information,” Pattern Recognition (CVPR), 2016, pp. 770–778.
Computers & Graphics, vol. 85, pp. 15–22, 2019. [80] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
[60] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant networks for biomedical image segmentation,” in International
feature transform,” in European Conference on Computer Vision Conference on Medical Image Computing and Computer-Assisted In-
(ECCV), 2016, pp. 467–483. tervention (MICCAI). Springer, 2015, pp. 234–241.
[61] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training [81] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random eras-
of deep visuomotor policies,” Journal of Machine Learning Research ing data augmentation,” AAAI Conference on Artificial Intelligence
(JMLR), vol. 17, no. 1, pp. 1334–1373, 2016. (AAAI), 2020.
[62] J. Thewlis, H. Bilen, and A. Vedaldi, “Unsupervised learning [82] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
of object landmarks by factorized spatial embeddings,” in IEEE “Imagenet: A large-scale hierarchical image database,” in IEEE
International Conference on Computer Vision (ICCV), 2017, pp. 5916– Conference on Computer Vision and Pattern Recognition (CVPR), 2009,
5925. pp. 248–255.
[63] W. Wu and S. Yang, “Leveraging intra and inter-dataset variations [83] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
for robust face alignment,” in IEEE Conference on Computer Vision tion,” in International Conference on Learning Representations (ICLR),
and Pattern Recognition Workshops (CVPRW), 2017, pp. 150–159. 2015.
[64] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou, “Look at [84] J. Naruniec, L. Helminger, C. Schroers, and R. Weber, “High-
boundary: A boundary-aware face alignment algorithm,” in IEEE resolution neural face swapping for visual effects,” Eurographics
Conference on Computer Vision and Pattern Recognition (CVPR), 2018, Symposium on Rendering, vol. 39, no. 4, 2020.
pp. 2129–2138. [85] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human
[65] R. Valle, J. M. Buenaposada, A. Valdés, and L. Baumela, “Face pose estimation: New benchmark and state of the art analysis,” in
alignment using a 3d deeply-initialized ensemble of regression IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
trees,” Computer Vision and Image Understanding (CVIU), vol. 189, 2014, pp. 3686–3693.
p. 102846, 2019. [86] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
[66] A. Dapogny, K. Bailly, and M. Cord, “Decafa: Deep convolutional
context,” in European Conference on Computer Vision (ECCV), 2014,
cascade for face alignment in the wild,” in IEEE International
pp. 740–755.
Conference on Computer Vision (ICCV), October 2019.
[67] S. Qian, K. Sun, W. Wu, C. Qian, and J. Jia, “Aggregation via sep-
aration: Boosting facial landmark detector with semi-supervised
style translation,” in IEEE International Conference on Computer
Vision (ICCV), 2019, pp. 10 153–10 163.
[68] A. Kumar, T. K. Marks, W. Mou, Y. Wang, M. Jones, A. Cherian,
T. Koike-Akino, X. Liu, and C. Feng, “Luvli face alignment: Esti-
mating landmarks’ location, uncertainty, and visibility likelihood,”
in IEEE Conference on Computer Vision and Pattern Recognition Baosheng Yu received a B.E. from the Uni-
(CVPR), 2020, pp. 8236–8246. versity of Science and Technology of China in
[69] X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Style aggregated 2014, and a Ph.D. from The University of Sydney
network for facial landmark detection,” in IEEE Conference on in 2019. He is currently a Research Fellow in
Computer Vision and Pattern Recognition (CVPR), 2018, pp. 379–388. the School of Computer Science and the Fac-
[70] M. Kowalski, J. Naruniec, and T. Trzcinski, “Deep alignment net- ulty of Engineering at The University of Sydney,
work: A convolutional neural network for robust face alignment,” NSW, Australia. His research interests include
in IEEE Conference on Computer Vision and Pattern Recognition machine learning, computer vision, and deep
Workshops (CVPRW), 2017, pp. 88–97. learning.
[71] J. Yang, Q. Liu, and K. Zhang, “Stacked hourglass network for ro-
bust facial landmark localisation,” in IEEE Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 79–87.
[72] R. Valle, J. M. Buenaposada, A. Valdés, and L. Baumela, “A
deeply-initialized coarse-to-fine ensemble of regression trees for
face alignment,” in European Conference on Computer Vision (ECCV),
2018, pp. 585–601.
[73] X. Zou, S. Zhong, L. Yan, X. Zhao, J. Zhou, and Y. Wu, “Learn-
ing robust facial landmark detection via hierarchical structured Dacheng Tao (F’15) is the President of the JD
ensemble,” in IEEE International Conference on Computer Vision Explore Academy and a Senior Vice President of
(ICCV), October 2019. JD.com. He is also an advisor and chief scientist
[74] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A face of the digital science institute in the University of
detection benchmark,” in IEEE Conference on Computer Vision and Sydney. He mainly applies statistics and mathe-
Pattern Recognition (CVPR), 2016, pp. 5525–5533. matics to artificial intelligence and data science,
[75] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 and his research is detailed in one monograph
faces in-the-wild challenge: The first facial landmark localization and over 200 publications in prestigious jour-
challenge,” in IEEE International Conference on Computer Vision nals and proceedings at leading conferences.
Workshops (ICCVW), 2013, pp. 397–403. He received the 2015 Australian Scopus-Eureka
[76] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, “Interactive Prize, the 2018 IEEE ICDM Research Contribu-
facial feature localization,” in European Conference on Computer tions Award, and the 2021 IEEE Computer Society McCluskey Tech-
Vision (ECCV), 2012, pp. 679–692. nical Achievement Award. He is a fellow of the Australian Academy of
[77] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated Science, AAAS, and ACM.
facial landmarks in the wild: A large-scale, real-world database
for facial landmark localization,” in IEEE International Conference
on Computer Vision Workshops (ICCVW), 2011, pp. 2144–2151.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
g
A PPENDIX A Similarly, we have E pqpyi , sqq
“ yig {s.
Considering that xi
p
p
P ROOFS OF T HEOREM 1 AND T HEOREM 2 and yi are linearly independent variables, we thus have
g g g
Theorem 1. Given an unbiased quantization system defined by E pqpxi , sqq “ pE pqpyi , sqq , E pqpxi , sqqq
the encode operation in (10) and the decode operation in (13), we “ pxgi {s, yig {sq.
then have that the quantization error tightly upper bounded, i.e.,
? Therefore, the encode operation in (15), i.e., random-round,
}xpi ´ xgi }2 ď 2s{2, is an unbiased encode operation for heatmap regression.
We then prove that the quantization system is losses
where s ą 1 indicates the downsampling stride of the heatmap. as follows. For the decode operation in (17), if there is no
Proof. Given the ground truth numerical coordinate xi “
g heatmap error, we then have
pxgi , yig q, the predicted numerical coordinate xpi “ pxpi , yip q, P txpi {s “ ptxgi {su, tyig {suqu “ p1 ´ x qp1 ´ y q,
and the downsampling stride of the heatmap s ą 1, if there
P txpi {s “ ptxgi {su ` 1, tyig {suqu “ x p1 ´ y q,
is no heatmap error, we then have
P txpi {s “ ptxgi {su, tyig {su ` 1qu “ p1 ´ x qy ,
hpi pxq “ hgi pxq, P txpi {s “ ptxgi {su ` 1, tyig {su ` 1qu “ x y .
p g
where hi pxq and hi pxq indicate the ground truth heatmap We can reconstruct the fractional part of xi , i.e.,
g
and the predicted heatmap, respectively. Therefore, accord- ÿ
ing to the decode operation in (13), we have the predicted pxpi {s, yip {sq “ P txpi uxpi {s
numerical coordinate as xp
i

“ ptxgi {su, tyig {suq ˚ p1 ´ x qp1 ´ y q


#
p txgi {su ` t ´ 0.5 if x ă t,
xi {s “ g ` ptxgi {su ` 1, tyig {suq ˚ x p1 ´ y q
txi {su ` t ` 0.5 otherwise.
# ` ptxgi {su, tyig {su ` 1q ˚ p1 ´ x qy
tyig {su ` t ´ 0.5 if y ă t, ` ptxgi {su ` 1, tyig {su ` 1q ˚ x y
yip {s “
tyig {su ` t ` 0.5 otherwise. “ pxgi {s, yig {sq .
g g g g p p g g
where x “ xi {s ´ txi {su and y “ yi {s ´ tyi {su. The That is, pxi , yi q “ pxi , yi q, i.e., there is no quantization
quantization error of vanilla quantization system then can error.
be evaluated as follows:
#
p g |t ´ x ´ 0.5| if x ă t,
|xi {s ´ xi {s| “ A PPENDIX B
|t ´ x ` 0.5| otherwise.
#
E XPERIMENTS
|t ´ y ´ 0.5| if y ă t, In this section, we provide additional experimental results
|yip {s ´ yig {s| “ on facial landmark detection and human pose estimation.
|t ´ y ` 0.5| otherwise.
p g
The maximum quantization error |xi ´xi | “ s{2 is achieved B.1 Facial Landmark Detection
when x “ t. Similarly, we have the maximum quantization
p g
error |yi ´ yi | “ s{2 is achieved with y “ t. Considering TABLE 11
p p
that xi and yi are linearly independent variables, we thus The influence of different numbers of training samples when using
have different input resolutions. In each cell, the first number indicates the
performance of baseline method and the second number indicates the
b
2 2 ? performance of the proposed method.
}xpi ´ xgi }2 “ pxpi ´ xgi q ` pyip ´ yig q ď 2s{2.
The maximum quantization error is achieved with x “ NME (%)
y “ t. That is, the quantization error?in vanilla quantization #Samples
256 ˆ 256 128 ˆ 128 64 ˆ 64
system is tightly upper bounded by 2s{2. 256 5.67/5.46 6.02/5.47 7.74/6.20
1024 4.79/4.63 5.27/4.77 7.13/5.43
4096 4.14/3.97 4.81/4.17 6.57/4.76
Theorem 2. Given the encode operation in (15) and the decode 7500 4.00/3.81 4.62/3.99 6.50/4.62
operation in (17), we then have that the 1) encode operation is
unbiased; and 2) quantization system is lossless, i.e., there is no The influence of different numbers of training sam-
quantization error. ples. The proposed quantization system does not rely on
Proof. Given the ground truth numerical coordinate xi “ any assumption about the number of training samples, and
g

pxi , yi q, the predicted numerical coordinate xi “ pxi , yi q, is lossless for heatmap regression if there is no heatmap
g g p p p

and the downsampling stride of the heatmap s ą 1, we then error. However, heatmap prediction performance will be
have influenced by the number of training samples: increasing
g g g the number of training samples improves the model gener-
E pqpxi , sqq “ E pP tx ă tu txi {su ` P tx ě tu ptxi {su ` 1qq alizability from the learning theory perspective. Therefore,
“ txgi {sup1 ´ x q ` ptxgi {su ` 1qx we perform experiments to evaluate the influence of the
“ xg {s
i
proposed method when using different numbers of training
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

TABLE 12
Comparison of different backbone networks and feature maps on WFLW dataset.

NME (%), Inter-ocular


Backbone Input Heatmap FLOPs #Params
test pose expression illumination make-up occlusion blur
HRNet-W18 256 ˆ 256 64 ˆ 64 4.84G 9.69M 3.81 6.45 4.07 3.70 3.66 4.48 4.30
HRNet-W18 256 ˆ 256 256 ˆ 256 4.98G 9.69M 3.91 6.83 4.09 3.82 3.70 4.59 4.41
U-Net 256 ˆ 256 256 ˆ 256 60.55G 31.46M 4.93 9.30 5.08 4.74 4.82 6.43 5.72
HRNet-W18 128 ˆ 128 32 ˆ 32 1.21G 9.69M 3.99 6.78 4.26 3.89 3.84 4.65 4.46
HRNet-W18 128 ˆ 128 128 ˆ 128 1.24G 9.69M 4.06 6.89 4.41 3.97 3.95 4.78 4.52
U-Net 128 ˆ 128 128 ˆ 128 15.14G 31.46M 4.17 7.10 4.45 4.06 4.03 5.00 4.73
HRNet-W18 64 ˆ 64 16 ˆ 16 0.30G 9.69M 4.64 7.77 5.05 4.52 4.59 5.31 4.96
HRNet-W18 64 ˆ 64 64 ˆ 64 0.31G 9.69M 4.61 7.70 5.00 4.44 4.58 5.28 4.94
U-Net 64 ˆ 64 64 ˆ 64 3.79G 31.46M 4.37 7.18 4.69 4.26 4.30 5.12 4.80

samples in practice. As shown in Table 11, we find that resolution, the heatmap regression model achieves better
1) the proposed method delivers consistent improvements performance with the binary heatmap. We demonstrate the
when using different numbers of training samples; and differences between binary heatmap and Gaussian heatmap
2) increasing the number of training samples significantly in Figure 9. Specifically, the Gaussian heatmap improves
improves the performance of heatmap regression models the robustness of heatmap prediction, while at the risk
with low-resolution input images. of increasing the uncertainty on the maximum activation
The influence of different backbone networks. If we point in the predicted heatmap. Therefore, when training
do not take the heatmap prediction error into consideration, very efficient heatmap regression models using a low input
the quantization error in heatmap regression is then caused resolution, we recommend the binary heatmap.
by the downsampling of heamaps: 1) the downsampling
of input images and 2) the downsampling of CNN feature
maps. Though the analysis of heamap prediction error is
out the scope of this paper, we perform some experiments to
demonstrate the influence of different feature maps from the
backbone networks in practice. Specifically, we perform ex-
periments using the following two settings: 1) upsampling
the feature maps from HRNet [22]; or 2) using the feature
maps from U-shape backbone networks, i.e., U-Net [80]. As
shown in Table 12, we see that 1) directly upsampling the
feature maps achieves comparable performance with the Fig. 9. An intuitive example of the ground truth heatmap on WFLW
baseline method; 2) U-Net performs better than HRNet- dataset using different σ . We plot all heatmaps hg1 , . . . , hg98 into a single
figure for better visualization.
W18 when using a small input resolution (e.g., 64 ˆ 64
pixels), while is significantly worse than HRNet when using The qualitative comparison between the vanilla quan-
a large input resolution (e.g., 256 ˆ 256 pixels); and 3) U-Net tization system and the proposed quantization system. We
contains more parameters and requires much more compu- provide some demo images for facial landmark detection
tations than HRNet when using the same input resolution. using both the baseline method (i.e., k “ 1) and the pro-
It would be interesting to further explore more efficient U- posed quantization method (i.e., k “ 9) to demonstrate the
shape networks for low-resolution heatmap-based semantic effectiveness of the proposed method for accurate semantic
landmark localization. landmark localization. As shown in Fig. 10, we see that
the black landmarks are closer to the blue landmarks than
TABLE 13
The influence of different types of heatmap when using the heatmap
the yellow landmarks, especially when using low resolution
regression model with different input resolutions. models (e.g., 64 ˆ 64 pixels).

NME (%) B.2 Human Pose Estimation


Heatmap
256 ˆ 256 128 ˆ 128 64 ˆ 64 To utilize the proposed method for accurate semantic land-
Gaussian 3.86 4.21 4.99
Binary 3.81 3.99 4.62 mark localization, it contains only one hyper-parameter k ,
i.e., the number of activation points. To better understand
the effectiveness of the proposed method for human pose
The influence of different types of heatmap. We per- estimation, we perform ablation studies using different
form some experiments to demonstrate the influence when numbers of alternative activation points on both MPII and
using different types of heatmap, Gaussian heatmap and COCO datasets. As shown in Table 14 and Table 15, we
binary heatmap. As shown in Table 13, 1) when using a find that the proposed method achieves comparable per-
large input resolution, the heatmap regression model us- formance when k is between 10 and 25, making it easy to
ing either Gaussian heatmap or binary heatmap achieves choose a proper k on the validation data for human pose
comparable performance; and 2) when using a low input estimation applications.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17

Fig. 10. Qualitative results from the test split of the WFLW dataset (best view in color). Blue: the ground truth facial landmarks. Yellow: the predicted
facial landmarks by the vanilla quantization system (i.e., k “ 1). Black: the predicted facial landmarks by the proposed quantization system (i.e.,
k “ 9). For above three rows, we use the heatmap regression models with the input resolutions, 256 ˆ 256, 128 ˆ 128, and 64 ˆ 64 pixels,
respectively.

TABLE 14
Results on COCO validation set. The input resolution is 256 ˆ 192 pixels. The first row indicates the performance when using the compensation
method “shift a quarter to the second maximum activation point”.

Backbone k AP Ap .5 AP .75 AP (M) AP (L) AR AR .5 AR .75 AR (M) AR (L)


HRNet-W32 - 0.744 0.905 0.819 0.708 0.810 0.798 0.942 0.865 0.757 0.858
HRNet-W32 1 0.723 0.904 0.811 0.690 0.788 0.782 0.941 0.859 0.741 0.841
HRNet-W32 2 0.738 0.905 0.817 0.702 0.805 0.793 0.942 0.864 0.752 0.854
HRNet-W32 3 0.743 0.905 0.819 0.708 0.810 0.797 0.942 0.865 0.756 0.857
HRNet-W32 4 0.743 0.904 0.819 0.709 0.809 0.797 0.941 0.866 0.756 0.857
HRNet-W32 5 0.747 0.904 0.819 0.712 0.814 0.800 0.940 0.866 0.759 0.859
HRNet-W32 6 0.747 0.905 0.820 0.711 0.814 0.799 0.941 0.865 0.759 0.859
HRNet-W32 7 0.745 0.905 0.820 0.709 0.812 0.798 0.941 0.864 0.757 0.857
HRNet-W32 8 0.746 0.905 0.819 0.711 0.813 0.799 0.942 0.863 0.759 0.858
HRNet-W32 9 0.746 0.905 0.819 0.710 0.812 0.798 0.942 0.863 0.758 0.857
HRNet-W32 10 0.749 0.905 0.820 0.713 0.815 0.801 0.943 0.864 0.760 0.860
HRNet-W32 11 0.750 0.906 0.820 0.715 0.817 0.802 0.942 0.865 0.761 0.861
HRNet-W32 12 0.750 0.906 0.821 0.714 0.817 0.801 0.942 0.865 0.760 0.861
HRNet-W32 13 0.749 0.906 0.821 0.713 0.816 0.800 0.942 0.865 0.760 0.860
HRNet-W32 14 0.749 0.905 0.820 0.713 0.816 0.800 0.941 0.864 0.760 0.860
HRNet-W32 15 0.749 0.906 0.820 0.713 0.817 0.800 0.942 0.863 0.760 0.860
HRNet-W32 16 0.749 0.905 0.820 0.713 0.816 0.800 0.941 0.863 0.760 0.860
HRNet-W32 17 0.750 0.905 0.821 0.714 0.817 0.801 0.942 0.863 0.761 0.860
HRNet-W32 18 0.750 0.905 0.820 0.713 0.817 0.801 0.942 0.864 0.760 0.860
HRNet-W32 19 0.750 0.904 0.820 0.714 0.816 0.801 0.941 0.864 0.760 0.860
HRNet-W32 20 0.749 0.904 0.820 0.713 0.816 0.800 0.940 0.864 0.760 0.859
HRNet-W32 21 0.747 0.904 0.819 0.712 0.813 0.799 0.940 0.862 0.758 0.858
HRNet-W32 22 0.749 0.904 0.819 0.713 0.815 0.799 0.941 0.863 0.759 0.859
HRNet-W32 23 0.749 0.904 0.819 0.713 0.816 0.800 0.941 0.863 0.760 0.860
HRNet-W32 24 0.749 0.904 0.820 0.713 0.817 0.800 0.941 0.863 0.759 0.860
HRNet-W32 25 0.749 0.905 0.820 0.713 0.817 0.800 0.941 0.864 0.760 0.860
HRNet-W32 30 0.749 0.905 0.819 0.714 0.818 0.800 0.941 0.862 0.759 0.859
HRNet-W32 35 0.748 0.902 0.819 0.712 0.816 0.799 0.940 0.862 0.758 0.859
HRNet-W32 40 0.746 0.901 0.818 0.710 0.814 0.797 0.939 0.861 0.756 0.857
HRNet-W32 45 0.746 0.901 0.815 0.710 0.814 0.796 0.939 0.860 0.755 0.857
HRNet-W32 50 0.746 0.901 0.814 0.710 0.816 0.797 0.938 0.859 0.755 0.857
HRNet-W32 75 0.741 0.899 0.811 0.705 0.810 0.792 0.936 0.856 0.751 0.853
HRNet-W32 100 0.734 0.897 0.805 0.697 0.803 0.785 0.934 0.850 0.743 0.846
HRNet-W32 125 0.717 0.893 0.795 0.682 0.785 0.770 0.930 0.841 0.727 0.831
HRNet-W32 150 0.651 0.889 0.755 0.629 0.702 0.713 0.925 0.809 0.680 0.762
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18

TABLE 15
Results on MPII validation set. The input resolution is 256 ˆ 256 pixels. The first row indicates the performance when using the compensation
method “shift a quarter to the second maximum activation point”.

Backbone k Head Shoulder Elbow Wrist Hip Knee Ankle Mean [email protected]
HRNet-W32 - 97.1 95.9 90.3 86.4 89.1 87.1 83.3 90.3 37.7
HRNet-W32 1 97.033 95.703 90.131 86.312 88.402 86.802 82.924 90.055 32.808
HRNet-W32 2 97.033 95.856 90.302 86.416 88.818 87.004 83.325 90.247 35.912
HRNet-W32 3 97.135 95.975 90.302 86.347 89.250 86.741 83.018 90.271 37.668
HRNet-W32 4 97.237 95.907 90.643 86.141 89.164 86.862 83.089 90.294 36.997
HRNet-W32 5 97.203 95.839 90.677 86.141 89.147 87.124 82.994 90.315 38.774
HRNet-W32 6 97.101 95.839 90.677 86.262 89.181 86.923 83.089 90.310 38.548
HRNet-W32 7 97.135 95.822 90.609 86.313 89.199 86.943 83.113 90.312 37.913
HRNet-W32 8 97.033 95.822 90.557 86.175 89.147 86.922 82.876 90.245 38.129
HRNet-W32 9 97.067 95.822 90.694 86.329 89.112 86.942 82.876 90.284 38.457
HRNet-W32 10 97.033 95.873 90.626 86.141 89.372 86.862 82.995 90.297 39.024
HRNet-W32 11 97.101 95.907 90.592 86.244 89.406 86.842 82.853 90.302 39.139
HRNet-W32 12 97.101 95.822 90.660 86.175 89.475 86.761 82.900 90.297 38.996
HRNet-W32 13 97.101 95.822 90.694 86.141 89.475 86.741 82.829 90.289 38.855
HRNet-W32 14 97.033 95.822 90.728 86.141 89.337 86.640 82.806 90.250 38.616
HRNet-W32 15 97.169 95.754 90.609 86.124 89.337 86.801 82.475 90.211 38.798
HRNet-W32 16 97.101 95.771 90.694 86.141 89.199 86.882 82.569 90.226 38.691
HRNet-W32 17 97.067 95.822 90.609 86.142 89.285 86.821 82.617 90.224 39.053
HRNet-W32 18 97.033 95.822 90.626 86.004 89.250 86.801 82.522 90.193 39.048
HRNet-W32 19 97.033 95.839 90.626 85.953 89.250 86.822 82.593 90.198 39.058
HRNet-W32 20 96.999 95.788 90.626 86.039 89.389 86.801 82.664 90.226 39.014
HRNet-W32 21 96.999 95.771 90.677 85.935 89.337 86.862 82.451 90.187 38.855
HRNet-W32 22 96.999 95.856 90.626 86.090 89.268 86.741 82.475 90.198 38.842
HRNet-W32 23 96.999 95.890 90.609 86.038 89.164 86.801 82.499 90.187 39.037
HRNet-W32 24 96.862 95.754 90.592 86.021 89.233 86.842 82.333 90.146 39.089
HRNet-W32 25 96.930 95.754 90.506 86.055 89.302 86.721 82.309 90.135 39.126
HRNet-W32 30 96.862 95.788 90.438 86.004 89.372 86.660 82.144 90.101 38.759
HRNet-W32 35 96.623 95.873 90.523 85.936 89.354 86.701 81.884 90.075 38.964
HRNet-W32 40 96.385 95.856 90.523 85.816 89.406 86.721 81.908 90.049 38.618
HRNet-W32 45 96.317 95.669 90.506 85.746 89.285 86.660 82.026 89.990 38.579
HRNet-W32 50 96.214 95.686 90.506 85.678 89.268 86.439 81.648 89.899 38.584
HRNet-W32 55 96.044 95.669 90.404 85.335 89.337 86.439 81.506 89.815 38.496
HRNet-W32 60 95.805 95.533 90.302 85.147 89.147 86.419 81.270 89.672 38.332
HRNet-W32 65 95.703 95.618 90.251 85.079 89.164 86.419 81.081 89.646 38.218
HRNet-W32 70 95.498 95.584 90.165 85.215 89.302 86.378 80.963 89.633 38.197
HRNet-W32 75 95.259 95.533 90.029 85.147 89.147 86.318 80.467 89.487 38.054
HRNet-W32 80 95.020 95.516 90.046 84.907 89.216 86.197 80.255 89.404 37.684
HRNet-W32 85 94.679 95.465 89.927 84.925 89.337 86.258 80.042 89.357 37.463
HRNet-W32 90 94.407 95.431 89.790 84.907 89.233 86.197 79.735 89.251 37.208
HRNet-W32 95 94.065 95.414 89.773 84.890 89.302 86.097 79.263 89.165 36.719
HRNet-W32 100 93.827 95.482 89.603 84.737 89.320 86.076 78.578 89.032 36.180
HRNet-W32 105 93.520 95.448 89.603 84.599 89.199 85.996 77.964 88.884 35.532
HRNet-W32 110 93.008 95.448 89.518 84.273 89.147 85.835 77.114 88.657 34.770
HRNet-W32 115 92.565 95.296 89.501 84.085 89.112 85.794 76.405 88.483 33.352
HRNet-W32 120 92.121 95.262 89.398 83.930 89.060 85.533 75.272 88.238 31.814
HRNet-W32 125 91.473 95.160 89.160 83.554 88.991 85.472 74.162 87.939 30.065
HRNet-W32 130 90.825 94.939 89.211 83.023 89.095 85.089 72.910 87.609 27.528
HRNet-W32 135 90.246 94.667 89.006 82.577 88.852 84.807 71.209 87.164 24.757
HRNet-W32 140 89.529 94.463 88.853 82.046 88.679 83.800 69.792 86.659 21.788
HRNet-W32 145 88.404 94.124 88.717 81.310 88.645 82.773 67.997 86.053 19.136
HRNet-W32 150 87.756 93.886 88.614 80.146 88.575 81.383 66.084 85.373 16.586
HRNet-W32 175 82.401 90.829 86.927 71.515 86.014 70.504 56.826 80.109 9.204
HRNet-W32 200 68.008 87.075 82.325 62.146 79.176 58.737 41.120 72.009 5.988

You might also like