Video-Based Estimation of Pain Indicators in Dogs
Video-Based Estimation of Pain Indicators in Dogs
Abstract
Dog owners are typically capable of recognizing behavioral cues that
reveal subjective states of their dogs, such as pain. But automatic recog-
nition of the pain state is very challenging. This paper proposes a novel
video-based, two-stream deep neural network approach for this problem.
We extract and preprocess body keypoints, and compute features from
both keypoints and the RGB representation over the video. We propose
an approach to deal with self-occlusions and missing keypoints. We also
present a unique video-based dog behavior dataset, collected by veteri-
nary professionals, and annotated for presence of pain, and report good
classification results with the proposed approach. This study is one of the
first works on machine learning based estimation of dog pain state.
1 Introduction
Interpretation of the behavior of animals is critical to understand their well-
being. In this paper, we propose an approach for automatic estimation of pain
states of dogs, using video modality. This problem is important for several
potential applications, including long-term automatic monitoring of dogs, as-
sessment and diagnosis aids for veterinary clinicians, and early warning systems
for nonexperts. It is challenging, because dogs come in different breeds that have
very different appearances, and exhibit different behavioral coping strategies.
1
Animal pain estimation from video uses facial expressions of animals, as
well as cues extracted from the body and the posture of the animal. While
a comprehensive work for automatic dog pain estimation is missing from the
literature, there are works on sheep, mice, pigs, and horses. Each animal requires
a separate data collection effort, a different analysis approach, and different pain
annotation schemes.
In this paper, we propose a pipeline for pain estimation in dogs, incorporat-
ing a number of off-the-shelf and in-house developed tools. We also introduce
a database of dog videos, annotated by veterinary experts, to evaluate our ap-
proach. Pain estimation in dogs is an under-researched problem, and there are
limited resources; we hope that our approach will establish a good baseline, and
there will be more research to follow. We make all our code and annotations
available publicly1 .
2 Related Work
Automated pain estimation for animals focuses on indicators from the body and
the face of the animal. We first discuss pose estimation in this section, and then
move to pain estimation.
mitted by the informed consent forms, such as federated learning approaches. Please contact
the second author, if you are interested in the data.
2
would require bounding box detection as a first step to reduce background
interference.
Figure 1: Image based representations for the dog. (1) Body key-joints, (2)
Outline boundary, (3) Instance segmentation, (4) Bounding box
Body key-joints is the main representation method for pose estimation ap-
proaches, as they depict the geometrical configuration of multiple body parts.
For a pose estimation algorithm, some methods would first use the bounding
box to identify the location of a single animal object [1–3] or apply instance seg-
mentation to extract the shape of the animal body. Combining the advantage
of the above four representation methods can help an algorithm keep track of
the target animal and exclude noise from the background.
Recent approaches for animal pose estimation like DeepLabCut [4] and
LEAP [5] use a convolutional neural network (CNN) to directly extract fea-
tures from animal images and to convert it to the key-joint representation. To
train the CNNs, a human-annotated training set is used. To deal with the lack
of large amounts of human-annotated data, transfer learning is the preferred so-
lution. In DeepLabcut, a ResNet model [6] pre-trained on an object recognition
task is used to enable learning animal specific estimators from only hundreds
of annotated training samples. Additionally, in [7], active learning is used to
reduce the number of required annotations. In this approach, the system will
suggest which images to annotate in each round, selecting these images actively.
On the other hand, semi-supervised learning is used in [8], where a larger set of
images without annotations is used jointly with a smaller set of annotated im-
ages to perform pose estimation. Apart from these three main approaches that
can be combined (i.e. transfer learning, active learning, and semi-supervised
3
learning), data augmentation is also used to increase the amount of available
samples.
In this work, we use the Deep High-Resolution Network (HRNet), which is
a state-of-the-art framework for 2D pose estimation that illustrated superior
performance on a number of case studies [9]. This approach uses an innovative
network architecture, described in Section 3.1.
The pose estimation approaches we mentioned are generic and can be used
with different animals. The DeepPoseKit method, for example, is applied on
vinegar fly, locust, and zebra pose estimation problems [7]. However, popu-
lar human pose estimators use further information from the human skeleton
structure [10]. In a similar way, animal-specific skeleton models can be used to
improve the detection of the pose.
A number of approaches have been used or proposed for pose estimation
in dogs and other canines. Modern object detection algorithms, trained on
large sets, such as MS COCO [11], typically contain a “dog” class. These
can be leveraged to find a bounding box (or a more detailed segmentation)
for detecting dogs in an image. The work by Tsai and Huang [12] introduced a
multiple stages pose estimation algorithm, which uses a Mask R-CNN model [13]
to generate the contour mask map of the dog. A faster R-CNN [14] performs
posture recognition on the contour mask image and key part recognition on the
original image. The key points of the skeleton would be obtained by analysing
the candidate bounding box and animal pose jointly.
In the RGBD-Dog approach [15] an RGBD dog dataset with landmark
ground truth is used, generated from a 3D motion capture system. They ap-
plied a stacked hourglass network [16] to predict a set of 2D heatmaps for a
given depth image. This can determine the 3D coordinates of the body joints.
A Hierarchical Gaussian Process Latent Variable Model (H-GPLVM) [17] is ap-
plied to refine the predicted 3D joint positions and to prevent pose ambiguities.
However, this method requires depth data, which is not always available.
In this work, we use the HRNet-W32 model obtained from [18], where W32
denotes the width of the high resolution network in the last three layers. The
model is initially trained on the AnimalPose dataset [19], which contains pose
annotation on five animal categories (i.e. dog, cat, cow, horse, sheep, respec-
tively) with 6,000+ instances in 4,000+ images, and 20 body keypoints anno-
tated. We will describe the experiments in Section 5.
4
Automatic, computer vision based pain estimation has not been extensively
researched for dogs, but there is important work for other animals, such as
horses [21, 22], mice [23], rabbits [24] and sheep [25] (see [26] for a recent sur-
vey). These works mostly focus on the face of the animal, and use validated
clinical scales that rely on a series of observations to score the presence of pain
indicators. These observations can include the position of the ears, the visibility
of sclera in the eyes, muscle tension in certain areas like the mouth, or presence
of specific behaviors, such as baring the teeth in horses, which are exhibited
when the animal is pain.
There is some related work on emotion estimation for dogs, which is rele-
vant as a basis for pain estimation. In [27], three emotional states (i.e. growl,
sleep, and smile, respectively) are automatically estimated from sequences of
dog images, based on ears, eyes, mouth and head features. In [28], DeepLabCut
is used to estimate anger, fear, happiness, and relaxation in dogs, based on a
custom dataset of 400 images, evenly distributed over these four classes.
The main challenges for computer vision based estimation of dog pain are
the existence of many different dog breeds, which implies different sizes and
colors that change the appearance, the presence of subtle cues, such as muscle
contractions that are hidden beneath the coat, and the difficulties that plague
ordinary face and body analysis, such as pose and illumination variations. Most
importantly, there is a lack of data with pain annotations.
3 Methodology
Our proposed approach for detecting the visual indication of a dog’s pain state
is made up of three parts. The off-the-shelf model for pose estimation is dis-
cussed in Section 3.1. The feature extraction and pre-processing are described
in Section 3.2, followed by pose normalization. In Section 3.4, we propose an
architecture based on two-stream deep learning for pain detection. Figure 2
depicts our proposed system, which takes a short, 8-frame video clip (sampled
over 4 seconds) as input, estimates the dog’s pose, and produces a pain score in
the 0-1 range, which can be treated as a probability that the video contains a
dog in pain.
5
Figure 2: Overall pipeline of the proposed method.
the model input. (See Figure 4 for the example image pre-processing represen-
tation)
As stated before, we use the HRNet [9] approach for pose estimation. The
majority of existing pose estimation methods, such as SimpleBaseline [31] and
DeepLabCut [32], adapt an encoder-decoder structure composed of a series
of convolutional networks to downsample the input image from high to low-
resolution feature maps and then apply transposed convolution layers to recover
the original high-resolution feature map. HRNet, on the other hand, is com-
posed of a central stem network that maintains high resolution throughout the
process, as well as sample branch networks in each layer of the stem, and par-
allel connections between the multi-resolution branch networks. Additionally,
HRNet implements multi-scale fusion between each parallel branch network,
in order to facilitate information exchange throughout the process. The stem
network’s last layer would output keypoint estimates (see Figure 3).
Since HRNet is a top-down pose estimation algorithm, it requires the ap-
proximate location of the target object in the image and preferably uses closely
cropped images as inputs. The bounding box output of the YOLOv5 model
is extended by 10% to ensure we include the entire dog, and provided to HR-
Net for further processing. We use the HRNet-W32 model obtained from [18],
trained on the AnimalPose dataset [19], at this stage. Figure 4 illustrates the
processing of a frame for bounding box localization and keypoints extraction.
6
Figure 3: The structure of HRNet. The horizontal and vertical axes correspond
to the depth of the network and the scale of the feature maps, respectively
(Adapted from [9]).
7
using body symmetry and by inference based on observed keypoints over time.
The central concepts are as follows:
1. If some keypoints are missing from a frame, they may be discovered and
interpolated by using the neighbouring frames in the sequence. Note that
this makes would only work in an “offline” mode.
2. If -after the first step- there are more than nine keypoints missing, or the
spine keypoints are missing, the frame is discarded from further processing.
3. Each leg has three keypoints (i.e. elbow, knee, and paw). If one keypoint
of a leg is missing, it can be inferred according to the position of the other
two keypoints and the spine points.
4. If more than one keypoint is missing in one leg, they can be guessed by
exploiting the left and right side body symmetry, or front and back side
symmetry.
This approach for resolving missing data is simple and effective, and necessary
for subsequent processing of the body pose. In comparison to the head and spine
keypoints, the leg keypoints convey more detailed information about the pose.
Thus, the essence of this approach is to compensate for the missing keypoints
induced by self-occlusion.
Our pose estimate step defines a mapping from the RGB image to body
keypoints graph as:
pose detector
si (t) −→ p( t) ∀t = 1, . . . , Ti , (1)
where pi (t) denotes the posture for sample i at time frame t and T denotes the
total sequence length for sample i. The whole dataset contains N samples. In
particular, pi (t) consists of a sequence of 2D coordinates as:
pi (t) = (xj (t), yj (t))i j∈J , (2)
The dependence of t is omitted here for simplicity. In this paper, we use the
middle point between the tail end and the neck as the root point. Thus, the
root point centered coordinate is defined as:
p̄i = (x̄j , ȳj )i j∈J (4)
8
Furthermore, in order to normalize the size of the dog within the image, we
re-scale the keypoint coordinates p̄i :
p̄i
p̄i = , (5)
kvneck,tailend k
where v̄neck,tailend is the vector between the neck and the tail end landmarks.
The transformation and normalization of landmarks has two purposes: The
placement of the dog within the image has no effect on the processing, and the
distance of the dog to the camera, as well as its size become less important.
One may assume that in case much more training data become available in the
future, some clustering in this space, followed by cluster-specific models, may
improve the modeling.
it = σ (Wu ht−1 + Iu xt + bi )
ft = σ Wf ht−1 + If xt + bf
ot = σ (Wo ht−1 + Io xt + bo )
(6)
c̃t = tanh (Wc ht−1 + Ic xt + bc )
ct = ft ct−1 + it c̃t
ht = tanh (ot ct ) ,
where it , f t , ot , and ct are the activation gates of the inputs, forget vector,
output, and cell states vector, respectively. c̃t is the cell input activation vector.
ht is the hidden state of LSTM units, σ represent the logistic sigmoid activation
function, represents the elementary multiplication. The weight matrices for
9
Figure 5: Our proposed two-stream model for dog pain estimation. The left
branch is a four layer LSTM model, which is responsible for processing the key-
point information. The right branch is a four layer convolutional LSTM model,
which is responsible for processing RGB images. The yellow and gray blocks
are the max pooling and batch normalization layers. Black blocks represent the
time attention layers. The green block is the fully connected layer. We fuse our
two branches by concatenating the last layers’ feature vectors.
10
3.4.2 Convolutional LSTM
Traditional versions of LSTM would use the fully connected layers in every
state transition and the gate output process. This kind of LSTM can also be
called the fully connected LSTM (FC-LSTM). The FC-LSTM has the ability to
extract temporal information from sequences, which makes it suitable for time
series problems [36–39]. However, the major drawback of FC-LSTM is that its
fully connected layers have too much redundancy for visual inputs. In order to
address this problem, Shi et al. [40] introduced Convolutional LSTM (C-LSTM),
which has convolutional structures in both the input-to-state and state-to-state
transitions.
C-LSTM provides the ability to process the spatio-temporal sequence more
effectively. A three-dimensional tensor can be fed into the network without
flattening into a one dimensional tensor [41, 42]. All the inputs x1 , ..., xt , cell
outputs c1 , ..., ct , hidden states h1 , ..., ht , and gates ft , it , ct of C-LSTM are 3D
tensors, whose last two dimensions are spatial dimensions (rows and columns).
This model can downsample the input image to extract high-dimensional fea-
tures and capture the information across the video frames. We use multiple
C-LSTM layers in our pipeline to process RGB frames in the video. The equa-
tions of C-LSTM are given in Eq. 7, reproduced from [40].
ft = σg (Wf ∗ xt + Uf ∗ ht−1 + Vf ◦ ct−1 + bf )
it = σg (Wi ∗ xt + Ui ∗ ht−1 + Vi ◦ ct−1 + bi )
ct = ft ◦ ct−1 + it ◦ tanh (Wc ∗ xt + Uc ∗ ht−1 + bc ) (7)
ot = σg (Wo ∗ xt + Uo ∗ ht−1 + Vo ◦ ct + bo )
ht = ot ◦ tanh (ct )
Here * denotes the convolution operator, and ◦ denotes the Hadamard product
(element-wise product). W, U, V and b are the weight matrices and bias vector
parameters that need to be learned during training.
11
The alignment model assigns a score αij to the pair of context vector elements
at position j and the annotation at position i. eij is computed by the vector
product between the LSTM hidden state vi−1 and the j th annotation. In our
model, both context vector and annotation are the hidden state sequences of
the last LSTM layer, which forms a self-attention model. In [37], the alignment
score α is parameterized by a feed-forward network with a single hidden layer,
and this network is jointly trained with the other parts of the model.
4 Datasets
There are a few datasets with dog images available for training pose estimation
models, but no datasets are available for assessing pain indicators. The dataset
we introduce in this paper, is unique in that respect. We use two other datasets
for assessing the pose estimation method that we use in this paper, namely,
StanfordExtra [44] and TigDog [45].
StanfordExtra is an image-based dog dataset with annotated 2D keypoints,
obtained by labeling part of the existing Stanford Dog Dataset [46], which con-
sists of 20,580 dog images taken “in the wild” from 120 different dog breeds.
The dataset contains various poses, with variation on environmental occlusions,
interaction with humans or other animals, and partial views. For each image,
there are 19 candidate body key points, but only the points that can be ob-
served are labeled. We selected 6 dog breeds out of 120 randomly, and used 60
images for each breed for comparison experiments. The dog breeds are Dandie-
Dinmont, Tibetan-terrier, Bluetick, Rhodesian-ridgeback, Brittany-spaniel, and
Brabancon-griffon, respectively.
TigDog is a video base dataset with no annotated key-points, sourced from
the Youtube-Objects dataset [47]. For most videos, the camera is not static,
and follows the dog, which is typically centered in the image. Figure 6 shows
some example images from these two datasets.
The Anonymous University Dog Behavior Dataset (AUDBD) used in our
experiment is collected by the Veterinary Medicine Department of ANONY-
MOUS University, in a group that performs research into dog behavior (cita-
tions removed for blind review). The complete dataset includes 61 videos, each
corresponding to an individual dog. The overall video duration is six hours and
45 minutes. There are 23 videos, totaling three hours and nine minutes, labeled
12
Figure 6: Example images from the StanfordExtra (top) and TigDog (bottom)
datasets.
as “pain condition”, and 38 videos with a total of three hours and 36 minutes
labeled as “non-pain” condition.
Pain can be associated to wear and tear of tissues, injuries to the bones or
a joint, or damage to muscles, ligaments, tendons or other soft tissues (muscu-
loskeletal). These are marked as “orthopedic pain” in the dataset. Other types
of pain can be related to irritated or damaged nerves, and these are marked as
“neurological pain”. The labels of the dataset are provided by the veterinar-
ian experts according to their medical records. Every sample in the pain set
is a real dog patient with a physical problem that could potentially cause pain
symptoms, however, not all images in the videos will exhibit all pain indicators.
We extract and process the frames from the videos at two frames per second,
and in clips of four seconds of video each. Each such clip is thus composed of
eight frames. In order to have enough data to train our models, we use a two
frame (i.e. one second) overlap between subsequent clips. This way, the dataset
contains a total of 12,435 clips (See dataset overview in Table 1).
Table 1: The overall statistics of the AUDBD dataset. Frames are sampled at
two FPS, and clips consist of eight frames with two frames overlap. Duration is
shown in hh:mm:ss.
13
5 Experimental Results
5.1 Pose estimation
We measure the performance of pre-trained HRNet on our dog pain dataset, as
well as StanfordExtra and Tigdog. We use the Percentage of Correct Keypoints
(PCK) measure, which is the percentage of keypoints detected falling within
a normalized distance of the manually annotated ground truth. We use the
square root of the bounding box area as the normalization index. The results
are shown in Table 2.
We can see that the pre-trained HRNet has a good performance in general,
and has better performance on the StanfordExtra dataset compared to Tigdog
and AUDBD, since StanfordExtra is visually more similar to the Animalpose
dataset, on which the HRNet was trained.
14
with the highest average F1 score and accuracy is C-LSTM+LSTM128, which
has an average F1 score of 76.3% ± 4.4 and an average accuracy of 77.0% ± 4.6.
Table 3: Results (%F1-score and accuracy) for different models. RGB: images,
KP: keypoints, OF: optical flow. 32, 64, 128 denote the number of hidden units
in the LSTM models, which are used to process the keypoints sequence.
where Z is the product of width and height. The importance of feature maps
(layer output channels) k for a class c are then weighted according to their
corresponding gradient magnitude. The feature maps are processed by a RELU
function to focus on the features which have a positive impact on the class of
interest. !
X
LcGrad−CAM = ReLU αkc Ak (12)
k
| {z }
linear combination
The region with the highest magnitude is considered to be the most impor-
tant for the classification decision. The output feature map is up-sampled with
15
an interpolation to the size of the original image, and superposed on it as an
heatmap.
Figure 7 shows two dogs in orthopedic pain condition. The top sequence in
Figure 7 is a case that is correctly classed as pain by our ConvLSTM+LSTM
model, with a confidence score of 0.95. According to the heatmap, the lower
body parts and the head parts are the regions that the model most focuses
on. This is reasonable, since a dog with an orthopedic pain would be unstable
in their walking motion, and this would show up in their leg movements and
head position. We note that the algorithm seems not to be affected too much
by the image background, and could follow the temporal patterns of the dog.
In the bottom sequence of Figure 7, we see an example sequence where the
dog is in pain, but the algorithm classifies it incorrectly as “no pain”, with a
low confidence score. From the heatmap, we can see that algorithm is focusing
on the back of the dog, and the legs are not clearly visible from this angle.
Therefore, the algorithm cannot reliably detect the pain expression.
Figure 7: Saliency maps for a true positive (top) and a false negative (bottom)
case belonging to the pain class.
6 Conclusions
Recognizing pain-related behaviors in dogs automatically is a challenging prob-
lem. Clinical pain assessment methods, such as the Glasgow composite measure
pain scale, use multiple behavioral indicators such as “vocalization, attention
to wound, mobility, response to touch, demeanor and posture/activity” for as-
sessment of behaviours [50]. In this paper, we have barely scratched the surface
of behavior analysis for this task, and integrated computer vision based tools
into a solution that uses a small number of cues. Furthermore, we have worked
only on samples with orthopedic and neuropathic pain; other categories, such
as organic pain, were not included.
Our results show that both spatial-temporal features in images and keypoint
based descriptions of the motion patterns are useful for the visual analysis.
While data restrictions and the risk of overfitting prevented us from fine-tuning
pose estimators, we show that generic tools have good performance for this task.
Contrary to most work on animal pain estimation, we have not focused on the
facial region. This remains a future work, but would require further clinical
tools, such as validated grimace scales for labeling.
16
Ethical Impact Statement
The proposed approach cannot be used as a clinical assessment tool for dog
pain estimation without further tests and validation. No dogs were harmed in
this research, there was no induced pain for the subjects, and ethics committee
approval is obtained for the preparation of the dog video dataset.
References
[1] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler,
and K. Murphy, “Towards accurate multi-person pose estimation in the
wild,” in Proc. CVPR, 2017, pp. 4903–4911.
[2] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional multi-person
pose estimation,” in Proc. CVPR, 2017, pp. 2334–2343.
[3] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded
pyramid network for multi-person pose estimation,” in Proc. CVPR, 2018,
pp. 7103–7112.
17
[11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in con-
text,” in Proc. ECCV. Springer, 2014, pp. 740–755.
[12] M.-F. Tsai and J.-Y. Huang, “Predicting canine posture with smart camera
networks powered by the artificial intelligence of things,” IEEE Access,
vol. 8, pp. 220 848–220 857, 2020.
[13] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
ICCV, 2017, pp. 2980–2988.
[14] R. Girshick, “Fast R-CNN,” in Proc. CVPR, 2015, pp. 1440–1448.
18
[25] F. Pessanha, M. Mahmoud, and K. McLennan, “Towards automatic moni-
toring of disease progression in sheep: A hierarchical model for sheep facial
expressions analysis from video,” in Proc. IEEE FG, 2020.
[26] S. Broomé, M. Feighelstein, A. Zamansky, G. C. Lencioni, P. H. Andersen,
F. Pessanha, M. Mahmoud, H. Kjellström, and A. A. Salah, “Going deeper
than tracking: a survey of computer-vision based recognition of animal pain
and affective states,” arXiv preprint arXiv:2206.08405, 2022.
[27] V. Franzoni, A. Milani, G. Biondi, and F. Micheli, “A preliminary work on
dog emotion recognition,” in IEEE/WIC/ACM International Conference
on Web Intelligence-Companion Volume, 2019, pp. 91–96.
[30] ——, “Yolo9000: better, faster, stronger,” in Proc. CVPR, 2017, pp. 7263–
7271.
[31] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estima-
tion and tracking,” in Proceedings of the European conference on computer
vision (ECCV), 2018, pp. 466–481.
[32] T. Nath, A. Mathis, A. C. Chen, A. Patel, M. Bethge, and M. W. Mathis,
“Using deeplabcut for 3d markerless pose estimation across species and
behaviors,” Nature Protocols, vol. 14, no. 7, pp. 2152–2176, 7 2019.
[Online]. Available: https://doi.org/10.1038/s41596-019-0176-0
19
[38] A. Graves, “Generating sequences with recurrent neural networks,” arXiv
preprint arXiv:1308.0850, 2013.
[39] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-
gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
networks for visual recognition and description,” in Proc. CVPR, 2015, pp.
2625–2634.
[40] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo,
“Convolutional lstm network: A machine learning approach for precipita-
tion nowcasting,” in Proc. NeurIPS, 2015, pp. 802–810.
[46] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li, “Novel dataset for
fine-grained image categorization: Stanford dogs,” in Proc. CVPRW, 2011.
[47] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, “Learning
object class detectors from weakly annotated video,” in Proc. CVPR, 2012,
pp. 3282–3289.
[48] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model
and the kinetics dataset,” in Proc. CVPR, 2017, pp. 6299–6308.
[49] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Ba-
tra, “Grad-cam: Visual explanations from deep networks via gradient-
based localization,” in Proc. CVPR, 2017, pp. 618–626.
20