0% found this document useful (0 votes)

38 views20 pages

Video-Based Estimation of Pain Indicators in Dogs

Uploaded by

pawan.rathod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views20 pages

Video-Based Estimation of Pain Indicators in Dogs

Uploaded by

pawan.rathod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Video-based estimation of pain indicators in dogs

Hongyi Zhu1 , Yasemin Salgırlı2 , Pınar Can3 , Durmuş Atılgan2 , and

arXiv:2209.13296v2 [cs.CV] 26 Nov 2022

Albert Ali Salah1,4

1
Department of Information and Computing Sciences, Utrecht
University, the Netherlands
2
Faculty of Veterinary Medicine, Department of Physiology,
Ankara University, Turkey
3
Faculty of Veterinary Medicine, Department of Surgery, Ankara
University, Turkey
4
Department of Computer Engineering, Boğaziçi University,
Turkey

November 29, 2022

Abstract
Dog owners are typically capable of recognizing behavioral cues that
reveal subjective states of their dogs, such as pain. But automatic recog-
nition of the pain state is very challenging. This paper proposes a novel
video-based, two-stream deep neural network approach for this problem.
We extract and preprocess body keypoints, and compute features from
both keypoints and the RGB representation over the video. We propose
an approach to deal with self-occlusions and missing keypoints. We also
present a unique video-based dog behavior dataset, collected by veteri-
nary professionals, and annotated for presence of pain, and report good
classification results with the proposed approach. This study is one of the
first works on machine learning based estimation of dog pain state.

1 Introduction
Interpretation of the behavior of animals is critical to understand their well-
being. In this paper, we propose an approach for automatic estimation of pain
states of dogs, using video modality. This problem is important for several
potential applications, including long-term automatic monitoring of dogs, as-
sessment and diagnosis aids for veterinary clinicians, and early warning systems
for nonexperts. It is challenging, because dogs come in different breeds that have
very different appearances, and exhibit different behavioral coping strategies.

1
Animal pain estimation from video uses facial expressions of animals, as
well as cues extracted from the body and the posture of the animal. While
a comprehensive work for automatic dog pain estimation is missing from the
literature, there are works on sheep, mice, pigs, and horses. Each animal requires
a separate data collection effort, a different analysis approach, and different pain
annotation schemes.
In this paper, we propose a pipeline for pain estimation in dogs, incorporat-
ing a number of off-the-shelf and in-house developed tools. We also introduce
a database of dog videos, annotated by veterinary experts, to evaluate our ap-
proach. Pain estimation in dogs is an under-researched problem, and there are
limited resources; we hope that our approach will establish a good baseline, and
there will be more research to follow. We make all our code and annotations
available publicly1 .

2 Related Work
Automated pain estimation for animals focuses on indicators from the body and
the face of the animal. We first discuss pose estimation in this section, and then
move to pain estimation.

2.1 Pose estimation

Pain estimation requires that the animal is properly detected and suitably rep-
resented. There are four main approaches for representing the dog, as shown in
Figure 1.

• Body key-joints representation: This representation allows easy modeling

of pose semantics, and resembles the skeleton models popular in human
behavior analysis. Examples of body joints are “shoulders”, “middle joint
of knees”, “paw”, and “spine”.

• Outline boundary: In this representation, the boundary of the dog is

marked. This method can represent the rigid shape of the animal.
• Instance segmentation: Processing the image pixel-wise and segmenting
the dog out of the background is called instance segmentation. This rep-
resentation is used in visual scene understanding applications, and subse-
quently, there are several image databases that have this kind of ground
truth annotations for dog images.
• Bounding box: This representation is a cheaper-to-annotate version of
instance segmentation that uses a rectangle to mark out the dog’s extent
in the image. Some pose estimation and instance segmentation algorithms
1 The dataset, while not publicly available, can be accessed under special conditions per-

mitted by the informed consent forms, such as federated learning approaches. Please contact
the second author, if you are interested in the data.

2
would require bounding box detection as a first step to reduce background
interference.

Figure 1: Image based representations for the dog. (1) Body key-joints, (2)
Outline boundary, (3) Instance segmentation, (4) Bounding box

Body key-joints is the main representation method for pose estimation ap-
proaches, as they depict the geometrical configuration of multiple body parts.
For a pose estimation algorithm, some methods would first use the bounding
box to identify the location of a single animal object [1–3] or apply instance seg-
mentation to extract the shape of the animal body. Combining the advantage
of the above four representation methods can help an algorithm keep track of
the target animal and exclude noise from the background.
Recent approaches for animal pose estimation like DeepLabCut [4] and
LEAP [5] use a convolutional neural network (CNN) to directly extract fea-
tures from animal images and to convert it to the key-joint representation. To
train the CNNs, a human-annotated training set is used. To deal with the lack
of large amounts of human-annotated data, transfer learning is the preferred so-
lution. In DeepLabcut, a ResNet model [6] pre-trained on an object recognition
task is used to enable learning animal specific estimators from only hundreds
of annotated training samples. Additionally, in [7], active learning is used to
reduce the number of required annotations. In this approach, the system will
suggest which images to annotate in each round, selecting these images actively.
On the other hand, semi-supervised learning is used in [8], where a larger set of
images without annotations is used jointly with a smaller set of annotated im-
ages to perform pose estimation. Apart from these three main approaches that
can be combined (i.e. transfer learning, active learning, and semi-supervised

3
learning), data augmentation is also used to increase the amount of available
samples.
In this work, we use the Deep High-Resolution Network (HRNet), which is
a state-of-the-art framework for 2D pose estimation that illustrated superior
performance on a number of case studies [9]. This approach uses an innovative
network architecture, described in Section 3.1.
The pose estimation approaches we mentioned are generic and can be used
with different animals. The DeepPoseKit method, for example, is applied on
vinegar fly, locust, and zebra pose estimation problems [7]. However, popu-
lar human pose estimators use further information from the human skeleton
structure [10]. In a similar way, animal-specific skeleton models can be used to
improve the detection of the pose.
A number of approaches have been used or proposed for pose estimation
in dogs and other canines. Modern object detection algorithms, trained on
large sets, such as MS COCO [11], typically contain a “dog” class. These
can be leveraged to find a bounding box (or a more detailed segmentation)
for detecting dogs in an image. The work by Tsai and Huang [12] introduced a
multiple stages pose estimation algorithm, which uses a Mask R-CNN model [13]
to generate the contour mask map of the dog. A faster R-CNN [14] performs
posture recognition on the contour mask image and key part recognition on the
original image. The key points of the skeleton would be obtained by analysing
the candidate bounding box and animal pose jointly.
In the RGBD-Dog approach [15] an RGBD dog dataset with landmark
ground truth is used, generated from a 3D motion capture system. They ap-
plied a stacked hourglass network [16] to predict a set of 2D heatmaps for a
given depth image. This can determine the 3D coordinates of the body joints.
A Hierarchical Gaussian Process Latent Variable Model (H-GPLVM) [17] is ap-
plied to refine the predicted 3D joint positions and to prevent pose ambiguities.
However, this method requires depth data, which is not always available.
In this work, we use the HRNet-W32 model obtained from [18], where W32
denotes the width of the high resolution network in the last three layers. The
model is initially trained on the AnimalPose dataset [19], which contains pose
annotation on five animal categories (i.e. dog, cat, cow, horse, sheep, respec-
tively) with 6,000+ instances in 4,000+ images, and 20 body keypoints anno-
tated. We will describe the experiments in Section 5.

2.2 Pain estimation

Pain is defined as “an unpleasant sensory and emotional experience associ-
ated with, or resembling that associated with, actual or potential tissue dam-
age,” [20]. It is not possible to measure the subjective experience of pain in
animals from a simple image, or a video, but it is possible to automatically ob-
serve indications that there is pain, and quantify them to some extent to create
useful applications, such as early warning or long-term monitoring systems. It
is noted that behavioral cues are not very reliable for pain estimation, because
the animal can choose not to exhibit behaviors under certain circumstances.

4
Automatic, computer vision based pain estimation has not been extensively
researched for dogs, but there is important work for other animals, such as
horses [21, 22], mice [23], rabbits [24] and sheep [25] (see [26] for a recent sur-
vey). These works mostly focus on the face of the animal, and use validated
clinical scales that rely on a series of observations to score the presence of pain
indicators. These observations can include the position of the ears, the visibility
of sclera in the eyes, muscle tension in certain areas like the mouth, or presence
of specific behaviors, such as baring the teeth in horses, which are exhibited
when the animal is pain.
There is some related work on emotion estimation for dogs, which is rele-
vant as a basis for pain estimation. In [27], three emotional states (i.e. growl,
sleep, and smile, respectively) are automatically estimated from sequences of
dog images, based on ears, eyes, mouth and head features. In [28], DeepLabCut
is used to estimate anger, fear, happiness, and relaxation in dogs, based on a
custom dataset of 400 images, evenly distributed over these four classes.
The main challenges for computer vision based estimation of dog pain are
the existence of many different dog breeds, which implies different sizes and
colors that change the appearance, the presence of subtle cues, such as muscle
contractions that are hidden beneath the coat, and the difficulties that plague
ordinary face and body analysis, such as pose and illumination variations. Most
importantly, there is a lack of data with pain annotations.

3 Methodology
Our proposed approach for detecting the visual indication of a dog’s pain state
is made up of three parts. The off-the-shelf model for pose estimation is dis-
cussed in Section 3.1. The feature extraction and pre-processing are described
in Section 3.2, followed by pose normalization. In Section 3.4, we propose an
architecture based on two-stream deep learning for pain detection. Figure 2
depicts our proposed system, which takes a short, 8-frame video clip (sampled
over 4 seconds) as input, estimates the dog’s pose, and produces a pain score in
the 0-1 range, which can be treated as a probability that the video contains a
dog in pain.

3.1 Dog pose estimation

For the initial detection of the dog in the images, we use a YOLOv5 deep neural
network model [29,30], pre-trained on the MSCOCO dataset [11], which includes
a “dog” class. The RGB video frames are provided to the network as input,
and the network outputs predicted bounding boxes, a confidence score, and the
likelihood of the dog class. We set the confidence threshold to 0.25, and we
select only the bounding boxes with a confidence score for containing a dog
greater than the threshold.
By combining the feature of YOLOv5 and HRNet, we can construct a data
pre-processing pipeline to extract both RGB image and posture information as

5
Figure 2: Overall pipeline of the proposed method.

the model input. (See Figure 4 for the example image pre-processing represen-
tation)
As stated before, we use the HRNet [9] approach for pose estimation. The
majority of existing pose estimation methods, such as SimpleBaseline [31] and
DeepLabCut [32], adapt an encoder-decoder structure composed of a series
of convolutional networks to downsample the input image from high to low-
resolution feature maps and then apply transposed convolution layers to recover
the original high-resolution feature map. HRNet, on the other hand, is com-
posed of a central stem network that maintains high resolution throughout the
process, as well as sample branch networks in each layer of the stem, and par-
allel connections between the multi-resolution branch networks. Additionally,
HRNet implements multi-scale fusion between each parallel branch network,
in order to facilitate information exchange throughout the process. The stem
network’s last layer would output keypoint estimates (see Figure 3).
Since HRNet is a top-down pose estimation algorithm, it requires the ap-
proximate location of the target object in the image and preferably uses closely
cropped images as inputs. The bounding box output of the YOLOv5 model
is extended by 10% to ensure we include the entire dog, and provided to HR-
Net for further processing. We use the HRNet-W32 model obtained from [18],
trained on the AnimalPose dataset [19], at this stage. Figure 4 illustrates the
processing of a frame for bounding box localization and keypoints extraction.

6
Figure 3: The structure of HRNet. The horizontal and vertical axes correspond
to the depth of the network and the scale of the feature maps, respectively
(Adapted from [9]).

Figure 4: Example of bounding box localization and keypoints extraction. (a)

We first apply the YOLOv5 algorithm to detect the bounding box of the dog.
(b) We then crop the image with the center of the bounding box to exclude
noisy background information. The bounding box is extended by 10% to ensure
that the majority of the dog’s body is included. (c) We apply the HRNet pose
estimation algorithm to detect the body keypoints within the bounding box,
and generate the coordinates of the keypoints.

3.2 Feature extraction and pre-processing

We extract 17 keypoints from various parts of the body of the dog, from the head,
spine, and leg joints. The head includes two eyes, and the nose tip (3 points),
spine includes withers and tail end (2 points) and legs include four elbows, four
knees, and four paws (12 points). During training, since self-occlusions of joints
is a common problem, we adopt a missing data treatment approach to augment
some missing landmarks and to exclude some poorly estimated images. The
poses are centered and scaled to be consistent with one another. This ensures
that our approach is not affected by the differing size of the dog on the image
(due to different distances to the camera).
Problems due to occlusions and self-occlusions can be partially solved by

7
using body symmetry and by inference based on observed keypoints over time.
The central concepts are as follows:
1. If some keypoints are missing from a frame, they may be discovered and
interpolated by using the neighbouring frames in the sequence. Note that
this makes would only work in an “offline” mode.
2. If -after the first step- there are more than nine keypoints missing, or the
spine keypoints are missing, the frame is discarded from further processing.
3. Each leg has three keypoints (i.e. elbow, knee, and paw). If one keypoint
of a leg is missing, it can be inferred according to the position of the other
two keypoints and the spine points.
4. If more than one keypoint is missing in one leg, they can be guessed by
exploiting the left and right side body symmetry, or front and back side
symmetry.
This approach for resolving missing data is simple and effective, and necessary
for subsequent processing of the body pose. In comparison to the head and spine
keypoints, the leg keypoints convey more detailed information about the pose.
Thus, the essence of this approach is to compensate for the missing keypoints
induced by self-occlusion.
Our pose estimate step defines a mapping from the RGB image to body
keypoints graph as:
pose detector
si (t) −→ p( t) ∀t = 1, . . . , Ti , (1)

where pi (t) denotes the posture for sample i at time frame t and T denotes the
total sequence length for sample i. The whole dataset contains N samples. In
particular, pi (t) consists of a sequence of 2D coordinates as:

pi (t) = (xj (t), yj (t))i j∈J , (2)

where j is the index of keypoints, and J is the number of keypoints defined in

the pose estimation model.

3.3 Pose normalization

The pose normalization step aims to transform the keypoint coordinates from
an absolute to a local root point-centered reference system. The transformed
pose coordinates are defined as:

(x̄j , ȳj )i = (xj , yj )i − (xroot , yroot )i ∀j ∈ J (3)

The dependence of t is omitted here for simplicity. In this paper, we use the
middle point between the tail end and the neck as the root point. Thus, the
root point centered coordinate is defined as:

p̄i = (x̄j , ȳj )i j∈J (4)

8
Furthermore, in order to normalize the size of the dog within the image, we
re-scale the keypoint coordinates p̄i :
p̄i
p̄i = , (5)
kvneck,tailend k

where v̄neck,tailend is the vector between the neck and the tail end landmarks.
The transformation and normalization of landmarks has two purposes: The
placement of the dog within the image has no effect on the processing, and the
distance of the dog to the camera, as well as its size become less important.
One may assume that in case much more training data become available in the
future, some clustering in this space, followed by cluster-specific models, may
improve the modeling.

3.4 Pain indicator estimation

Based on several other works, such as [33–35], we propose to use a two-stream
model structure to evaluate the temporal information in the detected pose rep-
resentation. In [33], a two-stream model structure is used to predict human
actions, one stream based on the long short term memory (LSTM) architec-
ture [36] and the other with a 1-dimensional convolutional network. Our pro-
posed method resembles this structure (see Figure 5), with some important
differences. We use a basic LSTM branch to process the keypoint based pose
representation, another, Convolutional LSTM based branch to process the RGB
images. Both branches are complemented with an attention module before their
outputs are combined with a fully convolutional layer, and the output is a binary
classification of the pain state. We describe each component separately.

3.4.1 Basic LSTM

LSTM is a variant of recurrent neural networks, which use part of the internal
information at time step t in the processing of time step t + 1. The basic LSTM
network maintains a memory vector m to control the outputs at each time step.
The equations for each time step are listed as follows:

it = σ (Wu ht−1 + Iu xt + bi )
ft = σ Wf ht−1 + If xt + bf

ot = σ (Wo ht−1 + Io xt + bo )
(6)
c̃t = tanh (Wc ht−1 + Ic xt + bc )
ct = ft ct−1 + it c̃t
ht = tanh (ot ct ) ,

where it , f t , ot , and ct are the activation gates of the inputs, forget vector,
output, and cell states vector, respectively. c̃t is the cell input activation vector.
ht is the hidden state of LSTM units, σ represent the logistic sigmoid activation
function, represents the elementary multiplication. The weight matrices for

9
Figure 5: Our proposed two-stream model for dog pain estimation. The left
branch is a four layer LSTM model, which is responsible for processing the key-
point information. The right branch is a four layer convolutional LSTM model,
which is responsible for processing RGB images. The yellow and gray blocks
are the max pooling and batch normalization layers. Black blocks represent the
time attention layers. The green block is the fully connected layer. We fuse our
two branches by concatenating the last layers’ feature vectors.

each gate are denoted as wu , wf , wo , wc . The projection matrices are defined

as Iu , If , Io , Ic . The bias vector parameters of the inputs, forget vector, out-
puts, and cell states vector are the bi , bf , bo , bc , which need to be learned during
training.
The LSTM network is able to learn temporal information, and combine
stored information from previous time steps into its prediction. It uses three
gating functions to control this: a forgot gate, an input gate, and an out gate,
whereby it deletes old information or adds new information into the current
state. However, this algorithm is not able to distinguish which time step contains
more crucial information, and it would be a challenge for LSTM to learn long-
term dependencies from longer sequences. In order to address this problem, an
attention mechanism was proposed in [37] to learn long-term dependencies.

10
3.4.2 Convolutional LSTM
Traditional versions of LSTM would use the fully connected layers in every
state transition and the gate output process. This kind of LSTM can also be
called the fully connected LSTM (FC-LSTM). The FC-LSTM has the ability to
extract temporal information from sequences, which makes it suitable for time
series problems [36–39]. However, the major drawback of FC-LSTM is that its
fully connected layers have too much redundancy for visual inputs. In order to
address this problem, Shi et al. [40] introduced Convolutional LSTM (C-LSTM),
which has convolutional structures in both the input-to-state and state-to-state
transitions.
C-LSTM provides the ability to process the spatio-temporal sequence more
effectively. A three-dimensional tensor can be fed into the network without
flattening into a one dimensional tensor [41, 42]. All the inputs x1 , ..., xt , cell
outputs c1 , ..., ct , hidden states h1 , ..., ht , and gates ft , it , ct of C-LSTM are 3D
tensors, whose last two dimensions are spatial dimensions (rows and columns).
This model can downsample the input image to extract high-dimensional fea-
tures and capture the information across the video frames. We use multiple
C-LSTM layers in our pipeline to process RGB frames in the video. The equa-
tions of C-LSTM are given in Eq. 7, reproduced from [40].
ft = σg (Wf ∗ xt + Uf ∗ ht−1 + Vf ◦ ct−1 + bf )
it = σg (Wi ∗ xt + Ui ∗ ht−1 + Vi ◦ ct−1 + bi )
ct = ft ◦ ct−1 + it ◦ tanh (Wc ∗ xt + Uc ∗ ht−1 + bc ) (7)
ot = σg (Wo ∗ xt + Uo ∗ ht−1 + Vo ◦ ct + bo )
ht = ot ◦ tanh (ct )
Here * denotes the convolution operator, and ◦ denotes the Hadamard product
(element-wise product). W, U, V and b are the weight matrices and bias vector
parameters that need to be learned during training.

3.4.3 Attention LSTM

In our model, we add attention operation on the top of the LSTM outputs.
The attention mechanism adds a context vector V onto the target sequence y.
In [37], the elements of the context vi depend on a sequence of annotations
(b1 ,..., bT ) with length of T . Each annotation would make a comparison across
the whole input sequence and estimates which term of the input sequence should
be paid more attention. The weighted sum of annotations bi is used to compute
the context vector as follows:
Tx
X
vi = αij bj (8)
j=1

The score αij of each annotation is inferred as:

exp (eij )
αij = PTx (9)
k=1 exp (eik )

11
The alignment model assigns a score αij to the pair of context vector elements
at position j and the annotation at position i. eij is computed by the vector
product between the LSTM hidden state vi−1 and the j th annotation. In our
model, both context vector and annotation are the hidden state sequences of
the last LSTM layer, which forms a self-attention model. In [37], the alignment
score α is parameterized by a feed-forward network with a single hidden layer,
and this network is jointly trained with the other parts of the model.

3.4.4 Fusion of branches

The most direct way to combine the two feature vectors created by the two
branches is concatenation fusion [43]. ycat = f cat xa , xb means we stack two
feature vectors of the feature channels c:
cat a
yc+c 0 = xc yccat
0 = xbc0 , (10)
0
where ycat ∈ RN ×(C+C ) . This is denoted with FC in Figure 5.

4 Datasets
There are a few datasets with dog images available for training pose estimation
models, but no datasets are available for assessing pain indicators. The dataset
we introduce in this paper, is unique in that respect. We use two other datasets
for assessing the pose estimation method that we use in this paper, namely,
StanfordExtra [44] and TigDog [45].
StanfordExtra is an image-based dog dataset with annotated 2D keypoints,
obtained by labeling part of the existing Stanford Dog Dataset [46], which con-
sists of 20,580 dog images taken “in the wild” from 120 different dog breeds.
The dataset contains various poses, with variation on environmental occlusions,
interaction with humans or other animals, and partial views. For each image,
there are 19 candidate body key points, but only the points that can be ob-
served are labeled. We selected 6 dog breeds out of 120 randomly, and used 60
images for each breed for comparison experiments. The dog breeds are Dandie-
Dinmont, Tibetan-terrier, Bluetick, Rhodesian-ridgeback, Brittany-spaniel, and
Brabancon-griffon, respectively.
TigDog is a video base dataset with no annotated key-points, sourced from
the Youtube-Objects dataset [47]. For most videos, the camera is not static,
and follows the dog, which is typically centered in the image. Figure 6 shows
some example images from these two datasets.
The Anonymous University Dog Behavior Dataset (AUDBD) used in our
experiment is collected by the Veterinary Medicine Department of ANONY-
MOUS University, in a group that performs research into dog behavior (cita-
tions removed for blind review). The complete dataset includes 61 videos, each
corresponding to an individual dog. The overall video duration is six hours and
45 minutes. There are 23 videos, totaling three hours and nine minutes, labeled

12
Figure 6: Example images from the StanfordExtra (top) and TigDog (bottom)
datasets.

as “pain condition”, and 38 videos with a total of three hours and 36 minutes
labeled as “non-pain” condition.
Pain can be associated to wear and tear of tissues, injuries to the bones or
a joint, or damage to muscles, ligaments, tendons or other soft tissues (muscu-
loskeletal). These are marked as “orthopedic pain” in the dataset. Other types
of pain can be related to irritated or damaged nerves, and these are marked as
“neurological pain”. The labels of the dataset are provided by the veterinar-
ian experts according to their medical records. Every sample in the pain set
is a real dog patient with a physical problem that could potentially cause pain
symptoms, however, not all images in the videos will exhibit all pain indicators.
We extract and process the frames from the videos at two frames per second,
and in clips of four seconds of video each. Each such clip is thus composed of
eight frames. In order to have enough data to train our models, we use a two
frame (i.e. one second) overlap between subsequent clips. This way, the dataset
contains a total of 12,435 clips (See dataset overview in Table 1).

# videos # clips Duration

Orthopedic pain 14 2,957 01:19:06
Neurological pain 9 2,161 01:50:08
Total pain 23 5,118 03:09:14
Not pain 38 7,584 03:36:50

Table 1: The overall statistics of the AUDBD dataset. Frames are sampled at
two FPS, and clips consist of eight frames with two frames overlap. Duration is
shown in hh:mm:ss.

13
5 Experimental Results
5.1 Pose estimation
We measure the performance of pre-trained HRNet on our dog pain dataset, as
well as StanfordExtra and Tigdog. We use the Percentage of Correct Keypoints
(PCK) measure, which is the percentage of keypoints detected falling within
a normalized distance of the manually annotated ground truth. We use the
square root of the bounding box area as the normalization index. The results
are shown in Table 2.

Dataset Head Spine Legs Total

StanfordExtra 93.4 88.9 84.0 86.5
Tigdog 85.1 83.3 78.5 81.5
AUDBD 88.4 82.1 77.3 80.9

Table 2: 10 percent PCK (Percentage of Correct Keypoints) on three datasets.

Head includes two eyes, and the nose tip. Spine includes withers and tail end.
Legs includes four elbows, four knees, and four paws.

We can see that the pre-trained HRNet has a good performance in general,
and has better performance on the StanfordExtra dataset compared to Tigdog
and AUDBD, since StanfordExtra is visually more similar to the Animalpose
dataset, on which the HRNet was trained.

5.2 Pain estimation

Our classification models are evaluated using five fold cross-validation. This
means that for a given video dataset, we create a video for each dog and then
use 13% of the dogs as the test set and the remaining dogs for five-fold cross-
validation. For each fold, we randomly sample 80% of videos from the pain class
and 80% from the non-pain class to create a training set; the remaining videos
would be used to create a validation set for early stopping. There is no dog used
in multiple folds. The test set would not be used to train or validate the model.
We contrast several models under this setting. We report the performance of
LSTM and C-LSTM branches separately, where the former uses the keypoints,
and the latter uses the RGB images. LSTM models are reported under three
different settings: LSTM32, LSTM64, and LSTM128, respectively, where the
number denoted the number of hidden units. For one-stream networks, we also
report results with an I3D network [48] as an additional baseline. Finally, we also
report results with a frequently used dual-stream approach, where one branch
processes RGB image, and the second branch processes the optical flow image
of the same video. Table 3 reports the average accuracy and F1 scores for these
models obtained after five fold cross-validation, reported on the test set. The
F1-score (i.e. the harmonic mean of precision and recall) is reported in addition
to accuracy, since it is informative when the dataset is imbalanced. The model

14
with the highest average F1 score and accuracy is C-LSTM+LSTM128, which
has an average F1 score of 76.3% ± 4.4 and an average accuracy of 77.0% ± 4.6.

One stream models Data type Avg. F1 Avg. Acc. #Param.

C-LSTM RGB 62.9±3.3 64.5±3.6 8,105,986
LSTM64 KP 58.8±5.5 59.1±6.8 137,922
I3D RGB 74.1±2.5 74.6±1.1 24,327,632
Two stream models
C-LSTM+LSTM32 RGB+KP 70.3±3.3 71.2±3.1 8,143,204
C-LSTM+LSTM64 RGB+KP 72.9±3.5 72.1±4.1 8,243,908
C-LSTM+LSTM128 RGB+KP 76.3±4.4 77.0±4.6 8,635,780
Double C-LSTM RGB+OF 67.7±5.1 69.2±3.9 16,210,403

Table 3: Results (%F1-score and accuracy) for different models. RGB: images,
KP: keypoints, OF: optical flow. 32, 64, 128 denote the number of hidden units
in the LSTM models, which are used to process the keypoints sequence.

5.3 Qualitative Analysis

We use the Grad-CAM method [49] to compute activation (saliency) maps for
our best model (i.e. ConvLSTM+LSTM) to visualize the model’s attention. We
apply the Grad-CAM method on the last convolutional layer of the RGB stream.
In order to obtain the class discrimination saliency map, we first compute the
gradient score of the class c, y c , with respect to the feature map activations Ak
∂y c
(layer output) in the last convolutional layer, i.e. ∂Ak . The gradient flows are
then processed by the global average pooling in the width and height dimensions
(denote as i and j respectively) to get the neuron importance weights αkc .
global average pooling
z }| {
1 XX ∂y c
αkc = (11)
Z i j ∂Akij
| {z }
gradients via backprop

where Z is the product of width and height. The importance of feature maps
(layer output channels) k for a class c are then weighted according to their
corresponding gradient magnitude. The feature maps are processed by a RELU
function to focus on the features which have a positive impact on the class of
interest. !
X
LcGrad−CAM = ReLU αkc Ak (12)
k
| {z }
linear combination
The region with the highest magnitude is considered to be the most impor-
tant for the classification decision. The output feature map is up-sampled with

15
an interpolation to the size of the original image, and superposed on it as an
heatmap.
Figure 7 shows two dogs in orthopedic pain condition. The top sequence in
Figure 7 is a case that is correctly classed as pain by our ConvLSTM+LSTM
model, with a confidence score of 0.95. According to the heatmap, the lower
body parts and the head parts are the regions that the model most focuses
on. This is reasonable, since a dog with an orthopedic pain would be unstable
in their walking motion, and this would show up in their leg movements and
head position. We note that the algorithm seems not to be affected too much
by the image background, and could follow the temporal patterns of the dog.
In the bottom sequence of Figure 7, we see an example sequence where the
dog is in pain, but the algorithm classifies it incorrectly as “no pain”, with a
low confidence score. From the heatmap, we can see that algorithm is focusing
on the back of the dog, and the legs are not clearly visible from this angle.
Therefore, the algorithm cannot reliably detect the pain expression.

Figure 7: Saliency maps for a true positive (top) and a false negative (bottom)
case belonging to the pain class.

6 Conclusions
Recognizing pain-related behaviors in dogs automatically is a challenging prob-
lem. Clinical pain assessment methods, such as the Glasgow composite measure
pain scale, use multiple behavioral indicators such as “vocalization, attention
to wound, mobility, response to touch, demeanor and posture/activity” for as-
sessment of behaviours [50]. In this paper, we have barely scratched the surface
of behavior analysis for this task, and integrated computer vision based tools
into a solution that uses a small number of cues. Furthermore, we have worked
only on samples with orthopedic and neuropathic pain; other categories, such
as organic pain, were not included.
Our results show that both spatial-temporal features in images and keypoint
based descriptions of the motion patterns are useful for the visual analysis.
While data restrictions and the risk of overfitting prevented us from fine-tuning
pose estimators, we show that generic tools have good performance for this task.
Contrary to most work on animal pain estimation, we have not focused on the
facial region. This remains a future work, but would require further clinical
tools, such as validated grimace scales for labeling.

16
Ethical Impact Statement
The proposed approach cannot be used as a clinical assessment tool for dog
pain estimation without further tests and validation. No dogs were harmed in
this research, there was no induced pain for the subjects, and ethics committee
approval is obtained for the preparation of the dog video dataset.

References
[1] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler,
and K. Murphy, “Towards accurate multi-person pose estimation in the
wild,” in Proc. CVPR, 2017, pp. 4903–4911.

[2] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional multi-person
pose estimation,” in Proc. CVPR, 2017, pp. 2334–2343.
[3] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded
pyramid network for multi-person pose estimation,” in Proc. CVPR, 2018,
pp. 7103–7112.

[4] A. Mathis, P. Mamidanna, K. M. Cury, T. Abe, V. N. Murthy, M. W.

Mathis, and M. Bethge, “Deeplabcut: markerless pose estimation of user-
defined body parts with deep learning,” Nature neuroscience, vol. 21, no. 9,
pp. 1281–1289, 2018.
[5] T. D. Pereira, D. E. Aldarondo, L. Willmore, M. Kislin, S. S.-H. Wang,
M. Murthy, and J. W. Shaevitz, “Fast animal pose estimation using deep
neural networks,” Nature methods, vol. 16, no. 1, pp. 117–125, 2019.
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. CVPR, 2016, pp. 770–778.

[7] J. M. Graving, D. Chae, H. Naik, L. Li, B. Koger, B. R. Costelloe, and

I. D. Couzin, “Deepposekit, a software toolkit for fast and robust animal
pose estimation using deep learning,” eLife, vol. 8, p. e47994, 10 2019.
[Online]. Available: https://pubmed.ncbi.nlm.nih.gov/31570119
[8] A. Wu, E. K. Buchanan, M. Whiteway, M. Schartner, G. Meijer, J.-P.
Noel, E. Rodriguez, C. Everett, A. Norovich, E. Schaffer et al., “Deep
graph pose: a semi-supervised deep graphical model for improved animal
pose tracking,” bioRxiv, 2020.
[9] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation
learning for human pose estimation,” in Proc. CVPR, 2019, pp. 5693–5703.

[10] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d

pose estimation using part affinity fields,” in Proc. CVPR, 2017, pp. 7291–
7299.

17
[11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in con-
text,” in Proc. ECCV. Springer, 2014, pp. 740–755.
[12] M.-F. Tsai and J.-Y. Huang, “Predicting canine posture with smart camera
networks powered by the artificial intelligence of things,” IEEE Access,
vol. 8, pp. 220 848–220 857, 2020.
[13] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
ICCV, 2017, pp. 2980–2988.
[14] R. Girshick, “Fast R-CNN,” in Proc. CVPR, 2015, pp. 1440–1448.

[15] S. Kearney, W. Li, M. Parsons, K. I. Kim, and D. Cosker, “Rgbd-dog:

Predicting canine pose from rgbd sensors,” in Proc. CVPR, 6 2020.
[16] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human
pose estimation,” in Proc. ECCV. Springer, 2016, pp. 483–499.

[17] N. D. Lawrence, “Gaussian process latent variable models for visualisation

of high dimensional data,” in Proc. NeurIPS, vol. 2, 2003, p. 5.
[18] M. Contributors, “Openmmlab pose estimation toolbox and benchmark,”
https://github.com/open-mmlab/mmpose, 2020.
[19] J. Cao, H. Tang, H.-S. Fang, X. Shen, C. Lu, and Y.-W. Tai, “Cross-domain
adaptation for animal pose estimation,” in Proc. ICCV, 2019.
[20] S. N. Raja, D. B. Carr, M. Cohen, N. B. Finnerup, H. Flor, S. Gibson,
F. Keefe, J. S. Mogil, M. Ringkamp, K. A. Sluka et al., “The revised IASP
definition of pain: Concepts, challenges, and compromises,” Pain, vol. 161,
no. 9, p. 1976, 2020.

[21] H. I. Hummel, F. Pessanha, A. A. Salah, T. J. van Loon, and R. C.

Veltkamp, “Automatic pain detection on horse and donkey faces,” in Proc.
IEEE FG. IEEE, 2020, pp. 793–800.
[22] F. Pessanha, A. A. Salah, T. van Loon, and R. Veltkamp, “Facial image-
based automatic assessment of equine pain,” IEEE Transactions on Affec-
tive Computing, 2022.
[23] N. Andresen, M. Wöllhaf, K. Hohlbaum, L. Lewejohann, O. Hellwich,
C. Thöne-Reineke, and V. Belik, “Towards a fully automated surveillance
of well-being status in laboratory mice using deep learning: Starting with
facial expression analysis,” Plos one, vol. 15, no. 4, p. e0228059, 2020.

[24] S. C. Keating, A. A. Thomas, P. A. Flecknell, and M. C. Leach, “Evaluation

of emla cream for preventing pain during tattooing of rabbits: changes
in physiological, behavioural and facial expression responses,” PloS one,
vol. 7, no. 9, p. e44437, 2012.

18
[25] F. Pessanha, M. Mahmoud, and K. McLennan, “Towards automatic moni-
toring of disease progression in sheep: A hierarchical model for sheep facial
expressions analysis from video,” in Proc. IEEE FG, 2020.
[26] S. Broomé, M. Feighelstein, A. Zamansky, G. C. Lencioni, P. H. Andersen,
F. Pessanha, M. Mahmoud, H. Kjellström, and A. A. Salah, “Going deeper
than tracking: a survey of computer-vision based recognition of animal pain
and affective states,” arXiv preprint arXiv:2206.08405, 2022.
[27] V. Franzoni, A. Milani, G. Biondi, and F. Micheli, “A preliminary work on
dog emotion recognition,” in IEEE/WIC/ACM International Conference
on Web Intelligence-Companion Volume, 2019, pp. 91–96.

[28] K. Ferres, T. Schloesser, and P. A. Gloor, “Predicting dog emotions based

on posture analysis using deeplabcut,” Future Internet, vol. 14, no. 4, p. 97,
2022.
[29] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv
preprint arXiv:1804.02767, 2018.

[30] ——, “Yolo9000: better, faster, stronger,” in Proc. CVPR, 2017, pp. 7263–
7271.
[31] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estima-
tion and tracking,” in Proceedings of the European conference on computer
vision (ECCV), 2018, pp. 466–481.
[32] T. Nath, A. Mathis, A. C. Chen, A. Patel, M. Bethge, and M. W. Mathis,
“Using deeplabcut for 3d markerless pose estimation across species and
behaviors,” Nature Protocols, vol. 14, no. 7, pp. 2152–2176, 7 2019.
[Online]. Available: https://doi.org/10.1038/s41596-019-0176-0

[33] F. Karim, S. Majumdar, H. Darabi, and S. Harford, “Multivariate LSTM-

FCNs for time series classification,” Neural Networks, vol. 116, pp. 237–245,
2019.
[34] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for
action recognition in videos,” in Proc. NeurIPS, 2014, vol. 27.

[35] S. Broomé, K. B. Gleerup, P. H. Andersen, and H. Kjellstrom, “Dynamics

are important for the recognition of equine pain in video,” in Proc. CVPR,
2019, pp. 12 667–12 676.
[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural com-
putation, vol. 9, no. 8, pp. 1735–1780, 1997.
[37] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
2014.

19
[38] A. Graves, “Generating sequences with recurrent neural networks,” arXiv
preprint arXiv:1308.0850, 2013.
[39] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-
gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
networks for visual recognition and description,” in Proc. CVPR, 2015, pp.
2625–2634.
[40] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo,
“Convolutional lstm network: A machine learning approach for precipita-
tion nowcasting,” in Proc. NeurIPS, 2015, pp. 802–810.

[41] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra,

“Video (language) modeling: a baseline for generative models of natural
videos,” arXiv preprint arXiv:1412.6604, 2014.
[42] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning
of video representations using LSTMs,” in Proc. ICML. PMLR, 2015, pp.
843–852.
[43] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
network fusion for video action recognition,” in Proc. CVPR, 2016, pp.
1933–1941.
[44] B. Biggs, O. Boyne, J. Charles, A. Fitzgibbon, and R. Cipolla, “Who left
the dogs out? 3d animal reconstruction with expectation maximization in
the loop,” in Proc. ECCV. Springer, 2020, pp. 195–211.
[45] L. Del Pero, S. Ricco, R. Sukthankar, and V. Ferrari, “Articulated motion
discovery using pairs of trajectories,” in Proc. CVPR, 2015.

[46] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li, “Novel dataset for
fine-grained image categorization: Stanford dogs,” in Proc. CVPRW, 2011.
[47] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, “Learning
object class detectors from weakly annotated video,” in Proc. CVPR, 2012,
pp. 3282–3289.

[48] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model
and the kinetics dataset,” in Proc. CVPR, 2017, pp. 6299–6308.
[49] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Ba-
tra, “Grad-cam: Visual explanations from deep networks via gradient-
based localization,” in Proc. CVPR, 2017, pp. 618–626.

[50] J. Reid, A. Nolan, J. Hughes, D. Lascelles, P. Pawson, E. Scott et al., “De-

velopment of the short-form glasgow composite measure pain scale (cmps-
sf) and derivation of an analgesic intervention score,” Animal Welfare,
vol. 16, pp. 97–104, 2007.

Multi-Animal Pose Estimation, Identification and Tracking With DeepLabCut
No ratings yet
Multi-Animal Pose Estimation, Identification and Tracking With DeepLabCut
27 pages
23 Benchmarking Monocular 3d Dog
No ratings yet
23 Benchmarking Monocular 3d Dog
11 pages
Fast Animal Pose Estimation Using Deep Neural Networks
No ratings yet
Fast Animal Pose Estimation Using Deep Neural Networks
13 pages
（3D狗姿态估计）BITE Beyond Priors for Improved Three-D Dog Pose Estimation
No ratings yet
（3D狗姿态估计）BITE Beyond Priors for Improved Three-D Dog Pose Estimation
10 pages
Dog Face Verification and Recognition
No ratings yet
Dog Face Verification and Recognition
13 pages
Predicting Dog Emotions Based On Posture Analysis Using DeepLabCut
No ratings yet
Predicting Dog Emotions Based On Posture Analysis Using DeepLabCut
16 pages
SuperAnimal Pretrained Pose Estimation Models For Behavioral Analysis
No ratings yet
SuperAnimal Pretrained Pose Estimation Models For Behavioral Analysis
19 pages
2022 Broome
No ratings yet
2022 Broome
19 pages
SLEAP Paper
No ratings yet
SLEAP Paper
25 pages
Computers and Electronics in Agriculture: Helena Russello, Rik Van Der Tol, Gert Kootstra
No ratings yet
Computers and Electronics in Agriculture: Helena Russello, Rik Van Der Tol, Gert Kootstra
12 pages
A B A U I S: Utomated Ehavioral Nalysis Sing Nstance Egmentation
No ratings yet
A B A U I S: Utomated Ehavioral Nalysis Sing Nstance Egmentation
15 pages
Dog Breed Identification: Whitney Larow Brian Mittl Vijay Singh
No ratings yet
Dog Breed Identification: Whitney Larow Brian Mittl Vijay Singh
7 pages
Fully Automated Deep Learning Models With Smartphone Applicability For Prediction of Pain Using The Feline Grimace Scale
No ratings yet
Fully Automated Deep Learning Models With Smartphone Applicability For Prediction of Pain Using The Feline Grimace Scale
13 pages
1 s2.0 S2666827022000330 Main
No ratings yet
1 s2.0 S2666827022000330 Main
9 pages
DeepLabCut: 3D Pose Estimation Tool
No ratings yet
DeepLabCut: 3D Pose Estimation Tool
27 pages
Using Deep Lab Cut For 3 D Markerless Pose Estimation Across Species and Behaviors
No ratings yet
Using Deep Lab Cut For 3 D Markerless Pose Estimation Across Species and Behaviors
23 pages
Animal Behavior Analysis Methods Using Deep Learni
No ratings yet
Animal Behavior Analysis Methods Using Deep Learni
28 pages
Markerless Tracking of User-Defined Features With Deep Learning
No ratings yet
Markerless Tracking of User-Defined Features With Deep Learning
14 pages
Convolutional Neural Network-Based Automated System For Dog Tracking and Emotion Recognition in Video Surveillance
No ratings yet
Convolutional Neural Network-Based Automated System For Dog Tracking and Emotion Recognition in Video Surveillance
29 pages
Animer: Animal Pose and Shape Estimation Using Family Aware Transformer
No ratings yet
Animer: Animal Pose and Shape Estimation Using Family Aware Transformer
15 pages
Automated Detection of Cat Facial Landmarks: George Martvel Ilan Shimshoni Anna Zamansky
No ratings yet
Automated Detection of Cat Facial Landmarks: George Martvel Ilan Shimshoni Anna Zamansky
16 pages
Video Based Automatic Lameness Detection of Dairy C 2024 Computers and Elect
No ratings yet
Video Based Automatic Lameness Detection of Dairy C 2024 Computers and Elect
12 pages
Deep Learning Models For Automated Classification of Dog Emotional States
No ratings yet
Deep Learning Models For Automated Classification of Dog Emotional States
6 pages
Evaluation of Unsupervised Learning Algorithms For The Classification of Behavior From Pose Estimation Data
No ratings yet
Evaluation of Unsupervised Learning Algorithms For The Classification of Behavior From Pose Estimation Data
14 pages
Marks22 NatureMachineIntel
No ratings yet
Marks22 NatureMachineIntel
20 pages
Bonneau Et Al. 2021 Postures Sows
No ratings yet
Bonneau Et Al. 2021 Postures Sows
11 pages
Uni Frame
No ratings yet
Uni Frame
24 pages
Human Pose Estimation with Compositional Tokens
No ratings yet
Human Pose Estimation with Compositional Tokens
12 pages
7.animal Detection Using Deep Learning March 2021 - Compressed
No ratings yet
7.animal Detection Using Deep Learning March 2021 - Compressed
5 pages
Dog - Breed Base Paper
No ratings yet
Dog - Breed Base Paper
4 pages
Journal Pcbi 1008914
No ratings yet
Journal Pcbi 1008914
20 pages
CNN-Based Animal Behavior Classification Survey
No ratings yet
CNN-Based Animal Behavior Classification Survey
29 pages
Deep-Learning-Based Automatic Segmentation of Head and Neck Organs For Radiation Therapy in Dogs
No ratings yet
Deep-Learning-Based Automatic Segmentation of Head and Neck Organs For Radiation Therapy in Dogs
12 pages
DAMM For The Detection and Tracking of Multiple Animals Within Complex Social and Environmental Settings
No ratings yet
DAMM For The Detection and Tracking of Multiple Animals Within Complex Social and Environmental Settings
15 pages
Syed Shafiq Sherazi (19pwele5545) DSP Mini Project Thesis
No ratings yet
Syed Shafiq Sherazi (19pwele5545) DSP Mini Project Thesis
19 pages
1 s2.0 S1090023320300824 Main
No ratings yet
1 s2.0 S1090023320300824 Main
7 pages
1119-Article Text-1850-1-10-20221018
No ratings yet
1119-Article Text-1850-1-10-20221018
18 pages
Real-Time Analysis of The Behaviour of Groups of Mice Via A Depth-Sensing Camera and Machine Learning
No ratings yet
Real-Time Analysis of The Behaviour of Groups of Mice Via A Depth-Sensing Camera and Machine Learning
16 pages
Küster Et Al. 2021 Pen
No ratings yet
Küster Et Al. 2021 Pen
13 pages
Cats and Dogs: A Summary
No ratings yet
Cats and Dogs: A Summary
3 pages
Paper 1473
No ratings yet
Paper 1473
7 pages
Animals 14 01032
No ratings yet
Animals 14 01032
18 pages
JETIR2302203
No ratings yet
JETIR2302203
4 pages
Applsci 12 06423
No ratings yet
Applsci 12 06423
16 pages
Ponymation: Learning 3D Animal Motions From Unlabeled Online Videos
No ratings yet
Ponymation: Learning 3D Animal Motions From Unlabeled Online Videos
15 pages
Evaluating Active Learning and Classifiers On Laying Hens' Motion Data of 27-Behavioral Classes
No ratings yet
Evaluating Active Learning and Classifiers On Laying Hens' Motion Data of 27-Behavioral Classes
32 pages
Csci5561 Project Proposal
No ratings yet
Csci5561 Project Proposal
4 pages
Hsu Dog
No ratings yet
Hsu Dog
6 pages
DeepLabCut Guide
No ratings yet
DeepLabCut Guide
15 pages
Using Deep Learning For Morphological Classification in Pigs With A Focus On Sanitary Monitoring
No ratings yet
Using Deep Learning For Morphological Classification in Pigs With A Focus On Sanitary Monitoring
35 pages
C. elegans Strain ID via Neural Network
No ratings yet
C. elegans Strain ID via Neural Network
10 pages
A Novel and Efficient Digital Pathology Classifier For Predicting Cancer Biomarkers Using Sequencer Architecture
No ratings yet
A Novel and Efficient Digital Pathology Classifier For Predicting Cancer Biomarkers Using Sequencer Architecture
11 pages
Equine Veterinary Journal - 2021 - May - Artificial Intelligence As A Tool To Aid in The Differentiation of Equine
No ratings yet
Equine Veterinary Journal - 2021 - May - Artificial Intelligence As A Tool To Aid in The Differentiation of Equine
9 pages
Animal Disease Prediction Tech
No ratings yet
Animal Disease Prediction Tech
1 page
Behavioral Flow
No ratings yet
Behavioral Flow
31 pages
Icra2022 PW
No ratings yet
Icra2022 PW
7 pages
Animalformer Multimodal Vision Framework For Behavior Based 3w9vdr6dbn
No ratings yet
Animalformer Multimodal Vision Framework For Behavior Based 3w9vdr6dbn
10 pages
Detection of Spondylosis Deformans in Thoracolumbar and Lumbar Lateral X-Ray Images of Dogs Using A Deep Learning Network
No ratings yet
Detection of Spondylosis Deformans in Thoracolumbar and Lumbar Lateral X-Ray Images of Dogs Using A Deep Learning Network
11 pages
KU CT-AI (Extended Abstract)
No ratings yet
KU CT-AI (Extended Abstract)
5 pages
Soal SIL XI IBB Deal
20% (5)
Soal SIL XI IBB Deal
5 pages
Israel Map
100% (1)
Israel Map
6 pages
Input/Output Organization in Computer Organisation and Architecture
100% (1)
Input/Output Organization in Computer Organisation and Architecture
99 pages
ULES01213 Ini
No ratings yet
ULES01213 Ini
2 pages
Understanding Hypochondriasis
No ratings yet
Understanding Hypochondriasis
35 pages
Simple Future Tense
No ratings yet
Simple Future Tense
5 pages
Natural Vegetation and Wildlife
No ratings yet
Natural Vegetation and Wildlife
45 pages
Saratoga Springs Police Audit 2024
No ratings yet
Saratoga Springs Police Audit 2024
90 pages
Apartment Rental Management System For Real-Time T
No ratings yet
Apartment Rental Management System For Real-Time T
8 pages
Man Booker Prize PDF
No ratings yet
Man Booker Prize PDF
7 pages
Disc Holder CMT FCD 45
No ratings yet
Disc Holder CMT FCD 45
1 page
Company Description
No ratings yet
Company Description
2 pages
19th CONGRESS LEDAC CLA As of 20 Sept 2023
No ratings yet
19th CONGRESS LEDAC CLA As of 20 Sept 2023
2 pages
Erection of HDPE Pipes
No ratings yet
Erection of HDPE Pipes
4 pages
Medical Norms For Police Personnel
No ratings yet
Medical Norms For Police Personnel
8 pages
2026 Intern-Software JD
No ratings yet
2026 Intern-Software JD
2 pages
Barangay Conciliation
No ratings yet
Barangay Conciliation
3 pages
Landscape Arch PPT 4 Ass
No ratings yet
Landscape Arch PPT 4 Ass
15 pages
Endless Pleasure, No Release
No ratings yet
Endless Pleasure, No Release
3 pages
Legal Aspects in Business Communication
100% (3)
Legal Aspects in Business Communication
5 pages
Scaffolding Safety Criteria Guide
No ratings yet
Scaffolding Safety Criteria Guide
7 pages
Folk Dances
No ratings yet
Folk Dances
5 pages
Holland Themes and 16PF
100% (1)
Holland Themes and 16PF
7 pages
308 Project Management
No ratings yet
308 Project Management
14 pages
DevSecOps Guide for SDLC Phases
No ratings yet
DevSecOps Guide for SDLC Phases
20 pages
Updated Blaw Past Papers by RNK-1
No ratings yet
Updated Blaw Past Papers by RNK-1
90 pages
Subject: Subject Pronouns and Verb " To Be " A) Fill in The Blanks Using "HE, SHE, IT, WE, THEY"
No ratings yet
Subject: Subject Pronouns and Verb " To Be " A) Fill in The Blanks Using "HE, SHE, IT, WE, THEY"
14 pages
MATH 241 Midterm 1 Practice Exam 1 Solutions
No ratings yet
MATH 241 Midterm 1 Practice Exam 1 Solutions
13 pages
Anatomy of A Scientific Article
No ratings yet
Anatomy of A Scientific Article
1 page
Uterine Prolapse: Types and Treatment
No ratings yet
Uterine Prolapse: Types and Treatment
9 pages

Video-Based Estimation of Pain Indicators in Dogs

Uploaded by

Video-Based Estimation of Pain Indicators in Dogs

Uploaded by

Video-based estimation of pain indicators in dogs

Hongyi Zhu1 , Yasemin Salgırlı2 , Pınar Can3 , Durmuş Atılgan2 , and

Albert Ali Salah1,4

November 29, 2022

2.1 Pose estimation

• Body key-joints representation: This representation allows easy modeling

• Outline boundary: In this representation, the boundary of the dog is

2.2 Pain estimation

3.1 Dog pose estimation

Figure 4: Example of bounding box localization and keypoints extraction. (a)

3.2 Feature extraction and pre-processing

where j is the index of keypoints, and J is the number of keypoints defined in

3.3 Pose normalization

(x̄j , ȳj )i = (xj , yj )i − (xroot , yroot )i ∀j ∈ J (3)

3.4 Pain indicator estimation

3.4.1 Basic LSTM

each gate are denoted as wu , wf , wo , wc . The projection matrices are defined

3.4.3 Attention LSTM

The score αij of each annotation is inferred as:

3.4.4 Fusion of branches

# videos # clips Duration

Dataset Head Spine Legs Total

Table 2: 10 percent PCK (Percentage of Correct Keypoints) on three datasets.

5.2 Pain estimation

One stream models Data type Avg. F1 Avg. Acc. #Param.

5.3 Qualitative Analysis

[4] A. Mathis, P. Mamidanna, K. M. Cury, T. Abe, V. N. Murthy, M. W.

[7] J. M. Graving, D. Chae, H. Naik, L. Li, B. Koger, B. R. Costelloe, and

[10] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d

[15] S. Kearney, W. Li, M. Parsons, K. I. Kim, and D. Cosker, “Rgbd-dog:

[17] N. D. Lawrence, “Gaussian process latent variable models for visualisation

[21] H. I. Hummel, F. Pessanha, A. A. Salah, T. J. van Loon, and R. C.

[24] S. C. Keating, A. A. Thomas, P. A. Flecknell, and M. C. Leach, “Evaluation

[28] K. Ferres, T. Schloesser, and P. A. Gloor, “Predicting dog emotions based

[33] F. Karim, S. Majumdar, H. Darabi, and S. Harford, “Multivariate LSTM-

[35] S. Broomé, K. B. Gleerup, P. H. Andersen, and H. Kjellstrom, “Dynamics

[41] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra,

[50] J. Reid, A. Nolan, J. Hughes, D. Lascelles, P. Pawson, E. Scott et al., “De-

You might also like