0% found this document useful (0 votes)
60 views8 pages

Fall Detection

This document proposes a motion and region aware adversarial learning framework for fall detection using thermal imaging. The framework contains two 3D convolutional autoencoders that reconstruct thermal video sequences and optical flow inputs. A technique tracks the region of interest in the thermal videos and introduces a region-based difference constraint. The autoencoders and a discriminator are trained jointly in an adversarial manner, where a larger reconstruction error from the autoencoders indicates a fall. Experiments on a thermal fall detection dataset show this approach outperforms standard baselines.

Uploaded by

Ankit Panda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views8 pages

Fall Detection

This document proposes a motion and region aware adversarial learning framework for fall detection using thermal imaging. The framework contains two 3D convolutional autoencoders that reconstruct thermal video sequences and optical flow inputs. A technique tracks the region of interest in the thermal videos and introduces a region-based difference constraint. The autoencoders and a discriminator are trained jointly in an adversarial manner, where a larger reconstruction error from the autoencoders indicates a fall. Experiments on a thermal fall detection dataset show this approach outperforms standard baselines.

Uploaded by

Ankit Panda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Motion and Region Aware Adversarial Learning for

Fall Detection with Thermal Imaging


Vineet Mehta? Abhinav Dhall†? Sujata Pal? Shehroz S. Khan+
? † +
Indian Institute of Technology Ropar, India Monash University, Australia KITE – University Health Network, Canada

Abstract—Automatic fall detection is a vital technology for discriminator is a feed-forward neural network, both of which
ensuring the health and safety of people. Home-based camera are trained in an adversarial manner. The most successful pre-
systems for fall detection often put people’s privacy at risk. vious works have mostly focused on learning spatio-temporal
Thermal cameras can partially or fully obfuscate facial features,
thus preserving the privacy of a person. Another challenge is the features by using 3D Convolutional Autoencoders (3DCAE)
less occurrence of falls in comparison to the normal activities and 3D Convolution Neural Network (3DCNN) [5].
of daily living. As fall occurs rarely, it is non-trivial to learn The performance of video-based fall detection may be
algorithms due to class imbalance. To handle these problems, marred by differences in background. This may become more
we formulate fall detection as an anomaly detection within an
prominent in thermal cameras where the intensities may
adversarial framework using thermal imaging. We present a
arXiv:2004.08352v2 [cs.CV] 24 Oct 2020

novel adversarial network that comprises of two-channel 3D change due to differences in heat (e.g., when a person enters
convolutional autoencoders which reconstructs the thermal data the scene). Therefore, it is important to focus on the region
and the optical flow input sequences respectively. We introduce around the person. The relative motion of the person and
a technique to track the region of interest, a region-based objects around it can also provide useful information to detect
difference constraint, and a joint discriminator to compute the
falls. Region and motion-based methods ( [7], [8]) have shown
reconstruction error. A larger reconstruction error indicates the
occurrence of a fall. The experiments on a publicly available superior performance in action recognition tasks. Therefore,
thermal fall dataset show the superior results obtained compared we hypothesize that the learning spatio-temporal features uti-
to the standard baseline. lizing region and motion awareness in video sequences would
Index Terms—Fall detection, adversarial learning, thermal improve the detection of falls when trained in an adversarial
manner. To this end, we propose a motion and region aware
I. I NTRODUCTION adversarial framework which consists of two separate channels
Automatic detection of falls is important due to the possi- optimized jointly. The first channel is input with thermal video
bility of severe injury, high cost to the health system, and the sequence (with the extracted region of interest) and the second
psychological effect of a fall. However, due to their rarity of channel is input with corresponding optical flow. The outputs
occurrence, traditional supervised machine learning classifiers from both channels are combined to give a discriminative
are ill-posed for this problem [1]. There are also challenges score for adversarial learning. We assume that joint training of
in collecting realistic fall data as it can put people’s lives in thermal and optical flow channels can facilitate in the learning
danger [1]. Therefore, in many realistic settings, there may of both motion and region-based discriminatory features.
be few, or no falls data available during training. Due to
these skewed data situations, we adopted fall detection as an II. R ELATED W ORK
anomaly detection problem [2]. In this setting, a classifier is
trained on only normal activities; during testing, both normal There is scarce literature on detecting falls in videos in an
and fall samples are presented to the classifier. adversarial manner with thermal cameras. We will now review
Another challenge in video-based fall detection is pre- studies that closely match our work.
serving the person’s privacy, which traditional RGB cameras Fall Detection: With progress in economical camera
cannot provide [3]. Thus, detecting falls in videos without sensors, there are several works [9]–[11], which use RGB
explicitly knowing a person’s identity is important for the cameras for data capturing. One major limitation of RGB
usability of such systems in the real world. Thermal Imaging sensors is the lack of privacy as the identity of the subject is not
can partially or fully obfuscate a person’s identity and has preserved. To overcome this limitation, Vadivelu et al. [3] were
been used in other fall detection applications [2]–[4]. one of the first fall detection works on thermal data. Further,
Most recent works have focused on reconstruction based Nogas et al. [4] proposed using the thermal cameras and
networks for fall detection using autoencoders [2] and ad- recurrent Convolutional AutoEncoder (CAE) for fall detection.
versarial learning [5]. The adversarial learning framework Motivated by these works, we also use the thermal camera
presents a unique opportunity to train a network to not only modality data in our experiments. The readers are also pointed
mimic the normal activities through a generator but also to recent surveys [1], [12] in fall detection for more insights
helps to discriminate it from the abnormal events through into different techniques proposed in the literature. Most of the
discriminator ( [5], [6]). For video-based anomaly detection, recent works on fall detection using thermal and depth camera
normally, the generator is some variant of autoencoder, and the formulate it as anomaly detection.
Anomaly Detection: Given the rare nature of fall events, TABLE I
we follow the works in abnormal event detection that is C ONFIGURATION OF THE 3DCAE.
conceptually similar to our work. Many of the recent anomaly Thermal 3DCAE Flow 3DCAE
Input (8, 64, 64, 1) (7, 64, 64, 1)
detection methods ( [13], [14]) are based on one class classi- 3D Conv - (8, 64, 64, 16) 3D Conv - (7, 64, 64, 16)
fication paradigm in which the distribution of normal events Encoder
3D Conv- (8, 32, 32, 8) 3D Conv- (7, 32, 32, 8)
3D Conv - (4, 16, 16, 8) 3D Conv - (4, 16, 16, 8)
are learned using autoencoders and the deviations from the 3D Conv - (2, 8, 8, 8) 3D Conv - (2, 8, 8, 8)
learned distribution is detected as anomaly during the test 3D Deconv - (4, 16, 16, 8) 3D Deconv - (4, 16, 16, 8)
time. Hasan et al. [13] learnt the normal motion patterns in 3D Deconv - (8, 32, 32, 8) 3D Deconv - (7, 32, 32, 8)
Decoder
3D Deconv - (8, 64, 64, 16) 3D Deconv - (7, 64, 64, 16)
videos using hand-crafted features and CAE. Ravanbakhsh et 3D Deconv - (8, 64, 64, 1) 3D Deconv - (7, 64, 64, 1)
al. [14] proposed a video to flow and vice-versa generation
in all the layers of the 3D discriminator except for the input
adversarial approach for abnormal event detection. Khan et al.
layer. LeakyRelu activation is set in all hidden layers, with a
[15] proposed the use of 3DCAE for abnormal event detection
negative slope coefficient set to 0.2 ( [5], [17]).
applied to fall detection. Sabokrou et al. [16] proposed an end-
2) Adversarial Learning: In this section, we first explain
to-end adversarial network that consists of a generator that
the general adversarial training as described in the work of
reconstructs the input with added noise and the discriminator
Khan et al. [5]. This model consists of a 3DCAE (represented
to discriminate the reconstructed output from the actual input.
as R), takes the input sequence I of window size T and
Further, Khan et al. [5] extend the work of Sabokrou et al. [16]
reconstructs the sequence where the output sequence is named
from single image to a sequence of images for fall detection
as O, which is then fed to fool 3DCNN (represented as D). R
using a spatio-temporal adversarial learning framework.
and D are trained using standard GAN loss described as:
Fall being a spatio-temporal change in a subject’s pose,
a limitation of the work is that the motion information is LR+D = EI∼p [lg D(I)] + EO∼p [lg (1 − D(O))] (1)
not explicitly added into the network. We build upon the We combine the adversarial loss with the Mean Squared Error
adversarial learning work of Khan et al. [5] and propose a (MSE) loss, which is used only for R and defined as
two channel network, with one channel explicitly learning the
LR = E[(I − O)2 ] (2)
motion in the form of the optical flow while the other takes
raw video frames as input. Our proposed approach can handle The total loss function to minimize R is defined as:
the situations where a person may not be present in a frame, L = LR+D + λLR (3)
which may reduce false positive rate.
where λ is a positive hyperparameter for weighted loss.
III. M ETHODS The notations used for thermal and flow networks are
(IT ,OT ,RT ,DT ) and (IF ,OF ,RF ,DF ), respectively.
Our proposed adversarial framework consists of two chan-
nels. The input to the first channel is a window of thermal
B. ROI Extraction
frames and the second channel is a window of optical flow
frames. Each channel consists of (i) a 3DCAE to reconstruct The performance of video-based fall detection methods may
the input window and (ii) a 3DCNN to discriminate them from get impacted by background artifacts. This situation can get
the original window of frames where both of the channels are worse with a thermal camera because changes in the heat
joined by a single neuron as a joint discriminator (Fig. 1). We can alter the background and pixel intensity of frames in a
trained this framework using only Activities of Daily Living video sequence [2]. Therefore, we reconstruct only the region
(ADL) from thermal frames. We perform person tracking where the person is present, which is not affected much by
and extract the Region of Interest (ROI) from the thermal the change in background objects and intensity. We perform
and the optical flow frames for motion and region-based person tracking using an object detector and image processing
reconstruction. techniques to localize the person in an image.
1) Person Detection: To the best of our knowledge, there
A. Advesarial Framework are no pre-trained publicly available deep learning-based
1) 3DCAE-3DCNN: Our architecture of 3DCAE is similar models for person detection, specifically for thermal images.
to Khan et al. [5]. We extend their network by adding a channel To this end, we used the Region-based Fully Convolutional
that takes optical flow as input (see Table I). We use 3D filters Network (R-FCN) [18] trained on COCO dataset [19]. As the
of 3 × 3 with a temporal depth of 5 in all layers of 3DCAE TSF dataset contains only one subject in a frame, the bounding
as same as Khan et al. [5]. The operations are the same in the box with the highest confidence score is selected. There are
flow 3DCAE except for the second deconvolution layer, which no false proposals by the detector; however, the localized
uses filters of 2 × 2 with a temporal depth of 4, to reconstruct bounding box is found to fluctuate in size and position, which
the temporal depth of odd length. degrades the prediction by the tracking method.
The architecture of the 3DCNN is the same as the encoder 2) Contour Box Localization: In thermal images, a person
in 3DCAE, followed by a neuron at the end with a sigmoid may appear brighter than the background due to differences in
function to output a probability of whether a sequence of heat emitted by as a person and object. Therefore, Otsu thresh-
frames is original or reconstructed. Batch normalization is used olding [20] is applied to the thermal image to separate dark
Fig. 1. The proposed adversarial network: top channel takes window of thermal frames and bottom channel takes window of the optical flow frames as input.

background, as shown in Fig. 2. The thresholded image may be at most three possible candidates for the person localization
still contain bright background objects. We find the contours – Detect, Contour, and Track box. Detect box confirms the
[21] on the thresholded image after applying morphological presence of the person, but it does not fit the person in most
operations and select the biggest contour on the basis of the cases. Therefore, the contour box and track box are used to
inside area. The smallest box containing that contour blob is improve the overall tracking. Algorithm 1 describes the whole
chosen as a candidate for the person bounding box. tracking method.
3) Tracking: We apply Kalman filtering on the top-left 4) Region based reconstruction: We remove the frames
and right-bottom coordinates of the bounding box with the in which a person is not localized post tracking, and rest
constant velocity assumption. The tracker is initialized with of the frames are masked by their corresponding bounding
the person detector and predicts the bounding box for the next box (named as ROI mask). For region-based reconstruction,
frame. We compare the predicted box with the person detection the 3DCAE is fed with the window of masked frames and a
bounding box (if detected in the next frame) to check if the region-based reconstruction loss LROI (instead of LR ) is used:
tracker drifts. We use a counter (age) to track the number of LROI = EROI [(ROI(I) − ROI(O))2 ] (4)
continuous tracker predictions without detection. In the case of
no detection, the tracker’s age is increased, and when the age where ROI(X) represents the masking of frames in window
of the tracker exceeds a limit of 20, the tracker is stopped. The X with the corresponding ROI masks, and the expectation is
Intersection of Union (IoU) is used in many tracking methods taken over pixels inside the ROI.
to match the bounding boxes. However, IoU is small when
the size of one box is large compared to the other box, which C. Motion Constraint and Reconstruction
could be possible due to the bad localization by the detector. Beside the appearance based constraint on R, we incorporate
Therefore, we also used other criteria such as the ratio of area motion in the fall detection system in two ways:
and checking for a subset. At a particular instant, there could 1) Difference Constraint: Mathieu et al. [22] compute the
difference and gradient-based losses for future frame predic-
tion, which increase the sharpness of the predicted frame. We
adapted a similar technique to add an additional loss term
in the thermal 3DCAE, which is based on the MSE of the
difference frames of I and O. A difference frame is a residual
map computed by subtracting two consecutive frames. We
mask the difference frames by their respective ROI, which
is the union of the ROIs of the two frames used to compute
the difference frame. Further, the difference loss is defined as:
LDif f = EROI [(ROI(DF (I)) − ROI(DF (O)))2 ] (5)
where DF(X) represents the difference frames for the window
X. Therefore, the final loss for R in (3) with LROI and LDiff is
defined as:
Fig. 2. Contour box localization process (Section III-B2). L = LR+D + λS LROI + λD LDif f (6)
Algorithm 1: Tracking Algorithm in a thermal video. Therefore, we train a spatio-temporal
Input : Frame adversarial network (RF , DF ) for flow reconstruction which
Output: FinalBox takes into input a window of optical flow frames. We compute
FinalBox=None; the dense optical flow frames for two consecutive frames [24].
DetectBox=Detector.GetLocalization(Frame); We stack the flow in the x, y direction, and the magnitude to
if DetectBox is not None then form a 3-dimensional image (similar to [25]).
ContourBox=GetBiggestContourBox(Frame); The flow images are masked with their ROI to remove
if Tracker is not None then noise due to temperature variations. As defined earlier for the
TrackBox=Tracker.GetCurrentBox(); difference frame, the ROI for flow image is the union of ROI
if TrackBox matches with DetectBox then of the two thermal frames used to compute the optical flow.
DetectBox=BoxSelection(DetectBox,TrackBox);

else D. Thermal and Optical Flow Fusion


Tracker=None;
end We propose a joint adversarial network for thermal and
end optical flow reconstruction, which consists of two 3DCAE
if ContourBox matches with DetectBox then and a single joint discriminator. In the joint discriminator
DetectBox=BoxSelection(DetectBox,ContourBox); network DTF , the individual 3DCNN networks - DT and DF are
used where the two individual sigmoid neurons are replaced
end by a single sigmoid neuron (see Figure 1). For region based
FinalBox=DetectBox; reconstruction, the joint loss function is defined as:
if Tracker is not None then L = LR+D + λT LT ROI + λF LF ROI (7)
Tracker.KalmanFilter(DetectBox);
where LT ROI and LF ROI are region-based reconstruction loss
Tracker.Losses=0;
for thermal and optical flow, respectively. The hyperparameters
else
λT and λF are their corresponding positive constants for
Tracker=InitializeTracker(DetectBox);
weighted loss. The joint loss function with region based
end
reconstruction and difference constraint can be written as:-
else
if Tracker is not None then L = LR+D +λT S LT ROI +λT D LT Dif f +λF LF ROI (8)
TrackBox=Tracker.Predict(); where LT Diff (Eq. (5)) and LT ROI (Eq. (4)) are difference
ContourBox=GetBiggestContourBox(Frame); constraint and region based reconstruction loss.
if ContourBox matches with TrackBox then
Tracker.Losses+=0.5;
FinalBox=ContourBox; IV. D ETECTING U NSEEN FALLS
Tracker.KalmanFilter(ContourBox);
else The strategy to detect unseen falls is shown in Figure 3
Tracker.CurrentBox=TrackBox; (derived from [2]). All the frames in the video, Fri , are broken
FinalBox=TrackBox; down into windows of frames of length, T = 8, using the
Tracker.Losses+=1; sliding window method with stride=1. For the ith window
end Ii , the 3DCAE gives the output as a reconstruction of this
end window, Oi . The reconstruction error (Ri,j ) between the j th
end frame of Ii and Oi can be calculated as (similar to Eq. (2))
if Tracker is not None and Tracker.Losses greater than
Ri,j = E[(Ii,j − Oi,j )2 ] (9)
MaxAge then
Tracker=None; For the region based reconstruction models, it can be defined
FinalBox=None; (similar to Eq. (4)) as follows:
end Ri,j = EROI [(ROI(Ii,j ) − ROI(Oi,j ))2 ] (10)
Return FinalBox;
There are two ways to detect unseen falls, at the frame level,
where λS and λD are positive constants for weighted loss. or at the window level, which are described below.
2) Optical Flow Reconstruction: Liu et al. [23] use con-
straint on the optical flow of the ground truth and reconstructed A. Frame Level Anomaly Score
images for future frame prediction. They used CNN based
model for flow estimation, which can facilitate backpropaga- In the frame level anomaly detection, the reconstruction
tion for optical flow loss, but these models are trained on RGB error (Ri,j ) (obtained from the 3DCAE) is computed for every
images and may generate noise due to temperature changes j th frame across different windows. The average (Cj µ ) and
Fig. 4. TSF dataset [3] samples: top-left shows empty room frame; bottom-
left shows the fall while during walking and the right column shows ADL
frames-walking and lying on bed respectively.
terms Thermal ROI-score, Flow ROI-score and Thermal
Diff-score for comparisons of the different anomaly scores
calculated for the same model. We define tolerance (α) as the
number of fall frames required in a window to set the ground-
truth label of the entire window as a fall. We varied α from
Fig. 3. Temporal sliding window method- showing error (Ri,j ) per frame (Frj ) 1 to T to understand its impact on the results.
with T = 8 (Figure Source: [5]).
standard deviation (Cj σ ) of the Ri,j across different windows
are used as an anomaly score for the j th frame as follows [2]: V. E XPERIMENTS
( Pj
1
j Ri,j j<T
Cµ = j1 Pi=1 j Dataset: We use the publicly available TSF dataset [3]
T i=j−T +1 Ri,j j≥T
containing 44 thermal videos of resolution 640x480. There
q P
 1 j (Ri,j − Cµj )2 (11)
i=1 j<T are 9 videos with normal ADL and 35 videos containing falls
j j
Cσ = q and other normal activities. The ADL frames include different
 1 Pj j 2
T i=j−T +1 (Ri,j − Cµ ) j≥T scenarios such as an empty room, a person entering a room,
sitting on a chair, or lying in bed, whereas the fall frames
B. Window Level Anomaly Score include a person falling from the chair, bed, or falling while
For window level anomaly detection, the anomaly score is walking. Some of the ADL and fall frames are shown in Fig.
calculated for the entire window. For a particular window Wi 4. ADL videos contain a total of 22,116 frames.
of length T, the mean of reconstruction error of all the T Data Processing: We perform person tracking on thermal
frames (Wi µ ) and their standard deviation (Wi µ ) are used as videos; when a person is not detected continuously for 10
an anomaly score [5]: frames, we break the video and create sub videos, resulting
T +i−1
1 X in 22 sub videos, which further contain 12,454 frames from
i
Wµ = Ri,j ADL videos for training. For ROI computation, the detection
T j=i
threshold is empirically set to 0.3. We perform the sliding
v
u T +i−1 (12) window method on sub videos with stride=1 to create windows
i
u1 X
Wσ = t i
(Ri,j − Wµ ) 2 of length T=8, which creates a total of 12,300 windows from
T j=i ADL sub videos. We normalize frame in the range [-1, 1] [17]
While using difference constraint in thermal reconstruction followed by resizing to 64x64.
models, we also compute another window level anomaly score ROI processing: For the input to region-based networks,
using the reconstruction error of difference frames (Eq. (5)) by we perform normalization on the pixels inside the ROI and
taking the mean and standard deviation of reconstruction error mask it by ROI mask. We set the pixels outside the ROI to
over T-1 frames (similar to Eq. (12) with window length T-1). −1. Finally, an ROI masked frame contains pixel value in the
Similarly, we compute the window level anomaly scores for range [−1, 1] inside the ROI and −1 outside the ROI.
the optical flow reconstruction models using the reconstruction Optical flow image pre-processing: Each channel of a
error of the optical flow frames. flow image is re-scaled in the range of [−1, 1] by min-
For region-based reconstruction models with difference con- max normalization. For region-based reconstruction in flow
straint, we name the anomaly scores computed using thermal images, we perform ROI masking as described earlier for the
frames as ROI-score and anomaly scores computed using normalized thermal image, where the ROI of the optical flow
difference frames as Diff-score. For fusion models, we use the frame is the union of the corresponding consecutive frames.
(a) 3DCAE input (b) 3DCAE output (c) ROI-3DCAE input (d) ROI-3DCAE output (e) ROI-3DCAE masked
output

(f) 3DCAE flow input (g) 3DCAE flow output (h) ROI-3DCAE flow input (i) ROI-3DCAE flow output (j) ROI-3DCAE masked
flow output
Fig. 5. Qualitative analysis- The middle frame of the input and output window reconstructed by different models: the top row images [(a)-(e)] and bottom
row images [(f)-(j)] are the inputs and corresponding outputs of the thermal data channel and flow magnitude channels, respectively.

A. Network Implementations deviation anomaly scores. We calculate and compare the AUC
We train two adversarial models, one each for for thermal of ROC and PR curve using all these scores.
window (Thermal-3DCAE) and optical flow window (Flow- C. Ablation Studies
3DCAE) reconstruction. For region-based reconstruction, we
1) 3DCAE: The thermal input and the reconstructed out-
train two adversarial models on thermal data, one with ROI
put by Thermal-3DCAE can be seen in Fig. 5. The basic
masking and ROI loss (Eq. (4)) described earlier and the other
techniques to utilize the ROI in the deep learning models
one with the addition of difference constraint in the region
are resizing and ROI pooling. We also train two different
based reconstruction (named as Thermal-ROI-3DCAE and
Thermal-3DCAE models by changing the thermal input (1)
Thermal-Diff-ROI-3DCAE) We train one adversarial model
resizing input ROI to 64x64, and (2) ROI pooling to 64x64
for optical flow with region-based reconstruction, named as
dimension. We observe that these techniques increase the
Flow-ROI-3DCAE. For the fusion networks, we train two
false-positives, and the results for them are not reported. We
models: one without the difference constraint (Fusion-ROI-
argue that the resizing the ROI leads to geometric distortions
3DCAE) and one with it (Fusion-Diff-ROI-3DCAE).
and introduces false motion on the borders even if the subject
We use the SGD optimizer with the learning rate as 0.0002
is not moving.
for the 3DCNN discriminator and adadelta optimizer for the
Optical Flow: As described earlier, optical flow images
3DCAE in all the adversarial models. All the models are
contain noise due to temperature variation due to which the
trained for 300 epochs. The hyperparameters (λ’s) used for the
reconstruction quality is also noisy (Fig. 5(g)).
weighted loss (Eq-(3), (6) and (7)) are varied between three
2) ROI-3DCAE: The output of Thermal-ROI-3DCAE and
values [0.1, 1 and 10]. We found that the large value of these
Flow-ROI-3DCAE is shown in Fig. 5. We observe that the
hyperparameters led to mode collapse. The best hyperparam-
eter setting for Thermal-Diff-ROI-3DCAE and Fusion-Diff-
ROI-3DCAE have all the constants equal to 1, whereas the
hyperparameter setting for the rest of the models has all the
constants equal to 0.1. The full code of our implementation is
available at https://github.com/ivineetm007/Fall-detection.

B. Evaluation Metrics
For assessing the performance of detecting falls as anomaly,
the Area Under Curve (AUC) of Receiver Operating Charac-
teristics (ROC) and Precision-Recall (PR) curve is used. The
latter is used to specifically focus on the detection of minority
‘fall’ class. We compute the anomaly scores at the frame and
window level, which consists of two types - mean and standard Fig. 6. Frame level anomaly score for a fall video.
Fig. 7. Plots of AUC values of ROC and PR curve computed using window level anomaly scores with the variation in tolerance- Wµ (Left) and Wσ (Right).

region-based method improves the reconstruction quality in the that there are minor improvements in frame-level results. We
ROI region as the model learns to reconstruct only the ROI compare the window level results ( Fig. 7) of Thermal-ROI-
(see Fig. 5 (c), (d), and (e)). Similar behaviour is observed for 3DCAE and Flow-ROI-3DCAE with Thermal-ROI score
Flow-ROI-3DCAE (see Fig. 5 (h), (i) and (j)). and Flow-ROI score of Fusion-ROI-3DCAE respectively, we
3) Difference constraint: To understand the impact of observe that there is a small increment in the results by ROI
difference constraint, we compare the results computed at score of thermal and a substantial increment in the results by
window level (see Fig. 7) for the AUC of ROC and PR. On the ROI score of flow, which indicates that joint learning improved
comparison between the Diff score and the ROI score of the the learning of flow reconstructor.
Thermal-Diff-ROI-3DCAE, we found that the results using Qualitative Analysis- We use the frame based anomaly
the Diff score are better than the ROI score in all the plots, scores to visualize the performance of the proposed method
which suggests that the use of Diff score is more suitable (see Fig. 6). Although our model is able to detect fall events,
for window level analysis. We also compare the results of we observed a false peak when the person enters the room and
Thermal-ROI-3DCAE and Diff score of Thermal-Diff-ROI- our tracking method misses some frames when person enters.
3DCAE, we found that the addition of constraint increase
the AUC for both ROC and PR curve. Similar behaviour VI. R ESULTS
was observed on the comparison of Fusion-ROI-3DCAE and The AUC values of ROC and PR curves computed using
Fusion-Diff-ROI-3DCAE. This suggests that the difference frame level anomaly scores are shown in Table II. For window
constraint makes the model more discriminative in temporal level results, we plot the AUC values with the variation of
direction and increases the overall performance. tolerance (α) from 1 to 8 (see Fig. 7). The frame level and
4) Fusion models: To understand the effect of fusion, we window level results can be summarized as:
first compare the results of Fusion-ROI-3DCAE with the ROI 1) There is a improvement in the AUC of the ROC and PR
based models at frame level as shown in Table II; we found curve by region-based construction, which confirms the
TABLE II importance of region awareness.
AUC OF ROC AND PR BASED FRAME LEVEL ANOMALY COMPARISON . 2) Addition of difference constraint increases the AUC values
Method ROC PR
Cµ Cσ Cµ Cσ
using window level scores, which indicates its importance
Thermal-3DCAE 0.88 0.90 0.47 0.48 for the learning of spatio-temporal autoencoders.
Thermal-ROI-3DCAE 0.89 0.92 0.55 0.57 3) Fusion models leads to an increase in the performance.
Thermal-Diff-ROI-3DCAE (ROI score) 0.90 0.92 0.57 0.56
Fusion-ROI-3DCAE (Thermal ROI score) 0.90 0.93 0.56 0.58 Comparison with the existing methods: In the previous
Fusion-Diff-ROI-3DCAE (Thermal ROI score) 0.90 0.93 0.57 0.57 works using DSTCAE-C3D [2], Conv-LSTM AE [4] and
TABLE III [4] J. Nogas, S. S. Khan, and A. Mihailidis, “Fall detection from thermal
C OMPARISON WITH THE PREVIOUS METHODS BASED ON AUC OF ROC camera using convolutional lstm autoencoder,” in Proceedings of the
AND PR CURVE CALCULATED ON FRAME LEVEL ANOMALY SCORES . 2nd workshop on Aging, Rehabilitation and Independent Assisted Living,
Method All frames Tracked frames IJCAI Workshop, 2018.
ROC ROC PR [5] S. S. Khan, J. Nogas, and A. Mihailidis, “Spatio-temporal adversarial
Cµ Cσ Cµ Cσ Cµ Cσ learning for detecting unseen falls,” Pattern Analysis and Applications,
Conv-LSTM AE [4] 0.76 0.83 0.63 0.73 0.26 0.37 pp. 1–11, 2020.
DSTCAE-C3D [2] 0.93 0.97 0.85 0.90 0.46 0.53 [6] T. Schlegl, P. Seeböck, S. M. Waldstein, G. Langs, and U. Schmidt-
3DCAE-3DCNN [5] 0.95 0.95 0.90 0.88 0.47 0.48 Erfurth, “f-anogan: Fast unsupervised anomaly detection with generative
Fusion-Diff-ROI-3DCAE — — 0.90 0.93 0.57 0.57 adversarial networks,” Medical image analysis, vol. 54, pp. 30–44, 2019.
[7] D. Li, Z. Qiu, Q. Dai, T. Yao, and T. Mei, “Recurrent tubelet proposal
3DCAE-3DCNN [5], AUC values of ROC computed using and recognition networks for action detection,” in Proceedings of the
frame-level anomaly score are reported (see Table III ‘All European Conference on Computer Vision, 2018, pp. 303–318.
[8] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new
frames’ column). The previous methods do not perform person model and the kinetics dataset,” in Proceedings of the IEEE Conference
tracking in the video, due to which the number of frames on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
used for training and testing is different. Therefore, we train [9] G. Baldewijns, G. Debard, G. Mertes, B. Vanrumste, and T. Croonen-
borghs, “Bridging the gap between real-life data and simulated data by
and test these methods using the frames on which a person providing a highly realistic fall dataset for evaluating camera-based fall
is localized by our tracking method. The comparison of these detection algorithms,” Healthcare technology letters, vol. 3, no. 1, pp.
methods with the proposed model on tracked frames only is 6–11, 2016.
[10] K. Sehairi, F. Chouireb, and J. Meunier, “Elderly fall detection system
shown in Table III and summarized as: based on multiple shape features and motion analysis,” in 2018 Interna-
1) AUC values of previous methods decrease when only tional Conference on Intelligent Systems and Computer Vision (ISCV).
IEEE, 2018, pp. 1–8.
tracked frames are used. Furthermore, the empty frames [11] S. Ezatzadeh and M. R. Keyvanpour, “Vifa: an analytical framework for
in train and test set videos are similar. Low reconstruction vision-based fall detection in a surveillance environment,” Multimedia
error on these frames may give high AUC values in Tools and Applications, vol. 78, no. 18, pp. 25 515–25 537, 2019.
[12] L. Ren and Y. Peng, “Research of fall detection and fall prevention
previous methods (Table III-‘All frames’) during testing. technologies: A systematic review,” IEEE Access, vol. 7, pp. 77 702–
2) The proposed methods achieves similar and better AUC of 77 722, 2019.
ROC than previous methods and higher AUC of PR against [13] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis,
“Learning temporal regularity in video sequences,” in Proceedings of the
all previous methods. The method focuses on region in the IEEE Conference on Computer Vision and Pattern Recognition, 2016,
frame where a person is present; therefore, it can facilitate pp. 733–742.
learning of background agnostic models. [14] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni,
and N. Sebe, “Abnormal event detection in videos using generative
adversarial nets,” in 2017 IEEE International Conference on Image
VII. C ONCLUSION AND F UTURE W ORK Processing (ICIP). IEEE, 2017, pp. 1577–1581.
[15] S. S. Khan, M. E. Karg, D. Kulić, and J. Hoey, “Detecting falls with
Fall detection is a non-trivial problem due to large imbal- x-factor hidden markov models,” Applied Soft Computing, vol. 55, pp.
ance in data; thus, we formulate it as an anomaly detection 168–177, 2017.
[16] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli, “Adversarially
problem. In this problem, we trained the model on only normal learned one-class classifier for novelty detection,” in Proceedings of the
ADL and predicts whether the test sample is normal ADL or IEEE Conference on Computer Vision and Pattern Recognition, 2018,
a fall. Building upon the advantages of adversarial learning pp. 3379–3388.
[17] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
paradigm, we present a two channel adversarial learning learning with deep convolutional generative adversarial networks,” arXiv
framework to learn spatio-temporal features by extracting ROI preprint arXiv:1511.06434, 2015.
and its generated optical flow followed by a joint discriminator. [18] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-
based fully convolutional networks,” in Advances in neural information
We note that the introduction of person-ROI and difference processing systems, 2016, pp. 379–387.
loss function increases the performance. The major improve- [19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
ment in comparison to previous methods is the increase in P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
context,” in European Conference on Computer Vision. Springer, 2014,
AUC of PR curve. The optical flow information is also useful pp. 740–755.
in the network and the fused method performs better than the [20] N. Otsu, “A threshold selection method from gray-level histograms,”
raw thermal analysis only. In the future, we plan to extend IEEE Transactions on Systems, Man, and Cybernetics, pp. 62–66, 1979.
[21] S. Suzuki et al., “Topological structural analysis of digitized binary
the proposed techniques to detect falls using multiple camera images by border following,” Computer Vision, Graphics, and Image
modalities, including depth and IP cameras. Processing, pp. 32–46, 1985.
[22] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video
R EFERENCES prediction beyond mean square error,” arXiv preprint arXiv:1511.05440,
2015.
[1] S. S. Khan and J. Hoey, “Review of fall detection techniques: A data [23] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction
availability perspective,” Medical engineering & physics, vol. 39, pp. for anomaly detection–a new baseline,” in Proceedings of the IEEE
12–22, 2017. Conference on Computer Vision and Pattern Recognition (CVPR), 2018,
[2] J. Nogas, S. S. Khan, and A. Mihailidis, “Deepfall: Non-invasive pp. 6536–6545.
fall detection with deep spatio-temporal convolutional autoencoders,” [24] G. Farnebäck, “Two-frame motion estimation based on polynomial
Journal of Healthcare Informatics Research, vol. 4, no. 1, pp. 50–70, expansion,” in Scandinavian conference on Image analysis. Springer,
2020. 2003, pp. 363–370.
[3] S. Vadivelu, S. Ganesan, O. R. Murthy, and A. Dhall, “Thermal imaging [25] G. Gkioxari and J. Malik, “Finding action tubes,” in The IEEE Confer-
based elderly fall detection,” in Asian Conference on Computer Vision. ence on Computer Vision and Pattern Recognition (CVPR), June 2015.
Springer, 2016, pp. 541–553.

You might also like