Anomalous Behavior Detection Using Spatio Temporal Feature
Anomalous Behavior Detection Using Spatio Temporal Feature
by
Jannatun Nahar
18101291
Zarin Tasnim Promi
18101589
Jannatul Ferdous
18101565
Fatin Ishrak
21301716
Ridah Khurshid
18101683
1. The thesis submitted is our own original work while completing degree at Brac
University.
3. The thesis does not contain material which has been accepted, or submitted,
for any other degree or diploma at a university or other institution.
Ridah Khurshid
18101683
i
Approval
The thesis/project titled “Anomalous behavior detection using Spatio temporal Fea-
ture and 3D CNN model for Surveillance ” submitted by
1. Jannatun Nahar (18101291)
2. Zarin Tasnim Promi (18101589)
3. Jannatul Ferdous (18101565)
4. Fatin Ishrak (21301716)
5. Ridah Khurshid (18101683)
Of Spring , 2022 has been accepted as satisfactory in partial fulfillment of the re-
quirement for the degree of B.Sc. in Computer Science on January, 2022.
Examining Committee:
Supervisor: (Member)
ii
Ethics Statement
We thus declare that this thesis is based on the findings of our research. All other
sources of information have been acknowledged in the text. This thesis has not been
previously submitted, in whole or in part, to any other university or institute for
the granting of any degree.
iii
Abstract
Anomalous and violent action detection has become an increasingly relevant topic
and active research domain of computer vision and video processing, within the past
few years. It has many proposed solutions by the researchers and this field attracted
new researchers to contribute in this domain. Furthermore , the widespread use of
cameras used for security purposes in big modern cities has also allowed researchers
to research and examine a vast amount of information so that autonomous monitor-
ing can be executed. Adding effective automated violence unearthing to videotape
security or multimedia content watching technologies (CCTV) would make the task
of carpoolers, walk organizations, and those who are in control of social media
activity monitoring much easier. We present a new deep scholarship skeleton for
determining whether a videotape is violent or not, based on a suited version of
DenseNet , and a bidirectional convolutional LSTM module that allows unscram-
bling pointed Spatio-temporal features in this paper. In addition, ablation research
of the input frames was carried out, comparing thick optic outpouring and touching
frames. Throughout the paper, we analyze various strategies to detect violence and
their classification in use. Furthermore, in this paper, we detect violence using the
Spatio-temporal feature with 3D CNN which is a DL violence detection framework,
specially better for crowded places. Finally, we used embedded devices like Jetson
Nano to feed with dataset and test our model and evaluate. We want a warning sent
to the local police station or security agency as soon as a violent activity is detected
so that urgent preventive measures can be taken. We have worked with various
benchmark datasets where in one dataset, multiple models achieved a test accuracy
of 100 percent, making them invincible. Furthermore, for a different dataset our
models have shown 99.50% and 97.50% accuracy rates. We also did a cross dataset
experiment in models which also showed pretty good results of higher than 60%. The
overall results we got suggests that our system has a viable solution to anomalous
behavior detection.
iv
Dedication
We would like to dedicate this thesis to our loving parents and all of the wonder-
ful academics we met and learned from while obtaining our Bachelor’s degree and
specially our beloved supervisor Dr.Amitabha Chakrabarty.
v
Acknowledgement
First and foremost, we would like to express our gratitude to our Almighty for
allowing us to conduct our research, put out our best efforts, and finish it. Second,
we would like to express our gratitude to our supervisor, Dr. Amitabha Chakrabarty
sir, for his input, support, advice, and participation in the project. We are grateful
for his great supervision, which enabled us to complete our research effectively.
Furthermore, we would like to express our gratitude to our faculty colleagues, family,
and friends who guided us with kindness, inspiration, and advice. Last but not
least, we are grateful to BRAC University for allowing us to do this research and
for allowing us to complete our Bachelor’s degree with it.
vi
Table of Contents
Declaration i
Approval ii
Abstract iv
Dedication v
Acknowledgment vi
List of Figures ix
List of Tables xi
Nomenclature xii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature Review 7
2.1 Human Action Recognition(HAR) . . . . . . . . . . . . . . . . . . . . 7
2.2 Different State-of-the-art Methods . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Spatio-temporal Texture Model . . . . . . . . . . . . . . . . . 9
2.2.3 Classification of Violence Detection Techniques . . . . . . . . 10
2.2.4 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Proposed Method 20
3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Model Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 DenseNet Convolutional 3D . . . . . . . . . . . . . . . . . . . 24
3.2.3 Multi-Head Self-Attention . . . . . . . . . . . . . . . . . . . . 24
vii
3.2.4 Bidirectional Convolutional LSTM 3D . . . . . . . . . . . . . 25
3.2.5 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Datasets 26
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Conclusion 43
Bibliography 48
viii
List of Figures
1.1 Different scenarios where real time violence detection will be applica-
ble and corresponding scenes with violence that should be detected
(A) Interior video surveillance (B) Traffic video surveillance (C) police
body cameras These use cases provide the motivation for this thesis:
the flexibility to rapidly and accurately detect violence in real-time
in multiple settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Human Action Recognition General Model . . . . . . . . . . . . . . 5
ix
5.5 Gun detected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6 Violent Action Detected . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.7 side by side comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.8 Comparison based on FPS . . . . . . . . . . . . . . . . . . . . . . . . 42
5.9 Human Action Recognition Subsystems . . . . . . . . . . . . . . . . . 43
x
List of Tables
xi
Nomenclature
The next list describes several symbols & abbreviation that will be later used within
the body of the document
DL Deep Learning
HF s Hockey Fights
KN N K-Nearest Neighbour
M F s Movie Fights
V F s Violent Flows
xii
Chapter 1
Introduction
At present era, the issue of identifying anybody’s movement through video has
brought a significant role in the field of computer vision. However, the identification
in detecting violence that conducts, received fewer attention in compared to any
other human activities. Increase in threats lead the world to use CCTVs to monitor
people everywhere in cities and towns. To ensure safety of citizens, human attacks
and fights are examples of situations where crime monitoring systems are required.
Before all else, we concentrate on the strategies for recognising and detecting violence
in surveillance recordings. One of our goals is to determine whether and then when
violence happens in a video. Various strategies for recognising and detecting violence
have been proposed in the last decade. For individual fight detection, A. Datta, M.
Shah et al. al.[3]has used motion trajectory data with limb orientation information.
Nevertheless, one of the disadvantages of this method is that it necessitates exact
segmentation, which is hard to achieve in real-world videos. Over recent years, with
the evolution of technology in the firing line of computer vision, a large number of
new techniques have arisen, attracting the interest of researchers due to its variety
of applications[19][40]. In South Korea, for example, about 954,261 CCTVs were
installed in public spaces in 2017, up 12.9 percent over the previous year [51].
Thus, this is the purpose of focusing on video based violence detection methods
to ensure security in public places. Lack of people and ignored dangers are issues
that develop when operators fail to discover objects or actions after 20 minutes of
watching a CCTV system. This necessitates advancements in automated systems
for the identification of violent acts or the detection of guns. Safety has always been
a concern in all aspects of daily life. Currently, there are socioeconomic differences.,
as well as the global economic crisis, have resulted in a rise in violence, as well as the
recording and dissemination of such acts. As a result, it’s critical to build automated
systems to detect these acts and increase security teams’ reactivity, as well as the
control of social media content and the examination of multimedia data for an age
restriction.
Violence detection has been a well-liked topic in recent human activity recognition
research, particularly in video surveillance. One amongst the difficulties with hu-
man activity recognition normally is the classification of human action in real time,
almost instantaneously after the action has taken place. This difficulty escalates
when dealing with surveillance video for a variety of things including the standard
1 of surveillance footage is diminished, lighting isn’t always guaranteed, and there’s
1
generally no contextual information which will be accustomed to ease detection of ac-
tions and classification of violent versus non-violent. Furthermore, for violent scene
recognition to be helpful in real-world surveillance applications, the identification of
violence must be swift in order to allow for prompt intervention and backbone.
In addition to poor video quality, violence can occur in any given setting at any time
of day, therefore, an answer must be robust to detect violence irrespective of the
conditions. Some settings for video surveillance where violence detection are often
applied include the inside and exterior of buildings, in traffic, or on police body
cameras.
Figure 1.1: Different scenarios where real time violence detection will be applicable
and corresponding scenes with violence that should be detected (A) Interior video
surveillance (B) Traffic video surveillance (C) police body cameras These use cases
provide the motivation for this thesis: the flexibility to rapidly and accurately detect
violence in real-time in multiple settings.
These use cases provide the motivation for this thesis: the flexibility to rapidly and
accurately detect violence in real-time in multiple settings.
2
1.1 Motivation
Despite the fact that there is a large corpus of study on the subject, there has been
absolutely no commercial method currently for detecting violence that combines AI
technology and human operators. In terms of any job quality, this is undoubtedly
necessary. As a result, the operators who must view this type of video will be less
stressed, allowing them to focus on more productive activities in some circumstances.
Beyond everything, it’s a matter of not being able to accomplish the task accurately
and effectively owing to a fundamental limitation in the videos which are observed
concurrently and along with requisite focus. The number of false positives that
cause the system to be deactivated, as well as the occurrence of false negatives
that indicate functionality failures, are the key roadblocks to developing a video
surveillance system capable of automatically identifying violence. This is why a more
accurate approach is presented, which has been tested on multiple datasets. Over
and above that, because violence can take many forms, drawing general inferences
from an individual and single dataset is problematic. In this sense, we believe that
conducting a cross dataset analysis to see if a training is performed on one dataset
can yield good results on another is crucial, and whether it is viable to put the
model into production or whether it has to be refined for each scenario.
Act recognition from visual data is one of the most difficult fields of research in the
development of advanced and smart cities, particularly in surveillance applications.
For law enforcement authorities to prevent crime, anomalous activity recognition is
critical. Unusual activities include those that are harmful for human life or property,
such as accidents, destruction of property, breaking the law, or criminal actions such
as fighting or theft. To evaluate the algorithms developed in the early testing of
activity recognition, different data sets containing activity conducted by one actor
under controlled circumstances are used. The focus of this research has shifted
to uncontrolled, realistic video data sets, which present more challenges to event
detection problems, including image noise, inter and intra class variances, opacity,
postures change, lens motions, and so on [41].
3
is good but computational 3 complexity is very high due to extracting the optical
flow. Therefore, these methods face challenges over large scale datasets and real-
time monitoring. Alternately, other architectures of 3D CNNs, such as two stream
3D ConvNet [35], pseudo-3D CNNs [30], and MiCT-Net [43]. are used to tackle
the problem of expensive calculation of spatio-temporal characteristics. Such 3D
CNNs can extract spatio-temporal features directly, which improves classification
performance while significantly increasing time complexity.
• Lastly, obtaining the trained deep learning model, we will optimize the model
using Deep-stream SDK, the toolkit of Nvidia. To deploy these types of devices
using the HAR model, we are using development boards such as Jetson Nano.
The trained model will be translated into an alternative manner based on
trained parameters and topologies using these toolkits.
4
Moreover, we will implement emerging techniques which are lightweight in
recognizing activity that can be easily integrated into image sensors and IoT
systems for cost-effective surveillance.
From the past decade, different approaches have been given for HAR design method-
ology. HAR systems are complex and follows some subsystems shown below:
• A system that uses DenseNet, a process that uses self-attention system with
multi-head, and Bi-Conv LSTM to recognise violent occurrences in real time.
• To see if we could attain a high accuracy rate, we ran a cross dataset study.
This report aims on constructing a surveillance system which will detect anoma-
lous behaviour using spatio-temporal features and a 3D CNN model. The selected
datasets were utilized to evaluate the model in this study. The report contains the
experimental outcomes, highlights the significance and relevance of the findings and
discusses the suggested model’s strengths and flaws.
chapter 1 - In the introduction part states the importance and the application of
violence detection and human action recognition specially in video surveillance sys-
tem.
5
chapter 2 - In the Literature Review section , we have gone through other researchers
paperwork and found their approaches. In literature review, we focused on the
methodologies and the models they have used to perform their research. we have
also acknowledged the outcomes of previous researchers paperwork.
chapter 3 - Proposed Method , states the models we have used to get our desired
result.It also describes the architecture of the model.
chapter 4 - In the Data phase, we included the datasets that we have used throught
our work.
6
Chapter 2
Literature Review
After the development of deep learning, Human Action Recognition in videos got
the popularity in the computer vision field to classify images and detect objects.
Many assessments were concluded by researchers on deep learning approaches [24]
[50]. Here, we will discuss some major approaches for video sequence data used by
deep learning and analyze them with both temporal and spatial features. First of
all, we will see the Human action recognition (HAR) in action.
7
Figure 2.1: Types of Human Action Recognition
When we get a video, the preprocessing part starts to clean the noise of the video.
For human action recognition, a moving object is to extract the shape of a human
from the background image serially of video frames by observing when they change
the place. All objects are detected when the item detection algorithm is done and
the classification of objects should be used for extraction. Various types of object
classification; the movement characteristics classification and shape classification.
Also various types of tracking methods; background subtraction method, optical
flow method, block matching, the time difference method, active contour models
method, etc.
8
2.2 Different State-of-the-art Methods
To detect crucial events and dangerous activities in videos many types of techniques
or methods are flourished. Particular approaches are offered in these strategies, each
of which works with a different set of input parameters. Different qualities of videos
such as look, movement, stream, and so on are called parameters.
Many researchers put their attention towards the computer vision field as it has a
wide range of applications for analyzing images and videos. Object detection and
activity recognition became top choices for its necessity. The basic architecture is
shown here :
This is a high performing method for crowd violence detection [22]. It is a sen-
sitive model and detects sudden changes in crowd motion. STT is composed of
spatiotemporal volumes (STV) which transform the frame of the video from 2D
based mechanism to 3D model analysis and slide window along the time axis. HRF
is then used to compose the STT feature space. Some features of STT are as follows:
9
Figure 2.4: Spatio-Temporal Texture Model
The use of computer vision to recognize aggression actions in surveillance has be-
come a renowned topic in the field of action detection [39]. VDT is divided into
three groups; using machine learning, using SVM and deep learning. Among them
SVM and deep learning is widely used due to their success rate and accuracy over
benchmark datasets. A short review on each categories as follows:
VDT using ML
10
Figure 2.5: The general overview of an approach illustrating the two main phases
of the system. The upper part of the figure gives the main steps performed during
training (i.e., coarse and fine-level model generation), while the lower part shows
the main steps of testing (i.e., execution). (DT: Dense Trajectory, BoAW: Bag-of-
Audio-Words, BoMW: Bag-of-Motion-Words, CSC: Coarse Sub-Concept, FSC: Fine
Sub-Concept)[17]
A set of apparatus is given by Langarian theory to analyze the long term, non
local data of movement in the computer vision field. A particular lagrangian tech-
nique [31] is presented on the basis of this theory of Crowd Violence Detection
because of auto recognition in video sequence data. Spatio-temporal model based
Lagrangian direction areas are used for original features and use the data of foun-
dation movement remuneration, appearances and long term motion. Comes about
that the expansion of Lagrangian theory may be an important sign to detect violence
and the classification execution executed over the state-of-the-art methods like ViF,
HOG + BoW, two stream CNN etc. in terms of AUC and accuracy. Here are some
techniques which are mostly used in Table 2.1
11
Object Feature
Classification
Method Detection Extraction Scene Type Accuracy %
Method
Method method
Motion blob (AMV) Spatio-
As an algo- Both
acceleration measure Ellipse temporal
rithm to find crowded
vector method for Detection feature use Near 90%
the accelera- and less
fighting from video method for classifica-
tion crowded
[19] tion
RIMOC method fo- Results for
Spatio-
cuses on speed and Covariance Both normal sit-
temporal STV uses
direction of an ob- Matrix crowded uation 97%
vector supervised
ject on the base of method STV and un- Dataset of
method learning
HOF(Histogram Opti- method crowded train station
(STV)
cal Flow)[34] 82%
The method includes Vif Objec- Lower Frame
Horn Shrunk Interpolation Less
two tion rate
Step detection of vio- Recognition
lence and faces in video CUDA 14% too high
Method for
by using VIF descrip- method and Classification crowded rate of 35
Histogram
tor and normalization KLT face fs/s 97%
algorithms detector
SVM method for Region mo-
Macro block
recognition based on Vector nor- tion and
technique for
statistical theory with- malization descriptor crowded 96.1%
features ex-
out decoding of video method for video
tractions
frames [45] classification
Classify on
Spatio
the base of Depend
Detecting fights with Binarization Temporal
blob length. crowded upon dataset
motion blobs [46] of images method to
Largest con- 70% to 98%
extract blobs
sider fighting
Kinetic framework by
Posture Gradient
analyzing the posture Joint angle
recognition descent
for recognition abnor- for acquiring Less crowded 85% to 91%
using logistic Method for
mal activities of ATM posture
regression classification
3D cameras [26]
Lagrangian
Lagrangian fields of di-
Global com- Theory and
rection and begs of Late fusion
pensation STP method
word framework to rec- for classifica- crowded 91% to 94%
of object for extract
ognize the violence in tion
motion Motion fea-
videos[53]
tures
Apply differ-
A simple approach to ent formulas
Rule based
form a video to pre- on the con-
Gaussian classifica-
processing then feature secutive
Mixture tion using Less crowded Up to 90%
extraction and recogni- frame to
Model a default
tion of normal and ab- extract
threshold
normal events required
feature
To resolve classification issues utilizing directed learning is known as the SVM algo-
rithm. In SVM, information plots assume measurement space and separate inside
12
two classes.
It is a robust technique and kernel based. Kernel is a function that turns data into
a high dimensional space in which the problem can be solved. Lack of clarity in re-
sult is one of the drawbacks of SVM[5]. Fast Violence Detection (FVT), Detecting
Violence in Videos using Subclasses, Human Violence Recognition and Detection
(HVRD), Violence Detection using Oriented Violent Flow, Robust Abnormal Hu-
man Activity Recognition, Framework for High-Level Activity Analysis, Real Time
Violence Detection, Automated Detection of Fighting Styles are some of the widely
used methods using SVM.
In the FVT model, it used the BoW framework which is specifically used for recog-
nizing fight Spatio temporal features have been removed from the video frame and
sent to ensure 90% of accuracy. Using this method infused with SIFT and MoSIFT
it became 15 times faster and an increase of 12% in accuracy [7]. In terms of AUC
and accuracy, the suggested descriptor outperforms state-of-the-art descriptors such
as HOG, Histogram of Optical Flow (HOF), Combination of HOG and HOF (HNF),
MoSIFT, and SIFT.
GMOF for surveillance hasn’t gotten nearly as much attention as action recognition.
The framework is robust and fast according to [23]. HVRD used Improved Fisher
Vectors (IFV) using spatio-temporal position. The IFV formulas are reformulated
and a summed area table data structure is employed to speed up the method. Normal
spatio temporal features are taken out from videos by the help of Improved Dense
Trajectories (IDT). After that, HOG represents the video using IFV. It used a
linear SVM model as a classifier. According to [40], the martial arts are classified by
Automated Detection of Fighting Styles methods. Mainly, it used KNN and SVM
combined models to outperform existing methods in terms of accuracy (Table 2.2)
This is the technique which uses CNN based categorization [18]. DL is based on
neural networks which is the following method in this paper as well. It is mainly
based on datasets and extracts attributes using more convolutional layers. Some of
the methods of this techniques are discussed below:
13
ture local spatio-temporal data, allowing for local motion analysis in video whose
performance is better than other state-of-the-art methods such as ViF+OVif, three
stream+LSTM in terms of accuracy.
Fight Recognition Method tasks like aggressive action which are studied less
than other methods. The main feature of the detectors is efficiency, which means
that these methods should be computationally quick. To achieve high accuracy, it
employs a 3D CNN and a hand-crafted spatio-temporal feature.
14
VDT using Keras , Convolutional 3D, Convolutional 2D LSTM and Con-
volutional 3D Transpose
Keras is an open source python toolkit which helps to create and examine DL mod-
els that are easy to use. It is compressed in Theano and Tensor- Flow, two efficient
numerical computing frameworks, and allows us to design and train neural network
models with just a few lines of code. The true neural network model is represented
by the Keras model [47]. Keras offers two ways to build models: a basic and straight-
forward Sequential API and a more versatile and sophisticated Functional API. On
the other Hand, A 3D Convolution is a form of convolution in which the kernel slides
in three dimensions rather than two as in 2D convolutions. Medical imaging is an
example of a use case in which a model is built utilizing 3D picture slices. When
extracting features in three dimensions or establishing a connection between three di-
mensions, 3D CNNs are employed. LSTM layer with 2D convolutional convolutions.
The input and recurring transformations are indeed convolutional, comparable to a
conventional LSTM. The CNN LSTM, is an LSTM architecture built particularly
for sequential prediction problems using spatial inputs such as pictures or videos. In
Keras, we might make a NN LSTM model by first indicating the CNN layer or layers,
then, at that point, encompassing them in a TimeDistributed layer, and afterward
determining the LSTM and result layers. There are two ways to deal with depict
the model, the two of which are indistinguishable and just contrast as far as incli-
nation. We might characterize the CNN model first, then, at that point, encase the
total series of CNN layers in a TimeDistributed layer to join it to the LSTM model.
Over a multi plane input picture, a 3D transposed convolution operator is applied.
The transposed convolution applies the results across all input plane outputs by
multiplying each input number element wise by a learnable kernel. This module
may be thought of as Conv3d’s gradients with regard to its input. It’s also referred
to as a deconvolution or a fractionally stride convolution. Convolution is utilized
to extricate applicable attributes from an info stream. It might be accomplished
involving a wide range of channels in picture handling. Moreover, Convolutions in
three dimensions exist. They are a 2D convolution’s generalization. The filter depth
in 3D convolution is less than the depth of the input layer (kernel size channel size).
As a result, it is capable of moving in all three directions (height, width, channel of
the image). The element-wise multiplication and addition produce one number at
each location. The output numbers are also organized in 3D space since the filter
glides across it. The result is a 3D data set (Table 2.3).
In this approach, few methods have been developed by researchers around the world.
Datta et al. [3], ffor example, employed a person’s limb orientation and trajectory
motion information to detect irregularities.HHMM has been suggested by Nguyen
et al. [4]. Utilization of HHMM and its structure was the main contribution of
them. Numerous analysts attempted to incorporate sound and Video technique,for
15
the identification of HAR for savagery. For example, Mahadevan et al. [8] fostered a
framework which recognized blood and blazed what’s more with level of movement
and sound to identify violence. Some other techniques were adopted by Hassner et al.
[10] to work with flow vector magnitude in other words violent flow descriptor (ViF).
By the help of a SVM, the ViF descriptor has been classified for crowded scenes
determining the violent or non-violent acts. Moreover, Huang et al.[13] showed
a method taking as it were the factual properties of the optical stream field in
video information to decide violence swarmed information which is grouped 15 into
ordinary and irregular conduct classes utilizing SVM. A Gaussian model of optical
stream for district extraction and utilization of an direction histogram of optical
stream (OHOF) has been utilized by Zhang et al. [23] to decide savagery from a
video transfer of an observation framework classed by direct SVM. Also, Gao et al.
[20] proposed a technique which portrays movement greatness what’s more direction
data, both utilizing focused fierce stream descriptors (OViF).
Many methods have been developed since last decade. Chen et al.[6], in addition
to Harris corner detector, (STIP) [6], and (MoSIFT) [9], [15], have used spatio-
temporal interest points to detect violence. To identify violent and aberrant crowds,
Lloyd et al. [29]created latest labels known as gray level co-occurrence texture
measures (GLCM), in which alternate swarm surfaces are recorded by transient
outlines. Additionally, Fu et al. [25] fostered a model to distinguish a battle scene
whose capacity is to take a gander at a progression of qualities upheld by movement
investigation utilizing three ascribes, including movement speed increase, movement
size, and furthermore the movement district. These properties are known as motion
signals that are acquired by the total of the motion region. Sudhakaran er al. [33]
used LSTM and nearby structure differentiation to put in the model with the help
of encoding the alternatives in the videos. Histogram of optical flow magnitude and
orientation (HOMO) proposed by Mahmoodi et al. [48].An analyst named Fenil et
al. [44] gave a structure in light of histogram of arranged angle (HoG) highlights
from each casing. They utilized the element to prepare bidirectional LSTM or
(BD-LSTM) which guarantees forward and in reverse data admittance to contain
data about fierce scenes. The methods described above attempted to address a
variety of issues in violence detection, such as camera viewpoints, complicated crowd
patterns, and intensity changes. When variation occurs within the physique for
violence detection, for example, they did not capture the discriminative and useful
traits by extracting them. Viewpoint, considerable mutual occlusion, and scale
all contribute to these differences [42]. Furthermore, ViF is unable to detect the
difference between the two flow vectors, limiting the accuracy when the movement
of a vector for same magnitude and different direction is present in the two frames
having 1 pixel.
16
Figure 2.7: The Framework of the proposed violent detection method
17
Object Feature
Method Detection Extraction Scene Type Accuracy %
Method method
Real time detection of violence in ViF descrip- Bag of fea-
crowded 88%
crowded scenes [4] tor tures
Ellipse es-
Bag of words Framework using ac-
Background timation Less Approx 90%
celeration
method
Multi model features framework Google net
on the base of the subclass[10] Im- for feature Less Crowded 98%
age CNN and ImageNET extraction
Spatial pyra- Spatio tem-
To determine the occurrence of vi-
mids and poral grid 96%-99% us-
olent purpose extended form of
grids for technique Crowded ing different
FV(Improved Fisher vector) and
Object De- for feature data sets
sliding windows [13]
tection extraction
Combination
Violence detection using oriented Optical flow of ViF and
Crowded 90%
violent Flow [23] method OViF de-
scriptor
HOG and
AEI tech- Spatio Both
AEI and HOG combined frame-
nique for temporal crowded 94%-95%
work to recognize the abnormal
background methods and Less
event in visual motions[6]
subtraction to extract Crowded
features
Optical
flow and
The framework includes prepro- temporal
Gaussian
cessing, detection of activity and difference for
function for Less
image retrieval. This work identi- object detec- 97%
video analy- Crowded
fies the abnormal event and image tion CBIR
sis
from data-based images[20] method for
retrieving
images
A motion
Vector
Late fusion method for temporal
method to
perception layers to detect high SGT MtPL Less
identify from 98%
level activities. Use multiple cam- method Crowded
multiple
eras from 1 to N [15]
cameras in
2D
A mo-
tion vector
method to
identify from SGT MtPL Less
Bi-channel 98%
multiple method Crowded
cameras in
two dimen-
sions
Image Net VGG-f
Convolutional neural network for method of model for
Crowded 91%-94%
real time detection[29] object detec- feature ex-
tion traction
Movement
Solve detecting problem by divid- Less
detection BoW Ap-
ing the objective in depth and clear Crowded 96%
And TR of proach
format using COV-Net[25]
Model
19
Chapter 3
Proposed Method
Each video is transformed to optical flow from RGB for achieving the later classifi-
cation using a fully connected network from robust generation for video encoding,
to mainly organized violence in videos. Optical flow is encoded by a dense network
as a segment of featuring maps. At first the featured maps go through a multi head
self-attention layer. After that it goes through a bidirectional ConvLSTM layer.
After all these, attention mechanisms are applied in the forward temporal direc-
tion as well as in the backward temporal direction respectively. This is basically
a spatio-temporal encoder which actually extracts the necessary features regarding
the spatio as well as temporal from each and every video. Finally, after doing all
these the features which are encoded are inserted to a four-layer classifier which
mainly signifies whether it is a violent video or not.
20
3.2 Model Justification
As shown in the figure below we can see a flourishing outcome of the test result of the
block composing the model architecture in the sector of human action recognition as
well as violent actions. Besides, for video classification the 3D DenseNet variant has
been used. The bidirectional recurrent convolutional block allows the feature analy-
sis in the forward and backward temporal direction and thus the block improves the
efficiency while recognizing the violent actions. The attention mechanism mainly
recognizes three things. Those are: human action, convolutional network combina-
tion and bidirectional convolutional recurrent blocks. The model really helped us
in developing our proposal as it is based on blocks which recognize human actions.
The detailed description of ViolentNet architecture in given below:
The pattern of apparent mobility of a visual object between two consecutive frames
created by the movement of an object or camera is known as optical flow.It’s a two
dimensional vector field in which each vector is a displacement vector indicating
the movement of points from one frame to the next. Structure from Motion, Video
Compression, and Video Stabilization are just a few of the uses for optical flow.
21
Figure 3.3: optical Flow
Here, we consider the first frame’s pixel I(x,y,t)where a new dimension, time, has
been added here. We used to simply work with photographs, thus there was no need
for time. In the next frame after dt time, it moves by distance (dx,dy) . We may
say
I(x, y, t) = I(x + dx, y + dy, t + dt) (3.1)
because those pixels are the same and the intensity does not vary. Remove common
terms and divide by dt using the right-hand side taylor series approximation to
produce the following equation:
fx u + fx v + ft = 0 (3.2)
∂f
fx = (3.3)
∂x
∂f
fy = (3.4)
∂y
dx
u= (3.5)
dt
dy
v= (3.6)
dt
The equation above is known as the Optical Flow equation.We can locate image
gradients fx and fy in there. Similarly, ft represents the gradient over time.
22
Farneback Method
The Farneback algorithm produces an image pyramid, with each level having a lower
resolution than the one before it. When we select a pyramid level greater than 1,
the system can track points at various resolution levels, starting with the lowest.
It has many functions and parameters when we compute a dense optical flow using
Gunnar Farneback’s algorithm. Optflow Farneback gaussian gauges optical stream
employing a Gaussian winsizewinsize channel instead of a box channel of the same
measure; regularly, this alternative gives z more exact stream than a box channel at
the cost of slower speed; regularly, winsize for a Gaussian window should be set to
a bigger esteem to attain the same level of vigor. Using the algorithm, the function
finds an optical flow for each previous pixel.
On the other hand, The minEig Threshold algorithm divides the number of pixels
in a window by the minimum eigenvalue of a 2x2 normal matrix of optical flow
equations (this matrix is called a spatial gradient matrix); if this value is less than
minEig Threshold, a corresponding feature is filtered out and its flow is not pro-
cessed, allowing for the removal of bad points and a performance boost.
This is one of the inputs of our network. The generation of frame sequence is done
in this algorithm where the most moved pixels between the consecutive frames are
represented with greater intensity. It has been the most vital component in violent
clips . Besides that the main components are contact and speed. The pixels have
a trait of moving much at the time of a particular segment in comparison to the
other segments of the video and also they have a tendency to make a cluster in a
particular portion of the video. Along with the optical flow we obtained a 2 channel
matrix after the application of the algorithm , the magnitude as well as the direction
are also included. The hue value of a picture is corresponded mainly through the
direction. That value is used for visualization purposes only and the magnitude
corresponds to the value plane. We chose dense optical flow over discrete optical
flow because dense optical flow generates flow vectors for the entire frame, up to one
flow vector per screen size, whereas sparse optical flow only generates flow vectors
for certain features, such as some pixels that portray the edges/seams of an object
within the frame.
Dense optical flow is used as an input in a model because. In deep learning models,
just like our proposed one we can see an unsupervised feature here and thus we can
see a wide range of features is way better actually.
23
3.2.2 DenseNet Convolutional 3D
The structure of DenseNet was built in order to process images. It has a 2D convo-
lutional layer. Other than that the DenseNet can be modified to work with videos.
The modifications are:
The DenseNet has a system of working in layer by layer and the layers are connected
in feed-forward fashion[27]. The reduction layers of DenseNet are MaxPool2D and
AveragePool2D the pool sizes are (2,2) and (7,7). Instead of them the MaxPool3D
and AveragePool3D were used which have the size of (2,2,2) and (7,7,7). The basis
of DenseNet structure is the dense blocks. The blocks are made of the feature maps
of a layer with all of its product.
In our suggested system we have used four sense blocks, all of them have different
sizes. The dense blocks consist of a course of layers. The layers follow the manner
of batch normalization convolutional 3D . The main reason for using DenseNet is its
simplicity in using feature maps.The DenseNet works in a more efficient way than
ResNet or Inception. The DenseNet structure can work in such a way that is more
prosperous and it generates a lower number of screens and specifications in order to
achieve high performance.
• dimension of questions dq = 32
• dimension of values dv = 32
• dimension of keys dk = 32
24
The improvements we got from this layer will be discussed in chapter 5. Multi-head
self attention mechanism forms new relations among features, determining if the
action is violent or not.
This system has two states: forward-state and future-state. The generated output
can get data from both states. This module is well-known for its ability to look back
in video. It divides the components of the periodic layers in positive and negative
time [28]. Usually in neural networks the temporal features are selected but spatial
features sometimes disappear. In order to avoid such a situation we proposed a
model where we generated convolutional layers instead of entirely connected layers,
here the convLSTM is able to observe both spatial and temporal features and let
us get data from both features. Bidirectional convolutional system is an advanced
convLSTM which has the access to look backward and forward in a video,which
gives the system an overall better outcome.
3.2.5 Classifier
Classifier is made up of connected layers. Each layer has nodes that are ordered
in a definite manner. These are 1024,128,16,2. So, we can see there are four full
layers which are actually connected. However, the ReLu activation function is used
by a layer which is hidden. The Sigmoid activation function is engaged verifying
whether an action category is violent or not and the output is a binary predictor
which is of the last layer. Self-Attention mechanism achieved a very high success
rate on determining the relevance of words in natural language processing and text
analyzing.
25
Chapter 4
Datasets
4.1 Datasets
In our thesis we actually have used the most widely accepted datasets which are
basically the benchmarks; those are the hockey fight dataset, movie fight dataset,
violent flows dataset, RWF-2000 dataset. These are well balanced, labeled and they
were having a (80-20)% ratio while splitting for training purposes as well as for
testing purposes respectively. Besides, these datasets cover mainly indoor scenes,
outdoor scenes and few weather conditions.
This dataset contains an equal amount of violent and nonviolent actions during
professional hockey games, with two players often involved in close body contact
and contains 1000 videos extracted from hockey games of NHL which is USA’s
National Hockey League.This dataset contains indoor scenarios.
26
Figure 4.2: Indoor scenes
This dataset includes violent and non-violent events of some action movies which
have around 200 clips of collection. It has scenes that have both indoor and outdoor
scenes but not any weather conditions scenes.
This is a collection of 246 videos captured in places that include violence in crowds.
This dataset is actually active on mass violence occurring outside and besides that
some scenes contain some weather conditions too.
This dataset contains 1000 violent videos which contain real life street fighting events
and 1000 non-violent videos which contain scenes related to normal daily life activ-
ities. This dataset has a collection of raw surveillance videos from YouTube. To
work with it more efficiently these videos have been sliced into 5-second chunks at 30
frames per second and recognize each clip as violent and non-violent action respec-
tively. The duplicate material appearing in both the training and validation sets
were removed. Then at last it obtained 2000 clips and 300,000 frames as a dataset
for violent action detection.
27
Figure 4.4: Real world videos captured by surveillance cameras with large diversity
Figure 4.5: Crowd violence (246 videos captured in places), Movies Fight (200 videos
extracted from action movies) and Hockey Fight (1k videos extracted from crowded
hockey games)
In RWF-2000 This data preprocessing system is done with python scripts and to
convert the videos they used a tensor with shape where the number of frames, the
image height and width were declared. Here, its last channel has three layers for the
RGB components and two layers for optical flow.
28
Figure 4.6: Detection of violence and non violence
To extract the features, in images the color changed RGB to Gray through optical
flow. The dimension is also used when we convert video to python file which is
of Height, width 224*224 and only 1 channel is used as it is gray. After padding
two channels have been created: one is a normal channel and another is an empty
channel. Lastly, before converting to numpy a channel has been created named
empty array to convert gray channel to RGB. After that the color again converted
to RGB from BGR and has been reshaped into three channels red, green, blue
respectively by intacting 224*224. For final data preprocessing five channels have
been created with the same height and width.
The Hockey fights dataset Here the frames are isolated from each movie into
distinct batches of frames and then applied all available augmentation procedures
to each dataset individually, such as removing black borders from the” Hockey”
dataset. Images are subtracted pixel by pixel from adjacent frames which are input
of the encoder model. Instead of the raw pixels from each frame, this was done to
include spatial movement in the input films.
For Movie Fight dataset In the case of this dataset the preprocessing method is
extraction of the frames from the shot and then this extraction will be inserted into
an algorithm which will be trained for violent recognition. In comparison to other
videos, such as surveillance, movies can usually be readily sampled into shots due
to their hierarchical structure. This hierarchical structure aids in the analysis of the
entire film at the shot level and thus a histogram-based method is used to segment
the movie into shots. A saliency map is created for all of the frames in a one shot
and then they are compared using the maximum number of non-zero pixels which
is divided by the summation of all the pixels in the frame.
From this dataset we faced some issues, due to the dark surroundings, fast movement
of objects, illumination blur, and other factors, many of the footage obtained by
surveillance cameras in public locations may not have high picture quality. Some
instances like only a portion of the person is seen in the photograph, crowds and
29
chaos, detecting objects from far distance, temporary actions, low resolution etc.
Therefore, to navigate the model we have saved all the datasets in Audio Video
Interleave extension. We tried to avoid the obstacles by removing the blurry pictures
and using the RGB components to get a most possible and accurate result and for
RWF-2000 and UCF datasets the level of accuracy we got in RWF-2000 and UCF
crime 2000 are 86.75% and 99% respectively in case of violence detection.
30
Chapter 5
In this chapter the implementation of our proposed model has been described for
detecting anomalous behavior for the surveillance system. This model used Python,
Tensorflow, Keras and OpenCV for implementation and used for testing also. The
implementation part is divided in four sections; Training Methodology,Training Met-
rics, Ablation Study and Experimentation on Cross Dataset.
5.1 Implementation
This section summarized the training process and assessed the ablation research
where it would be understood the importance of the self-attention mechanism that
we proposed. Moreover, the experimentation of a cross dataset is implemented to
assess the generality of anomalous behavior.
In this stage, the weights of all neurons in the model were randomly initialized. Here,
A range of 0 to 1 has been chosen to normalize each input pixel value. To calculate
the input video sequence from all of the videos from every dataset the average of
all sequence was calculated. The extra frame has been removed if the input video
consists of more than the average frames. To maintain the average frame, if there
was less sequence than the average, the last sequence is being repeated untill it
can reach the average. Moreover, the frames were enlarged to the standard size
for Keras pre-trained models, which is 224 × 224 × 3. The following parameters
were chosen: 104 as the rate of base learning, 12 films as the batch size, and 150
epochs. The weight decay has set to 0.1. In addition to that, the default setup of
the Adam optimizer has been employed. For the last layer of the classifier, as the
loss function determiner the Binary Cross Entropy has been taken and for activation
function the Sigmoid function has been chosen. On the Nvidia GT 730 GPU, the
CUDA toolbox was utilized to extract deep features for the tests where the operating
31
system was Windows 10 and the processor was an Intel Core i7. It is decided to use
a three-fold cross validation method and to test the performance of the dataset a
random permutation crossvalidator method was used. Two types of input method
were used to conduct the implementation: one is with optical flow and the other is
with neighboring frames removal which can be named as the pseudo-optical flow.
The temporal component was implicitly expressed in both entries, but in various
ways. To see what kind of input performs better and generates the best outcome,
two input methods were used. By removing two consecutive frames, the pseudo
optical flow was created.
On each pair of the adjacent frames, a matrix subtraction has been performed in a
sequence of frames
Using the way, any variation of two successive frames was introduced between the
pixels. Three violent situations are transformed into both input formats individually.
The input method that uses the neighboring frames removal technique represents
pixels that have not moved between two successive frames (f0......fx)which is the
fundamental distinction between the two approaches. The pixel turns black when
the value does not change in both frames as both of the frames had been removed
and it is not taken into account if that pixel moved or not. The pixels that turn
black in the optical flow method represent those pixels that never moved between
two successive frames.
5.1.2 Metrics
To identify the performance and efficiency measurement, the set of training metrics
has been used for this model are as follows:
Train accuracy: The number of right classified sections made by the proposed
algorithm on the train set on which the training was built was fraction of total
number of classification.
32
Figure 5.1: Train Accuracy
Test accuracy: The number of right classified sections made by the proposed
algorithm on the new cases divided by the total number of classification.
Inference time during test: When making atomic predictions on the test dataset,
the average delay time.
To see the usage of both input methods (Optical flow and Neighboring frames re-
moval technique) and to examine the self attention mechanism on the model and
the importance of it, a two-fold experiment was developed for the ablation investiga-
tion. The direct connection between the Bi-ConvLSTM and the DenseNet avoided
the self-attention process.
33
5.1.4 Experimentation on Cross Dataset
The goal of the experimentation of a cross dataset is to see how accurately the
trained model on one dataset can evaluate the examples from the other dataset.
One of the major goals that needed to be observed is if the learning of the model
in concept of violence is generic enough or not to appropriately detect violence on
different datasets. In this research, two types of cross dataset configuration have
been examined. For the first type, the model has been trained with one dataset and
then it is tested with the first one; and the second configuration was the training
which is combined of three dataset and tested on the remaining one.
5.2 Results
Here in this section we have discussed the comparison between the state of the
art and the cross dataset experimentation and we have gained the results from the
ablation study. In addition to that, this segment provides the implementation result
after running our proposed model for violence detection in surveillance. Python,
Tensorflow and Keras have been used to run the test on unclassified input data and
to generate the result.
Despite the fact that a very powerful core network has been used here than in any
prior studies. We thought it would be really amazing to watch the improvement
of performance through the modification of the input of the network (that is the
optical flow and the pseudo optical flow) and also engaging the attention mechanism.
Even though we have found mainly two major benefits: improvement in accuracy
and shorter inference time after measuring the two versions of the proposed model;
those are optical flow and the pseudo-optical flow and those were compared to the
ones without having the self-attention module. We can see a constant progress
towards betterment in accuracy as well as in the inference time from the Table 5.1
shown below. The reason for the progress is as the operation time required for
bidirectional convolutional recurrent layer required longer on application to feature
maps generated from CNN compared to attention layer that is concatenated in terms
of sequence, it was visible that the inference time required with attention mechanism
was less than without it.
34
precision
precision precision precision
evaluations
Method of evaluation reasoning reasoning
Dataset (with atten-
Input (without At- time (with At- time (without At-
tion)
tention) tention) tention)
feature
extracted
0.1627±0.0035s
HF with 99.10±0.6% 98.90±1.0% 0.1398±0.0025s
Optical
flow
Neighboring
frame re- 0.1627±0.0034s
HF 97.40±1.0% 97.30±1.0% 0.1398±0.0024s
moved
method
feature
extracted
0.2018±0.0045s
MF with 100.00±0.0% 100.00±0.0% 0.1917±0.0093s
Optical
flow
Neighboring
frame re- 0.2018±0.0045s
MF 100.00±0.0% 100.00±0.0% 0.1917±0.0093s
moved
method
feature
extracted
0.3115±0.0073s
VF with 96.90±0.5% 94.00±1.0% 0.2971±0.0030s
Optical
flow
Neighboring
frame re- 0.3115±0.0073s
VF 94.81±0.5% 92.51±0.5% 0.2993±0.0030s
moved
method
feature
extracted
RWF- 0.3029±0.0059s
with 95.61±0.6% 93.50±1.0% 0.2768±0.0020s
2000
Optical
flow
Neighboring
RWF- frame re- 0.3029±0.0059s
94.20±0.8% 92.30±0.8% 0.2777±0.0020s
2000 moved
method
The variation accuracy was least when the dataset consisted of fifty frames on the
average (HF and MF), but when it reached hundred frames on the average (VF
and RWF-2000), the model with self attention outperformed the others by two 2
points. The deduction time lessen on each dataset; it lessened from 4% (VF) to
16 %(HF) without attention. Lastly,when we got the result for all dataset and we
saw that pseudo-optical flow without attention and optical-flow with attention both
were successful in all dataset environments except for the MF dataset. The outcomes
were 2 points in HF, 4.4 points in VF, 3.4 points in RWF-2000 respectively.
35
5.2.2 Cross-Dataset Experimentation Results
Two logical facts emerged from the cross-dataset experiments,firstly there was mi-
nor connection in different datasets and a better understanding in various envi-
ronments. Secondly it revealed a better perception of the idea when we compare
multiple dataset at a time. We can also see that the results did not exhibit successful
generalization for pairings of datasets which were significantly dissimilar in terms
of the sort of violence scenario. When comparing MF and VF the results show a
low accuracy rate of 52.32% because it has completely different environments. But
when we trained the dataset in the opposite direction it showed a slightly higher
accuracy of 60.02%. The best result we found for HF,RWF-2000,VF dataset where
the test accuracy was 81.51%. Overall the RWF-2000 dataset gave us the best result
in every aspect.
The results we got while testing in different environments are shown in Table 5.2
precision
precision evaluation
Training Testing evaluation
Pseudo-Optical flow
Optical Flow
64.87±0.41%
HF MF 65.19±0.34%
61.23±0.22%
HF VF 62.57±0.33%
57.37±0.22%
HF RWF-2000 58.23±0.24%
53.51±0.12%
MF HF 54.93±0.33%
51.78±0.30%
MF VF 52.33±0.34%
55.81±0.20%
MF RWF-2000 56.73±0.19%
64.77±0.49%
VF HF 65.17±0.59%
59.49±0.16%
VF MF 60.03±0.24%
58.33±027%
VF RWF-2000 58.77±0.49%
68.87±0.14%
RWF-2000 HF 69.25±0.27%
74.65±0.22%
RWF-2000 MF 75.83±0.17%
66.69±0.22%
RWF-2000 VF 67.85±0.32%
69.85±0.14%
HF+MF+VF RWF-2000 70.09±0.19%
75.69±0.14%
HF+MF+RWF-2000 VF 76.10±0.20%
HF+RWF-2000+VF MF 81.52±0.09% 80.50±0.05%
36
5.3 State of the Art Comparison
When the studies were completed we discovered that the optical flow input produced
superior results than the pseudo-optical flow input. In the below table the results
are shown for the training and testing process.
Here we can see the optical flow highlighting the spatio-temporal features of the
movies and it works better than the pseudo-optical flow because it has a bigger loss
function reduction.
While comparing our method with the state of the art Table 5.3 we can see that
it shows better results than previous studies,even for those studies which do not
include cross-validation and maintain a small number of specifications. Where there
was person-to-person violence the best result we got from MF and HF. The MF
dataset, in particular, was the most comparable and least demanding. The concept
also functioned well in situations where there was a lot of crowd violence.
37
Object Feature Test Ac- Test Accu-
Train Ac- Training System Us-
Dataset Detection Extraction curacy racy Non-
curacy Loss age
Method method Violence Violence
3D Bi-
LSTM Optical 1.30×10-
HF 100% 99.50% 100.00% 68%
(Pro- flow 5
posed)
HF LSTM-AE NFR 97% 99% 1.35×10-5 97.00% 85%
RNN that
HF uses convL- FeedForward 100% 1.22×10-5 98.00% 100.00% 93%
STM cells
Back prop-
2D Conv- 73%
HF agation 93% 1.53×10-5 91.00% 92.80%
LSTM
method
3D Bi-
LSTM Optical 1.18×10-
MF 100% 100% 100% 67%
(Pro- flow 5
posed)
Neighboring
MF LSTM-AE frames re- 99.00% 1.39×10-5 98.40% 99.00% 82%
moval
RNN that
MF uses convL- FeedForward 100% 1.19×10-5 100% 100% 87%
STM cells
Back prop-
2D Conv-
MF agation 95.00% 1.62×10-5 94.20% 95.00% 70%
LSTM
method
3D Bi-
LSTM Optical 1.50×10-
VF 98% 97.00% 96.00% 71%
(Pro- flow 4
posed)
Neighboring
VF LSTM-AE frames re- 97.00% 2.94×10-4 95.00% 94.00% 83%
moval
RNN that
VF uses convL- FeedForward 99% 1.34×10-4 98.00% 98.80% 91%
STM cells
Back prop-
2D Conv-
VF agation 90% 3.10×10-4 89.00% 89.90% 73%
LSTM
method
3D Bi-
RWF- LSTM Optical 3.10×10-
96% 96.00% 95.00% 65%
2000 (Pro- flow 4
posed)
Neighboring
RWF-
LSTM-AE frames re- 95% 7.3×10-4 94.00% 93.00% 80%
2000
moval
RNN that
RWF-
uses convL- FeedForward 98% 2.90×10-4 96.00% 97.60% 93%
2000
STM cells
Back prop-
RWF- 2D Conv-
agation 92% 8.4×10-4 91.00% 92.00% 69%
2000 LSTM
method
38
In the hockey game the player constantly moves and hardly makes any physical
contact. Our proposed model made a huge rate of success in observing the move-
ments in this dataset. Our model acquired temporal properties based on vigorous
movement at the time it occurred.
The VF dataset had a distinct problem generalizing the concept of violence than the
hockey fight dataset. The videos in the VF dataset showed a large scale of violent
activities such as protests,concerts and other gatherings. During major events, a
lot of things happened at the same time. Because the viewpoints were so far away
from the action, many people seemed in low resolution. The distance between the
event and the observer machine made the object tiny and the motion detection on
a single film made it a lot more difficult to tell if the action was violent or not and
in fact, it is difficult for humans too. In addition to that, the environment with
mass events contains unique events such as a crowd trying to catch a baseball which
seems to be the start of a fight. Because the RWF-2000 dataset is so diverse, it is
also challenging to normalize the process of violence. The RWF-2000 scenes were
not topic-specific, unlike the other three datasets. When it is about non-violence
category, the heterogeneity of the dataset becomes more visible. Along with that,
the actions were vastly different for every scene from each other. Our model has
achieved state-of-the-art test accuracy for various datasets. In comparison to other
models, ours was in a decent spot. For the HF, MF, VF, and RWF-2000 datasets,
various models were used before our suggestion. There were many models that have
come to the field by now that achieved 100% accuracy in terms of test set which
made them unbeatable.
Our approach surpassed the nearest one by more than 2 points on the VF dataset.
Prior to ours, the RWF-2000 dataset had only been evaluated with one model. Using
hold-out validation, this model has a test accuracy of 92.00 percent for the RWF2000
dataset. Our model scored 95.60 percent, which is an improvement over the current
state of the art.
The optical flow input method’s test accuracy was significantly greater than with the
pseudo-optical flow input. The MF dataset was an exception to that. It has the same
test accuracy value for the both input methods. Apart from that, this happened with
39
all datasets. Even when measuring our model to others who employed a hold-out
validation procedure, it is clear that our strategy outperformed the competition.
However, RNN based Conv-LSTM has a better score in terms of accuracy but it
has greater system usage than our proposed model. We are using Jetson Nano de-
veloper board to execute our model and make it use for real time surveillance. In
this situation, we need to trade accuracy with the resource. Our model requires
10.70 Giga FLOPs. Compared to high accuracy RNN based Conv-LSTM it is 40.92
Giga FLOPs that requires more time and resources. Therefore, we chose our pro-
posed model that is best suited for implementation in Jetson Nano with satisfactory
accuracy rate.
In the following two figures, we can see that Jetson Nano has kept only guns in red
and removed all other features of the picture as per our model that increases the
violence detection rate and also increases the fps.
40
Figure 5.5: Gun detected
Our approach was not limited to detecting violence but also we wanted to implement
it in smart cities. Therefore, we created feeds for CCTV so that later we can identify
the culprit. The real time CCTV feed version of our model with detection of violence
is shown in the figures.
For better understanding of our approach, the figure shown below is the side by side
comparison of the original CCTV footage which will be monitored by the proper
authorities and the underlying process of the Jetson Nano for detecting the violence.
In this section, we will see the comparison of our model with other models after
implementing on Jetson Nano in terms of performance. To get a clear idea, the
comparison is shown based on the FPS of the video. In the figure, we can see that
our model outperformed the rest of the model in terms of the fps.
41
Figure 5.8: Comparison based on FPS
The figure shows our model has achieved 12 fps while detecting the violence which
is the highest among others. The second best model with 10 fps is the LSTM
AutoEncoder that has a lower accuracy rate. The RNN based approach that has
convLSTM cells in between Encoder and Decoder has the fps of 7 to 9 which is lower
than the proposed method of ours.
After a few trade offs with accuracy to gain performance, our model outperformed
other models to detect real time violence with satisfactory accuracy and perfor-
mance.
The architecture of violenceNet has the ability to classify the videos whether they
are violent or nonviolent. Our model worked with CCTV footage of different videos
of the same length. The segments of the videos were trained with dense optical
flow technique and violenceNet takes input from it and classifies if the segments are
violent or not.
42
Figure 5.9: Human Action Recognition Subsystems
This system identifies each fragment based on the light flow provided by a camera in
a CCTV system. The violent features are represented by the red box and non-violent
features are represented by the blue box.
43
Conclusion
44
Bibliography
45
[13] J.-F. Huang and S.-L. Chen, “Detection of violent crowd behavior based on
statistical characteristics of the optical flow,” in 2014 11th International Con-
ference on Fuzzy Systems and Knowledge Discovery (FSKD), IEEE, 2014,
pp. 565–569.
[14] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for ac-
tion recognition in videos,” arXiv preprint arXiv:1406.2199, 2014.
[15] L. Xu, C. Gong, J. Yang, Q. Wu, and L. Yao, “Violent video detection based
on mosift feature and sparse coding,” in 2014 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2014, pp. 3538–
3542.
[16] I. Serrano Gracia, O. Deniz Suarez, G. Bueno Garcia, and T.-K. Kim, “Fast
fight detection,” PloS one, vol. 10, no. 4, e0120448, 2015.
[17] E. Acar, F. Hopfgartner, and S. Albayrak, “Breaking down violence detection:
Combining divide-et-impera and coarse-to-fine strategies,” Neurocomputing,
vol. 208, pp. 225–237, 2016.
[18] T. Agrawal, A. Kumar, and S. K. Saraswat, “Comparative analysis of convolu-
tional codes based on ml decoding,” in 2016 2nd International Conference on
Communication Control and Intelligent Systems (CCIS), IEEE, 2016, pp. 41–
45.
[19] G. Batchuluun, Y. G. Kim, J. H. Kim, H. G. Hong, and K. R. Park, “Robust
behavior recognition in intelligent surveillance environments,” Sensors, vol. 16,
no. 7, p. 1010, 2016.
[20] Y. Gao, H. Liu, X. Sun, C. Wang, and Y. Liu, “Violence detection using
oriented violent flows,” Image and vision computing, vol. 48, pp. 37–41, 2016.
[21] P. C. Ribeiro, R. Audigier, and Q. C. Pham, “Rimoc, a feature to discriminate
unstructured motions: Application to violence detection for video-surveillance,”
Computer vision and image understanding, vol. 144, pp. 121–143, 2016.
[22] J. Wang and Z. Xu, “Spatio-temporal texture modelling for real-time crowd
anomaly detection,” Computer Vision and Image Understanding, vol. 144,
pp. 177–187, 2016.
[23] T. Zhang, Z. Yang, W. Jia, B. Yang, J. Yang, and X. He, “A new method for
violence detection in surveillance scenes,” Multimedia Tools and Applications,
vol. 75, no. 12, pp. 7327–7349, 2016.
[24] G. Batchuluun, J. H. Kim, H. G. Hong, J. K. Kang, and K. R. Park, “Fuzzy
system based human behavior recognition by combining behavior prediction
and recognition,” Expert Systems with Applications, vol. 81, pp. 108–133, 2017.
[25] E. Y. Fu, H. V. Leong, G. Ngai, and S. C. Chan, “Automatic fight detection
in surveillance videos,” International Journal of Pervasive Computing and
Communications, 2017.
[26] S. Herath, M. Harandi, and F. Porikli, “Going deeper into action recognition:
A survey,” Image and vision computing, vol. 60, pp. 4–21, 2017.
[27] K. W. Lee, H. G. Hong, and K. R. Park, “Fuzzy system-based fear estimation
based on the symmetrical characteristics of face and facial feature points,”
Symmetry, vol. 9, no. 7, p. 102, 2017.
46
[28] Q. Liu, F. Zhou, R. Hang, and X. Yuan, “Bidirectional-convolutional lstm
based spectral-spatial feature learning for hyperspectral image classification,”
Remote Sensing, vol. 9, no. 12, p. 1330, 2017.
[29] K. Lloyd, P. L. Rosin, D. Marshall, and S. C. Moore, “Detecting violent and
abnormal crowd activity using temporal analysis of grey level co-occurrence
matrix (glcm)-based texture measures,” Machine Vision and Applications,
vol. 28, no. 3-4, pp. 361–371, 2017.
[30] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with
pseudo-3d residual networks,” in proceedings of the IEEE International Con-
ference on Computer Vision, 2017, pp. 5533–5541.
[31] T. Senst, V. Eiselein, A. Kuhn, and T. Sikora, “Crowd violence detection us-
ing global motion-compensated lagrangian features and scale-sensitive video-
level representation,” IEEE transactions on information forensics and secu-
rity, vol. 12, no. 12, pp. 2945–2956, 2017.
[32] Y. Shi, Y. Tian, Y. Wang, and T. Huang, “Sequential deep trajectory de-
scriptor for action recognition with three-stream cnn,” IEEE Transactions on
Multimedia, vol. 19, no. 7, pp. 1510–1520, 2017.
[33] S. Sudhakaran and O. Lanz, “Learning to detect violent videos using convolu-
tional long short-term memory,” in 2017 14th IEEE International Conference
on Advanced Video and Signal Based Surveillance (AVSS), IEEE, 2017, pp. 1–
6.
[34] A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, “Action
recognition in video sequences using deep bi-directional lstm with cnn fea-
tures,” IEEE access, vol. 6, pp. 1155–1166, 2017.
[35] X. Wang, L. Gao, P. Wang, X. Sun, and X. Liu, “Two-stream 3-d convnet
fusion for action recognition in videos with arbitrary size and length,” IEEE
Transactions on Multimedia, vol. 20, no. 3, pp. 634–644, 2017.
[36] Z. Wang, D. Wu, R. Gravina, G. Fortino, Y. Jiang, and K. Tang, “Kernel
fusion based extreme learning machine for cross-location activity recognition,”
Information Fusion, vol. 37, pp. 1–9, 2017.
[37] L. Fan, W. Huang, C. Gan, S. Ermon, B. Gong, and J. Huang, “End-to-end
learning of motion representation for video understanding,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2018,
pp. 6016–6025.
[38] S. Lee and E. Kim, “Multiple object tracking via feature pyramid siamese
networks,” IEEE access, vol. 7, pp. 8181–8194, 2018.
[39] R. Olmos, S. Tabik, and F. Herrera, “Automatic handgun detection alarm in
videos using deep learning,” Neurocomputing, vol. 275, pp. 66–72, 2018.
[40] A. Ullah, K. Muhammad, J. Del Ser, S. W. Baik, and V. H. C. de Albuquerque,
“Activity recognition using temporal optical flow convolutional features and
multilayer lstm,” IEEE Transactions on Industrial Electronics, vol. 66, no. 12,
pp. 9692–9702, 2018.
[41] B. Yousefi and C. K. Loo, “A dual fast and slow feature interaction in biolog-
ically inspired visual recognition of human action,” Applied Soft Computing,
vol. 62, pp. 57–72, 2018.
47
[42] P. Zhou, Q. Ding, H. Luo, and X. Hou, “Violence detection in surveillance
video using low-level features,” PLoS one, vol. 13, no. 10, e0203668, 2018.
[43] Y. Zhou, X. Sun, Z.-J. Zha, and W. Zeng, “Mict: Mixed 3d/2d convolutional
tube for human action recognition,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2018, pp. 449–458.
[44] E. Fenil, G. Manogaran, G. Vivekananda, T. Thanjaivadivel, S. Jeeva, A.
Ahilan, et al., “Real time violence detection framework for football stadium
comprising of big data analysis and deep learning through bidirectional lstm,”
Computer Networks, vol. 151, pp. 191–200, 2019.
[45] I. U. Haq, K. Muhammad, A. Ullah, and S. W. Baik, “Deepstar: Detecting
starring characters in movies,” IEEE Access, vol. 7, pp. 9265–9272, 2019.
[46] S. Khan, K. Muhammad, S. Mumtaz, S. W. Baik, and V. H. C. de Albu-
querque, “Energy-efficient deep cnn for smoke detection in foggy iot environ-
ment,” IEEE Internet of Things Journal, vol. 6, no. 6, pp. 9237–9245, 2019.
[47] H. Lee and J. Song, “Introduction to convolutional neural network using keras;
an understanding from a statistician,” Communications for Statistical Appli-
cations and Methods, vol. 26, pp. 591–610, 2019.
[48] J. Mahmoodi and A. Salajeghe, “A classification method based on optical flow
for violence detection,” Expert systems with applications, vol. 127, pp. 121–127,
2019.
[49] M. Sajjad, S. Khan, T. Hussain, et al., “Cnn-based anti-spoofing two-tier
multi-factor authentication system,” Pattern Recognition Letters, vol. 126,
pp. 123–131, 2019.
[50] M. Sajjad, M. Nasir, F. U. M. Ullah, K. Muhammad, A. K. Sangaiah, and
S. W. Baik, “Raspberry pi assisted facial expression recognition framework
for smart security in law-enforcement services,” Information Sciences, vol. 479,
pp. 416–431, 2019.
[51] F. U. M. Ullah, A. Ullah, K. Muhammad, I. U. Haq, and S. W. Baik, “Vi-
olence detection using spatiotemporal features with 3d convolutional neural
network,” Sensors, vol. 19, no. 11, p. 2472, 2019.
[52] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing multi-
head self-attention: Specialized heads do the heavy lifting, the rest can be
pruned,” arXiv preprint arXiv:1905.09418, 2019.
[53] L. M. Dang, K. Min, H. Wang, M. J. Piran, C. H. Lee, and H. Moon, “Sensor-
based and vision-based human activity recognition: A comprehensive survey,”
Pattern Recognition, vol. 108, p. 107 561, 2020.
48