AI Video Analytics for Surveillance Robots
AI Video Analytics for Surveillance Robots
Résumé
Dans notre stage de fin d’études, nous nous concentrons sur le défi du déploiement
de l’analyse vidéo alimentée par l’IA dans les logiciels de gestion vidéo et la détection
intelligente par des robots pour les applications de surveillance. Nous avons l’intention
dans ce projet d’utiliser de tels réseaux afin d’obtenir de bons résultats pour la tâche
de détection. En utilisant les entrées vidéo des caméras visibles, le processus est intégré
à la petite architecture YOLO-v3. De plus, l’objectif principal est l’intégration de
tels modèles dans un logiciel de gestion vidéo afin de l’utiliser dans l’industrie, ce qui
est une tâche difficile. Pour montrer l’efficacité de notre modèle proposé, le proces-
sus de détection est appliqué en temps réel sur le robot PGuard fabriqué par Enova
ROBOTICS, la société hôte de ce projet. Au total, ce travail peut se résumer en deux
étapes. La première étape consiste à développer un modèle de détection de personnes,
puis la deuxième étape consiste à intégrer le modèle de détection d’objets dans PGuard
Robot en construisant un pipeline pour la phase de déploiement puis en développant
un plugin dans le logiciel de gestion vidéo qui pourra déclencher des alarmes sur la
base de la détection dans cette zone. Mots clés: Vision par ordinateur, apprentissage
en profondeur, détection d’objets, YOLO, caméras visibles et thermiques, déploiement,
analyse vidéo intelligente, Service de traitement vidéo.
Acknowledgement
The internship opportunity that I had with Enova Robotics was a great chance for
learning and professional development. I am grateful for having a chance to meet so
many wonderful people and professionals who led me through this internship period.
I would like to express my gratefulness to my supervisors for their continuous support
and encouragement throughout my graduation project. This project would not have
been possible without their assistance. I want to use this opportunity to express my
deepest gratitude and special thanks to Mr.Amir Ismail my professional supervisor
who guided me throughout this project, for his time to hear, guide and thus making
me feel confident. It has been invaluable in helping me to develop both personally and
professionally. It is my radiant sentiment to place on record deepest sense of gratitude
to my university supervisor Mr.Sami Achour for the help and advice concerning the
missions mentioned in this report, which he gave me during the various follow-ups and
for the time he devoted to us throughout this period without forgetting his participa-
tion in the development of this report.
Contents i
List of Figures iv
List of Tables vi
General Introduction 1
1 General presentation 3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 General context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Presentation of the host organization . . . . . . . . . . . . . . . . . . . 3
1.2.1 Enova Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Sectors of activity . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Project context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Definitions and concepts . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 Existing solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3.1 VEER . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3.2 Openpath . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3.3 BriefCam . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.4 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Project management methodology . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Unified Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Two Tracks Unified Process . . . . . . . . . . . . . . . . . . . . 12
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Theoretical study 15
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
i
CONTENTS
ii
CONTENTS
4 Realization 44
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1 People detection model . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 Development environment . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Data source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2.1 COCO dataset . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2.2 Data preparation . . . . . . . . . . . . . . . . . . . . . 46
4.1.3 Detection phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.3.1 Implementation and configuration details . . . . . . . 47
4.1.3.2 Training and testing . . . . . . . . . . . . . . . . . . . 48
4.2 Deployment phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Hardware tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 Software tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2.1 JetPack SDK . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2.2 DeepStream SDK . . . . . . . . . . . . . . . . . . . . . 51
4.2.2.3 TensorRT . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.3 VPS Toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.4 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Integration into Milestone XProtect . . . . . . . . . . . . . . . . . . . 55
4.3.1 General presentation of MIP SDK . . . . . . . . . . . . . . . . . 55
4.3.2 Development environment . . . . . . . . . . . . . . . . . . . . . 56
4.3.2.1 Hardware environment . . . . . . . . . . . . . . . . . . 56
4.3.2.2 Software environment and technologies . . . . . . . . . 57
4.3.3 VMS Plugin development . . . . . . . . . . . . . . . . . . . . . 59
4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 Authentication interface . . . . . . . . . . . . . . . . . . . . . . 60
4.4.2 Milestone XProtect Smart Client home interface . . . . . . . . . 60
4.4.3 Plugin interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.4 Alarm manager interface . . . . . . . . . . . . . . . . . . . . . . 62
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
General conclusion 64
iii
List of Figures
iv
LIST OF FIGURES
v
List of Tables
vi
General Introduction
Within the next ten years, robotics will dominate every aspect of life. This technol-
ogy has the potential to change the way of life and work and to raise standards. Over
time, its influence will become greater and greater, and the interaction between robots
and humans will also become greater. Between the 1960s and 1990s, most robots and
robotic systems were generally limited to industrial applications. It will have a huge
impact on many sectors such as military industry, health, customer service, transport
and logistics.
Robotics and artificial intelligence are increasingly being used in security, as they are
able to provide a high level of security while also reducing costs. For example, robots can
be used to patrol and monitor area, while artificial intelligence can be used to analyze
raw data from security cameras and to identify potential threats. Therefore, artificial
intelligence is being used to create more sophisticated and effective surveillance systems.
Despite the promise of machine learning and artificial intelligence, more than 80%
of data science projects never made it to production. It’s not just having the right
machine learning models and services that allows us to do Machine Learning at scale
the way we want to; it’s being able to put them in the right secure, operationally per-
formant, fully featured, cost-effective system with the right access controls that allows
us to achieve the business results we want.
It is in this context that our end-of-studies project hosted by the company Enova
Robotics fits, which aims to set up a solution for integrating AI-powered video analytics
applications into video management system to improve security, efficiency and to reduce
cost. This solution is specifically targeted to the Pearl Guard robot which is a security
robot. It is mobile, intelligent and autonomous. Its main mission is to secure industrial
sites and detect intrusions. The Pearl Guard can track the intruder by transmitting
the precise location of the intruder in real-time along with video stream and heat stream.
This report is divided into four chapters. The first chapter presents the general
1
General Introduction
context of the project, in which we focus on the work to be carried out and the study
of the related works. The second chapter presents the theoritical study which is an
introduction to deep learning and computer vision also presenting the state-of-the-art of
object detection. The third chapter reveals the preliminary requirements of the study to
answer the problem of the project as well as the conceptual study by providing diagrams
that can better illustrate the proposed solution. The last chapter exposes the various
human-machine interfaces developed as well as some details of the implementation and
results of our solution. We end this report with a conclusion and perspectives.
2
Chapter 1
General presentation
Introduction
In the first chapter, we start with a brief overview of the host organization. Then,
we highlight the context and issues of the project and describe its objectives. Finally,
we define the management methodology we used.
3
CHAPTER 1. GENERAL PRESENTATION
4
CHAPTER 1. GENERAL PRESENTATION
• Mini-Lab:
Mini-Lab as shown in the figure 1.3 is a robot designed by teachers for teachers.
It is a medium-sized robot optimized for indoor applications. It was born after an
experience of more than a decade in teaching and research in the field of mobile
robotics. Mini-Lab offers the best compromise between robustness and economic
competitiveness. For a complete teaching experience, Mini-Lab comes with pre-
defined labs and can be simulated on Matlab and Gazebo. Its control architecture
is open-source and is based on the Robot Operating System (ROS).
• Ogy:
Ogy as shown in the figure 1.4 is a home security robot developed by Enova
Robtics in partnership with other Tunisian companies: Chifco and OPCMA. It
is a companion robot that protects and monitors the home and has the ability to
communicate with connected objects while acting in real time.
5
CHAPTER 1. GENERAL PRESENTATION
• Covea:
Covea as represented in the figure 1.5 is a telepresence robot designed to help el-
derly people at home. It provides continuous monitoring and tele-vigilance. It is
therefore a way to facilitate follow-up for doctors and allow them to act with peo-
ple who are far away. This model has been deployed in one of the main Tunisian
hospitals that cares for patients with COVID-19, in order to limit contact be-
tween caregivers and patients and to improve remote and synchronous exchanges
between patients and their families.
• AGV:
AGV as shown in the figure 1.6 is an autonomous mobile cart designed to trans-
port payload from warehouses to production lines in a wide range of industries.
The AGV is used by manufacturers to automate their internal transport and lo-
gistics. Depending on the customer’s business demand, custom top modules can
be mounted on the AGV to support different types of payloads.
6
CHAPTER 1. GENERAL PRESENTATION
7
CHAPTER 1. GENERAL PRESENTATION
8
CHAPTER 1. GENERAL PRESENTATION
The search for the products to be analyzed is based on the intelligent video analytics
platforms integrated with the video management system. Then, we chose to treat the
following systems:
1.3.3.1 VEER
• Strengths:
VEER offers a number of features that make it a valuable tool for businesses, in-
cluding its ability to segment customers, its ability to create and manage customer
profiles, and its ability to integrate with other business applications. Among the
features and services, we cite gun detection and alert, people counting, speed
detection for traffic monitoring, etc.
9
CHAPTER 1. GENERAL PRESENTATION
• Weaknesses:
There are several potential weaknesses of the VEER solution, which include:
1. Lack of flexibility: The VEER solution is designed to be used in a specific way,
and it may not be possible to adapt it to different situations or needs.
2. Limited scalability: The solution may not be able to scale up to meet the needs
of a larger organization.
3. Dependence on technology: The solution is reliant on technology, and if there
are technical problems, the solution may not be able to function properly.
4. Lack of customization: The solution may not be able to be customized to meet
the specific needs of an organization.
1.3.3.2 Openpath
Openpath can be integrated with multiple leading video security solutions for greater
visibility and powerful, real-time analytics [8].
For more efficient risk reduction, video management software (VMS) combined with
physical access control creates a unified solution with best-in-class technology partners.
Integrating Openpath and remote video management software is a smarter approach to
IP surveillance because it provides visual context for all access control events.
• Strengths:
There are numerous benefits of an integrated video surveillance and access control
system.
1. Improved safety: Reduce loss-prevention issues by visualizing credentialed
users and controlling zone capacity with video occupancy tracking features.
2. Remote monitoring and management: Visually confirm or query entrances
from anywhere using a phone or tablet and no additional servers. With remote
lock and unlock capabilities, you can solve problems and manage access events.
3. Intelligent response: Context-driven security benefits from real-time alerts and
reporting, which leads to better decision-making and issue response.
4. Reduced administrative and IT hassle: Provisioning and installation are simple,
with no drain on IT resources.
• Weaknesses:
1. Potential security risks: The solution may pose security risks, as it stores
sensitive information in a cloud based service.
2. Lack of customization: The Openpath solution may not be adaptable to an
organization’s specific requirements.
10
CHAPTER 1. GENERAL PRESENTATION
1.3.3.3 BriefCam
Briefcam is a video content analytics platform that enables users to quickly and
easily identify, review and analyze video footage [9]. The platform provides a range of
features and tools that make it easy to search and review video footage, as well as to
identify and track objects and people in video footage.
• Strengths:
1. Faster filtering and sorting experience
2. Security: Customers can define their own database credentials.
3. Multilangual: Adding multiple languages to the user interface.
4. Easy to use: To help you better navigate the BriefCam system, they added
some interactive exercices to the user training in their portal.
• Weaknesses:
There are several weaknesses of the BriefCam video content analytics platform
that should be considered before using it.
1. The platform is not able to process video in real-time, which can be a problem
if you need to analyze footage as it happens.
2. The platform is not very accurate when it comes to identifying objects and
people in video footage, which can lead to false positives or false negatives when
trying to detect specific activity.
3. The BriefCam video content analytics platform is quite expensive, which may
make it unaffordable for some users.
1.3.4 Objective
The objective of this internship is to build a solution for integrating and deploying
deep learning models with a video management system that enables us to receive video
streams from different cameras of the robot, to be then forwarded to the intelligent video
analytics that will be responsible for the video processing part. The video processing
service should be able for example to detect specific objects such as persons, cars, etc
from different cameras using deep learning models. Then, it will return metadata to
the video management software to be read and analyzed. In proposed solution, we will
focus on object detection to test our pipeline. In fact, the aim of object detection is to
analyze the video sequences and trigger alarms to the user when something occurs. In
this manner, a bounding box should be drawn around each detected object all along its
presence in the video with a class name above it. We will focus on people detection in
our solution as an example of object detection. The alarm will be shown in the smart
11
CHAPTER 1. GENERAL PRESENTATION
client interface and the recorded videos will be displayed and saved. This could help us
determine the state of the monitored area.
• Architecture-centric:
Different views are used to describe the system architecture. The architect pro-
ceeds in stages, starting by defining a simplified architecture that meets the prior-
ity requirements, then defining the subsystems more precisely from the simplified
architecture found earlier.
• Risk-focused:
Identify risks and maintain a list of risks throughout the project.
12
CHAPTER 1. GENERAL PRESENTATION
system exchange, producing the specifications and modeling the context (the system is
a box black, the actors surround him and are connected to him, on the axis that binds
an actor to the system we put the messages that the two exchange with meaning).
The process then revolves around three essential phases :
• A technical branch:
The technical branch capitalizes on technical know-how and/or technical con-
straints. The techniques developed for the system are independent of the functions
to be performed.
• A functional branch:
The functional branch capitalizes on the knowledge of the business of the company.
This branch captures functional needs, which produces a model focused on the
business of end users.
• An implementation phase:
The implementation phase consists of bringing the two branches together, allowing
application design to be carried out and finally the delivery of a solution adapted
to the needs.
13
CHAPTER 1. GENERAL PRESENTATION
Conclusion
This chapter is the corner stone of our project in which we have defined the scope
of our study followed by the problem statement to specify our objectives. Indeed, the
issues and problem faced enabled us to prepare a good design for the improvements
that we will add to the solution offered to meet our needs. In the following chapter
we will present the theoritical study which is represented in an introduction to deep
learning and computer vision as well as the state-of-the-art of object detection.
14
Chapter 2
Theoretical study
Introduction
The objective of this chapter is to define and introduce the field of deep learning
and then present the state-of-the-art of object detection. We have divided this chapter
into three sections. In section one, we introduce the field of deep learning, where we
describe its general concept and how neural networks learn. Section two refers to the
computer vision part in which we define the concept of object detection. Finally, in
the last section, we present the state of the art of object detection and the different
architectures.
15
CHAPTER 2. THEORETICAL STUDY
A network may have three types of layers: input layers that take raw input from the
domain, hidden layers that take input from another layer and pass output to another
layer, and output layers that make a prediction as shown in the figure 2.2.
• Activation for hidden layers: In the hidden layers of a neural network, a dif-
ferentiable nonlinear activation function is typically used. As a result, the model
can learn more complex functions than a network trained with a linear activation
function. The picture 2.3 presents the Sigmoid, TanH and ReLu functions.
• Activation for output layers: There are three activation functions you may
want to consider for use in the output layer:
– Linear
16
CHAPTER 2. THEORETICAL STUDY
– Logistic (Sigmoid)
– Softmax
The figure 2.4 represents the softmax activation function. Basically, The function
is useful and great for classification problems, especially multi-class classification
problems, because it returns the ”confidence score” for each class. Because we’re
dealing with probabilities, the softmax function’s results will add up to 1.
17
CHAPTER 2. THEORETICAL STUDY
you minimize it, it means it’s better performance on all of those samples.
Backpropagation is the algorithm for efficiently computing this gradient, which is the
heart of how a neural network learns.
The majority of deep neural networks are feed-forward, which means they only flow
from input to output in one direction. Backpropagation, on the other hand, allows you
to train your model in the opposite direction.It allows us to calculate and attribute each
neuron’s error, allowing us to fine-tune and fit the algorithm appropriately.
2.1.5 Optimizers
Optimizers are algorithms used to minimize an error function (loss function) or to
maximize the efficiency of production. Optimizers are mathematical functions that are
dependent on the model’s learnable parameters including weights and biases. They help
us know how to change the weights and learning rate of a neural network to reduce the
losses. We are going to introduce two types of optimizers.
18
CHAPTER 2. THEORETICAL STUDY
The question is how to choose optimizers. In fact, if the data is sparse, use the self-
applicable methods, namely Adagrad, Adadelta, RMSprop, Adam. In many cases,
RMSprop, Adadelta, and Adam have similar effects. While Adam just added bias-
correction and momentum on the basis of RMSprop. Finally, As the gradient becomes
sparse, Adam will perform better than RMSprop.
• A fully connected layer that uses the output of the convolution process to predict
the image’s class using the features extracted in previous stages.
19
CHAPTER 2. THEORETICAL STUDY
2.2.3 Deployment
Deep learning models are able to learn complex patterns in data and make pre-
dictions about new data. Deploying deep learning models can help organizations to
automate tasks, improve decision making, and gain insights from data. It’s one of the
last steps in the machine learning process, and it’s also one of the most cumbersome
and time-consuming phase.
There are many ways to deploy deep learning models. Some popular methods are:
1. Using a cloud service: this is a popular method for deploying deep learning
models. The model is stored on a cloud service such as Amazon AWS or Google Cloud
Platform. Once the model is deployed, you will need to make sure it is accessible to your
users. This can be done by creating an API or by using a tool such as AWS Lambda
and finally it needs to be monitored to make sure it is working correctly. There are a
few reasons why it is not recommended to deploy your AI model on the cloud. First, it
can be expensive to be hosted. Second, it can be difficult to manage and monitor your
model on the cloud. Finally, there is a risk that your model may be compromised.
2. On the edge: edge AI refers to AI algorithms that are run locally on a device
such as an edge node. Many applications require the deployment of deep learning at the
edge for real-time inference and privacy issues. In terms of network bandwidth, network
latency, and power consumption, it significantly reduces the cost of communicating with
the cloud. Edge devices, on the other hand, have limited memory, computing resources,
and processing power. This requires the optimization of a deep learning network for
embedded deployment.
20
CHAPTER 2. THEORETICAL STUDY
One-stage methods prioritize inference speed, and example models include YOLO, SSD,
and RetinaNet. Two-stage methods prioritize detection accuracy, and example models
include R-CNN, Fast R-CNN, Faster R-CNN, and Cascade R-CNN. Section one refers
to two-stage methods while section two presents one-stage methods.
2.3.1.1 R-CNN
The R-CNN algorithm was the first model to apply deep learning to the task of ob-
ject detection. Region-based Convolutional Neural Network (R-CNN). This algorithm
was published in 2014. The R-CNN series are based on the concept of region proposals.
They are used to localize the object in an image. As described in the figure 2.7, the
overall pipeline is composed of three stages:
• Generate region proposals: the model must draw candidates of objects in the
image, independent from the category using a selective search algorithm. The
proposal regions are cropped and resized from the image.
• The second stage is a fully convolutional neural network that computes features
from each cropped and resized region.
• The final stage is a fully connected layer, expressed as SVMs that will refine the
region proposal bounding boxes.
The selective search algorithm is used to generate region proposals. To avoid the prob-
lem of selecting a large number of regions, Ross Girshick et al. proposed a method in
21
CHAPTER 2. THEORETICAL STUDY
which we use selective search to extract only 2000 regions from the image, which he
referred to as region proposals [3]. Selective Search is composed of 3 steps:
• Use greedy algorithm to recursively combine similar regions into larger ones.
• Create the final candidate region proposals using the generated regions.
The problems presented in R-CNN consist of taking a huge amount of time to train
the network as you would have to classify 2000 region proposals per image. Second,
it cannot be implemented in real-time because each test image takes approximately 47
seconds to process, and finally, the selective search algorithm is a fixed algorithm. As
a result, no learning is happening at that stage. This may result in the generation of
poor candidate region proposals.
22
CHAPTER 2. THEORETICAL STUDY
We can conclude that selective search is a slow and time-consuming process that
affects the performance of a network. As a result, Shaoqing Ren et al. developed
an object detection algorithm that does away with the selective search algorithm and
allows the network to learn region proposals [4]. Fast R-CNN likewise, we pass the
input image directly to the convolutional neural network but instead of applying on
the convolutional feature map the selective search algorithm to identify the region
proposals, another network is applied for the prediction of the region proposals. Then,
the predicted region proposals are reshaped using a RoI pooling layer which is used for
the image classification within the proposed region and predict the offset values for the
bounding boxes.
The picture 2.9 describes the components of Faster R-CNN.
23
CHAPTER 2. THEORETICAL STUDY
It is composed of 3 parts:
• Region of interest: we have different sizes of the feature map, so we apply the
RoI pooling to reduce all the features.
• Classifier and regressor: draw bounding boxes on the object which we classi-
fied.
The picture 2.10 shows the difference in test-time speed between the different models.
It is clearly shown that Faster R-CNN is much faster than its predecessors. Therefore,
it can even be used for real-time object detection.
24
CHAPTER 2. THEORETICAL STUDY
boxes into a series of default boxes at varied aspect ratios and scales per feature map
position, requiring only one shot to detect multiple items present in an image.
SSD has a base VGG-16 network followed by multibox conv layers. In fact, VGG-16
base network for SDD is standard CNN architecture for high quality image classification
but without the final classification layers. It is used for feature extraction. Then, we
add additional convolutional layers for detection.
SSD does not split the image into a grid like YOLO, but it predicts the offsets of the
predefined anchor boxes for every location in the feature map. In this case, each box
has a fixed size and position in relation to its corresponding cell. The framework of the
SSD is seen in figure 2.11.
The final bounding box prediction is calculated by adding the predicted offset over the
default boxes and the regression loss is calculated to correct the offset on the basis of
the ground-truth bounding box matched. In the situation where no predicted boundary
box can be mapped to any of the ground truth boundary boxes, the regression loss is
fixed to zero.
W. Liu et al.[5] have proposed an SSD detection framework that follows an output
discretization of bounding boxes into a default set of boxes (similar to Faster-RCNN
anchor boxes) for every feature map location. The default boxes have various aspect
ratios and scales to better match any object shape in the image. In addition, SSD
combines feature map predictions at multiple scales to further manage the scales of
objects relative to the image. SSD is a one-stage approach that removes the necessity
of an object proposal step, which makes it simpler and faster than the Faster-RCNN
approach.
The term ”You Only Look Once” is abbreviated as YOLO[6]. It’s a real-time object
detection algorithm that employs neural networks. This algorithm is well-known for
its trade-off between speed, accuracy, and ability to learn. It has been used to identify
traffic signals, persons, parking meters, and animals in a variety of applications. It
25
CHAPTER 2. THEORETICAL STUDY
employs CNN to detect objects in real-time. A single algorithm run is used to predict
the entire image. At the same time, the CNN is used to predict multiple bounding
boxes and class probabilities.
The three techniques used by the YOLO algorithm are as follows:
• Residual blocks: this means the image is divided into various grids. Each grid
has the dimensions S x S.
• Intersection Over Union (IOU): assuming that there are two bounding boxes,
one green and the other blue. The blue box represents the predicted box, and the
green box represents the actual box. YOLO makes sure the two boundary boxes
are the same size and equal.
The final detection will be made up of distinct bounding boxes that exactly suit the
objects. The YOLO algorithm’s several steps are illustrated in the figure below.
Each of the ss grid cells predict bounding boxes B. The model produces a confidence
score for each bounding box indicating the probability that the cell contains an object
[7]. The score for each bounding boxes does not classify the type of object or know
which object matches the bounding box, it simply gives a confidence value or probability
value of the quality of the bounding boxes surrounding an object in each cell[8]. It is
the intersection of the union (IOU) between the predicted box and the detected box,
defined as:
26
CHAPTER 2. THEORETICAL STUDY
With Pr (Object) is the probability of having an object. If no object exists in this cell,
the confidence scores must be zero. The model also produces 4 numbers to represent the
position and dimensions of the predicted bounding box: the center of the bounding box
in the (x, y) and the width (w) and height (h) of the bounding box. YOLO, a unified
one-step detection model proposed by J. Redmon et al. [9], reassigns object classifiers
to the object detection task. In YOLO, a single detection model is used for predicting
both object class probabilities and bounding box parameters in a single forward pass.
When compared to Faster-RCNN, this allows YOLO to save time. An illustration of
YOLO detection model is shown in figure 2.13
YOLO v1 [9] is the first version of YOLO detectors that uses the Darknet framework
which is trained on the ImageNet dataset. The use of YOLO v1 was relatively limited.
The issue is primarily caused by the inability of this version to detect small objects
in images. The second version of YOLO was released [10] at the end of 2016. The
improvements of this version is mainly about faster execution time and it includes all
well more advanced components. Precisely, YOLO v2 improves the stability of the
neural network, with the increase of the input size of the image which improved the
Mean Average Precision (MAP) up to 4%. Furthermore, YOLO v2 addressed the
problem of missing detection of small objects in the first version by dividing the image
into 13 × 13 grid cells. This enables YOLO v2 to recognize and locate smaller objects
in the image, allowing it to become equally effective with larger objects. YOLOv3 is the
third version of the YOLO family. It performs 3 scales prediction to enable multi-scale
detection as the case for FPN detector. For the following reasons, the third version has
become one of the most popular object detectors: For starters, YOLOv3 is more efficient
in terms of speed, as it runs much faster than previous YOLO versions. In addition,
27
CHAPTER 2. THEORETICAL STUDY
the third version of the YOLO algorithm is much more precise in the detection of small
objects. It also makes the identification of classes more exact.
The YOLOv3 tiny model is a simplified version of the YOLOv3 model with less
mean average precision and more frames per second as it is shown in the table down
below 2.1.
When speaking of architectures, we have 3 principle blocks : Backbone, Neck and Head.
For YOLOv3 tiny we have :
YOLOv3 uses darknet53 as backbone which uses many 1x1 and 3x3 convolution kernels
to extract features. YOLOv3 tiny reduces the number of convolutional layers, its basic
structure has 13 convolutional layers and 6 Max pooling layers, and then features are
extracted by using a small number of 1x1 and 3x3 convolutional layers [12]. To achieve
dimensional reduction, YOLOv3-tiny uses the pooling layer instead of YOLOv3’s con-
volutional layer with a step size of 2 [13]. However, its convolutional layer structure
still uses the same structure of Convolution2D + BatchNormalization + LeakyRelu as
YOLOv3. The image is first fed into Darknet53 for feature extraction before being fed
into the feature pyramid network for feature fusion. The results are then generated by
the YOLO layer. In a one-stage detector, the head’s role is to make the final prediction,
which consists of a vector with the bounding box’s width, height, class label, and class
probability.
The network structure of YOLOv3 tiny is shown in Figure 2.14 .
28
CHAPTER 2. THEORETICAL STUDY
Conclusion
This chapter introduced the deep learning and computer vision domains. Also it
provided an overview of existing solutions in the field of object detection that use CNN
architectures. Two types of algorithms of detection were presented: Two-Stage and
One-Stage object detection algorithms with different models.
29
Chapter 3
Introduction
As we explained in the first chapter, the needs are also the functionalities offered
by our solution. We will describe the functional needs by actor and the non-functional
needs that our project must meet. This phase is essential in the life cycle of any
software and more precisely with the Y methodology, then we present our proposed
solution. Finally, we will detail the different elements of the design of our proposed
solution in the conceptual study section.
30
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
work with live video streams in real-time, and be able to integrate with other security
systems.
The system has the potential to replace human security guards in a number of ways.
-First, video analytic can be used to automatically identify and track people and
objects in a video feed. This means that it can be used to monitor an area for secu-
rity purposes without the need for a human guard to constantly be watching the footage.
-Second, video analytic can be used to raise an alarm in the event of a security
breach. This could be done by detecting unusual patterns of movement or by recogniz-
ing specific objects that are not supposed to be in a particular area.
-Third, video analytic can be used to provide detailed reports of events that have
occurred in an area. This could be used to investigate a security breach or to track the
movements of people and objects over time.
Overall, video analytic has the potential to replace human security guards in a number
of ways. It is more accurate than human guards, it can monitor a larger area, and it
can provide detailed reports of events.
The process would be as follow:
- When an alarm is triggered a recording will be started on the source camera of the
alarm for 25 seconds with 5 seconds in advance for further observation by a human.
Moreover, building deep learning models that are capable of detection or tracking is not
sufficient, the difficult task is to be able to use it in industry as well. This means being
able to take your models and put them into a production environment so that they can
be used by others. As a result, the main task is to build a pipeline and a workflow for
deploying any kind of models to be ready to use. So, the integration part is a necessity
in any computer vision project and it is challenging.
31
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
- Actors: represent the users of the system. An actor can be a person, an organiza-
tion, or a piece of software.
-Use cases: represent the actions that users can perform with the system. A use
case should be a complete and self-contained description of an action that a user can
take.
- Associations: represent the relationships between actors and use cases. An asso-
ciation indicates that an actor can perform a use case.
- Generalizations: represent inheritance relationships between actors or use cases.
A generalization indicates that one actor or use case is a more specific type of another
actor or use case.
• Global use cases diagram A use case highlights a functionality, i.e. an in-
teraction between the actor and the system. The use cases delimit the system,
its functions and relations with its environment. They therefore represent a way
of using the system and make it possible to describe its functional requirements.
The following figure 3.1 shows the general use case diagram.
32
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
- For our application, reliability and speed are two proportional constraints. The
satisfaction of one affects the other. Our solution must guarantee a minimum error rate
and try to reduce the response time as much as possible.
- In addition, we must cover all possible cases in order to avoid any exception.
Finally, our solution must imperatively be able to ensure security during the exchange
of information between the VMS and the VPS and to guarantee that neither can access
the viewing of video streams from monitored sites.
3.2.2 Constraints
In this part, we will describe in details the technical constraints that our solution
must follow. First of all, our company uses a VMS called XProtect Milestone so it’s a
necessity to integrate our solution in this video management software. In the section
below, we will present some VMS alternatives.
There are many video management softwares (VMS) available on the market today.
Some of the most popular are Milestone XProtect, Genetec Security Center, and Hikvi-
sion NVR. Each has its own unique set of features and benefits, so it is important to
choose the one that is right for your specific needs.
• Hikvision NVR [26]: is a cost-effective VMS that is perfect for small busi-
nesses. It offers many of the same features as the other two VMSs, but it is more
affordable. Hikvision NVR is also very easy to use, so it is a good choice for
33
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
- One of the main strengths of Hikvision’s VMS is its ease of use. The system
is designed to be user-friendly and can be easily navigated by users of all levels
of experience. The VMS also offers a wide range of customization options, which
allows users to tailor the system to their specific needs.
• Genetec Security Center [27]: is a comprehensive VMS that is ideal for large
organizations. It offers advanced features such as video analytics, license plate
recognition, and integration with other security systems. Genetec Security Cen-
ter is also very scalable and can be customized to meet the specific needs of your
business.
34
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
- GSC has a number of features that make it a powerful VMS, such as the abil-
ity to manage and monitor multiple security devices from a single interface, the
ability to create rules and alerts to notify users of potential security threats, and
the ability to integrate with other Genetec security products.
- However, GSC also has some weaknesses. One weakness is that it is a very com-
plex system, which can make it difficult to use and configure. Another weakness
is that it is not as widely compatible with third-party security devices as some
other VMSs.
- One of the main strengths of XProtect Milestone VMS is its scalability. The
software can be used to manage small surveillance systems with just a few cam-
eras, or large enterprise-level systems with hundreds or even thousands of cameras.
Additionally, the software is easy to install and configure, and users can be up
and running in just a few minutes.
35
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
fair amount of technical knowledge to install and configure, which can be a barrier
for some users.
In this manner, we choosed to work with Milestone XProtect since it is more scalable and
the best software to work with. It offers a great deal of flexibility and customization.
Basically, it provides an MIP SDK that enables you to develop your own features
without their intervention. It also has a robust feature set that includes things like
video analytics and alarm management. Furthermore, to go back in time, the hosting
company has developed a PGMS which is a ”PGuard Management System” that it has
integrated into XProtect VMS Milestone lately. This management system allows the
management and control of the various functionalities of the PGuard robot on the one
hand and the control of this PGuard through devices connected to the interface.
36
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
Second, we develop the pipeline of video processing service. VPS [30] currently enables
the analysis of live video streams where live video is parsed into the environment where
VPS is deployed for analysis and which returns metadata and a video as output from
the video analysis application that is back injected into the VMS. Finally, the plugin
integrated with the video management software will read the metadata that is returned
from the analytic application to generate alarms to the user when something occurs in
the monitored area, also a recording video will be available to the end user for review.
The figure 3.6 shows the IVA, our VPS and the flow of video streams.
37
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
38
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
39
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
to graphically represent the behavior of a method or the routing of use cases. The
following figure 3.8 represents a complete scenario of using our plugin in Milestone
XProtect and navigating within it.
The agent must launch the Milestone VMS by first checking the servers (Milestone
XProtect Management server and Milestone XProtect Recording server). Once the
server is launched, the agent can access the plug-in.
If the robot is connected as well as the cameras, the agent can view the video feeds
from the cameras, and he can have an idea about the current status of the robot in
real-time. The agent can activate intrusion detection to ensure optimal surveillance of
the area where the robot performs its patrols by means of artificial intelligence. In case
of intrusion detection, an alarm is generated to notify security agents in real-time and
finally could stop this by disabling this functionality.
40
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
41
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
This is the first step that takes place when an application is launched. According to
the following figure 3.9, the user must launch Milestone to enter his connection parame-
ters, then the system will verify the entry data with Milestone’s Xprotect Management
server. The latter displays the ”home interface” space, in case of error, an error message
will be displayed.
The sequence diagram presented by the following figure 3.10 is relative to the step
of starting the automatic surveillance. After the authentication of the user, and the
42
CHAPTER 3. PRELIMINARY AND CONCEPTUAL STUDY
connection verification of the robot and the cameras, the user can open the Smart
Client to access the plug-in, so that the system activates the ”Start” button. Once
this button is clicked, the system will send the video streams from the IP cameras in
the robot to a VPService and then read the metadata in real-time so it could be ana-
lyzed whether there is a person detected at that moment or not. If a person is detected
an alarm will triggered to the user interface with a recording videos at that specific time.
Conclusion
In this chapter we have detailed the main functionalities of our application which
we have illustrated by use case diagrams in order to ensure a better understanding and
a good mastery of our work as well as the conceptual study of our solution which will
be enriched among other things through deployment diagram, sequence diagrams and
an activity diagram, reflecting the static and dynamic aspects. In the next chapter we
will present the realization part and implementation of our project.
43
Chapter 4
Realization
Introduction
This last part of our project is very critical since it puts into reality all the the-
ory studied in the previous chapters. In this last chapter we will firstly present the
hardware and software environment in which our project was elaborated, indicating the
technologies used. Then, we will present the data used for building our model. Finally,
we conclude this chapter with the interfaces of our solution.
• Google Colab: It is a web IDE for python, to enable Machine Learning with
storage on the cloud. It is a Google product.
44
CHAPTER 4. REALIZATION
• All the images in the dataset should contain at least one person.
• The persons in the images should be of different sizes, scales, shapes and heights.
For example, some of the images may contain very close persons that can be
appreciated perfectly in the image, but other persons may be very far away. Also,
images should contain persons from different camera angles.
To sum up, the dataset should cover persons in different situations, in which they can
be found in the real life (weather condition, camera status,etc.), so our model which
is intended for real-world applications can detect them under all circumstances. The
dataset for person detections, assessed in our work are discussed in the next paragraphs.
The dataset that we used for training our custom model is MS COCO dataset [31]. In
fact, MS COCO (Microsoft Common Things in Context) is a large-scale image dataset
of 328,000 samples of commonplace objects and humans with 80 object categories.
Annotations from the dataset can be used to train machine learning models to recognize,
classify, and describe things.
45
CHAPTER 4. REALIZATION
Data preparation is the process of cleaning and organizing data so that it can be
used for analysis. This may involve removing invalid data, filling in missing values, and
reformatting data so that it is consistent. Data preparation is an important step before
creating deep learning models because it can improve the accuracy and quality of the
results.
Therefore, we used techniques such labeling with the help of the Roboflow platform
tool [45]. We started off with 5019 images then after the data cleaning procedure we
ended up with 4670 images split as following :
Also, to make sure that all images are in the same size, we resized them to 416x416.
In order for data to be useful, it must be properly labeled. This process is known as
data labeling, and it is a critical part of any data-driven project. With the help of
Roboflow, we assured the annotation of all our images.
By the end of this process, we obtain a .txt file that has the same name of the corre-
sponding image and it contains the class and the bounding box annotations in a YOLO
format as shown in the figure 4.2 below.
46
CHAPTER 4. REALIZATION
Box coordinates (x center, y center, width, height) must be normalized (from 0–1),
which means they must be divided by the image width and height.
• Create and edit the configuration file yolov3-tiny-custom.cfg as shown in the table
4.1, filters = ((nb classes + 5)*3) Setting the batch size 64 means we will be using
64 images for every training iteration.
47
CHAPTER 4. REALIZATION
Batch 64
Subdivisions 16
width 416
height 416
max batches 12000
filters 18
classes 1
learning rate 0.001
• Create obj.data and obj.names as a pre-training step: obj.names will contain the
class names, in our case the class name is ”person”. The figure 4.3 represents the
obj.data file which contains the number of classes and the emplacement of the
train and valid samples.
Now that our dataset is ready we can launch the training command to start the
training.
The results of the train were as following: mAP = 40% / Avg Loss = 2.
After finishing the training process, the weights are saved in the backup folder. This file
will be used in the testing phase. As for the test command, we assigned the obj.data
file, the custom config file and the best weights obtained from the training as it is shown
in the figure 4.5.
48
CHAPTER 4. REALIZATION
49
CHAPTER 4. REALIZATION
The NVIDIA JetPack SDK[34] is the most complete platform for creating AI appli-
cations. It contains the most recent Jetson OS images, as well as libraries and APIs,
samples, developer tools, and documentation. DeepStream for streaming video analyt-
ics, Isaac for robotics, and Riva for conversational AI are all supported by JetPack SDK.
It is a software development kit for the NVIDIA Jetson embedded supercomputers. The
SDK provides a complete Linux environment for developing CUDA applications as well
as support for installing the latest version of the CUDA toolkit. We choosed to install
JetPack version 4.5 since it supports DeepStream 5.1 that we will be working with.
50
CHAPTER 4. REALIZATION
51
CHAPTER 4. REALIZATION
create custom plug-ins to extend the functionality of the framework. The GStreamer
framework is an open source project.
In fact, the GStreamer-based DeepStream reference application as shown in the figure
4.10 consists of a collection of GStreamer plugins that encapsulate low-level APIs to
create an entire graph. The reference application can accept input from a wide range
of sources, including cameras, RTSP input, encoded file input, and it also supports
multiple streams and sources. NVIDIA has implemented and made available a number
of GStreamer plugins as part of the DeepStream SDK, including:
4.2.2.3 TensorRT
TensorRT[37] is a high performance inference engine for Nvidia GPUs. It enables de-
velopers to optimize models for deployment on embedded, mobile, and server platforms.
TensorRT can also be used to improve the performance of deep learning applications
such as image classification, object detection, and machine translation.
52
CHAPTER 4. REALIZATION
The orange arrows show how video and metadata from a real camera flow through the
system. A Camera Driver captures video and metadata, which is then transmitted to
53
CHAPTER 4. REALIZATION
be optionally recorded on disk and provided as a live stream to the Smart Client. In
an XProtect VMS, this is how the typical flow of video and metadata works.
The VPS Driver sends video to a VP Service an ASP .NET Core 3.1 web application,
which processes it using a GStreamer pipeline. The video and/or metadata output from
the GStreamer pipeline can optionally be transmitted back into the XProtect VMS via
the VPS Driver. This is represented in the diagram by the blue arrows. The VPS
Driver receives the feed from the VP Service and exposes it via a new camera device
and a new metadata device. The video and/or metadata feeds can now be used exactly
as if they had originated from a physical camera device. You can, for example, record
the feed and view both the current stream and the recorded data in the Smart Client.
4.2.4 Deployment
The first step is to install the JetPack version 4.5 on Jetson Nano and install Deep-
Stream SDK 5.1. Second, we create two txt files named respectively
’deepstream app config yoloV3 tiny.txt’ and
’config infer primary yoloV3 tiny.txt’ that will contain parameters about our custom
model and paths to our custom yolov3-tiny configuration file, labels file and the best
weights. We should change the batch size and subdivisions to 1 for the test.
The nvdsinfer custom impl yolo folder, contains the YOLO custom implementation
files, should be re-make using the command in the figure 4.13.
Also, in order to detect bounding boxes correctly, we have created a new parse-bbox-
function in this folder that the deepstream app config.txt file will call. This function
depends on the anchors and masks belonging to our custom model’s config file. We
should also speicify the number of classes which is in our case one class.
After we finish preparing our files, we run the command in the figure 4.14 that will create
a TensorRt engine file and run the application on a specific video that we mention its
location in the deepstream app config file.
54
CHAPTER 4. REALIZATION
The last step is to run the VPService on Linux using the command make run. As
shown in the figure 4.15 we have specified the port in which our service will listen
on by editing the ’appsettings.Production.json’ file. Also, we have modified first the
GStreamer plugins, that are written in C++ language, ”vpsdeepstream” in which we
specified the deepstream application path we created and the plugin ”vpsnvdstoonvif”
which is at the end of this pipeline, and it converts the bounding boxes created by
the DeepStream elements into Onvif format and passes them along as a vpsonvifmeta.
This allows XProtect to understand it and display the bounding boxes in the XProtect
SmartClient.
55
CHAPTER 4. REALIZATION
• Protocol integration:
A basic integration method that is especially well suited for the integration of
non-Windows applications.
• Component integration:
Allows you to implement MIP components into your application, which is useful
when using Milestone libraries in a Windows-based application.
• Plug-in integration:
The most refined method of integration. This allows us to integrate our plugin
into the Milestone XProtect Management environment and to run our plugin as
an integral part of XProtect software and its client applications.
56
CHAPTER 4. REALIZATION
ConceptD 500 Acer computer shown in the figure 4.17 whose characteristics are the
following:
• Storage: 2 TB
• Milestone VMS: is known as the best system used in world, it provides a MIP
SDK that will allow us to develop our plugin in the Milestone system.
57
CHAPTER 4. REALIZATION
58
CHAPTER 4. REALIZATION
• Setting the VPS configuration in the Management Client plugin by adding a VPS
Service url to a source camera or a group, in our case we will add it to a group
since the robot has 4 cameras. As a result, 4 VPS hardware will be created. Each
VPS hardware have 2 devices: camera device, and metadata device that would
be set as related metadata to the source camera.
• Defining the Analytic event that will be sent on 9090 port to the Event server.
You could specify the network addresses that could send the analytic event for
better security.
• Defining the alarm in the Management Client. The alarm sources are the 4
cameras, and the triggering event is the one we have defined. We set the alarm
priority as high. Basically, the alarm is generated by the Event server and appears
in the Smart Client.
The plugin is responsible of reading in real-time the metadata of each VPS metadata
device which is in our case four, analyze it and check if there is a person detected, if so
then an analytic event in an XML format will be sent to the Event server over TCP/IP
connection. The source of the analytic event will be the camera in which the person is
detected. As a result, a recording is started on that source camera.
4.4 Implementation
This section is dedicated to the presentation of the main interfaces of our solution
mainly the plugin.
59
CHAPTER 4. REALIZATION
60
CHAPTER 4. REALIZATION
The figure 4.26 represents the plugin that we developed which is integrated into the
XProtect Milestone VMS. As shown in the figure, We have two buttons in our plugin,
the button ”Start” will start the functionalities of monitoring the area and triggering
alarms whenever a suspected The ”Stop” button will stop the process of monitoring.
• Start button: This will launch a parallel thread to read the metadata of all
streams received from the VPS and then if there is a bounding boxes of a human,
an Analytic event[40] will be sent to the Event server[38] that is responsible for
a variety of functions relating to events, alerts, and maps, as well as perhaps
third-party integrations via the MIP SDK, in order to trigger an alarm that will
activate the recording on the related device for a definite period.
• Stop button: This will stop the monitoring of the robot’s cameras which means
the surveillance and the alarms are disabled.
61
CHAPTER 4. REALIZATION
When testing our model on the robot’s cameras, we have observed that the model
can detect on the thermal camera however it is only trained on visible images which
highlights the effectiveness of our model. The picture 4.28 shows how our model detect
on thermal camera.
62
CHAPTER 4. REALIZATION
Conclusion
This last chapter is devoted to represent the steps followed to realize our solution and
to the presentation of the results of our project. For this we have exposed the different
environments and techniques adopted, followed by a presentation of the interfaces of
our platform through some screen prints.
63
General conclusion
Our end-of-studies internship was carried out within the startup ”Enova Robotics”.
It allowed us to set up a solution for the integration of AI-powered video analytics
into Milestone XProtect to help the security guard performing his task correctly and
comfortably through a video management system.
To achieve our goal, we started with studying the existing solutions. Then we ex-
plored the methodology necessary to carry out our project.
We also extracted the functional needs required by the specifications to develop the
preliminary design to finish with the detailed design. Finally, we presented the work
carried out as well as the different environments and techniques adopted.
In addition to the advantages on the technical side, this project was an opportunity
to improve and develop our communication skills and allowed us to better integrate
into the professional world.
64
Bibliography
[1] Wei Fang, Lin Wang, and Peiming Ren. Tinier-yolo: A real-time object detection
method for constrained environments. IEEE Access, PP:1–1, 12 2019.
[3] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich fea-
ture hierarchies for accurate object detection and semantic segmentation. CoRR,
abs/1311.2524, 2013.
[4] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-
CNN: towards real-time object detection with region proposal networks. CoRR,
abs/1506.01497, 2015.
[5] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed,
Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector.
CoRR, abs/1512.02325, 2015.
[6] Juan Du. Understanding of object detection based on CNN family and YOLO.
Journal of Physics: Conference Series, 1004:012029, apr 2018.
[7] J. Lin and M. Sun. A yolo-based traffic counting system. In 2018 Conference on
Technologies and Applications of Artificial Intelligence (TAAI), pages 82–85, Los
Alamitos, CA, USA, dec 2018. IEEE Computer Society.
[8] Aleksa Ćorović, Velibor Ilić, Siniša urić, Mališa Marijan, and Bogdan Pavković.
The real-time detection of traffic participants using yolo algorithm. In 2018 26th
Telecommunications Forum (TELFOR), pages 1–4, 2018.
[9] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look
once: Unified, real-time object detection. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June 2016.
[10] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. CoRR,
abs/1804.02767, 2018.
65
Bibliography
[11] Upesh Nepal and Hossein Eslamiat. Comparing yolov3, yolov4 and yolov5 for
autonomous landing spot detection in faulty uavs. Sensors, 22(2):464, 2022.
[12] Tao Li, Yitao Ma, and Tetsuo Endoh. A systematic study of tiny yolo3 inference:
Toward compact brainware processor with less memory and logic gate. IEEE
Access, PP:1–1, 08 2020.
[13] Dong Xiao, Feng Shan, Ze Li, Ba Tuan Le, Xiwen Liu, and Xuerao Li. A target
detection model based on improved tiny-yolov3 under the environment of mining
truck. IEEE Access, 7:123757–123764, 2019.
[14] Wen-Hui Chen, Han-Yang Kuo, Yu-Chen Lin, and Cheng-Han Tsai. A lightweight
pedestrian detection model for edge computing systems. In International Sympo-
sium on Distributed Computing and Artificial Intelligence, pages 102–112. Springer,
2020.
66
Webliography
[1] https://www.enovarobotics.eu/
[2] https://enovarobotics.eu/pguard/
[3] https://enovarobotics.eu/minilab/
[4] https://www.enovarobotics.eu/other-products/
[5] https://www.enovarobotics.eu/agv/
[6] https://www.veertec.com/
[7] https://www.milestonesys.com/marketplace/veertec-ltd/zen-analytics-
platform/
[8] https://www.openpath.com/vms
[9] https://www.briefcam.com/resources/videos/
[10] https://en.wikipedia.org/wiki/Unified_Process
[11] https://fr.wikipedia.org/wiki/Two_Tracks_Unified_Process
[12] https://www.ibm.com/cloud/blog/ai-vs-machine-learning-vs-deep-
learning-vs-neural-networks
[13] https://towardsdatascience.com/everything-you-need-to-know-about-
neural-networks-and-backpropagation-machine-learning-made-easy-
e5285bc2be3a
[14] https://www.researchgate.net/figure/Activation-function-a-Sigmoid-
b-tanh-c-ReLU_fig6_342831065
[15] https://towardsdatascience.com/softmax-activation-function-
explained-a7e1bc3ad60
[16] https://sebastianraschka.com/faq/docs/gradient-optimization.html
67
Webliography
[17] https://towardsdatascience.com/adam-latest-trends-in-deep-learning-
optimization-6be9a291375c
[18] https://medium.com/@RaghavPrabhu/understanding-of-convolutional-
neural-network-cnn-deep-learning-99760835f148
[19] https://wansook0316.github.io/ds/dl/2020/09/02/computer-vision-02-
RCNN.html
[20] https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-
object-detection-algorithms-36d53571365e
[21] https://www.researchgate.net/figure/Faster-R-CNN-Architecture-
9_fig1_324549019
[22] https://www.researchgate.net/figure/Comparison-of-test-time-speed-
of-different-R-CNN_fig7_340712186
[23] https://m.researching.cn/articles/OJ60cb5c99374ba95f/figureandtable
[24] https://pyimagesearch.com/2018/11/12/yolo-object-detection-with-
opencv/
[25] https://www.researchgate.net/figure/Comparison-of-different-
methods-in-FPS-and-mAP_tbl1_345653552
[26] https://us.hikvision.com/en/partners/technology-partners/vms
[27] https://www.genetec.com/products/unified-security/security-center
[28] https://www.milestonesys.com/solutions/platform/video-management-
software/
[29] https://doc.developer.milestonesys.com/html/index.html
[30] https://doc.developer.milestonesys.com/html/gettingstarted/intro_
vps_toolkit.html
[31] https://cocodataset.org/#home
[32] https://content.milestonesys.com/search/media/?field=metaproperty_
Assettype&value=CompanyLogo&field=metaproperty_Assetcategory&
value=Logo&multiple=true&filterType=add&hideFilter=false&filterkey=
savedFilters
[33] https://developer.nvidia.com/embedded/jetson-nano-developer-kit
68
Webliography
[34] https://developer.nvidia.com/embedded/jetpack
[35] https://developer.nvidia.com/deepstream-sdk
[36] https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_
plugin_Intro.html
[37] https://developer.nvidia.com/tensorrt
[38] https://doc.milestonesys.com/latest/en-US/system/sad/sad_
servercomponents.htm
[39] https://doc.milestonesys.com/latest/en-US/standard_features/
sf_mc/sf_mcnodes/sf_5rulesandevents/mc_rulesandeventsexplained_
rulesandevents.htm
[40] https://doc.milestonesys.com/latest/en-US/standard_features/sf_mc/
sf_mcnodes/sf_5rulesandevents/mc_analyticsevents_rulesandevents.htm
[41] https://doc.milestonesys.com/latest/en-US/system/security/
hardeningguide/hg_recordingserver.htm
[42] https://docs.nvidia.com/metropolis/deepstream/4.0.2/dev-guide/
index.html#page/DeepStream%20Development%20Guide/deepstream_app_
architecture.html
[43] https://www.crowdhuman.org/
[44] https://voxel51.com/docs/fiftyone/integrations/coco.html
[45] https://app.roboflow.com/
[46] https://www.youtube.com/watch?v=o9K6GDBnByk
69