0% found this document useful (0 votes)
14 views50 pages

Project Report D

Uploaded by

jatin Pundir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views50 pages

Project Report D

Uploaded by

jatin Pundir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

INTEGRATED PROJECT REPORT

ON
OBJECT DETECTION USING DEEP LEARNING
Submitted in Partial fulfillment of the requirements
For the award of the degree of
BACHELOR OF TECHNOLOGY
IN

ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


Submitted by

Deepak Choudhary
(04215611921)

Ms Naina Devi, Department of AI & DS

Department of Artificial Intelligence and Data Science


Dr. AKHILESH DAS GUPTA INSTITUTE OF TECHNOLOGY & MANAGEMENT
(A Unit of BBD Group)
Approved by AICTE and Affiliated with GGSIP University
FC-26, Shastri Park, New Delhi-110 053
DECLARATION

I Deepak Choudhary ,hereby declare that this submission is our own work and that, to
the best of our knowledge and belief, it contains no material previously published or
writen by another person nor material which to a substantal extent has been accepted
for the award of any other degree of the university or other insttute of higher learning,
except where due acknowledgment has been made in the text.

Signature:
Name: Deepak Choudhary
Roll No: 04215611921
Date:
CERTIFICATE

This is to certfy that Project Report enttled “Object Detecton Using Deep
Learning” which is submited by Deepak Choudhary partal fulfllment of the
requirement for the award of degree B. Tech. in Department of Artfcial
Intelligence and Data Science of Dr. Akhilesh Das Gupta Insttute of Technology &
Management (ADGITM) formerly known as Northern India Engineering College
(NIEC), New Delhi, is a record of the candidate own work carried out by him
under my supervision. The mater embodied in this thesis is original and has not
been submited for the award of any other degree.

Date: Superviser:Ms.Naina Devi


ACKNOWLEDGEMENT

It gives us a great sense of pleasure to present the report of the B. Tech


Project undertaken during B. Tech. Final Year. We owe special debt of gratitude
to Ms Naina Devi for her constant support and guidance throughout the
course of our work. Her sincerity,thoroughness and perseverance have been a
constant source of inspiration for us. It is only his cognizant efforts that our
endeavors have seen light of the day. We also take the opportunity to
acknowledge the contribution of Ms Naina Devi for her full support and
assistance during the development of the project. I also do not like to miss the
opportunity to acknowledge the contribution of all faculty members of the
department for their kind assistance and cooperation during the development
of our project. Last but not the least, we acknowledge our friends for their
contribution in the completion of the project.

Signature:
Name: Deepak Choudhary
Roll No : 04215611921
Date:
ABSTRACT

Object Detection is very closely connected with the Field of Computer Vision.
Object detection empowers recognizing instance of different objects in images
and videos or video recordings. It identifies the different characteristics of Images
rather than object detection techniques and produces an intelligent and effective
understanding of pictures very much like human vision works. In this paper, We
will starts with the concise presentation of introduction of deep learning and
famous object detection system like CNN (Convolutional Neural Network), R-CNN,
RNN (Recurrent brain network), Faster RNN, YOLO (You Only look once). Then, at
that point, we centre around our proposed object detection model architecture
along for certain advancements and modifications. The conventional model
recognizes a little object in pictures. Our proposed model gives the right outcome
with precision.
TABLE OF CONTENTS
Page

DECLARATION .................................................................................................................. .ii


CERTIFICATE ..................................................................................................................... .iii
ACKNOWLEDGEMENTS.......................................................................................................iv
ABSTRACT.......................................................................................................................... v
LIST OF TABLES................................................................................................................... vi
LIST OF FIGURES.............................................................................................................. …vii
LIST OF SYMBOLS ........................................................................................................... …viii
LIST OF ABBREVIATIONS .................................................................................................... ix
CHAPTER 1: INTRODUCTION............................................................................................. 1
1.1 MACHINE LEARNING ……………………………………………………………………………………………….3
1.2. DEEP LEARNING…………............................................................................................... 5
CHAPTER 2: LITERATURE SURVEY ..................................................................................... 8
CHAPTER 3: METHODOLOGY AND TECHNOLOGY............................................................ 17
3.1.CNN....................................................................................................................... 19
3.2. R-CNN..........................................................................................................................20
3.3. FAST R-CNN……………………………………………………………………………………………………………21
3.4. FASTER R-CNN…………………………………………………………………………………………………………22
3.5. YOLO………………………………………………………………………………………………………………………23
CHAPTER 4: RESULT ANALYSIS AND DISCUSSION ............................................................ 24
CHAPTER 5: CONCLUSIONS AND FUTURE WORK ............................................................ 30
5.1. CONCLUSION............................................................................................................. 36
5.2. ................................................................................................................................... 37
APPENDIX A: RESEARCH PAPER........................................................................................ 38
APPENDIX B: SURVEY DATA OR TYPICAL PART OF SOURCE CODE ………………………………... 39
REFERENCES...................................................................................................................... 40
LIST OF SYMBOLS

Σ (sigma)- Summation operation


π(pi)- Product operation
Σ (capital sigma)- Summation over a range or sequence
∈(Belongs to)- Element belongs to a set
∃ (Exists)- There exists an element
∀(For all)- Statement holds true for all
=(Equal)- Equality
≠(Not equal)- Inequality
LIST OF ABBREVIATIONS

CNN - Convolutional Neural Network


R-CNN - Region-based Convolutional Neural Network
ROI - Region of Interest
SSD - Single Shot MultiBox Detector
YOLO - You Only Look Once
FPN - Feature Pyramid Network
ML- Machine Learning
DL- Deep Learning
CHAPTER -1 Introduction

1.1 MACHINE LEARNING


Machine Learning can be defineds the field of study that deals with giving
computers the capability to learn without the requirement to hardcode every use
scenario. As the name suggests, it is the ability given to a machine to behave as
close as possible to a human – learning ability. Machine learning is being put to
use in a multitude of areas in today’s world. Machine learning focuses on creating
programs that can progressively learn by accessing data and performing the logic
by themselves.
Data analysis till now has been signified by trial and error, but this approach
becomes close to impossible as the data sets get larger and heterogeneous. This
is where machine learning provides an advantage; by finding smart alternatives
to analyze large volumes of data. Machine learning is able to provide accurate
and dependable results and analysis by the use of efficient and fast algorithms
and by real time data processing by data driven models.
The aim of machine learning is to facilitate computers to learn for themselves
automatically, without the need for intervention by a user, or human assistance,
and refine its actions according to changes that may occur. This process begins
with data instances like examples, instructions, experience, etc. being used to
detect patterns in data pools and make refined decisions based on the provided
examples and data instances.
There are countless uses for machine learning. Similarly, there are a multitude of
algorithms for machine learning. They vary in complexity based on several factors.
A few commonly used models are:
● Support Vector Machines: This is a type of algorithm that involves recognizing
correlations, usually between a couple of variables, and making predictions of
future points based on the detected correlations/patterns.
● Decision trees: This model observes certain points of action, and using this,
detects an optimal course to arrive at the outcome.
● K-means clustering: This model works by grouping data points into groups
based on similar characteristics.
● Neural networks: These models use huge amounts of data that they are trained
on, to detect patterns and correlate variables in order to process data that they
would encounter in the future.
● Reinforcement Learning: This type of algorithm works by re-iterating models
over several trials to finish a process. Rewards and penalizations are given to
favorable and undesired outcomes respectively, till the process is optimized by
the algorithm.

Standard workflow of a machine learning process


Methods involved in a machine learning process

1.2 DEEP LEARNING


Deep learning is a characteristic of AI that deals with the emulation of the learning
process that a human being would use to obtain information and acquire nuanced
skills. It is, at its core, a method to automate a prediction-based analysis that
human cognition is able to perform, without the need to hardcode every scenario,
and by understanding the courses of outcomes to several logics.
A traditional machine learning algorithm is a linear process of learning. Deep
learning algorithms introduce a hierarchical stacking method, based on the
complexity of the data and its abstraction.
Evolution of deep learning

To understand the process of deep learning, let us assume that we have some
images, each of which contains a category of objects (out of four categories, for
the sake of this example), and our requirement is that the algorithm detects the
category of an object in any of the images. We first create a data set by labeling
the images, so that the network can be trained using this set. The network would
then start to recognize unique features and correlate them with a category.
Successive layers use data from preceding layers, and pass this on to the next.
The complexity of learning and the details increases as data moves from layer to
layer. It is to be noted that the network directly learns from the data acquired, i.e,
the user has no influence in the details that is learned by the algorithm.
Structure of Neural Networks

In instances where deep learning is used, the process that occurs is almost
the same, which would go as follows: the algorithm acquires the data, this
data is subjected to transformations that are non-linear. Learning is done
through the transformations and the output is acquired as a model. This
carries on for several layers, through a multitude of trials, till a reliable and
accurate output is arrived at.

Difference between Deep and machine learning


The magnitude of specificity provided by the user, is required to be very
high when a ML algorithm is used, since the computer cannot interpret what
is has to search for, with lower levels of intricacies. Provision of such high
levels of accuracy is daunting as it requires manual inputs, and the rate of
successfulness is entirely reliant on the user’s ability to provide accurate
distinction. This is where the utility of DL is at, since it can illustrate feature
sets on its own, without the need for manual input or control, with high levels
of accuracy. Not only is this process expeditious, it is also more reliable and
precise.

Deep Neural Networks

A DL algorithm follows a course of operation very similar to what can be


found in the brain, i.e, a network of neurons. This is why DL networks are also
known as deep neural networks. Here, we can see an instance of bio-mimicry,
since the path of operation of the algorithms is structured like a set of neurons
interlinked to one another, such that a following layer takes the output of the
previous one, and so on. Unique countenances of the data are interpreted by
unique layers, the collective of which forms the entire network.
To extend the understanding of neural networks further, let us assume a
scenario, where the ML program is designed to distinguish manually written
alphabets. Sequencing of the layers could be in several ways, for example, the
initial layer would handle the recognition of grayscale percentage, and
chromaticity. A successive layer could handle the recognition of the structure
of the alphabet based on contours, and another successive layer could handle
recognition of an overall resemblance to an alphabet. This procession
continues throughout all the layer, and at the end, an output is obtained. This
would be the possibility of the written alphabet is any one from a to z.
Learning is done sequentially. The network determines how to perceive
individual characteristics by progressively changing the importance of a
certain characteristic data while passing through layers. A certain “weight” is
attributed to every link, the equivalence of which can be changed to modify
the link’s importance. Ultimately, the closeness of the output with the actual
data is determined, after every learning instance.
CHAPTER -2 Literature Survey

S.NO Title of the paper Authors Year Description

1. Faster R-CNN: Shaoqing Jan2016 Faster R-CNN is the third


Towards Real-Time Ren, version of the Original
Object Detection with Kaiming He, paper Region Based
Region Proposal Ross Convolutional Neural
Networks Girshick, and Network. It uses Region
Jian Sun proposal network to render
the algorithm faster than
its predecessors.

2. YOLO, You only Joseph June 2015


Look Once: Unified, Redmon, YOLO(You 100k
Real time object Santosh only once) uses a
detection Divvala, CNN to detect, track
Ross and localize objects
Girshick, All in videos. YOLO
Farhadi divides the image
into S*S grids and
generates anchor
boxes for each
grid along with a
confidence score for
each object present
in the image. It can
process up to 45 fps
on Titan X pascal
GPU. It uses 2
convolutional
layers to perform
feature extraction.

3. YOLOv3: An Joseph 2016 This improved


Incremental Redmon, All version of YOLO
Improvement uses 53
Farhadi convolutional
layers instead of 24
to make faster and
more accurate
predictions. It uses
the GPU in a more
efficient manner to
make faster
calculations.

4. Detection of Indian Indumathi.K, 2016 It uses


Traffic Signs Gnana CLAHE(Contrast
Abinaya, Limited Adaptive
Thangamani Histogram
K, Ashok Equalization) to
Deva A preprocess the
image and Integral
channel features
and
Aggregate Channel
features for Traffic
Signs Detection.
CHAPTER 3: METHODOLOGY AND TECHNOLOGY
3.1 CNN: CONVOLUTIONAL NEURAL NETWORK

Convolutional neural networks are DL algorithms that are capable of feature detection on
images that are given as input. CNNs can attribute weights to various features of an image to
distinguish each one. CNNs work with lesser amounts of pre-processing. CNNs can learn filters
and features by themselves, unlike manual teaching required for basic algorithms. This results in
lesser pre-processing being needed.

Structure of a typical CNN

Since CNNs are basically neural networks, the sequencing is influenced by the structure of
neurons found in the visual cortex of the brain. Response of unique neurons is conditional to a
receptive field within which they operate and are triggered. These neurons are found as bundles,
which form the entire area of the visual cortex.
Usually CNNs consist of

i. Convolutional layers – the number of layers is dependent upon how deep the network is.

ii. Activation layers – common activation layers such as ReLU, Leaky ReLU are used to
make the math intact and prevent attaining a practically impossible outcome or random
excitations.

iii. Pooling layers – tensor size is reduced by the use of these.

3.2 R-CNN

R-CNN stands for Region specific Convolutional Neural Network. It uses a methodology
known as Selective Search to look for regions with objects in an image. Selective search looks
for patterns in an image and classifies objects based on that. Patterns are understood by the
network by monitoring varying scales, colors, textures and enclosures. Initially the network takes
in an image and selective search generates sub-segmentations based on these patterns. Similar
regions are then associated to produce bigger regions. This is typically on the basis of things like
color, shape, etc.The regions generated are reshaped to the Conv net’s size requirements and are
sent to convolutional networks for detection of objects. The convolutional network then extracts
features from those regions and SVMs (Support Vector Machines) divide those features
according to different classes. After that is done, a bounding box regressor predicts the areas of
the objects that are identified.
Image Processes involved in R-CNN
Image data flow in R-CNN

Limitations of R-CNN

Training an R-CNN model is expensive and slow due to the following aspects of the network,

→ Extracting 2000 regions particular to an image on the basis of selective searching


→ Feature extraction for each image would become N*2000 If N is the average number of
features for each image or rather N should be the number of features extraction methods
to be learnt by the network.
→ The entire process of object detection has 3 models wrapped in it: I.
CNNs for learning features to be extracted from the image
II. SVM classifier for object classification
III. Regressor model for narrowing down bounding boxes

The complexity of the processes involved in object detection renders R-CNN really slow and
also computationally expensive to train models. It typically takes 40 – 50 seconds to perform the
object detection.

3.3 FAST R-CNN

This is an improvisation of R-CNN to make it faster. Instead of running the CNNs 2000 times
per image, for all the 2000 regions, the CNN was run one time to generate a feature map for all
the 2000 regions. Using these convolutional region maps, different regions of consideration are
extracted and a ROI pool layer is used to remake the extracted region-specific feature maps and
it is fed into the Fully connected layer.

Instead of 3 models like R-CNN, Fast R-CNN uses only one model which performs extraction
ofcharacteristics from different regions,after which division into classes is done and then it
returns the bounding boxes for distinguished classes simultaneously.
Image data flow in Faster R-CNN

Limitations of Fast R-CNN

Even though it’s faster than R-CNN, it uses region proposal methods like selective search
which makes the process of detecting objects slow, in other words not fast enough to stand as a
robust Object detection network. Fast R-CNN still takes about 2 seconds to make predictions
about the objects in the image.

3.4 FASTER R-CNN

Faster R-CNN is the modified version of Fast R-CNN, where generation of ROIs is done using
Region Proposal Network. This algorithm acquires input in the form of feature maps. The result
is obtained as objectness scores.
Image data flow in Faster R-CNN
→ The Faster R-CNN takes in an input image, passes it to the CNN and allows it to generate
an image’s feature map.

→ RPN is utilized on those maps, returning the object suggestion and the objectness score.
The RPN uses sliding windows over these object maps and ‘k’ Anchor boxes of different
dimensions are generated. For each anchor box, two things are predicted, o The
probability of the anchor holding objects (it doesn’t classify the objects at this step)
o A bounding box regressor to better adjust the anchor to fit the objects better inside
the box.
→ Next, the proposals are taken in, cropped so that each proposal contains an object. An ROI
pool layer is applied to all these suggestionsto bring them down to same size. ROI
pooling layers extract a fixed size pooling map for each anchor.
→ Ultimately,the resized proposals pass to the fully connected layer containing the
‘SoftMax’ layer. Linear regression layer is present on the top, to classify and output the
bounding boxes.
ROI Pooling Layer Working in Faster R-CNN

Limitations of Faster R-CNN

All of R-CNN’s versions (including Faster R-CNN) look at different regions of a particular
image sequentially as opposed to taking a look at the whole image. The result of this are two
complications:

→ Several passes are required through an image to extract all objects.


→ The performance of all the systems in the network depends upon the performance of the
previous systems, thereby if an error occurs due to improper generation of feature maps,
it messes up the detection process after that.

3.5 YOLO: YOU ONLY LOOK ONCE

This is a system which is focused on real-time determination of several objects and instances.
This works in a similar structure to that of a FCNN algorithm, unlike RPNs where same regions
in an image are processed several times, where the image gets passed once and the output is
obtained. [input- NxN; output- SxS]. The image is divided into a grid of size SxS. Every grid is
assigned to predict a unique object. A certain number of boundary boxes is assigned to all grids.
The drawback of YOLO is that since each grid can detect only a unique object, the proximity of
the objects determines whether they are detected or not.

For every individual cell in the grid,

● ‘B’ number of boundary boxes are predicted. every box has a box
confidence score,

● Regardless of the no. of B, only one object can be detected ● ‘C’


conditional class probabilities are predicted.

Grid size: 7x7 [SxS]; Boundary boxes: 2; No. of classes: 20

Contents of a boundary box: x,y,w,h; box confidence score. The “objectness” is reflected
based on the confidence score. This shows the likeliness that a box encompasses the object and
the accuracy of the box. The dimensions of the bounding box (w,h) are standardized with those
of the image and x, y are cell offset values (note that x,y,w,h lie within 0 – 1). Conditional class
probability: 20 per cell; probability of an object pertaining to a certain class.

Prediction shape for YOLO: (S, S, (Bx5) +C) = (7, 7, (2x5) +20) = (7, 7,30).
Structure of YOLO

YOLO is set on the idea of building a CNN to predict a tensor of size 7,7,30.The spatial
dimension reduces to 7x7, with every location having 1024 o/p channels. Two FC layers are used
to make 7x7x2 boundary box on which linear regression is performed. Those with confidence
scores above 0.25 are used to make predictions finally. class confidence score is calculated
as(per pred.box):

Classification and localization, both are needed to calculate confidence.


Design of the network

YOLO consists of: 24 convolutional layers, 2 FC layers. In certain layers, 1x1 reduction layers
are utilized to reduce depth of feature maps. Tensor shape of the last layer is (7,7,1024), and this
tensor is flattened. 2 FC layers are used as linear regressions to give a 7x7x30 o/p, which is
reshaped to (7,7,30) (2 bound. box predictions at each location)

Fast YOLO: 9 convolutional layers; more shallow feature maps

Loss function

YOLO works by predicting several bounding boxes for every cell. This would result in the
generation of false positives. To counter this, we require only one bounding box to correspond to
the object. The largest IoU is chosen with ground truth, for this reason, thus making every
prediction more effective at guessing specific sizes, aspect ratios, etc.

Sum squared error is utilized between predictions and ground truth as a means to compute
loss. Loss function comprises of: classification, localization and confidence losses.
Classification loss

It is the square of error of class cond. probabilities per class:

Localization loss

Error in predicted bound. box locations and dimensions is measured, where the box
corresponding to the object is only counted.

Different weights should be given to boxes of different sizes, since it may be the same for a
small or a large box, if say a 3 pixel error occurs. This is partly compensated since YOLO uses
sq. root of width and height in the place of the absolute values. Loss multiplied by λcoord
emphasizes the bound. box accuracy further(5 is the default).

Confidence loss

Measurement of objectness of box (if object has been detected)

Measurement of objectness of box(if object not detected)

Class imbalances are created since most boxes don’t contain objects. Thus, the model is
trained such that the background is detected more often compared to the objects. λnoobj is used
as a factor to weigh down loss, to counter this(0.5 is the standard)

Conglomerated Loss Function

The final loss sums up the others


3.6 YOLOv3

YOLOv2 used a custom CNN known as Darknet-19 with 19 layers from the original network
and 11 additional layers to make it a 30 layers network for object detection. However ever with a
30 layers architecture, YOLOv2 struggled with small objects detection and this was attributed to
loss of fine-grained features as the input passed through each pooling layer. Identity mapping and
concatenating characteristics were utilized from preceding layers to determine low lever features,
in order to compensate for this.
Ever after all this, it lacked several important aspects of a object detection algorithm which
made it stable such as residual blocks, skip connections and upsampling layers.

These corrections were made and a new version of YOLO was born and that is known as
YOLOv3.

YOLOv3 also uses a variant of Darknet-53 which has 53 convolutional layers trained on
Imagenet for the purpose of classification with an additional of 53 more layers stacked onto it to
make it a full-fledged network to perform classification and detection. As a result, YOLOv3 is
slower than the second version but a lot more accurate than its predecessors.
Structure of YOLOv3

The main difference between YOLOv3 and its predecessors is that it makes predictions at 3
different scales. The initial detection is performed in the 82nd layer. If an input image of 416x416
is fed into the network, the feature map thus acquired would be of size 13x13. The other two
scales at which detections happen are at the 94th layer yielding a feature map of dimensions
26x26x255 and the final detection happens at the 106th layer, resulting in a feature map with
dimension 52x52x255. The detection which happens at the 82nd layer, is liable for the detection
of large objects and the detections which happen at the 106th layer is liable for detecting small
objects with the 94th layer, staying in-between these 2 with a dimension of 26x26, detecting
medium size objects.

This kind of varied detection scale renders YOLOv3 good at detecting small objects than its
predecessors.
YOLOv3 uses 9 anchor boxes to localize objects with 3 for each detection scale. The anchor
boxes are assigned in the descending order with the largest 3 boxes of all for first detection layer
which is used to detect large objects, the next 3 for the medium sized objects detection layer and
the final 3 for the small objects’ detection layer.

YOLOv3 uses 10x the number of bounding boxes used by YOLOv2 since YOLOv3 detects at
3 different scales. For instance, for an image of input size 416x416, YOLOv2Would predict
13x13x5 = 845 boxes whereas YOLOv3 would go for 13x13x5 + 26x26x5 + 52x52x5 = A
whopping 10,647 boxes.

The loss function of YOLOv3 was modified


This is the loss function of YOLOv2 where the last 3 terms in this image correspond to the
function which penalizes for the objectness score predicted by the model for the bounding boxes
responsible for predicting objects. The second last term is responsible for bounding boxes having
no objects and the final one penalizes the model for the class prediction score for the bounding
box which predicts objects.

The terms involved in the calculation of loss function in YOLOv2 were calculated using Mean
Squared Error method while in YOLOv3, it was modified to Logistic Regression since this
model offer a better fit than the previous one.

YOLOv3 executes classification of multiple labels for objects that are detected in images and
videos. In YOLOv2, softmaxing is performed on all the class scores and whichever has the
maximum class score is assigned to that object. This rests on the assumption that if one object
belongs to one class, it can’t be a part of another class. For instance, it is not necessary that an
object belonging to the class Car would not belong to the class Vehicle. An alternative approach
to this would be using logistic regression to predict class scores of objects and setting a threshold
for predicting multiple labels. Classes that have scores greater than the threshold score arethen
assigned to the box.
YOLOv3 was benchmarked against popular state of the art detectors like RetinaNet50 and
RetinaNet101 with the COCO mAP 50 benchmark where 50 stands for the accuracy of the
model, how well the predicted bounding boxes align with the Ground Truth bounding boxes.
This metric of evaluating CNNs is known as IOU, Intersection Over Union. 50 over here
corresponds to 0.5 on the IOU scale of the evaluation. If the prediction made by the model is less
than 0.5, it is classified as a mislocalisation and classified as a false positive.

YOLOv3 is really fast and accurate. When measured at 50 mAP, it is on par with the
RetinaNet50 and RetinaNet101 but it is almost 4x faster than those two models.In benchmarks
where the accuracy metric is higher (COCO 75), the boxes need to be more aligned with the
Ground Truth label boxes and here is where RetinaNet zooms past YOLO in terms of accuracy.
Benchmark scores of YOLOv3 and other networks against COCO mAP 50

Benchmark scores of object detection networks against COCO Dataset

These are the metrics for different models with different benchmarks and it is quite observable
that the mAP (mean Average Precision) for YOLOv3 is 57.9 on COCO 50 benchmark and 34.4
on COCO 75 benchmark. RetinaNet is better at detecting small objects but YOLOv3 is so much
faster than versions of RetinaNet.

YOLO uses a technique known as Non-Maximal Suppression (NMS) to eliminate duplicates


of the same object being detected twice or more than that. It essentially retains on the bounding
box with the highest confidence score. The initial step is to discard all the bounding boxes which
have a confidence score lesser than the input of the threshold set for detected objects. If the
threshold is set to 0.55, it retains bounding boxes with confidence scores more than or equal to
0.55

YOLO also uses anIntersection Over Unionmetric to grade the algorithm’s accuracy. IOU is
a simple ratio of area of intersection between the predicted box and the Ground truth box to the
Area of the Union of the predicted box with ground truth box. After removing bounding boxes
with the detection probability lesser than the NMS threshold, YOLO discards all the boxes for
objects with IOU scores lesser than the IOU threshold to eliminate duplicate detections further.
3.7 IMPLEMENTATION OF YOLOv3

To implement the pre-trained YOLOv3 network, all that is required from the library is the
config file of YOLOv3 which defines the layers and other essential specifics of the network like
the number of filters in each layer, learning rate, classes, stride, input size for each layer and
channels, output tensor etc. The config file gives the basic structure of the model by defining the
number of neurons in each layer and different kinds of layers. With the help of the config file,
one could start training their model with either a pre-existing dataset like COCO, Alex net,
MNIST dataset for handwritten digits detection etc.
80 classes on which YOLOv3 was trained on

Training a neural network is a painstaking process which requires tons of data and
computational power in terms of Graphics Processing power and it could take several anywhere
between several hours to several weeks to train the model. Training is advisable if the model is
going to be used to detect custom objects which are not present in the dataset on which it was
trained on or if the objects need to be detected specifically under a class as opposed to making
general predictions as to which class the object falls under.

To make general predictions to detect objects on which the model was already trained on, an
implementation could be performed to get the network up and running. This requires the config
file and the pretrained weights file of the model. The pretrained weights file has the arithmetic
of the connections between each neuron in the same layer and the next layer, in other words the
weight of one neuron with respect to the weight of the other one since neurons in successive
layers are interconnected and the input received by the input neuron has an effect on the neuron
in the next layer. Hence when the model is trained from the scratch, rando values are allocated
for weights and these values are optimized using the loss function and the optimization function
to bring the weights values closest to making the network function similar to human perception
or better! For modifying the weights while retraining the network, weights of selective layers are
frozen basedon the amount of data available and the purpose for which the network is being
retrained to do.
YOLOv3 Config File

For this unmodified implementation of YOLOv3, the config file and the weights file were
taken from the library OpenCV (Computer Vision Library). The network was decided to be used
as it is since the purpose of detecting objects and tracking them in a video was solved.

A code script was written using the programming language python and other dependencies
such as OpenCV, numpy, argparse, OS etc. to receive the images of the video file, preprocess it,
send it to the network and perform the task of object detection and tracking.
Pre-processing of the video
Model Runtime and frame info on the input video

Output video from the model after detection and tracking


CHAPTER -4 Result Analysis and Discussion

Object detection using deep learning has revolutionized computer vision tasks and
enabled a wide range of applications such as autonomous driving, surveillance, and
image recognition. In this analysis and discussion, we will explore the key aspects and
advancements in object detection using deep learning.

1.Convolutional Neural Networks (CNNs): CNNs form the backbone of most object
detection systems. They have shown remarkable performance in learning spatial
hierarchies and capturing complex patterns in images. Popular CNN architectures like
VGG, ResNet, and InceptionNet have been successfully utilized in object detection
frameworks.

2.Region Proposal Methods: One of the challenges in object detection is identifying


potential object regions in an image efficiently. Region Proposal Methods (RPMs) such
as Selective Search, EdgeBoxes, and Faster R-CNN generate region proposals by
employing bottom-up or top-down approaches. RPMs help reduce the search space and
focus computational resources on regions more likely to contain objects.

3.Single Shot Detectors (SSDs): SSDs revolutionized object detection by introducing a


single-pass approach, eliminating the need for separate region proposal generation.
SSDs use a series of convolutional layers with different scales to detect objects of
various sizes. They achieve high detection accuracy with real-time performance.

4.You Only Look Once (YOLO): The YOLO algorithm is another popular approach
for object detection, known for its speed and accuracy. YOLO divides an image into a
grid and predicts bounding boxes and class probabilities for each grid cell. YOLO
models such as YOLOv3 and YOLOv4 have achieved state-of-the-art performance.
5.Anchor-based and Anchor-free Methods: Anchor-based methods, such as Faster R-
CNN, use predefined anchor boxes to localize objects. These methods match anchor
boxes with ground truth objects during training. In contrast, anchor-free methods, such
as CenterNet and EfficientDet, directly predict object centers and sizes. Anchor-free
methods reduce the complexity and achieve competitive performance.

6.Transfer Learning and Pretrained Models: Deep learning models for object
detection often leverage transfer learning by using pretrained models on large-scale
datasets like ImageNet. Transfer learning allows models to generalize well even with
limited training data and accelerates convergence.

7.Data Augmentation: Data augmentation techniques such as random cropping,


rotation, scaling, and flipping are commonly employed to increase the diversity of
training data. Augmentation helps models generalize better and improves robustness to
variations in object appearance, lighting conditions, and viewpoints.

8.Evaluation Metrics: Common evaluation metrics for object detection include


Intersection over Union (IoU), Average Precision (AP), and Mean Average Precision
(mAP). These metrics assess the accuracy of object localization and classification
performance at different IoU thresholds.

9.Challenges and Future Directions: Despite significant progress, object detection still
faces challenges in detecting small objects, occlusion handling, and generalization to
novel object categories. Future research focuses on developing more robust and efficient
algorithms, exploring 3D object detection, and addressing ethical considerations.

In conclusion, object detection using deep learning has witnessed remarkable


advancements, enabling accurate and real-time detection of objects in images and
videos. With the continuous evolution of algorithms, architectures, and datasets, we can
expect further breakthroughs in the field, leading to more robust and reliable object
detection systems.
CHAPTER 5: CONCLUSIONS AND FUTURE WORK

5.1. CONCLUSIONS

The intent of this project is to build an implementation of YOLOv3 after comparing it


with the other state-of-the-art object detection algorithm, ultimately to run the
implementation on data collected on Indian Roads. Since there was not any pre-existing
dataset on Indian Roads, the data had to be collected and the dataset had to be built.
While the implementation was performed and tested on a few sample videos from the
incipient dataset on Indian roads, the process of building the dataset is still an ongoing
task. The videos and images in the dataset will be annotated, organized based on classes
and scenes and it will be uploaded online as an open source document to encourage and
support the communities working on object detection and tracking, traffic signs
detection and other autonomous vehicle projects to accelerate the development of
autonomous vehicles in the Indian Roads.

5.2. FUTURE WORK

In the field of object detection using deep learning, there are several areas that
researchers and practitioners are focusing on for future work. Here are some potential
directions and challenges:

1. Improving detection accuracy: Enhancing the accuracy of object detection


algorithms is an ongoing pursuit. Researchers are exploring novel architectures,
such as advanced convolutional neural networks (CNNs) or transformer-based
models, to achieve better object localization and classification performance.

2. Efficient and lightweight models: While accuracy is crucial, there is also a need
for more efficient and lightweight models that can run on resource-constrained
devices like mobile phones or embedded systems. Future work involves
developing compact architectures, optimizing network parameters, and exploring
techniques like knowledge distillation and model quantization.

3. Small object detection: Detecting small objects accurately remains challenging,


particularly when they appear in cluttered or low-resolution images. Future
research could focus on developing techniques that can effectively handle small-
scale objects, potentially through feature pyramids, attention mechanisms, or
multi-scale approaches.

4. Real-time object detection: Real-time object detection is essential for


applications like autonomous driving or video surveillance. Future work involves
developing faster algorithms, leveraging techniques such as network pruning,
quantization, or architecture optimization to achieve real-time performance
without compromising accuracy.

5. Handling occlusions and crowded scenes: Object detection in crowded scenes


or instances with occlusions is an ongoing research area. Future work aims to
improve detection algorithms to handle occlusions, partial visibility, and
overlapping instances more robustly, possibly through the use of contextual
information, instance-level reasoning, or graph-based approaches.

6. Domain adaptation and transfer learning: Adapting object detection models to


new domains or tasks with limited labeled data is a significant challenge. Future
research could explore techniques such as domain adaptation, transfer learning, or
unsupervised/weakly supervised learning to improve the generalization and
adaptability of object detectors.

7. Interpretable and explainable object detection: Deep learning models are often
regarded as black boxes, making it challenging to understand their decisions.
Future work involves developing techniques to provide interpretability and
explanations for object detection models, enabling better trust, transparency, and
error analysis.

8. Multi-modal object detection: Integrating multiple sensor modalities, such as


RGB images, depth maps, or LiDAR data, can improve object detection
performance. Future research could explore fusion techniques, attention
mechanisms, or multi-modal architectures to leverage complementary
information from different sources.

9. Robustness against adversarial attacks: Deep learning models are vulnerable


to adversarial attacks, where small perturbations to input data can mislead the
model's predictions. Future work aims to develop object detection models that are
more robust and resilient against such attacks, potentially through adversarial
training or defense mechanisms.

10.Privacy and ethical considerations: As object detection becomes more


prevalent, ensuring privacy and addressing ethical concerns become crucial.
Future research could focus on developing privacy-preserving techniques,
exploring fairness and bias mitigation, or incorporating ethical considerations in
the design and deployment of object detection systems.

You might also like