Project Report D
Project Report D
ON
OBJECT DETECTION USING DEEP LEARNING
Submitted in Partial fulfillment of the requirements
For the award of the degree of
BACHELOR OF TECHNOLOGY
IN
Deepak Choudhary
(04215611921)
I Deepak Choudhary ,hereby declare that this submission is our own work and that, to
the best of our knowledge and belief, it contains no material previously published or
writen by another person nor material which to a substantal extent has been accepted
for the award of any other degree of the university or other insttute of higher learning,
except where due acknowledgment has been made in the text.
Signature:
Name: Deepak Choudhary
Roll No: 04215611921
Date:
CERTIFICATE
This is to certfy that Project Report enttled “Object Detecton Using Deep
Learning” which is submited by Deepak Choudhary partal fulfllment of the
requirement for the award of degree B. Tech. in Department of Artfcial
Intelligence and Data Science of Dr. Akhilesh Das Gupta Insttute of Technology &
Management (ADGITM) formerly known as Northern India Engineering College
(NIEC), New Delhi, is a record of the candidate own work carried out by him
under my supervision. The mater embodied in this thesis is original and has not
been submited for the award of any other degree.
Signature:
Name: Deepak Choudhary
Roll No : 04215611921
Date:
ABSTRACT
Object Detection is very closely connected with the Field of Computer Vision.
Object detection empowers recognizing instance of different objects in images
and videos or video recordings. It identifies the different characteristics of Images
rather than object detection techniques and produces an intelligent and effective
understanding of pictures very much like human vision works. In this paper, We
will starts with the concise presentation of introduction of deep learning and
famous object detection system like CNN (Convolutional Neural Network), R-CNN,
RNN (Recurrent brain network), Faster RNN, YOLO (You Only look once). Then, at
that point, we centre around our proposed object detection model architecture
along for certain advancements and modifications. The conventional model
recognizes a little object in pictures. Our proposed model gives the right outcome
with precision.
TABLE OF CONTENTS
Page
To understand the process of deep learning, let us assume that we have some
images, each of which contains a category of objects (out of four categories, for
the sake of this example), and our requirement is that the algorithm detects the
category of an object in any of the images. We first create a data set by labeling
the images, so that the network can be trained using this set. The network would
then start to recognize unique features and correlate them with a category.
Successive layers use data from preceding layers, and pass this on to the next.
The complexity of learning and the details increases as data moves from layer to
layer. It is to be noted that the network directly learns from the data acquired, i.e,
the user has no influence in the details that is learned by the algorithm.
Structure of Neural Networks
In instances where deep learning is used, the process that occurs is almost
the same, which would go as follows: the algorithm acquires the data, this
data is subjected to transformations that are non-linear. Learning is done
through the transformations and the output is acquired as a model. This
carries on for several layers, through a multitude of trials, till a reliable and
accurate output is arrived at.
Convolutional neural networks are DL algorithms that are capable of feature detection on
images that are given as input. CNNs can attribute weights to various features of an image to
distinguish each one. CNNs work with lesser amounts of pre-processing. CNNs can learn filters
and features by themselves, unlike manual teaching required for basic algorithms. This results in
lesser pre-processing being needed.
Since CNNs are basically neural networks, the sequencing is influenced by the structure of
neurons found in the visual cortex of the brain. Response of unique neurons is conditional to a
receptive field within which they operate and are triggered. These neurons are found as bundles,
which form the entire area of the visual cortex.
Usually CNNs consist of
i. Convolutional layers – the number of layers is dependent upon how deep the network is.
ii. Activation layers – common activation layers such as ReLU, Leaky ReLU are used to
make the math intact and prevent attaining a practically impossible outcome or random
excitations.
3.2 R-CNN
R-CNN stands for Region specific Convolutional Neural Network. It uses a methodology
known as Selective Search to look for regions with objects in an image. Selective search looks
for patterns in an image and classifies objects based on that. Patterns are understood by the
network by monitoring varying scales, colors, textures and enclosures. Initially the network takes
in an image and selective search generates sub-segmentations based on these patterns. Similar
regions are then associated to produce bigger regions. This is typically on the basis of things like
color, shape, etc.The regions generated are reshaped to the Conv net’s size requirements and are
sent to convolutional networks for detection of objects. The convolutional network then extracts
features from those regions and SVMs (Support Vector Machines) divide those features
according to different classes. After that is done, a bounding box regressor predicts the areas of
the objects that are identified.
Image Processes involved in R-CNN
Image data flow in R-CNN
Limitations of R-CNN
Training an R-CNN model is expensive and slow due to the following aspects of the network,
The complexity of the processes involved in object detection renders R-CNN really slow and
also computationally expensive to train models. It typically takes 40 – 50 seconds to perform the
object detection.
This is an improvisation of R-CNN to make it faster. Instead of running the CNNs 2000 times
per image, for all the 2000 regions, the CNN was run one time to generate a feature map for all
the 2000 regions. Using these convolutional region maps, different regions of consideration are
extracted and a ROI pool layer is used to remake the extracted region-specific feature maps and
it is fed into the Fully connected layer.
Instead of 3 models like R-CNN, Fast R-CNN uses only one model which performs extraction
ofcharacteristics from different regions,after which division into classes is done and then it
returns the bounding boxes for distinguished classes simultaneously.
Image data flow in Faster R-CNN
Even though it’s faster than R-CNN, it uses region proposal methods like selective search
which makes the process of detecting objects slow, in other words not fast enough to stand as a
robust Object detection network. Fast R-CNN still takes about 2 seconds to make predictions
about the objects in the image.
Faster R-CNN is the modified version of Fast R-CNN, where generation of ROIs is done using
Region Proposal Network. This algorithm acquires input in the form of feature maps. The result
is obtained as objectness scores.
Image data flow in Faster R-CNN
→ The Faster R-CNN takes in an input image, passes it to the CNN and allows it to generate
an image’s feature map.
→ RPN is utilized on those maps, returning the object suggestion and the objectness score.
The RPN uses sliding windows over these object maps and ‘k’ Anchor boxes of different
dimensions are generated. For each anchor box, two things are predicted, o The
probability of the anchor holding objects (it doesn’t classify the objects at this step)
o A bounding box regressor to better adjust the anchor to fit the objects better inside
the box.
→ Next, the proposals are taken in, cropped so that each proposal contains an object. An ROI
pool layer is applied to all these suggestionsto bring them down to same size. ROI
pooling layers extract a fixed size pooling map for each anchor.
→ Ultimately,the resized proposals pass to the fully connected layer containing the
‘SoftMax’ layer. Linear regression layer is present on the top, to classify and output the
bounding boxes.
ROI Pooling Layer Working in Faster R-CNN
All of R-CNN’s versions (including Faster R-CNN) look at different regions of a particular
image sequentially as opposed to taking a look at the whole image. The result of this are two
complications:
This is a system which is focused on real-time determination of several objects and instances.
This works in a similar structure to that of a FCNN algorithm, unlike RPNs where same regions
in an image are processed several times, where the image gets passed once and the output is
obtained. [input- NxN; output- SxS]. The image is divided into a grid of size SxS. Every grid is
assigned to predict a unique object. A certain number of boundary boxes is assigned to all grids.
The drawback of YOLO is that since each grid can detect only a unique object, the proximity of
the objects determines whether they are detected or not.
● ‘B’ number of boundary boxes are predicted. every box has a box
confidence score,
Contents of a boundary box: x,y,w,h; box confidence score. The “objectness” is reflected
based on the confidence score. This shows the likeliness that a box encompasses the object and
the accuracy of the box. The dimensions of the bounding box (w,h) are standardized with those
of the image and x, y are cell offset values (note that x,y,w,h lie within 0 – 1). Conditional class
probability: 20 per cell; probability of an object pertaining to a certain class.
Prediction shape for YOLO: (S, S, (Bx5) +C) = (7, 7, (2x5) +20) = (7, 7,30).
Structure of YOLO
YOLO is set on the idea of building a CNN to predict a tensor of size 7,7,30.The spatial
dimension reduces to 7x7, with every location having 1024 o/p channels. Two FC layers are used
to make 7x7x2 boundary box on which linear regression is performed. Those with confidence
scores above 0.25 are used to make predictions finally. class confidence score is calculated
as(per pred.box):
YOLO consists of: 24 convolutional layers, 2 FC layers. In certain layers, 1x1 reduction layers
are utilized to reduce depth of feature maps. Tensor shape of the last layer is (7,7,1024), and this
tensor is flattened. 2 FC layers are used as linear regressions to give a 7x7x30 o/p, which is
reshaped to (7,7,30) (2 bound. box predictions at each location)
Loss function
YOLO works by predicting several bounding boxes for every cell. This would result in the
generation of false positives. To counter this, we require only one bounding box to correspond to
the object. The largest IoU is chosen with ground truth, for this reason, thus making every
prediction more effective at guessing specific sizes, aspect ratios, etc.
Sum squared error is utilized between predictions and ground truth as a means to compute
loss. Loss function comprises of: classification, localization and confidence losses.
Classification loss
Localization loss
Error in predicted bound. box locations and dimensions is measured, where the box
corresponding to the object is only counted.
Different weights should be given to boxes of different sizes, since it may be the same for a
small or a large box, if say a 3 pixel error occurs. This is partly compensated since YOLO uses
sq. root of width and height in the place of the absolute values. Loss multiplied by λcoord
emphasizes the bound. box accuracy further(5 is the default).
Confidence loss
Class imbalances are created since most boxes don’t contain objects. Thus, the model is
trained such that the background is detected more often compared to the objects. λnoobj is used
as a factor to weigh down loss, to counter this(0.5 is the standard)
YOLOv2 used a custom CNN known as Darknet-19 with 19 layers from the original network
and 11 additional layers to make it a 30 layers network for object detection. However ever with a
30 layers architecture, YOLOv2 struggled with small objects detection and this was attributed to
loss of fine-grained features as the input passed through each pooling layer. Identity mapping and
concatenating characteristics were utilized from preceding layers to determine low lever features,
in order to compensate for this.
Ever after all this, it lacked several important aspects of a object detection algorithm which
made it stable such as residual blocks, skip connections and upsampling layers.
These corrections were made and a new version of YOLO was born and that is known as
YOLOv3.
YOLOv3 also uses a variant of Darknet-53 which has 53 convolutional layers trained on
Imagenet for the purpose of classification with an additional of 53 more layers stacked onto it to
make it a full-fledged network to perform classification and detection. As a result, YOLOv3 is
slower than the second version but a lot more accurate than its predecessors.
Structure of YOLOv3
The main difference between YOLOv3 and its predecessors is that it makes predictions at 3
different scales. The initial detection is performed in the 82nd layer. If an input image of 416x416
is fed into the network, the feature map thus acquired would be of size 13x13. The other two
scales at which detections happen are at the 94th layer yielding a feature map of dimensions
26x26x255 and the final detection happens at the 106th layer, resulting in a feature map with
dimension 52x52x255. The detection which happens at the 82nd layer, is liable for the detection
of large objects and the detections which happen at the 106th layer is liable for detecting small
objects with the 94th layer, staying in-between these 2 with a dimension of 26x26, detecting
medium size objects.
This kind of varied detection scale renders YOLOv3 good at detecting small objects than its
predecessors.
YOLOv3 uses 9 anchor boxes to localize objects with 3 for each detection scale. The anchor
boxes are assigned in the descending order with the largest 3 boxes of all for first detection layer
which is used to detect large objects, the next 3 for the medium sized objects detection layer and
the final 3 for the small objects’ detection layer.
YOLOv3 uses 10x the number of bounding boxes used by YOLOv2 since YOLOv3 detects at
3 different scales. For instance, for an image of input size 416x416, YOLOv2Would predict
13x13x5 = 845 boxes whereas YOLOv3 would go for 13x13x5 + 26x26x5 + 52x52x5 = A
whopping 10,647 boxes.
The terms involved in the calculation of loss function in YOLOv2 were calculated using Mean
Squared Error method while in YOLOv3, it was modified to Logistic Regression since this
model offer a better fit than the previous one.
YOLOv3 executes classification of multiple labels for objects that are detected in images and
videos. In YOLOv2, softmaxing is performed on all the class scores and whichever has the
maximum class score is assigned to that object. This rests on the assumption that if one object
belongs to one class, it can’t be a part of another class. For instance, it is not necessary that an
object belonging to the class Car would not belong to the class Vehicle. An alternative approach
to this would be using logistic regression to predict class scores of objects and setting a threshold
for predicting multiple labels. Classes that have scores greater than the threshold score arethen
assigned to the box.
YOLOv3 was benchmarked against popular state of the art detectors like RetinaNet50 and
RetinaNet101 with the COCO mAP 50 benchmark where 50 stands for the accuracy of the
model, how well the predicted bounding boxes align with the Ground Truth bounding boxes.
This metric of evaluating CNNs is known as IOU, Intersection Over Union. 50 over here
corresponds to 0.5 on the IOU scale of the evaluation. If the prediction made by the model is less
than 0.5, it is classified as a mislocalisation and classified as a false positive.
YOLOv3 is really fast and accurate. When measured at 50 mAP, it is on par with the
RetinaNet50 and RetinaNet101 but it is almost 4x faster than those two models.In benchmarks
where the accuracy metric is higher (COCO 75), the boxes need to be more aligned with the
Ground Truth label boxes and here is where RetinaNet zooms past YOLO in terms of accuracy.
Benchmark scores of YOLOv3 and other networks against COCO mAP 50
These are the metrics for different models with different benchmarks and it is quite observable
that the mAP (mean Average Precision) for YOLOv3 is 57.9 on COCO 50 benchmark and 34.4
on COCO 75 benchmark. RetinaNet is better at detecting small objects but YOLOv3 is so much
faster than versions of RetinaNet.
YOLO also uses anIntersection Over Unionmetric to grade the algorithm’s accuracy. IOU is
a simple ratio of area of intersection between the predicted box and the Ground truth box to the
Area of the Union of the predicted box with ground truth box. After removing bounding boxes
with the detection probability lesser than the NMS threshold, YOLO discards all the boxes for
objects with IOU scores lesser than the IOU threshold to eliminate duplicate detections further.
3.7 IMPLEMENTATION OF YOLOv3
To implement the pre-trained YOLOv3 network, all that is required from the library is the
config file of YOLOv3 which defines the layers and other essential specifics of the network like
the number of filters in each layer, learning rate, classes, stride, input size for each layer and
channels, output tensor etc. The config file gives the basic structure of the model by defining the
number of neurons in each layer and different kinds of layers. With the help of the config file,
one could start training their model with either a pre-existing dataset like COCO, Alex net,
MNIST dataset for handwritten digits detection etc.
80 classes on which YOLOv3 was trained on
Training a neural network is a painstaking process which requires tons of data and
computational power in terms of Graphics Processing power and it could take several anywhere
between several hours to several weeks to train the model. Training is advisable if the model is
going to be used to detect custom objects which are not present in the dataset on which it was
trained on or if the objects need to be detected specifically under a class as opposed to making
general predictions as to which class the object falls under.
To make general predictions to detect objects on which the model was already trained on, an
implementation could be performed to get the network up and running. This requires the config
file and the pretrained weights file of the model. The pretrained weights file has the arithmetic
of the connections between each neuron in the same layer and the next layer, in other words the
weight of one neuron with respect to the weight of the other one since neurons in successive
layers are interconnected and the input received by the input neuron has an effect on the neuron
in the next layer. Hence when the model is trained from the scratch, rando values are allocated
for weights and these values are optimized using the loss function and the optimization function
to bring the weights values closest to making the network function similar to human perception
or better! For modifying the weights while retraining the network, weights of selective layers are
frozen basedon the amount of data available and the purpose for which the network is being
retrained to do.
YOLOv3 Config File
For this unmodified implementation of YOLOv3, the config file and the weights file were
taken from the library OpenCV (Computer Vision Library). The network was decided to be used
as it is since the purpose of detecting objects and tracking them in a video was solved.
A code script was written using the programming language python and other dependencies
such as OpenCV, numpy, argparse, OS etc. to receive the images of the video file, preprocess it,
send it to the network and perform the task of object detection and tracking.
Pre-processing of the video
Model Runtime and frame info on the input video
Object detection using deep learning has revolutionized computer vision tasks and
enabled a wide range of applications such as autonomous driving, surveillance, and
image recognition. In this analysis and discussion, we will explore the key aspects and
advancements in object detection using deep learning.
1.Convolutional Neural Networks (CNNs): CNNs form the backbone of most object
detection systems. They have shown remarkable performance in learning spatial
hierarchies and capturing complex patterns in images. Popular CNN architectures like
VGG, ResNet, and InceptionNet have been successfully utilized in object detection
frameworks.
4.You Only Look Once (YOLO): The YOLO algorithm is another popular approach
for object detection, known for its speed and accuracy. YOLO divides an image into a
grid and predicts bounding boxes and class probabilities for each grid cell. YOLO
models such as YOLOv3 and YOLOv4 have achieved state-of-the-art performance.
5.Anchor-based and Anchor-free Methods: Anchor-based methods, such as Faster R-
CNN, use predefined anchor boxes to localize objects. These methods match anchor
boxes with ground truth objects during training. In contrast, anchor-free methods, such
as CenterNet and EfficientDet, directly predict object centers and sizes. Anchor-free
methods reduce the complexity and achieve competitive performance.
6.Transfer Learning and Pretrained Models: Deep learning models for object
detection often leverage transfer learning by using pretrained models on large-scale
datasets like ImageNet. Transfer learning allows models to generalize well even with
limited training data and accelerates convergence.
9.Challenges and Future Directions: Despite significant progress, object detection still
faces challenges in detecting small objects, occlusion handling, and generalization to
novel object categories. Future research focuses on developing more robust and efficient
algorithms, exploring 3D object detection, and addressing ethical considerations.
5.1. CONCLUSIONS
In the field of object detection using deep learning, there are several areas that
researchers and practitioners are focusing on for future work. Here are some potential
directions and challenges:
2. Efficient and lightweight models: While accuracy is crucial, there is also a need
for more efficient and lightweight models that can run on resource-constrained
devices like mobile phones or embedded systems. Future work involves
developing compact architectures, optimizing network parameters, and exploring
techniques like knowledge distillation and model quantization.
7. Interpretable and explainable object detection: Deep learning models are often
regarded as black boxes, making it challenging to understand their decisions.
Future work involves developing techniques to provide interpretability and
explanations for object detection models, enabling better trust, transparency, and
error analysis.