0% found this document useful (0 votes)
10 views

m. e report

The project report titled 'Real Time Object Detection Using Deep Learning' by S. Dhanalakshmi focuses on developing a system for recognizing and locating objects in real-time using deep learning techniques, particularly Convolutional Neural Networks and OpenCV. The report discusses the significance of object detection in various applications such as video surveillance and autonomous vehicles, as well as the challenges faced in achieving efficient real-time detection. It also includes a literature review of existing methodologies and technologies related to object detection, along with the project's objectives and motivation.

Uploaded by

dhana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

m. e report

The project report titled 'Real Time Object Detection Using Deep Learning' by S. Dhanalakshmi focuses on developing a system for recognizing and locating objects in real-time using deep learning techniques, particularly Convolutional Neural Networks and OpenCV. The report discusses the significance of object detection in various applications such as video surveillance and autonomous vehicles, as well as the challenges faced in achieving efficient real-time detection. It also includes a literature review of existing methodologies and technologies related to object detection, along with the project's objectives and motivation.

Uploaded by

dhana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

REAL TIME OBJECT

DETECTION USING DEEP


LEARNING

A PROJECT REPORT

Submitted by

S. DHANALAKSHMI (730923401002)

in partial fulfillment for the award of the degree

of

M.E Applied Electronics

in

ELECTRONICS AND COMMUNICATION ENGINEERING

EXCEL ENGINEERING COLLEGE

(An Autonomous Institution, affiliated to Anna University Chennai and Approved by

AICTE, New Delhi)

KOMARAPALAYAM – 637303

MAY-2025
BONAFIDE CERTIFICATE

Certified that this project report titled “REAL TIME OBJECT


DETECTION USING DEEP LEARNING” is the Bonafide work of
S.DHANALAKSHMI (730923401002), and who carried out the project under my
supervision.

SIGNATURE SIGNATURE
Dr.R.Dinesh M.E.,Ph.D.,
Dr.T.C.Kalaiselvi M.E., Ph.D.,
HEAD OF THE DEPARTMENT ASSOCIATE PROFESSOR
Professor, Department of Electronics and
Department of Electronics communication engineering(ECE),
and communication engineering (ECE), Excel Engineering College,
Excel Engineering College, Komarapalayam – 637303.
Komarapalayam – 637303.

Submitted for the viva-voce examination held on ………………

Internal Examiner External Examiner


ACKNOWLEDGEMENT

It is with great pride that we express our gratitude to our


institution, “EXCEL ENGINEERING COLLEGE” for providing us
the opportunity to do this project.
We are greatly indebted to our Chairman Prof. Dr. A. K.
NATESAN, M.Com., MBA, M.Phil., Ph.D., FTA, AKS and Vice
Chairman Dr. N. MATHAN KARTHICK, M.B.B.S., M.H.Sc.
(Diabetology), AKS of EXCEL GROUP OF INSTITUTIONS, for
providing all the facilities for successfully completing the project.
We express our heartfelt thanks to the Principal, Dr. K.
BOMMANNA RAJA, M.E., Ph.D., for his valuable suggestions in our
entire endeavor.
We thank the Head of the Department, Dr. T. C. KALAI
SELVI.,M.E., Ph.D., of Electronics and Communication Engineering for
his guidance and constant inspiration in completing this project.
We wish to express our heartfelt thanks and sincere
acknowledgement to our supervisor, Dr.R.DINESH M.E.,Ph.D., for her
encouragement and dedicated guidance.
We express our everlasting and loving thanks to our parents for
their kind help and support, which have contributed for the fruitful
completion of the project.
ABSTRACT

Real time object detection is a vast, vibrant and complex area of


computer vision. Ifthere is a single object to be detected in an image,
it is known as Image Localization and ifthere are multiple objects in
an image, then it is Object Detection. This detects the semanticobjects
of a class in digital images and videos. The applications of real time
object detectioninclude tracking objects, video surveillance, pedestrian
detection, people counting, self-driving cars, face detection, ball
tracking in sports and many more. Convolution NeuralNetworks is a
representative tool of Deep learning to detect objects using
OpenCV(Opensource Computer Vision), which is a library of
programming functions mainly aimed at realtime computer vision.

Keywords: Computer vision, Deep Learning, Convolution Neural Networks.

i
TABLE OF CONTENT

CHAPTE TITLE PAGE


R NO NO
ABSTRACT i
LIST OF TABLES ii

LIST OF FIGURES iv

LIST OF ABBREVATION v
1 INTRODUCTION 1

1.1 Project Objective 1


1.2 Motivation 2
1.3 Object Detection 2
1.4 Digital Image Processing 3
2 LITERATURE REVIEW 10
2.1 YOLO - Redmon et al 10
2.2 SSD - Liu et al 10
2.3 Fastren et aler R-CNN - 10
2.4 EffcientDet - Tan et al 11
2.5 CenterNet -Zhou et al 11
2.6 Vision Transformers for object Detection - 11
Dosovitskiy et al
3 DEEP LEARNING 12
3.1 Introduction 12

3.2 Feedforword and feedback networks 13

3.3 Weighted Sum 14

ii
3.4 Threshold Function 15

3.5 Sigmoid Function 15


3.6 Hyperbolic Tangent Function 15

3.7 Rectifier Function 15

4 CONVOLUTION NEURAL NETWORKS 16


4.1 Introduction 16

4.2 Artificial Nerural Networks 18

4.3 CNN Architecture 19

4.4 Convolution Layers 21

4.5 Fully Connected Layers 24

4.6 Types of caffe Models 26

4.7 Learning Feature Herarchy 30

5 OPEN COMPUTER VISION 32


5.1 Introduction 32

5.2 OpenCV applications 34

5.3 Libraries in OpenCV 35


6 RESULTS AND DISCUSSIONS 41
6.1 Introduction 41

6.2 Input Layer 41


6.3 Convolution Layers 41

6.4 Dense Layers 42


6.5 Output Layers 43
6.6 CONCULSION 45
7 REFERENCES 46

iii
LIST OF FIGURES

FIGURE TITLE PAGE


NO NO
1.1 Digital Image 4

1.2 Types of Image Processing 5

4.1 Aritificial Neural Networks 16

4.2 Simple Neural Networks 17

4.3 CNN Architecture 20

4.4 Visual representation of convolution layer 22

4.5 Algorithm 26

5.1 Rectangle feature 39

6.1 Forward & Backward propagation 43

6.2 Input & Output 44

iv
LIST OF ABBREVATION

ADAS Advanced Driver Assistance System


R-CNN Region - Convolution Nerual Network

VGG Visual Geometry Group

CPU Central Processing Unit

GPU Graphics Processing Unit

TPU Tensor Processing Unit

SSD Single Shot Detector

YOLO You Only Look Once

RPN Region Proposal Network

BiFPN Bi-Directional Feature Pyramid Network

ANN Artifical Nerual Network

CNN Convolutional Nerual Network

RBF Radial Basic Function

SVM Support Vector Machine


FCN Fully Convolutional Network

SGD Stochastic Gradient Descent

v
CHAPTER 1
INTRODUCTION

1.1 Project Objective:


The motive of object detection is to recognize and locate all
known objects in a scene. Preferably in 3D space, recovering pose of
objects in 3D is very important for robotic control systems. Imparting
intelligence to machines and making robots more and more autonomous
and independent has been a sustaining technological dream for the
mankind. It is our dream to let the robots take on tedious, boring, or
dangerous work so that we can commit our time to more creative tasks.
Unfortunately, the intelligent part seems to be still lagging behind. In real
life, to achieve this goal, besides hardware development, we need the
software that can enable robot the intelligence to do the work and act
independently. One of the crucial components regarding this is vision,
apart from other types of intelligences such as learning and cognitive
thinking. A robot cannot be too intelligent if it cannot see and adapt to a
dynamic environment.

The searching or recognition process in real time scenario is


very difficult. So far, no effective solution has been found for this
problem. Despite a lot of research in this area, the methods developed so
far are not efficient, require long training time, are not suitable for real
time application, and are not scalable to large number of classes. Object
detection is relatively simpler if the machine is looking for detecting one
particular object. However, recognizing all the objects inherently requires
the skill to differentiate one object from the other, though they may be of
same type. Such problem is very difficult for machines, if they do not
know about the various possibilities of objects.

1
1.2 Motivation:
Blind people do lead a normal life with their own style of
doing things. But, they definitely face troubles due to inaccessible
infrastructure and social challenges. The biggest challenge for a blind
person, especially the one with the complete loss of vision, is to navigate
around places. Obviously, blind people roam easily around their house
without any help because they know the position of everything in the
house. Blind people have a tough time finding objects around them. . So
we decided to make a REAL TIME OBJECT DETECTION System. We
are interested in this project after we went through few papers in this area.
As a result we are highly motivated to develop a system that recognizes
objects in the real time environment
1.3 OBJECT DETECTION

Object Detection is the process of finding and recognizing real-


world object instances such as car, bike, TV, flowers, and humans out of
an images or videos. An object detection technique lets you understand
the details of an image or a video as it allows for the recognition,
localization, and detection of multiple
objects within an image.
It is usually utilized in applications like image retrieval,
security, surveillance, and advanced driver assistance systems
(ADAS).Object Detection is done through many ways:
 Feature Based Object Detection
 Viola Jones Object Detection
 SVM Classifications with HOG Features
 Deep Learning Object Detection

2
Object detection from a video in video surveillance applications is the
major task these days. Object detection technique is used to identify
required objects in video sequences and to cluster pixels of these objects.
The detection of an object in video sequence plays a major role in several
applications specifically as video surveillance applications.
Object detection in a video stream can be done by processes
like pre-processing, segmentation, foreground and background extraction,
feature extraction.
Humans can easily detect and identify objects present in an
image. The human visual system is fast and accurate and can perform
complex tasks like identifying multiple objects with little conscious
thought. With the availability of large amounts of data, faster GPUs, and
better algorithms, we can now easily rain computers to detect and classify
multiple objects within an image with high accuracy.

DIGITAL IMAGE PROCESSING


Computerized picture preparing is a range portrayed by the
requirement for broad test work to build up the practicality of proposed
answers for a given issue. A critical trademark hidden the plan of picture
preparing frameworks is the huge level of testing and experimentation
that Typically is required before touching base at a satisfactory
arrangement. This trademark infors that the capacity to plan approaches
and rapidly model hopeful arrangements by and large assumes a
noteworthy part in diminishing the cost and time required to land at a
suitableframework execution.

WHAT IS AN IMAGE?

3
A picture is spoken to as a two dimensional capacity f(x, y)
where x and y are spatial co-ordinates and the adequacy of "T" at any
match of directions (x, y) is known as the power of the picture by then.

Fig. 1.1 digital image

Processing on image:
Processing on image can be of three types They are low-level, mid-level,
high level.

Low-level Processing:
 Preprocessing to remove noise.
 Contrast enhancement.
 Image sharpening

Medium Level Processing :


 Segmentation.
 Edge detection
 Object extraction.

High Level Processing:


 Image analysis
 Scene interpretation

4
Why Image Processing?
Since the digital image is invisible, it must be prepared for viewing on
one or more output device(laser printer, monitor at).The digital image can
be optimized for the application by enhancing the appearance of the
structures within it.

There are three of image processing used. They are


 Image to Image transformation
 Image to Information transformations
 Information to Image transformations

Fig. 1.2 Types of Image Processing

Pixel :
Pixel is the smallest element of an image. Each pixel correspond to any
one value. In an 8-bit gray scale image, the value of the pixel between 0
and 255.Each pixel store a value proportional to the light intensity at that

5
particular location. It is indicated in either Pixels per inch or Dots per
inch.
Resolution :
The resolution can be defined in many ways. Such as pixel resolution,
spatial resolution, temporal resolution, spectral resolution.In pixel
resolution, the term resolution refers to the total number of count of
pixels in an digital image. For example, If an image has M rows and N
columns, then its resolution can be defined as MX N. Higher is the pixel
resolution, the higher is the quality of the image.

RELATED TECHNOLOGY:
R-CNN
R-CNN is a progressive visual object detection system that
combines bottom-up region proposals with rich options computed by a
convolution neural network.
R-CNN uses region proposal ways to initial generate potential bounding
boxes in a picture and then run a classifier on these proposed boxes.
SINGLE SIZE MULTI BOX DETECTOR
SSD discretizes the output space of bounding boxes into a set
of default boxes over different aspect ratios and scales per feature map
location. At the time of prediction the network generates scores for the
presence of each object category in each default box and generates
adjustments to the box to better match the object shape.
Additionally, the network combines predictions from multiple feature
maps with different resolutions to naturally handle objects of various
sizes.

6
ALEXNET
AlexNet is a convolutional neural Network used for
classification which has 5 Convolutional layers, 3 fullyconnected layers
and 1 softmax layer with 1000 outputs for classification as his
architecture.
YOLO
YOLO is real-time object detection. It applies one neural
network to the complete image dividing the image into regions and
predicts bounding boxes and possibilities for every region.
Predicted probabilities are the basis on which these bounding boxes are
weighted. A single neural network predicts bounding boxes and class
possibilities directly from full pictures in one evaluation. Since the full
detection pipeline is a single network, it can be optimized end-to-end
directly on detection performance.15
VGG
VGG network is another convolution neural network
architecture used for image classification.
MOBILENETS
To build lightweight deep neural networks MobileNets are used.
It is based on a streamlined architecture that uses depth-wise separable
convolutions. MobileNet uses 3×3 depth-wise separable convolutions that
uses between 8 times less computation than standard convolution at
solely a little reduction accuracy. Applications and use cases including
object detection, fine grain classification, face attributes and large scale-
localization.
TENSOR FLOW
Tensor flow is an open source software library for high
performance numerical computation. It allows simple deployment of
computation across a range of platforms (CPUs, GPUs, TPUs) due to its

7
versatile design also from desktops to clusters of servers to mobile and
edge devices. Tensor flow was designed and developed by researchers
and engineers from the Google Brain team at intervals Google’s AI
organization, it comes with robust support for machine learning and deep
learning and the versatile numerical computation core is used across
several alternative scientific domains. To construct, train and deploy
Object Detection Models TensorFlow is used that makes it easy and also
it provides a collection of Detection Models pre-trained on the COCO
dataset, the Kitti dataset, and the Open Images dataset. One among the
numerous Detection Models is that the combination of Single Shot
Detector (SSDs) and Mobile Nets architecture that is quick, efficient and
doesn't need huge computational capability to accomplish the object
Detection.
APPLICATION OF OBJECT DETECTION
The major applications of Object Detection are:
FACIAL RECOGNITION
“Deep Face” is a deep learning facial recognition system
developed to identify human faces in a digital image. Designed and
developed by a group of researchers in Facebook. Google also has its
own facial recognition system in Google Photos, which automatically
separates all the photos according to the person in the image.
There are various components involved in Facial Recognition or authors
could say it focuses on various aspects like the eyes, nose, mouth and the
eyebrows for recognizing a faces.16
PEOPLE COUNTING
People counting is also a part of object detection which can
be used for various purposes like finding person or a criminal; it is used
for analysing store performance or statistics of crowd during festivals.

8
This process is considered a difficult one as people move out of the frame
quickly.

INDUSTRIAL QUALITY CHECK


Object detection also plays an important role in industrial
processes to identify or recognize products. Finding a particular object
through visual examination could be a basic task that's involved in
multiple industrial processes like sorting, inventory management,
machining, quality management, packaging and so on. Inventory
management can be terribly tough as things are hard to trace in real time.
Automatic object counting and localization permits improving inventory
accuracy.
SELF DRIVING CARS
Self-driving is the future most promising technology to be
used, but the working behind can be very complex as it combines a
variety of techniques to perceive their surroundings, including radar, laser
light, GPS, odometer, and computer vision. Advanced control systems
interpret sensory info to allow navigation methods to work, as well as
obstacles and it. This is a big step towards Driverless cars as it happens at
very fast speed.
SECURITY
Object Detection plays a vital role in the field of Security; it
takes part in major fields such as face ID of Apple or the retina scan used
in all the sci-fi movies. Government also widely use this application to
access the security feed and match it with their existing database to find
any criminals or to detecting objects like car number involved in criminal
activities. The applications are limitless.

9
CHAPTER-2
LITERATURE REVIEW

2.1. YOLO (You Only Look Once) - Redmon et al. (2016)

YOLO introduced a real-time object detection model that uses a


single neural network to predict bounding boxes and class probabilities
directly from images. Unlike previous methods, YOLO processes the
entire image in one go, making it significantly faster.The model achieved
real-time performance with FPS up to 45 on standard GPUs and
demonstrated good accuracy on the PASCAL VOC dataset. Struggles
with small object detection and overlapping objects.

2.2. SSD (Single Shot MultiBox Detector) - Liu et al. (2016)

SSD enhances real-time detection by using multiple feature


maps at different scales to detect objects. It eliminates the need for region
proposals, improving speed and efficiency. Introduced default boxes and
feature pyramids to improve small object detection. SSD achieves a good
balance between accuracy and speed.Performance degrades when
detecting very small objects.

2.3. Faster R-CNN - Ren et al. (2015)

Faster R-CNN improved upon R-CNN by introducing the


Region Proposal Network (RPN), which generates region proposals
directly from feature maps, significantly reducing computation time.
Improved accuracy over YOLO and SSD, making it one of the most
accurate object detection models.Not as fast as YOLO or SSD, making it
less suitable for real-time applications.

10
2.4. EfficientDet - Tan et al. (2020)

EfficientDet utilizes a Bi-directional Feature Pyramid


Network (BiFPN) and compound scaling to balance accuracy and
efficiency. Achieved state-of-the-art performance with better accuracy
while maintaining high speed. More complex architecture, requiring
careful tuning of scaling factors.

2.5. CenterNet - Zhou et al. (2019)

CenterNet eliminates anchor boxes by detecting objects as


keypoints, simplifying the architecture and improving inference speed.
Reduces computational overhead while maintaining high
accuracy.Struggles with detecting small and occluded objects.

2.6. Vision Transformers (ViTs) for Object Detection - Dosovitskiy et


al. (2020)

Introduced the use of transformer-based architectures for


object detection, leveraging self-attention mechanisms to improve feature
representation.Achieved high accuracy on large-scale datasets and
demonstrated strong performance in handling occlusions.Requires
significant computational resources, making real-time deployment
challenging.

11
CHAPTER 3
DEEP LEARNING
INTRODUCTION
Deep learning is a machine learning technique. It teaches a
computer to filter inputs through layers to learn how to predict and
classify information. Observations can be in the form of images, text, or
sound. The inspiration for deep learning is the way that the human brain
filters information. Its purpose is to mimic how the human brain works to
create some real magic. In the human brain, there are about 100 billion
neurons. Each neuron connects to about 100,000 of its neighbors. We’re
kind of recreating that, but in a way and at a level that works for
machines. In our brains, a neuron has a body, dendrites, and an axon. The
signal from one neuron travels down the axon and transfers to the
dendrites of the next neuron. That connection where the signal passes is
called a synapse. Neurons by themselves are kind of
useless. But when you have lots of them, they work together to create
some serious magic.That’s the idea behind a deep learning algorithm!
You get input from observation and you put your input into one layer.
That layer creates an output which in turn becomes the input for the next
layer, and so on. This happens over and over until your final output signal!
The neuron (node) gets a signal or signals ( input values), which pass
through the neuron. That neuron delivers the output signal.

Think of the input layer as your senses: the things you see, smell,
and feel, for example. These are independent variables for one single
observation. This information is broken down into numbers and the bits
of binary data that a computer can use. You’ll need to either standardize
or normalize these variables so that they’re within the same range. They

12
use many layers of nonlinear processing units for feature extraction and
transformation. Each successive layer uses the output of the previous
layer for its input. What they learn forms a hierarchy of concepts. In this
hierarchy, each level learns to transform its input data into a more and
more abstract and composite representation. That means that for an image,
for example, the input might be a matrix of pixels. The first layer might
encode the edges and compose the pixels. The next layer might compose
an arrangement of edges. The next layer might encode a nose and eyes.
The next layer might recognize that the image contains a face, and so on.

What happens inside the neuron?


The input node takes in information in a numerical form. The
information is presented as an activation value where each node is given a
number. The higher the number, the greater the activation. Based on the
connection strength (weights) and transfer function, the activation value
passes to the next node. Each of the nodes sums the activation values that
it receives (it calculates the weighted sum) and modifies that sum based
on its transfer function. Next, it applies an activation function. An
activation19 function is a function that’s applied to this particular neuron.
From that, the neuron understands if it needs to pass along a signal or not.
Each of the synapses gets assigned weights, which are crucial to
Artificial Neural Networks (ANNs). Weights are how ANNs learn. By
adjusting the weights, the ANN decides to what extent signals get passed
along. When you’re training your network, you’re deciding how the
weights are adjusted.

Feedforward and feedback networks


A feedforward network is a network that contains inputs, outputs, and
hidden layers. The signals can only travel in one direction (forward).

13
Input data passes into a layer where calculations are performed. Each
processing element computes based upon the weighted sum of its inputs.
The new values become the new input values that feed the next layer
(feed-forward). This continues through all the layers and determines the
output. Feedforward networks are often used in, for example, data mining.

A feedback network (for example, a recurrent neural network)


has feedback paths. This means that they can have signals traveling in
both directions using loops. All possible connections between neurons are
allowed. Since loops are present in this type of network, it becomes a
non-linear dynamic system which changes continuously until it reaches a
state of equilibrium. Feedback networks are often used in optimization
problems where the network looks for the best arrangement of
interconnected factors.

Weighted Sum
Inputs to a neuron can either be features from a training set or
outputs from the neurons of a previous layer. Each connection between
two neurons has a unique synapse with a unique weight attached. If you
want to get from one neuron to the next, you have to travel along the
synapse and pay the “toll” (weight). The neuron then applies an activation
function to the sum of the weighted inputs from each incoming synapse.
It passes the result on to all the neurons in the next layer. When we talk
about updating weights in a network, we’re talking about adjusting the
weights on these synapses.

A neuron’s input is the sum of weighted outputs from all the


neurons in the previous layer. Each input is multiplied by the weight
associated with the synapse connecting the input to the current neuron. If

14
there are 3 inputs or neurons in the previous layer, each neuron in the
current layer will have 3 distinct weights: one for each synapse.

Thresholdfunction
This is a step function. If the summed value of the input reaches a certain
threshold the function passes on 0. If it’s equal to or more than zero, then
it would pass on 1. It’s a very rigid, straightforward, yes or no function.

Sigmoid function
This function is used in logistic regression. Unlike the threshold function,
it’s a smooth, gradual progression from 0 to 1. It’s useful in the output
layer and is used heavily for linear regression.22

Hyperbolic Tangent Function


This function is very similar to the sigmoid function. But unlike the
sigmoid function which goes from 0 to 1, the value goes below zero, from
-1 to 1. Even though this isn’t a lot like what happens in a brain, this
function gives better results when it comes to training neural networks.
Neural networks sometimes get “stuck” during training with the sigmoid
function. This happens when there’s a lot of strongly negative input that
keeps the output near zero, which messes with the learning process.

Rectifier function
This might be the most popular activation function in the universe of
neural networks. It’s the most efficient and biologically plausible. Even
though it has a kink, it’s smooth and gradual after the kink at 0. This
means, for example, that your output would be either “no” or a
percentage of “yes.” This function doesn’t require normalization or other
complicated calculations.

15
CHAPTER 4
CONVOULUTION NEURAL NETWORKS

INTRODUCTION TO CONVOLUTIONAL NEURAL NETWORKS


(CNN)

Artificial Neural Networks


The idea of ANNs is based on the belief that working of
human brain by making the right connections, can be imitated using
silicon and wires as living neurons and dendrites.

Fig: 4.1

The human brain is composed of 86 billion nerve cells called


neurons. They are connected to other thousand cells by Axons. Stimuli
from external environment or inputs from sensory organs are accepted by
dendrites. These inputs create electric impulses, which quickly travel
through the neural network. A neuron can then send the message to other
neuron to handle the issue or does not send it forward.

ANNs are composed of multiple nodes, which imitate


biological neurons of human brain. The neurons are connected by links
and they interact with each other. The nodes can take input data and

16
perform simple operations on the data. The result of these operations is
passed to other neurons. The output at each node is called its activation or
node value. Each link is associated with weight. ANNs are capable of
learning, which takes place by altering weight values.

Neural network:
A neural network is a network or circuit of neurons, or in a modern sense,
an artificial neural network, composed of artificial neurons or nodes.
Thus a neural network is either a biological neural network, made up of
real biological neurons, or an artificial neural network, for solving
artificial intelligence (AI) problem. The connections of the biological
neuron are modeled as weights. A positive weight reflects an excitatory
connection, while negative values mean inhibitory connections. All inputs
are modified by a weight and summed. This activity is referred as a linear
combination. Finally, an activation function controls the amplitude of the
output. For example, an acceptable range of output is usually between 0
and 1, or it could be - 1 and 1.
which can derive conclusions from a complex and seemingly unrelated
set of information.

Fig. 4.2 A simple neural network

17
A deep neural network (DNN) is an artificial neural network
(ANN) with multiple layers between the input and output layers. The
DNN finds the correct mathematical manipulation to turn the input into
the output, whether it be a linear relationship or a non-linear relationship.

CONVOLUTIONAL NEURAL NETWORKS:


Convolutional Neural Networks are very similar to ordinary
Neural Networks from the previous chapter: they are made up of neurons
that have learnable weights and biases. Each neuron receives some inputs,
performs a dot product and optionally follows it with a non-linearity.
Convolutional Neural Networks (CNNs) are analogous to traditional
ANNs in that they are comprised of neurons that self-optimise through
learning. Each neuron will still receive an input and perform a operation
(such as a scalar product followed by a non-linear function) - the basis of
countless ANNs. From the input raw image vectors to the final output of
the class score, the entire of the network will still express a single
perceptive score function (the weight). The last layer will contain loss
functions associated with the classes, and all of the regular tips and tricks
developed for traditional ANNs still apply.
The only notable difference between CNNs and traditional
ANNs is that CNNs are primarily used in the field of pattern recognition
within images. This allows us to encode image-specific features into the
architecture, making the network more suited for image-focused tasks -
whilst further reducing the parameters required to set up the model. One
of the largest limitations of traditional forms of ANN is that they tend to
struggle with the computational complexity required to compute image
data. Common machine learning benchmarking datasets such as the
MNIST database of handwritten digits are suitable for most forms of

18
ANN, due to its relatively small image dimensionality of just 28 × 28.
With this dataset a single neuron in the first hidden layer will contain 784
weights (28×28×1 where 1 bare in mind that MNIST is normalised to just
black and white values), which is manageable for most forms of ANN. If
you consider a more substantial coloured image input of 64 × 64, the
number of weights on just a single neuron of the first layer increases
substantially to 12, 288. Also take into account that to deal with this scale
of input, the network will also need to be a lot larger than one used to
classify colour-normalised MNIST digits, then you will understand the
drawbacks of using such models.

CNN ARCHITECTURE:

CNNs are feedforward networks in that information flow takes place in


one direction only, from their inputs to their outputs. Just as artificial
neural networks (ANN) are biologically inspired, so are CNNs. The
visual cortex in the brain, which consists of alternating layers of simple
and complex cells (Hubel & Wiesel, 1959, 1962), motivates their
architecture.

CNN architectures come in several variations; however, in general, they


consist of convolutional and pooling (or subsampling) layers, which are
grouped into modules. Either one or more fully connected layers, as in a
standard feedforward neural network, follow these modules. Modules are
often stacked on top of each other to form a deep model. It illustrates
typical CNN architecture for a toy image classification task. An image is
input directly to the network, and this is followed by several stages of
convolution and pooling. Thereafter, representations from these
operations feed one or more fully connected layers.

19
Finally, the last fully connected layer outputs the class label. Despite this
being the most popular base architecture found in the literature, several
architecture changes have been proposed in recent years with the
objective of improving image classification accuracy or reducing
computation costs. Although for the remainder of this section, we merely
fleetingly introduce standard CNN architecture.

Fig: 4.3

OVERALL ARCHITECTURE:
CNNs are comprised of three types of layers. These are
convolutional layers, pooling layers and fullyconnected layers. When
these layers are stacked, a CNN architecture has been formed. A
simplified CNN architecture for MNIST classification is illustrated in
Figure 2. input 0 9 convolution w/ReLu pooling output fully-connected
w/ ReLu fully-connected ... Fig.2: An simple CNN architecture,
comprised of just five layers The basic functionality of the example CNN
above can be broken down into four key areas. 1. As found in other forms

20
of ANN, the input layer will hold the pixel values of the image. 2. The
convolutional layer will determine the output of neurons of which are
connected to local regions of the input through the calculation of the
scalar product between their weights and the region connected to the
input volume. The rectified linear unit (commonly shortened to ReLu)
aims to apply an ’elementwise’ activation function such as sigmoid to the
output of the activation produced by the previous layer. 3. The pooling
layer will then simply perform downsampling along the spatial
dimensionality of the given input, further reducing the number of
parameters within that activation. 4. The fully-connected layers will then
perform the same duties found in standard ANNs and attempt to
understanding the overall architecture of a CNN architecture will not
suffice. The creation and optimisation of these models can take quite
some time, and can be quite confusing. We will now explore in detail the
individual layers, detailing their hyperparameters and
connectivities.produce class scores from the activations, to be used for
classification. It is also suggested that ReLu may be used between these
layers, as to improve performance. Through this simple method of
transformation, CNNs are able to transform the original input layer by
layer using convolutional and downsampling techniques to produce class
scores for classification and regression purposes. However, it is important
to note that simply understanding the overall architecture of a CNN
architecture will not suffice. The creation and optimisation of these
models can take quite some time, and can be quite confusing. We will
now explore in detail the individual layers, detailing their
hyperparameters and connectivities.

CONVOLUTIONAL LAYERS:

21
The convolutional layers serve as feature extractors, and thus
they learn the feature representations of their input images. The neurons
in the convolutional layers are arranged into feature maps. Each neuron in
a feature map has a receptive field, which is connected to a neighborhood
of neurons in the previous layer via a set of trainable weights, sometimes
referred to as a filter bank. Inputs are convolved with the learned weights
in order to compute a new feature map, and the convolved results are sent
through a nonlinear activation function.
All neurons within a feature map have weights that are
constrained to be equal; however, different feature maps within the same
convolutional layer have different weights so that several features can be
extracted at each location. As the name implies, the convolutional layer
plays a vital role in how CNNs operate. The layers parameters focus
around the use of learnable kernels. These kernels are usually small in
spatial dimensionality, but spreads along the entirety of the depth of the
input. When the data hits a convolutional layer, the layer convolves each
filter across the spatial dimensionality of the input to produce a 2D
activation map. These activation maps can be visualised. As we glide
through the input, the scalar product is calculated for each value in that
kernel. From this the network will learn kernels that ’fire’ when they see
a specific feature at a given spatial position ofthe input. These are
commonly known as activations.

Fig: 4.4 Visual representation of a convolutional layerz

22
The centre element of the kernel is placed over the input vector,
of which is then calculated and replaced with a weighted sum of itself and
any nearby pixels. Every kernel will have a corresponding activation map,
of which will be stacked along the depth dimension to form the full
output volume from the convolutional layer. As we alluded to earlier,
training ANNs on inputs such as images results in models of which are
too big to train effectively. This comes down to the fullyconnected
manner of stan ard ANN neurons, so to mitigate against this every neuron
in a convolutional layer is only connected to small region of the input
volume. The dimensionality of this region is commonly referred to as the
receptive field size of the neuron. The magnitude of the connectivity
through the depth is nearly always equal to the depth of the input.
For example, if the input to the network is an image of size 64
× 64 × 3 (aRGBcoloured image with a dimensionality of 64 × 64) and we
set the receptive field size as 6 × 6, we would have a total of 108 weights
on each neuron within the convolutional layer. (6 × 6 × 3 where 3 is the
magnitude of connectivity across the depth of the volume) To put this
into perspective, a standard neuron seen in other forms of ANN would
contain 12, 288 weights each. Convolutional layers are also able to
significantly reduce the complexity of the model through the optimisation
of its output. These are optimised through three hyperparameters, the
depth, the stride and setting zero-padding.

The depth of the output volume produced by the


convolutional layers can be manually set through the number of neurons
within the layer to a the same region of the input. This can be seen with
other forms of ANNs, where the all of the neurons in the hidden layer are
directly connected to every single neuron beforehand. Reducing this
hyperparameter can significantly minimise the total number of neurons of

23
the network, but it can also significantly reduce the capabilities of the
model.

Fully Connected Layers


Several convolutional and pooling layers are usually stacked on
top of each other to extract more abstract feature representations in
moving through the network. The fully connected layers that follow these
layers interpret these feature representations and perform the function of
high-level reasoning. . For classification problems, it is standard to use
the softmax operator on top of a DCNN. While early success was enjoyed
by using radial basis functions (RBFs), as the classifier on top of the
convolutional towers found that replacing the softmax operator with a
support vector machine (SVM) leads to improved classification accuracy.
The fully-connected layer contains neurons of which are directly
connected to the neurons in the two adjacent layers, without being
connected to any layers within them. This is analogous to way that
neurons are arranged in traditional forms of ANN.

Despite the relatively small number of layers required to form a


CNN, there is no set way of formulating a CNN architecture. That being
said, it would be idiotic to simply throw a few of layers together and
expect it to work. Through reading of related literature it is obvious that
much like other forms of ANNs, CNNs tend to follow a common
architecture. This common architecture is illustrated in Figure 2, where
convolutional layers are stacked, followed by pooling layers in a repeated
manner before feeding forward to fully-connected layers.

Convolutional Neural Networks differ to other forms of


Artifical Neural Network in that instead of focusing on the entirety of the

24
problem domain, knowledge about the specific type of input is exploited.
This in turn allows for a much simpler network architecture to be set up.
This paper has outlined the basic concepts of
Convolutional Neural Networks, explaining the layers required to build
one and detailing how best to structure the network in most image
analysis tasks.

Research in the field of image analysis using neural networks has


somewhat slowed in recent times. This is partly due to the incorrect belief
surrounding the level of complexity and knowledge required to begin
modelling these superbly powerful machine learning algorithms. The
authors hope that this paper has in some way reduced this confusion, and
made the field more accessible to beginners.

Training
CNNs and ANN in general use learning algorithms to adjust
their free parameters in order to attain the desired network output. The
most common algorithm used for this purpose is backpropagation.
Backpropagation computes the gradient of an objective function to
determine how to adjust a network’s parameters in order to minimize
errors that affect performance. A commonly experienced problem with
training CNNs, and in particular DCNNs, is overfitting, which is poor
performance on a held-out test set after the network is trained on a small
or even large training set. This affects the model’s ability to generalize on
unseen data and is a major challenge for DCNNs that can be assuaged by
regularization.

25
Fig: 4.5

Caffe Model
Caffe is a framework of Deep Learning and it was made used for the
implementation and to access the following things in an object detection
system.
•Expression: Models and optimizations are defined as plaintext schemas
in the caffe model unlike others which use codes for this purpose.
•Speed: for research and industry alike speed is crucial for state-of-the-art
models and massive data [11].
•Modularity: Flexibility and extension is majorly required for the new
tasks and different settings.
•Openness: Common code, reference models, and reproducibility are the
basic requirements of scientific and applied progress.
Types of Caffe Models
Open Pose
The first real-time multi-person system is portrayed by OpenPose which
can collectively sight human body, hand, and facial keypoints (in total
130 keypoints) on single pictures. Fully Convolutional Networks for
Semantic Segmentation In the absolutely convolutional networks (FCNs)

26
Fully Convolutional Networks are the reference implementation of the
models and code for the within the PAMI FCN and CVPR FCN papers.
Cnn-vis
Cnn-vis is an open-source tool that lets you use convolutional neural
networks to generate images. It has taken inspiration from the Google's
recent Inceptionism blog post.
Speech Recognition
Speech Recognition with the caffe deep learning framework.
DeconvNet
Learning Deconvolution Network for Semantic Segmentation.

Coupled Face Generation


This is the open source repository for the Coupled Generative Adversarial
Network (CoupledGAN or CoGAN) work.These models are compatible
with Caffe master, unlike earlier FCNs that required a prerelease branch
(note: this reference edition of the models remains ongoing and not all of
the models have yet been ported to master).
Codes for Fast Image Retrieval
To create the hash-like binary codes it provides effective framework for
fast image retrieval.
SegNet and Bayesian SegNet
SegNet is real-time semantic segmentation architecture for scene
understanding.
Deep Hand
It gives pre-trained CNN models.
DeepYeast
 Deep Yeast may be an 11-layer convolutional neural network trained
on biaural research pictures of yeast cells carrying fluorescent
proteins with totally different subcellular localizations. Python VS

27
other languages for Object Detection: Object detection may be a
domain-specific variation of the machine learning prediction
drawback. Intel’s OpenCV library that is implemented in C/C++ has
its interfaces offered during a} very vary of programming
environments like C#, Matlab, Octave, R, Python and then on. Why
Python codes are much better option than other language codes for
object detection are more compact and readable code. Python uses
zero-based indexing
 Dictionary (hashes) support provided.
 Simple and elegant Object-oriented programming
 Free and open
 Multiple functions can be package in one module
 More choices in graphics packages and toolsets Supervised learning
also plays an important role.
The utility of unsupervised pre-training is usually evaluated on the
premise of what performance is achieved when supervised fine-tuning.
This paper reviews and discusses the fundamentals of learning as well as
supervised learning for classification models, and also talks about the
mini batch stochastic
gradient descent algorithm that is used to fine-tune many of the models.
Object Classification in Moving Object Detection Object classification
works on the shape, motion, color and texture. The classification can be
done under various categories like plants, objects,
animals, humans etc. The key concept of object classification is tracking
objects and analysing their features.

Shape-Based
A mixture of image-based and scene based object parameters
such as image blob (binary large object) area, the as pectration of blob

28
bounding box and camera zoom is given as input to this detection system.
Classification is performed on the basis of the blob at each and every
frame. The results are kept in the histogram.
Motion-Based
When an easy image is given as an input with no objects in
motion, this classification isn't required. In general, non- rigid articulated
human motion shows a periodic property; therefore this has been used as
a powerful clue for classification of moving objects. based on this useful
clue, human motion is distinguished from different objects motion.
ColorBased- though color isn't an applicable live alone for police
investigation and following objects, but the low process value of the
colour primarily based algorithms makes the coloura awfully smart
feature to be exploited. As an example, the colorhistogram based
technique is employed for detection of vehicles in period. Color bar chart
describes the colour distribution in a very given region that is powerful
against partial occlusions.
Texture-Based
The texture-based approaches with the assistance of texture pattern
recognition work just like motionbased approaches. It provides higher
accuracy, by exploitation overlapping native distinction social control
however might need longer, which may be improved exploitation some
quick techniques. I. proposed WORK Authors have applied period object
detection exploitation deep learning and OpenCV to figure to work with
video streams and video files. This will be accomplished using the
efficient open computer vision. Implementation of proposed strategy
includes caffe- model based on Google Image Scenery; Caffe offers the
model definitions, optimization settings, pre- trained weights[4].
Prerequisite includes Python 3.7, OpenCV 4 packages and numpy to
complete this task of object detection. NumPy is the elementary package

29
for scientific computing with Python. It contains among other things: a
strong N-dimensional array object, subtle (broadcasting) functions tools
for integrating C/C++ and fortran code, helpful linear algebra, Fourier
transform, and random36 number capabilities. Numpy works in backend
to provide statistical information of resemblance of object with the image
scenery caffemodel database. Object clusters can be created according to
fuzzy value provided by NumPy. This project can detect live objects from
the videos and images.

LEARNING FEATURE HIERARCHY:


Learn hierarchy all the way from pixels classifier One layer extracts
features from output of previous layer, train all layers jointly
Zero-One Loss
The models given in these deep learning tutorials are largely used for
classification. The major aim of training a classifier is to reduce the
amount of errors (zero-one loss) on unseen examples
Negative Log-Likelihood Loss
Optimizing it for large models (thousands or millions of parameters) is
prohibitively expensive (computationally) because the zero-one loss isn't
differentiable. In order to achieve this maximization of the log-likelihood
is done on the classifier given all the labels in a training set [14].The
likelihood of the correct class and number of right predictions is not the
equal, but they are pretty similar from the point of view of a randomly
initialized classifier. As the likelihood and zero-one loss are different
objectives but we should always see that they are co-related on the
validation set but sometimes one will rise while the other falls, or vice-
versa.
Stochastic Gradient Descent

30
Ordinary gradient descent is an easy rule within which we repeatedly
create tiny steps downward on an error surface defined by a loss function
of some parameters. For the aim of normal gradient descent we take into
account that the training data is rolled into the loss function. Then the
pseudo code of this algorithm can be represented as Stochastic gradient
descent (SGD) works according to similar principles as random gradient
descent (SGD) operates on the basis of similar principles as normal
gradient descent. It quickly proceeds by estimating the gradient from
simply a few examples at a time instead of complete training set. In its
purest kind, we use simply one example at a time to estimate the gradient.
Caffe is a deep learning framework or else we can say a library
it's made with expression speed and modularity in mind they will put by
Berkeley artificial intelligence research and created by young King Gia
there are many deep learning or machine learning frameworks for
computer vision like tensorflow ,Tiano, Charis and SVM[2]. But why
exactly we implement edition cafe there as on is its expressive
architecture we can easily switch between CPU and GPU while training
on GPU machine37 modules and optimization for Our problem is defined
by configuration without hard coding. It supports extensible code since
cafes are open source library. It is four foot by over twenty thous and
developers and github since its birth it offers coding platform in
extensible languages like Python and C++. The next reason is speed for
training the neural networks speed is the primary constraint. Caffe can
process over million images in a single day with the standard media GPU
that is milliseconds per image. Whereas the same dataset of million
images can take weeks for Tiana and Kara's Caffe is the fastest
convolution neural network present community as mentioned earlier since
its open source library huge number of research arepowered by cafe and
every single day something new is coming out of it.

31
CHAPTER 5
OPEN COMPUTER VISION

5.1 INTRODUCTION
OpenCV stands for Open supply pc Vision Library is
associate open supply pc vision and machine learning software system
library. The purpose of creation of OpenCV was to produce a standard
infrastructure for computer vision applications and to accelerate the
utilization of machine perception within the business product [6]. It
becomes very easy for businesses to utilize and modify the code with
OpenCV as it is a BSD-licensed product. It is a rich wholesome libraby as
it contains 2500 optimized algorithms, which also includes a
comprehensive set of both classic and progressive computer vision and
machine learning algorithms. These algorithms is used for various
functions such as discover and acknowledging faces. Identify objects
classify human actions. In videos, track camera
movements, track moving objects. Extract 3D models of objects,
manufacture 3D purpose clouds from stereo cameras, sew pictures along
to provide a high-resolution image of a complete scene, find similar
pictures from a picture information, remove red eyes from images that are
clicked with the flash, follow eye movements, recognize scenery and
establish markers to overlay it with augmented reality.

Officially launched in 1999 the OpenCV project was initially an Intel


Research initiative to advance CPU-intensive applications, part of a series
of projects including real-time ray tracing and 3D display walls The main
contributors to the project included a number of optimization experts in

32
Intel Russia, as well as Intel's Performance Library Team. In the early
days of OpenCV, the goals of the project were describedas:
 Advance vision research by providing not only open but also
optimized code for basic vision infrastructure. No more reinventing
the wheel.
 Disseminate vision knowledge by providing a common infrastructure
that developers could build on, so that code would be more readily
readable and transferable.
 Advance vision-based commercial applications by making portable,
performance-optimized code available for free – with a license that
did not require code to be open or free itself.

The first alpha version of OpenCV was released to the public at the IEEE
Conference on Computer Vision and Pattern Recognition in 2000, and
five betas were released between 2001 and 2005. The first 1.0 version
was released in 2006. A version 1.1 "pre-release" was released in October
2008. The second major release of the OpenCV was in October 2009.
OpenCV 2 includes major changes to the C++ interface, aiming at easier,
more type-safe patterns, new functions, and better implementations for
existing ones in terms of performance (especially on multi-core systems).
Official releases now occur every six months and development is now
done by an independent Russian team supported by commercial
corporations.

In August 2012, support for OpenCV was taken over by a non-


profit foundation OpenCV.org, which maintains a developer and user site.
On May 2016, Intel signed an agreement to acquire ITSEEZ, a leading
developer of OpenCV. OpenCV (Open source computer vision) is a
library of programming functions mainly aimed at realtime computer

33
vision. Originally developed by Intel, it was later supported by Willow
Garage then Itseez (which was later acquired by Intel). The library is
cross-platform and free for use under the open-source BSD license. It has
C++, Python, Java and MATLAB interfaces and supports Windows,
Linux, Android and Mac OS. OpenCV leans mostly towards real-time
vision applications and takes advantage of MMX and
SSE instructions when available. A full-featured CUDAandOpenCL
interfaces are being actively developed right now. There are over 500
algorithms and about 10 times as many functions that compose or support
those algorithms. OpenCV is written natively in C++ and has a templated
interface that works seamlessly with STL containers.

OpenCV's application areas include :


 2D and 3D feature toolkits
 Egomotion estimation
 Facial recognition system
 Gesture recognition
 Human–computer interaction (HCI)
 Mobile robotics
 Motion understanding
 Object identification
 Segmentation and recognition
 Stereopsis stereo vision: depth perception from 2 cameras
 Structure from motion (SFM)
 Motion tracking
 Augmented reality

To support some of the above areas, OpenCV includes a statistical


machine learning library that contains :

34
 Boosting Decision tree learning
 Gradient boosting trees
 Expectation-maximization algorithm
 k-nearestneighbor algorithm
 Naive Bayes classifier
 Artificial neural networks
 Random forest
 Random forest
 Support vector machine (SVM)
 Deep neural networks (DNN)

Libraries in OpenCV
Numpy:
NumPy is an acronym for "Numeric Python" or "Numerical Python". It is
an open source extension module for Python, which provides fast
precompiled functions for mathematical and numerical routines.
Furthermore, NumPy enriches the programming language Python with
powerful data structures for efficient computation of multi-dimensional
arrays and matrices. The implementation is even aiming at huge matrices
and arrays. Besides that the module supplies a large library of high-level
mathematical functions to operate on these matrices and arrays.

It is the fundamental package for scientific computing with Python. It


contains various features including these important ones:
 A powerful N-dimensional array object Sophisticated (broadcasting)
functions
 Tools for integrating C/C++ and Fortran code
 Useful linear algebra, Fourier Transform, and random number
capabilities.

35
Numpy Array:
A numpy array is a grid of values, all of the same type, and is indexed by
a tuple of nonnegative integers. The number of dimensions is the rank of
the array; the shape of an array is a tuple of integers giving the size of the
array along each dimension.
SciPy:
SciPy (Scientific Python) is often mentioned in the same breath with
NumPy. SciPy extends the capabilities of NumPy with further useful
functions for minimization, regression, Fourier
transformation and many others. NumPy is based on two earlier Python
modules dealing with arrays. One of these is Numeric. Numeric is like
NumPy a Python module for high-performance, numeric computing, but
it is obsolete nowadays. Another predecessor of NumPy is Numarray,
which is a complete rewrite of Numeric but is deprecated as well. NumPy
is a merger of those two, i.e. it is build on the code of Numeric and the
features of Numarray

The Python Alternative To Matlab :

Python in combination with Numpy, Scipy and Matplotlib can be used as


a replacement for MATLAB. The combination of NumPy, SciPy and
Matplotlib is a free (meaning both "free" as in "free beer" and "free" as in
"freedom") alternative to MATLAB. Even though MATLAB has a huge
number of additional toolboxes available, NumPy has the advantage that
Python is a more modern and complete programming language and - as
we have said already before - is open source. SciPy adds even more
MATLAB-like functionalities to Python. Python is rounded out in the
direction of MATLAB with the module Matplotlib, which provides
MATLAB-like plotting functionality.

36
Haar Cascade Classifier in OpenCv
The algorithm needs a lot of positive images (images of faces)
and negative images (images without faces) to train the classifier. Then
we need to extract features from it. For this, haar features shown in below
image are used. They are just like our convolutional kernel. Each feature
is a single value obtained by subtracting sum of pixels under white
rectangle from sum of pixels under black rectangle. Now all possible
sizes and locations of each kernel is used to calculate plenty of features.
(Just imagine how much computation it needs? Even a 24x24 window
results over 160000 features). For each feature calculation, we need to
find sum of pixels under white and black rectangles. To solve this, they43
introduced the integral images. It simplifies calculation of sum of pixels,
how large may be the number of pixels, to an operation involving just
four pixels. Nice, isn‟t it? It makes things super-fast.

Final classifier is a weighted sum of these weak classifiers. It is called


weak because it alone can‟t classify the image, but together with others
forms a strong classifier. The paper says even 200 features provide
detection with 95% accuracy. Their final setup had around 6000 features.
(Imagine a reduction from 160000+ features to 6000 features. That is a
big gain).So now you take an image. Take each 24x24 window. Apply
6000 features to it. Check if it is face or not. Wow.. Wow..Isn‟t it a little
inefficient and time consuming? Yes, it is. Authors have a good solution
for that. In an image, most of the image region is non-face region. So it is
a better idea to have a simple method to check if a window is not a face
region. If it is not, discard it in a single shot. Don‟t process it again.
Instead focus on region where there can be a face. This way, we can find
more time to check a possible face region

37
In the detection phase of the Viola–Jones object detection
framework, a window of the target size is moved over the input image,
and for each subsection of the image the Haar-like feature is calculated.
This difference is then compared to a learned threshold that separates
non-objects from objects. Because such a Haar-like feature is only a weak
learner or classifier (its detection quality is slightly better than random
guessing) a large number of Haar-like features are necessary to describe
an object with sufficient accuracy. In the Viola–Jones object detection
framework, the Haar-like features are therefore organized in something
called a classifier cascade to form a strong learner or classifier. The key
advantage of a Haar-like feature over most other features is its calculation
speed. Due to the use of integral images, a Haar-like feature of any size
can be calculated in constant time (approximately 60 microprocessor
instructions for a 2-rectangle feature).

Rectangular Haar-like features


A simple rectangular Haar-like feature can be defined as the
difference of the sum of pixels of areas inside the rectangle, which can be
at any position and scale within the original image. This modified feature
set is called 2-rectangle feature. Viola and Jones also defined 3-rectangle
features and 4- rectangle features. The values indicate certain
characteristics of a particular area of the image. Each feature type can
indicate the existence (or absence) of certain characteristics in the image,
such as edges or changes in texture. For example, a 2-rectangle feature
can indicate where the border lies between a dark region and a light
region.

38
Fast Computation of Haar-like features:
One of the contributions of Viola and Jones was to use summed-
area tables, which they called integral images. Integral images can be
defined as two-dimensional lookup tables in the form of a matrix with the
same size of the original image. Each element of the integral image
contains the sum of all pixels located on the up-left region of the original
image (in relation to the element's position)

Sum= I(C) + I(A) – I(B) – I(D)

Fig: 5.1 2-Rectangle feature

OpenCV has a modular structure, which means that the package includes
several shared or static libraries. The following modules are available:
 Core functionality (core) - a compact module defining basic data
structures, including the dense multiondimensional array Mat and
basic functions used by all other modules.
 Image Processing (imgproc) - an image processing module that
includes linear and non-linear image filtering, geometrical image
table-based remapping), color sp e transformations (resize, affine and
perspective conversion, histograms, and so on. e warping, generic
 Video Analysis (video) - a video analysis module that includes
motion esti subtraction, and object tracking algorithms. action,
background

39
 Camera Calibration and 3D Reconstruction (calib3d) - basic
multiple-view geometry algorithms, single and stereo camera
calibration, object pose estimation, stereocorrespondence algorithms,
and elements of 3D reconstruction.
 2D Features Framework (featu es2d) - salient feature detectors,
descriptors, and descriptor
 matchers.
 Object Detection (objdetect) - detection of objects and instances of
the predefined classes (for example, faces, eyes, mugs, people, cars,
and so on).
 High-level GUI (highgui) - an easy-to-use interface to simple UI
capabilities.46
 Video I/O (videoio) - an easy-to-use interface to video capturing
and video codecs.
 Some other helper modules, such as FLANN and Google test
wrappers, Python bindings, and others.

40
CHAPTER 6
RESULTS AND DISCUSSIONS

INTRODUCTION TO IMPLEMENTATION OF PROBLEM


The Model
Deep learning is a popular technique used in computer vision. We chose
Convolutional Neural Network (CNN) layers as building blocks to create
our model architecture. CNNs are known to imitate how the human brain
works when analyzing visuals.

A typical architecture of a convolutional neural network contain an input


layer, some convolutional layers, some dense layers (aka. fully-connected
layers), and an output layer . These are linearly stacked layers ordered in
sequence.
Input Layer
The input layer has pre-determined, fixed dimensions, so the image must
be pre-processed before it can be fed into the layer. We used OpenCV, a
computer vision library, for object detection in the video. The OpenCV
contains pre-trained filters and uses Adaboost to quickly find and crop the
object. The cropped object is then converted into gray scale using
cv2.cvtColor and resized to 48-by-48 pixels with cv2.resize. This step
greatly reduces the dimensions compared to the original RGB format with
three colour dimensions (3, 48, 48). The pipeline ensures every image can
be fed into the input layer as a (1, 48, 48) numpy array.

Convolutional Layers
The numpy array gets passed into the Convolution2D layer
where we specify the number of filters as one of the hyper parameters.

41
The set of filters are unique with randomly generated weights. Each filter,
(3, 3) receptive field, slides across the original image with shared weights
to create a feature map.

Convolution generates feature maps that represent how pixel


values are enhanced, for example, edge and pattern detection. A feature
map is created by applying filter 1 across the entire image. Other filters
are applied one after another creating a set of feature maps.

Pooling is a dimension reduction technique usually applied


after one or several convolutional layers. It is an important step when
building CNNs as adding more convolutional layers can greatly affect
computational time. We used a popular pooling method called
MaxPooling2D that uses (2, 2) windows across the feature map only
keeping the maximum pixel value. The pooled pixels form an image with
dimentions reduced by 4

Dense Layers
The dense layer (aka fully connected layers), is inspired by
the way neurons transmit signals through the brain. It takes a large
number of input features and transform features through layers connected
with trainable weights.

These weights are trained by forward propagation of


training data then backward propagation of its errors. Back propagation
starts from evaluating the difference between prediction and true value,
and back calculates the weight adjustment needed to every layer before.
We can control the training speed and the complexity of the architecture
by tuning the hyper-parameters, such as learning rate and network density.

42
As we feed in more data, the network is able to gradually make
adjustments until errors are minimized.Essentially, the more layers/nodes
we add to the network the better it can pick up signals.

Fig: 6.1

As good as it may sound, the model also becomes increasingly prone to


overfitting the training data. One method to prevent overfitting and
generalize on unseen data is to apply dropout. Dropout randomly selects a
portion (usually less than 50%) of nodes to set their weights to zero
during training. This method can effectively control the model's
sensitivity to noise during training while maintaining the necessary
complexity of the architecture

Output layer
The output layer in a CNN as mentioned previously is a fully connected
layer, where the input from the other layers is flattened and sent so as the
transform the output into the number of classes as desired by the network.

43
RESULTS

Input Output

44
CONCLUSION:
Deep learning based object detection has been a research
hotspot in recent years. This project starts on generic object detection
pipelines which provide base architectures for other related tasks. With
the help of this the three other common tasks, namely object detection,
face detection and pedestrian detection, can be accomplished. Authors
accomplished this by combing two things: Object detection with deep
learning and OpenCV and Efficient, threaded video streams with
OpenCV. The camera sensor noise and lightening condition can change
the result as it can create problem in recognizing the object. The end
result is a deep learning- based object detector that can process around 6-
8 FPS.

45
REFERENCES:

Bruckner, Daniel. Ml-o-scope: a diagnostic visualization system for deep


machine learning pipelines. No. UCB/EECS-2014-99.CALIFORNIA
UNIV BERKELEY DEPT OF ELECTRICAL ENGINEERING AND
COMPUTER SCIENCES, 2014.

K Saleh, Imad, Mehdi Ammi, and Samuel Szoniecky, eds. Challenges of


the Internet of Things: Technique, Use, Ethics. John Wiley & Sons, 2018.

Petrov, Yordan. Improving object detection by exploiting semantic


relations between objects.MS thesis.UniversitatPolitècnica de Catalunya,
2017.

Nikouei, SeyedYahya, et al. "Intelligent Surveillance as an Edge Network


Service: from Harr-Cascade, SVM to a Lightweight CNN." arXiv
preprint arXiv:1805.00331 (2018).

Thakar, Kartikey, et al. "Implementation and analysis of template


matching for image registration on DevKit- 8500D." Optik-International
Journal for Light and Electron Optics 130 (2017): 935-944.. Bradski,
Gary, and Adrian Kaehler.Learning OpenCV: Computer vision with the
OpenCV library." O'Reilly Media, Inc.", 2008.

Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural


networks for mobile vision applications." arXiv preprint
arXiv:1704.04861 (2017).

46
Kong, Tao, et al. "Ron: Reverse connection with objectness prior
networks for object detection." 2017

IEEE Conference on Computer Vision and Pattern Recognition


(CVPR).IEEE, 2017.
Liu, Wei, et al. "Ssd: Single shot multibox detector." European
conference on computer vision.Springer, Cham, 2016.

Veiga, Francisco José Lopes. "Image Processing for Detection of


Vehicles In Motion." (2018).
Huaizheng Zhang, Han Hu, GuanyuGao, Yonggang Wen, Kyle Guan,
"Deepqoe: A Unified Framework for Learning to Predict Video QoE",
Multimedia and Expo (ICME) 2018 IEEE International Conference on,
pp. 1- 6, 2018.

Shijian Tang and Ye Yuan,“Object Detection based on Conventional


Neural Network”.

R. P. S. Manikandan, A. M. Kalpana, "A study on feature selection in big


data", Computer
Communication and Informatics (ICCCI) 2017 International Conference
on, pp. 1-5, 2017

Warde-Farley, David. "Feedforward deep architectures for classification


and synthesis." (2018).

Shilpisingh et al” An Analytic approach for 3D Shape descriptor for face


recognition”, International Journal of Electrical, Electronics, Computer

47
Science & Engineering (IJEECSE), Special Issue - ICSCAAIT-2018| E-
ISSN: 2348-2273 | P-ISSN: 2454-1222,pp-138-140.

48

You might also like