m. e report
m. e report
A PROJECT REPORT
Submitted by
S. DHANALAKSHMI (730923401002)
of
in
KOMARAPALAYAM – 637303
MAY-2025
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Dr.R.Dinesh M.E.,Ph.D.,
Dr.T.C.Kalaiselvi M.E., Ph.D.,
HEAD OF THE DEPARTMENT ASSOCIATE PROFESSOR
Professor, Department of Electronics and
Department of Electronics communication engineering(ECE),
and communication engineering (ECE), Excel Engineering College,
Excel Engineering College, Komarapalayam – 637303.
Komarapalayam – 637303.
i
TABLE OF CONTENT
LIST OF FIGURES iv
LIST OF ABBREVATION v
1 INTRODUCTION 1
ii
3.4 Threshold Function 15
iii
LIST OF FIGURES
4.5 Algorithm 26
iv
LIST OF ABBREVATION
v
CHAPTER 1
INTRODUCTION
1
1.2 Motivation:
Blind people do lead a normal life with their own style of
doing things. But, they definitely face troubles due to inaccessible
infrastructure and social challenges. The biggest challenge for a blind
person, especially the one with the complete loss of vision, is to navigate
around places. Obviously, blind people roam easily around their house
without any help because they know the position of everything in the
house. Blind people have a tough time finding objects around them. . So
we decided to make a REAL TIME OBJECT DETECTION System. We
are interested in this project after we went through few papers in this area.
As a result we are highly motivated to develop a system that recognizes
objects in the real time environment
1.3 OBJECT DETECTION
2
Object detection from a video in video surveillance applications is the
major task these days. Object detection technique is used to identify
required objects in video sequences and to cluster pixels of these objects.
The detection of an object in video sequence plays a major role in several
applications specifically as video surveillance applications.
Object detection in a video stream can be done by processes
like pre-processing, segmentation, foreground and background extraction,
feature extraction.
Humans can easily detect and identify objects present in an
image. The human visual system is fast and accurate and can perform
complex tasks like identifying multiple objects with little conscious
thought. With the availability of large amounts of data, faster GPUs, and
better algorithms, we can now easily rain computers to detect and classify
multiple objects within an image with high accuracy.
WHAT IS AN IMAGE?
3
A picture is spoken to as a two dimensional capacity f(x, y)
where x and y are spatial co-ordinates and the adequacy of "T" at any
match of directions (x, y) is known as the power of the picture by then.
Processing on image:
Processing on image can be of three types They are low-level, mid-level,
high level.
Low-level Processing:
Preprocessing to remove noise.
Contrast enhancement.
Image sharpening
4
Why Image Processing?
Since the digital image is invisible, it must be prepared for viewing on
one or more output device(laser printer, monitor at).The digital image can
be optimized for the application by enhancing the appearance of the
structures within it.
Pixel :
Pixel is the smallest element of an image. Each pixel correspond to any
one value. In an 8-bit gray scale image, the value of the pixel between 0
and 255.Each pixel store a value proportional to the light intensity at that
5
particular location. It is indicated in either Pixels per inch or Dots per
inch.
Resolution :
The resolution can be defined in many ways. Such as pixel resolution,
spatial resolution, temporal resolution, spectral resolution.In pixel
resolution, the term resolution refers to the total number of count of
pixels in an digital image. For example, If an image has M rows and N
columns, then its resolution can be defined as MX N. Higher is the pixel
resolution, the higher is the quality of the image.
RELATED TECHNOLOGY:
R-CNN
R-CNN is a progressive visual object detection system that
combines bottom-up region proposals with rich options computed by a
convolution neural network.
R-CNN uses region proposal ways to initial generate potential bounding
boxes in a picture and then run a classifier on these proposed boxes.
SINGLE SIZE MULTI BOX DETECTOR
SSD discretizes the output space of bounding boxes into a set
of default boxes over different aspect ratios and scales per feature map
location. At the time of prediction the network generates scores for the
presence of each object category in each default box and generates
adjustments to the box to better match the object shape.
Additionally, the network combines predictions from multiple feature
maps with different resolutions to naturally handle objects of various
sizes.
6
ALEXNET
AlexNet is a convolutional neural Network used for
classification which has 5 Convolutional layers, 3 fullyconnected layers
and 1 softmax layer with 1000 outputs for classification as his
architecture.
YOLO
YOLO is real-time object detection. It applies one neural
network to the complete image dividing the image into regions and
predicts bounding boxes and possibilities for every region.
Predicted probabilities are the basis on which these bounding boxes are
weighted. A single neural network predicts bounding boxes and class
possibilities directly from full pictures in one evaluation. Since the full
detection pipeline is a single network, it can be optimized end-to-end
directly on detection performance.15
VGG
VGG network is another convolution neural network
architecture used for image classification.
MOBILENETS
To build lightweight deep neural networks MobileNets are used.
It is based on a streamlined architecture that uses depth-wise separable
convolutions. MobileNet uses 3×3 depth-wise separable convolutions that
uses between 8 times less computation than standard convolution at
solely a little reduction accuracy. Applications and use cases including
object detection, fine grain classification, face attributes and large scale-
localization.
TENSOR FLOW
Tensor flow is an open source software library for high
performance numerical computation. It allows simple deployment of
computation across a range of platforms (CPUs, GPUs, TPUs) due to its
7
versatile design also from desktops to clusters of servers to mobile and
edge devices. Tensor flow was designed and developed by researchers
and engineers from the Google Brain team at intervals Google’s AI
organization, it comes with robust support for machine learning and deep
learning and the versatile numerical computation core is used across
several alternative scientific domains. To construct, train and deploy
Object Detection Models TensorFlow is used that makes it easy and also
it provides a collection of Detection Models pre-trained on the COCO
dataset, the Kitti dataset, and the Open Images dataset. One among the
numerous Detection Models is that the combination of Single Shot
Detector (SSDs) and Mobile Nets architecture that is quick, efficient and
doesn't need huge computational capability to accomplish the object
Detection.
APPLICATION OF OBJECT DETECTION
The major applications of Object Detection are:
FACIAL RECOGNITION
“Deep Face” is a deep learning facial recognition system
developed to identify human faces in a digital image. Designed and
developed by a group of researchers in Facebook. Google also has its
own facial recognition system in Google Photos, which automatically
separates all the photos according to the person in the image.
There are various components involved in Facial Recognition or authors
could say it focuses on various aspects like the eyes, nose, mouth and the
eyebrows for recognizing a faces.16
PEOPLE COUNTING
People counting is also a part of object detection which can
be used for various purposes like finding person or a criminal; it is used
for analysing store performance or statistics of crowd during festivals.
8
This process is considered a difficult one as people move out of the frame
quickly.
9
CHAPTER-2
LITERATURE REVIEW
10
2.4. EfficientDet - Tan et al. (2020)
11
CHAPTER 3
DEEP LEARNING
INTRODUCTION
Deep learning is a machine learning technique. It teaches a
computer to filter inputs through layers to learn how to predict and
classify information. Observations can be in the form of images, text, or
sound. The inspiration for deep learning is the way that the human brain
filters information. Its purpose is to mimic how the human brain works to
create some real magic. In the human brain, there are about 100 billion
neurons. Each neuron connects to about 100,000 of its neighbors. We’re
kind of recreating that, but in a way and at a level that works for
machines. In our brains, a neuron has a body, dendrites, and an axon. The
signal from one neuron travels down the axon and transfers to the
dendrites of the next neuron. That connection where the signal passes is
called a synapse. Neurons by themselves are kind of
useless. But when you have lots of them, they work together to create
some serious magic.That’s the idea behind a deep learning algorithm!
You get input from observation and you put your input into one layer.
That layer creates an output which in turn becomes the input for the next
layer, and so on. This happens over and over until your final output signal!
The neuron (node) gets a signal or signals ( input values), which pass
through the neuron. That neuron delivers the output signal.
Think of the input layer as your senses: the things you see, smell,
and feel, for example. These are independent variables for one single
observation. This information is broken down into numbers and the bits
of binary data that a computer can use. You’ll need to either standardize
or normalize these variables so that they’re within the same range. They
12
use many layers of nonlinear processing units for feature extraction and
transformation. Each successive layer uses the output of the previous
layer for its input. What they learn forms a hierarchy of concepts. In this
hierarchy, each level learns to transform its input data into a more and
more abstract and composite representation. That means that for an image,
for example, the input might be a matrix of pixels. The first layer might
encode the edges and compose the pixels. The next layer might compose
an arrangement of edges. The next layer might encode a nose and eyes.
The next layer might recognize that the image contains a face, and so on.
13
Input data passes into a layer where calculations are performed. Each
processing element computes based upon the weighted sum of its inputs.
The new values become the new input values that feed the next layer
(feed-forward). This continues through all the layers and determines the
output. Feedforward networks are often used in, for example, data mining.
Weighted Sum
Inputs to a neuron can either be features from a training set or
outputs from the neurons of a previous layer. Each connection between
two neurons has a unique synapse with a unique weight attached. If you
want to get from one neuron to the next, you have to travel along the
synapse and pay the “toll” (weight). The neuron then applies an activation
function to the sum of the weighted inputs from each incoming synapse.
It passes the result on to all the neurons in the next layer. When we talk
about updating weights in a network, we’re talking about adjusting the
weights on these synapses.
14
there are 3 inputs or neurons in the previous layer, each neuron in the
current layer will have 3 distinct weights: one for each synapse.
Thresholdfunction
This is a step function. If the summed value of the input reaches a certain
threshold the function passes on 0. If it’s equal to or more than zero, then
it would pass on 1. It’s a very rigid, straightforward, yes or no function.
Sigmoid function
This function is used in logistic regression. Unlike the threshold function,
it’s a smooth, gradual progression from 0 to 1. It’s useful in the output
layer and is used heavily for linear regression.22
Rectifier function
This might be the most popular activation function in the universe of
neural networks. It’s the most efficient and biologically plausible. Even
though it has a kink, it’s smooth and gradual after the kink at 0. This
means, for example, that your output would be either “no” or a
percentage of “yes.” This function doesn’t require normalization or other
complicated calculations.
15
CHAPTER 4
CONVOULUTION NEURAL NETWORKS
Fig: 4.1
16
perform simple operations on the data. The result of these operations is
passed to other neurons. The output at each node is called its activation or
node value. Each link is associated with weight. ANNs are capable of
learning, which takes place by altering weight values.
Neural network:
A neural network is a network or circuit of neurons, or in a modern sense,
an artificial neural network, composed of artificial neurons or nodes.
Thus a neural network is either a biological neural network, made up of
real biological neurons, or an artificial neural network, for solving
artificial intelligence (AI) problem. The connections of the biological
neuron are modeled as weights. A positive weight reflects an excitatory
connection, while negative values mean inhibitory connections. All inputs
are modified by a weight and summed. This activity is referred as a linear
combination. Finally, an activation function controls the amplitude of the
output. For example, an acceptable range of output is usually between 0
and 1, or it could be - 1 and 1.
which can derive conclusions from a complex and seemingly unrelated
set of information.
17
A deep neural network (DNN) is an artificial neural network
(ANN) with multiple layers between the input and output layers. The
DNN finds the correct mathematical manipulation to turn the input into
the output, whether it be a linear relationship or a non-linear relationship.
18
ANN, due to its relatively small image dimensionality of just 28 × 28.
With this dataset a single neuron in the first hidden layer will contain 784
weights (28×28×1 where 1 bare in mind that MNIST is normalised to just
black and white values), which is manageable for most forms of ANN. If
you consider a more substantial coloured image input of 64 × 64, the
number of weights on just a single neuron of the first layer increases
substantially to 12, 288. Also take into account that to deal with this scale
of input, the network will also need to be a lot larger than one used to
classify colour-normalised MNIST digits, then you will understand the
drawbacks of using such models.
CNN ARCHITECTURE:
19
Finally, the last fully connected layer outputs the class label. Despite this
being the most popular base architecture found in the literature, several
architecture changes have been proposed in recent years with the
objective of improving image classification accuracy or reducing
computation costs. Although for the remainder of this section, we merely
fleetingly introduce standard CNN architecture.
Fig: 4.3
OVERALL ARCHITECTURE:
CNNs are comprised of three types of layers. These are
convolutional layers, pooling layers and fullyconnected layers. When
these layers are stacked, a CNN architecture has been formed. A
simplified CNN architecture for MNIST classification is illustrated in
Figure 2. input 0 9 convolution w/ReLu pooling output fully-connected
w/ ReLu fully-connected ... Fig.2: An simple CNN architecture,
comprised of just five layers The basic functionality of the example CNN
above can be broken down into four key areas. 1. As found in other forms
20
of ANN, the input layer will hold the pixel values of the image. 2. The
convolutional layer will determine the output of neurons of which are
connected to local regions of the input through the calculation of the
scalar product between their weights and the region connected to the
input volume. The rectified linear unit (commonly shortened to ReLu)
aims to apply an ’elementwise’ activation function such as sigmoid to the
output of the activation produced by the previous layer. 3. The pooling
layer will then simply perform downsampling along the spatial
dimensionality of the given input, further reducing the number of
parameters within that activation. 4. The fully-connected layers will then
perform the same duties found in standard ANNs and attempt to
understanding the overall architecture of a CNN architecture will not
suffice. The creation and optimisation of these models can take quite
some time, and can be quite confusing. We will now explore in detail the
individual layers, detailing their hyperparameters and
connectivities.produce class scores from the activations, to be used for
classification. It is also suggested that ReLu may be used between these
layers, as to improve performance. Through this simple method of
transformation, CNNs are able to transform the original input layer by
layer using convolutional and downsampling techniques to produce class
scores for classification and regression purposes. However, it is important
to note that simply understanding the overall architecture of a CNN
architecture will not suffice. The creation and optimisation of these
models can take quite some time, and can be quite confusing. We will
now explore in detail the individual layers, detailing their
hyperparameters and connectivities.
CONVOLUTIONAL LAYERS:
21
The convolutional layers serve as feature extractors, and thus
they learn the feature representations of their input images. The neurons
in the convolutional layers are arranged into feature maps. Each neuron in
a feature map has a receptive field, which is connected to a neighborhood
of neurons in the previous layer via a set of trainable weights, sometimes
referred to as a filter bank. Inputs are convolved with the learned weights
in order to compute a new feature map, and the convolved results are sent
through a nonlinear activation function.
All neurons within a feature map have weights that are
constrained to be equal; however, different feature maps within the same
convolutional layer have different weights so that several features can be
extracted at each location. As the name implies, the convolutional layer
plays a vital role in how CNNs operate. The layers parameters focus
around the use of learnable kernels. These kernels are usually small in
spatial dimensionality, but spreads along the entirety of the depth of the
input. When the data hits a convolutional layer, the layer convolves each
filter across the spatial dimensionality of the input to produce a 2D
activation map. These activation maps can be visualised. As we glide
through the input, the scalar product is calculated for each value in that
kernel. From this the network will learn kernels that ’fire’ when they see
a specific feature at a given spatial position ofthe input. These are
commonly known as activations.
22
The centre element of the kernel is placed over the input vector,
of which is then calculated and replaced with a weighted sum of itself and
any nearby pixels. Every kernel will have a corresponding activation map,
of which will be stacked along the depth dimension to form the full
output volume from the convolutional layer. As we alluded to earlier,
training ANNs on inputs such as images results in models of which are
too big to train effectively. This comes down to the fullyconnected
manner of stan ard ANN neurons, so to mitigate against this every neuron
in a convolutional layer is only connected to small region of the input
volume. The dimensionality of this region is commonly referred to as the
receptive field size of the neuron. The magnitude of the connectivity
through the depth is nearly always equal to the depth of the input.
For example, if the input to the network is an image of size 64
× 64 × 3 (aRGBcoloured image with a dimensionality of 64 × 64) and we
set the receptive field size as 6 × 6, we would have a total of 108 weights
on each neuron within the convolutional layer. (6 × 6 × 3 where 3 is the
magnitude of connectivity across the depth of the volume) To put this
into perspective, a standard neuron seen in other forms of ANN would
contain 12, 288 weights each. Convolutional layers are also able to
significantly reduce the complexity of the model through the optimisation
of its output. These are optimised through three hyperparameters, the
depth, the stride and setting zero-padding.
23
the network, but it can also significantly reduce the capabilities of the
model.
24
problem domain, knowledge about the specific type of input is exploited.
This in turn allows for a much simpler network architecture to be set up.
This paper has outlined the basic concepts of
Convolutional Neural Networks, explaining the layers required to build
one and detailing how best to structure the network in most image
analysis tasks.
Training
CNNs and ANN in general use learning algorithms to adjust
their free parameters in order to attain the desired network output. The
most common algorithm used for this purpose is backpropagation.
Backpropagation computes the gradient of an objective function to
determine how to adjust a network’s parameters in order to minimize
errors that affect performance. A commonly experienced problem with
training CNNs, and in particular DCNNs, is overfitting, which is poor
performance on a held-out test set after the network is trained on a small
or even large training set. This affects the model’s ability to generalize on
unseen data and is a major challenge for DCNNs that can be assuaged by
regularization.
25
Fig: 4.5
Caffe Model
Caffe is a framework of Deep Learning and it was made used for the
implementation and to access the following things in an object detection
system.
•Expression: Models and optimizations are defined as plaintext schemas
in the caffe model unlike others which use codes for this purpose.
•Speed: for research and industry alike speed is crucial for state-of-the-art
models and massive data [11].
•Modularity: Flexibility and extension is majorly required for the new
tasks and different settings.
•Openness: Common code, reference models, and reproducibility are the
basic requirements of scientific and applied progress.
Types of Caffe Models
Open Pose
The first real-time multi-person system is portrayed by OpenPose which
can collectively sight human body, hand, and facial keypoints (in total
130 keypoints) on single pictures. Fully Convolutional Networks for
Semantic Segmentation In the absolutely convolutional networks (FCNs)
26
Fully Convolutional Networks are the reference implementation of the
models and code for the within the PAMI FCN and CVPR FCN papers.
Cnn-vis
Cnn-vis is an open-source tool that lets you use convolutional neural
networks to generate images. It has taken inspiration from the Google's
recent Inceptionism blog post.
Speech Recognition
Speech Recognition with the caffe deep learning framework.
DeconvNet
Learning Deconvolution Network for Semantic Segmentation.
27
other languages for Object Detection: Object detection may be a
domain-specific variation of the machine learning prediction
drawback. Intel’s OpenCV library that is implemented in C/C++ has
its interfaces offered during a} very vary of programming
environments like C#, Matlab, Octave, R, Python and then on. Why
Python codes are much better option than other language codes for
object detection are more compact and readable code. Python uses
zero-based indexing
Dictionary (hashes) support provided.
Simple and elegant Object-oriented programming
Free and open
Multiple functions can be package in one module
More choices in graphics packages and toolsets Supervised learning
also plays an important role.
The utility of unsupervised pre-training is usually evaluated on the
premise of what performance is achieved when supervised fine-tuning.
This paper reviews and discusses the fundamentals of learning as well as
supervised learning for classification models, and also talks about the
mini batch stochastic
gradient descent algorithm that is used to fine-tune many of the models.
Object Classification in Moving Object Detection Object classification
works on the shape, motion, color and texture. The classification can be
done under various categories like plants, objects,
animals, humans etc. The key concept of object classification is tracking
objects and analysing their features.
Shape-Based
A mixture of image-based and scene based object parameters
such as image blob (binary large object) area, the as pectration of blob
28
bounding box and camera zoom is given as input to this detection system.
Classification is performed on the basis of the blob at each and every
frame. The results are kept in the histogram.
Motion-Based
When an easy image is given as an input with no objects in
motion, this classification isn't required. In general, non- rigid articulated
human motion shows a periodic property; therefore this has been used as
a powerful clue for classification of moving objects. based on this useful
clue, human motion is distinguished from different objects motion.
ColorBased- though color isn't an applicable live alone for police
investigation and following objects, but the low process value of the
colour primarily based algorithms makes the coloura awfully smart
feature to be exploited. As an example, the colorhistogram based
technique is employed for detection of vehicles in period. Color bar chart
describes the colour distribution in a very given region that is powerful
against partial occlusions.
Texture-Based
The texture-based approaches with the assistance of texture pattern
recognition work just like motionbased approaches. It provides higher
accuracy, by exploitation overlapping native distinction social control
however might need longer, which may be improved exploitation some
quick techniques. I. proposed WORK Authors have applied period object
detection exploitation deep learning and OpenCV to figure to work with
video streams and video files. This will be accomplished using the
efficient open computer vision. Implementation of proposed strategy
includes caffe- model based on Google Image Scenery; Caffe offers the
model definitions, optimization settings, pre- trained weights[4].
Prerequisite includes Python 3.7, OpenCV 4 packages and numpy to
complete this task of object detection. NumPy is the elementary package
29
for scientific computing with Python. It contains among other things: a
strong N-dimensional array object, subtle (broadcasting) functions tools
for integrating C/C++ and fortran code, helpful linear algebra, Fourier
transform, and random36 number capabilities. Numpy works in backend
to provide statistical information of resemblance of object with the image
scenery caffemodel database. Object clusters can be created according to
fuzzy value provided by NumPy. This project can detect live objects from
the videos and images.
30
Ordinary gradient descent is an easy rule within which we repeatedly
create tiny steps downward on an error surface defined by a loss function
of some parameters. For the aim of normal gradient descent we take into
account that the training data is rolled into the loss function. Then the
pseudo code of this algorithm can be represented as Stochastic gradient
descent (SGD) works according to similar principles as random gradient
descent (SGD) operates on the basis of similar principles as normal
gradient descent. It quickly proceeds by estimating the gradient from
simply a few examples at a time instead of complete training set. In its
purest kind, we use simply one example at a time to estimate the gradient.
Caffe is a deep learning framework or else we can say a library
it's made with expression speed and modularity in mind they will put by
Berkeley artificial intelligence research and created by young King Gia
there are many deep learning or machine learning frameworks for
computer vision like tensorflow ,Tiano, Charis and SVM[2]. But why
exactly we implement edition cafe there as on is its expressive
architecture we can easily switch between CPU and GPU while training
on GPU machine37 modules and optimization for Our problem is defined
by configuration without hard coding. It supports extensible code since
cafes are open source library. It is four foot by over twenty thous and
developers and github since its birth it offers coding platform in
extensible languages like Python and C++. The next reason is speed for
training the neural networks speed is the primary constraint. Caffe can
process over million images in a single day with the standard media GPU
that is milliseconds per image. Whereas the same dataset of million
images can take weeks for Tiana and Kara's Caffe is the fastest
convolution neural network present community as mentioned earlier since
its open source library huge number of research arepowered by cafe and
every single day something new is coming out of it.
31
CHAPTER 5
OPEN COMPUTER VISION
5.1 INTRODUCTION
OpenCV stands for Open supply pc Vision Library is
associate open supply pc vision and machine learning software system
library. The purpose of creation of OpenCV was to produce a standard
infrastructure for computer vision applications and to accelerate the
utilization of machine perception within the business product [6]. It
becomes very easy for businesses to utilize and modify the code with
OpenCV as it is a BSD-licensed product. It is a rich wholesome libraby as
it contains 2500 optimized algorithms, which also includes a
comprehensive set of both classic and progressive computer vision and
machine learning algorithms. These algorithms is used for various
functions such as discover and acknowledging faces. Identify objects
classify human actions. In videos, track camera
movements, track moving objects. Extract 3D models of objects,
manufacture 3D purpose clouds from stereo cameras, sew pictures along
to provide a high-resolution image of a complete scene, find similar
pictures from a picture information, remove red eyes from images that are
clicked with the flash, follow eye movements, recognize scenery and
establish markers to overlay it with augmented reality.
32
Intel Russia, as well as Intel's Performance Library Team. In the early
days of OpenCV, the goals of the project were describedas:
Advance vision research by providing not only open but also
optimized code for basic vision infrastructure. No more reinventing
the wheel.
Disseminate vision knowledge by providing a common infrastructure
that developers could build on, so that code would be more readily
readable and transferable.
Advance vision-based commercial applications by making portable,
performance-optimized code available for free – with a license that
did not require code to be open or free itself.
The first alpha version of OpenCV was released to the public at the IEEE
Conference on Computer Vision and Pattern Recognition in 2000, and
five betas were released between 2001 and 2005. The first 1.0 version
was released in 2006. A version 1.1 "pre-release" was released in October
2008. The second major release of the OpenCV was in October 2009.
OpenCV 2 includes major changes to the C++ interface, aiming at easier,
more type-safe patterns, new functions, and better implementations for
existing ones in terms of performance (especially on multi-core systems).
Official releases now occur every six months and development is now
done by an independent Russian team supported by commercial
corporations.
33
vision. Originally developed by Intel, it was later supported by Willow
Garage then Itseez (which was later acquired by Intel). The library is
cross-platform and free for use under the open-source BSD license. It has
C++, Python, Java and MATLAB interfaces and supports Windows,
Linux, Android and Mac OS. OpenCV leans mostly towards real-time
vision applications and takes advantage of MMX and
SSE instructions when available. A full-featured CUDAandOpenCL
interfaces are being actively developed right now. There are over 500
algorithms and about 10 times as many functions that compose or support
those algorithms. OpenCV is written natively in C++ and has a templated
interface that works seamlessly with STL containers.
34
Boosting Decision tree learning
Gradient boosting trees
Expectation-maximization algorithm
k-nearestneighbor algorithm
Naive Bayes classifier
Artificial neural networks
Random forest
Random forest
Support vector machine (SVM)
Deep neural networks (DNN)
Libraries in OpenCV
Numpy:
NumPy is an acronym for "Numeric Python" or "Numerical Python". It is
an open source extension module for Python, which provides fast
precompiled functions for mathematical and numerical routines.
Furthermore, NumPy enriches the programming language Python with
powerful data structures for efficient computation of multi-dimensional
arrays and matrices. The implementation is even aiming at huge matrices
and arrays. Besides that the module supplies a large library of high-level
mathematical functions to operate on these matrices and arrays.
35
Numpy Array:
A numpy array is a grid of values, all of the same type, and is indexed by
a tuple of nonnegative integers. The number of dimensions is the rank of
the array; the shape of an array is a tuple of integers giving the size of the
array along each dimension.
SciPy:
SciPy (Scientific Python) is often mentioned in the same breath with
NumPy. SciPy extends the capabilities of NumPy with further useful
functions for minimization, regression, Fourier
transformation and many others. NumPy is based on two earlier Python
modules dealing with arrays. One of these is Numeric. Numeric is like
NumPy a Python module for high-performance, numeric computing, but
it is obsolete nowadays. Another predecessor of NumPy is Numarray,
which is a complete rewrite of Numeric but is deprecated as well. NumPy
is a merger of those two, i.e. it is build on the code of Numeric and the
features of Numarray
36
Haar Cascade Classifier in OpenCv
The algorithm needs a lot of positive images (images of faces)
and negative images (images without faces) to train the classifier. Then
we need to extract features from it. For this, haar features shown in below
image are used. They are just like our convolutional kernel. Each feature
is a single value obtained by subtracting sum of pixels under white
rectangle from sum of pixels under black rectangle. Now all possible
sizes and locations of each kernel is used to calculate plenty of features.
(Just imagine how much computation it needs? Even a 24x24 window
results over 160000 features). For each feature calculation, we need to
find sum of pixels under white and black rectangles. To solve this, they43
introduced the integral images. It simplifies calculation of sum of pixels,
how large may be the number of pixels, to an operation involving just
four pixels. Nice, isn‟t it? It makes things super-fast.
37
In the detection phase of the Viola–Jones object detection
framework, a window of the target size is moved over the input image,
and for each subsection of the image the Haar-like feature is calculated.
This difference is then compared to a learned threshold that separates
non-objects from objects. Because such a Haar-like feature is only a weak
learner or classifier (its detection quality is slightly better than random
guessing) a large number of Haar-like features are necessary to describe
an object with sufficient accuracy. In the Viola–Jones object detection
framework, the Haar-like features are therefore organized in something
called a classifier cascade to form a strong learner or classifier. The key
advantage of a Haar-like feature over most other features is its calculation
speed. Due to the use of integral images, a Haar-like feature of any size
can be calculated in constant time (approximately 60 microprocessor
instructions for a 2-rectangle feature).
38
Fast Computation of Haar-like features:
One of the contributions of Viola and Jones was to use summed-
area tables, which they called integral images. Integral images can be
defined as two-dimensional lookup tables in the form of a matrix with the
same size of the original image. Each element of the integral image
contains the sum of all pixels located on the up-left region of the original
image (in relation to the element's position)
OpenCV has a modular structure, which means that the package includes
several shared or static libraries. The following modules are available:
Core functionality (core) - a compact module defining basic data
structures, including the dense multiondimensional array Mat and
basic functions used by all other modules.
Image Processing (imgproc) - an image processing module that
includes linear and non-linear image filtering, geometrical image
table-based remapping), color sp e transformations (resize, affine and
perspective conversion, histograms, and so on. e warping, generic
Video Analysis (video) - a video analysis module that includes
motion esti subtraction, and object tracking algorithms. action,
background
39
Camera Calibration and 3D Reconstruction (calib3d) - basic
multiple-view geometry algorithms, single and stereo camera
calibration, object pose estimation, stereocorrespondence algorithms,
and elements of 3D reconstruction.
2D Features Framework (featu es2d) - salient feature detectors,
descriptors, and descriptor
matchers.
Object Detection (objdetect) - detection of objects and instances of
the predefined classes (for example, faces, eyes, mugs, people, cars,
and so on).
High-level GUI (highgui) - an easy-to-use interface to simple UI
capabilities.46
Video I/O (videoio) - an easy-to-use interface to video capturing
and video codecs.
Some other helper modules, such as FLANN and Google test
wrappers, Python bindings, and others.
40
CHAPTER 6
RESULTS AND DISCUSSIONS
Convolutional Layers
The numpy array gets passed into the Convolution2D layer
where we specify the number of filters as one of the hyper parameters.
41
The set of filters are unique with randomly generated weights. Each filter,
(3, 3) receptive field, slides across the original image with shared weights
to create a feature map.
Dense Layers
The dense layer (aka fully connected layers), is inspired by
the way neurons transmit signals through the brain. It takes a large
number of input features and transform features through layers connected
with trainable weights.
42
As we feed in more data, the network is able to gradually make
adjustments until errors are minimized.Essentially, the more layers/nodes
we add to the network the better it can pick up signals.
Fig: 6.1
Output layer
The output layer in a CNN as mentioned previously is a fully connected
layer, where the input from the other layers is flattened and sent so as the
transform the output into the number of classes as desired by the network.
43
RESULTS
Input Output
44
CONCLUSION:
Deep learning based object detection has been a research
hotspot in recent years. This project starts on generic object detection
pipelines which provide base architectures for other related tasks. With
the help of this the three other common tasks, namely object detection,
face detection and pedestrian detection, can be accomplished. Authors
accomplished this by combing two things: Object detection with deep
learning and OpenCV and Efficient, threaded video streams with
OpenCV. The camera sensor noise and lightening condition can change
the result as it can create problem in recognizing the object. The end
result is a deep learning- based object detector that can process around 6-
8 FPS.
45
REFERENCES:
46
Kong, Tao, et al. "Ron: Reverse connection with objectness prior
networks for object detection." 2017
47
Science & Engineering (IJEECSE), Special Issue - ICSCAAIT-2018| E-
ISSN: 2348-2273 | P-ISSN: 2454-1222,pp-138-140.
48