DL-UNIT_4
DL-UNIT_4
Lenet:
LeNet Architecture is developed by Yann LeCun and his colleagues in the late 1980s and early
1990s, is one of the earliest convolutional neural networks that has substantially influenced the
field of deep learning, particularly in image recognition. Designed originally to recognize
handwritten and machine-printed characters, LeNet was a groundbreaking model at the time of
its inception.
LeNet's significance in deep learning cannot be overstated. It was one of the first
demonstrations that convolutional neural networks (CNNs) could be successfully applied to
visual pattern recognition. LeNet introduced several key concepts that are now standard in CNN
architectures, including the use of multiple convolutional and pooling layers, local receptive
fields, shared weights, and the backpropagation algorithm for training the network.
These innovations have paved the way for the development of more complex and deeper
networks, which are the backbone of modern artificial intelligence systems in various
applications ranging from autonomous vehicles to medical diagnosis. The principles laid down
by LeNet have not only survived but have been expanded upon, leading to the development of
more sophisticated deep learning frameworks that continue to push the boundaries of what
machines can learn and achieve.
The development of LeNet was influenced by a series of advancements and the increasing
interest in neural networks during the late 1980s. Prior to LeNet, neural networks had primarily
been limited to fully connected architectures that lacked the ability to process spatial data
efficiently. The introduction of backpropagation in the 1980s by Rumelhart, Hinton, and
Williams provided a robust method for training deep neural networks, but these networks still
struggled with tasks like image recognition due to the high dimensionality and variability of
image data.
During this period, there was a significant interest in finding solutions that could effectively
reduce dimensionality and learn invariant features directly from the data. The concept of using
localized receptive fields, shared weights, and spatial hierarchies in neural networks was
inspired by studies of the visual cortex in animals, suggesting that these biological processes
could be mimicked to improve machine perception.
Late 1980s: Yann LeCun begins foundational work on convolutional neural networks at AT&T
Bell Labs, leading to the development of the initial LeNet models.
1989: The first iteration, LeNet-1, is introduced, employing backpropagation for training
convolutional layers.
1998: LeNet-5, the most notable version, is detailed in the seminal paper "Gradient-Based
Learning Applied to Document Recognition." This iteration is optimized for digit recognition and
demonstrates practical applications.
2000s: LeNet's success inspires further research and adaptations in various fields beyond digit
recognition, such as medical imaging and object recognition.
2010s and Beyond: LeNet's principles influence the development of more advanced CNN
architectures like AlexNet and ResNet, solidifying its legacy in the field of deep learning.
The LeNet architecture consists of several layers that progressively extract and condense
information from input images. Here, is it the description of each layer of the LeNet
architecture:
Input Layer: Accepts 32x32 pixel images, often zero-padded if original images are smaller.
First Convolutional Layer (C1): Consists of six 5x5 filters, producing six feature maps of 28x28
each.
First Pooling Layer (S2): Applies 2x2 average pooling, reducing feature maps' size to 14x14.
Second Convolutional Layer (C3): Uses sixteen 5x5 filters, but with sparse connections,
outputting sixteen 10x10 feature maps.
Second Pooling Layer (S4): Further reduces feature maps to 5x5 using 2x2 average pooling.
First Fully Connected Layer (C5): Fully connected with 120 nodes.
Output Layer: Softmax or Gaussian activation that outputs probabilities across 10 classes
Applications of LeNet
LeNet's architecture, originally developed for digit recognition, has proven versatile and
foundational, influencing a variety of applications beyond its initial scope. Here are some
notable applications and adaptations:
Handwritten Character Recognition: Beyond recognizing digits, LeNet has been adapted to
recognize a broad range of handwritten characters, including alphabets from various languages.
This adaptation has been crucial for applications such as automated form processing and
handwriting-based authentication systems.
Object Recognition in Images: The principles of LeNet have been extended to more complex
object recognition tasks. Modified versions of LeNet are used in systems that need to recognize
objects in photos and videos, such as identifying products in a retail setting or vehicles in traffic
management systems.
Document Classification: LeNet can be adapted for document classification by recognizing and
learning from the textual and layout features of different document types. This application is
particularly useful in digital document management systems where automatic categorization of
documents based on their content and layout can significantly enhance searchability and
retrieval.
Medical Image Analysis: Adaptations of LeNet have been applied in the field of medical image
analysis, such as identifying abnormalities in radiographic images, segmenting biological
features in microscopic images, and diagnosing diseases from patterns in medical imagery.
These applications demonstrate the potential of convolutional neural networks in supporting
diagnostic processes and enhancing the accuracy of medical evaluations.
Alexnet:
The original paper's primary result was that the depth of the model was essential for its high
performance, which was computationally expensive, but made feasible due to the utilization of
graphics processing units (GPUs) during training.[1]
The three formed team SuperVision and submitted AlexNet in the ImageNet Large Scale Visual
Recognition Challenge on September 30, 2012.[2] The network achieved a top-5 error of 15.3%,
more than 10.8 percentage points better than that of the runner-up.
The architecture influenced a large number of subsequent work in deep learning, especially in
applying neural networks to computer vision.
This was the first architecture that used GPU to boost the training performance. AlexNet
consists of 5 convolution layers, 3 max-pooling layers, 2 Normalized layers, 2 fully connected
layers and 1 SoftMax layer. Each convolution layer consists of a convolution filter and a non-
linear activation function called “ReLU”. The pooling layers are used to perform the max-
pooling function and the input size is fixed due to the presence of fully connected layers. The
input size is mentioned at most of the places as 224x224x3 but due to some padding which
happens it works out to be 227x227x3. Above all this AlexNet has over 60 million parameters.
Key Features:
Data Augmentation is been carried out like flipping, jittering, cropping, colour normalization,
etc.
AlexNet was trained on a GTX 580 GPU with only 3 GB of memory which couldn’t fit the entire
network. So the network was split across 2 GPUs, with half of the neurons(feature maps) on
each GPU.
As the model had to train 60 million parameters (which is quite a lot), it was prone to
overfitting. According to the paper, the usage of Dropout and Data Augmentation significantly
helped in reducing overfitting. The first and second fully connected layers in the architecture
thus used a dropout of 0.5 for the purpose. Artificially increasing the number of images through
data augmentation helped in the expansion of the dataset dynamically during runtime, which
helped the model generalize better.
Another distinct factor was using the ReLU activation function instead of tanh or sigmoid, which
resulted in faster training times (a decrease in training time by 6 times). Deep Learning
Networks usually employ ReLU non-linearity to achieve faster training times as the others start
saturating when they hit higher activation values
developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, is a landmark model that
won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2012. It introduced
several innovative ideas that shaped the future of CNNs.
AlexNet Architecture:
AlexNet consists of 8 layers, including 5 convolutional layers and 3 fully connected layers. It
uses traditional stacked convolutional layers with max-pooling in between. Its deep network
structure allows for the extraction of complex features from images.
The architecture employs overlapping pooling layers to reduce spatial dimensions while
retaining the spatial relationships among neighbouring features.
Activation function: AlexNet uses the ReLU activation function and dropout regularization,
which enhance the model’s ability to capture non-linear relationships within the data.
AlexNet is a relatively shallow network compared to GoogleNet. It has eight layers, which
makes it simpler to train and less prone to overfitting on smaller datasets.
In 2012, AlexNet produced ground-breaking results in the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC). It outperformed prior CNN architectures greatly and set the
path for the rebirth of deep learning in computer vision.
Several architectural improvements were introduced by AlexNet, including the use of rectified
linear units (ReLU) as activation functions, overlapping pooling, and dropout regularisation.
These strategies aided in the improvement of performance and generalisation
ZF-Net:
To understand ZFNet First we have to take a look into what is Convolutional neural network and
Imagenet.
A convolutional neural network is a special kind of multi neural network designed to extract
visual patterns from a given image. Since images are formed of pixels so a convolutional neural
network tries to capture the important pixels values by a process known as convolution.
ImageNet Project is a large database designed for use in visual object recognition software
research.
The ImageNet Project runs an annual competition ImageNet Large Scale Visual Recognition
Challenge (ILSVRC ) where software programs compete to correctly classify the objects and
scenes.
It is imperative to know about AlexNet before coming to ZFNet. AlexNet was primarily designed
by Alex Krishevsky. It was published with Ilya Sutskever and Geoffry Hinton.It won first place in
ImageNet large Scale Visual recognition Challenge in 2012 by achieving error of 15.3%.This was
10.8% lower than that of runner up. AlexNet was considered as a massive jump in the accuracy
of neural networks.
Rob Fergus and Matthew D.Zeiler introduced ZFNet. ZFNet is named after their surname Zeiler
and Fergus. ZFNet was a slight improvement over AlexNet .The 2013 ILSVRC was won by
ZFNet.It actually visualized how each layer of AlexNet performs and what parameters can be
tuned to achieve greater accuracy.
Some Key Features of ZFNet architecture
· Convolutional layers:
In these layers convolutional filters are applied to extract important features,ZFNet consists of
multiple convolutional layers to extract important features.
· MaxPooling Layers:
MaxPooling Layers are used to downsample the spatial dimensions of feature map in.It consist
of aggregation function known as maxima.
Relu is used after each convolution layer to introduce non linearity into the model which is
crucial for learning complex patterns. It rectifies the feature map ensuring the feature maps are
always positive.
In the latter part of ZFNet architecture fully connected dense layers are used to extract patterns
from features .The activation function used in the neurons is relu.
· SoftMax Activation:
SoftMax activation is used in the last layer to obtain the probabilities of the image belonging to
the 1000 classes.
VGG-16
The VGG-16 model is a convolutional neural network (CNN) architecture that was proposed by
the Visual Geometry Group (VGG) at the University of Oxford. It is characterized by its depth,
consisting of 16 layers, including 13 convolutional layers and 3 fully connected layers. VGG-16 is
renowned for its simplicity and effectiveness, as well as its ability to achieve strong
performance on various computer vision tasks, including image classification and object
recognition. The model’s architecture features a stack of convolutional layers followed by max-
pooling layers, with progressively increasing depth. This design enables the model to learn
intricate hierarchical representations of visual features, leading to robust and accurate
predictions. Despite its simplicity compared to more recent architectures, VGG-16 remains a
popular choice for many deep learning applications due to its versatility and excellent
performance.
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual competition in
computer vision where teams tackle tasks including object localization and image classification.
VGG16, proposed by Karen Simonyan and Andrew Zisserman in 2014, achieved top ranks in
both tasks, detecting objects from 200 classes and classifying images into 1000 categories.
VGG Architecture:
The VGG-16 architecture is a deep convolutional neural network (CNN) designed for image
classification tasks. It was introduced by the Visual Geometry Group at the University of Oxford.
VGG-16 is characterized by its simplicity and uniform architecture, making it easy to understand
and implement.
The VGG-16 configuration typically consists of 16 layers, including 13 convolutional layers and 3
fully connected layers. These layers are organized into blocks, with each block containing
multiple convolutional layers followed by a max-pooling layer for down sampling.
Google Net:
Google Net (or Inception V1) was proposed by research at Google (with the collaboration of
various universities) in 2014 in the research paper titled “Going Deeper with Convolutions”.
This architecture was the winner at the ILSVRC 2014 image classification challenge. It has
provided a significant decrease in error rate as compared to previous winners AlexNet (Winner
of ILSVRC 2012) and ZF-Net (Winner of ILSVRC 2013) and significantly less error rate than VGG
(2014 runner up). This architecture uses techniques such as 1×1 convolutions in the middle of
the architecture and global average pooling.
Features of GoogleNet:
The GoogLeNet architecture is very different from previous state-of-the-art architectures such
as AlexNet and ZF-Net. It uses many different kinds of methods such as 1×1 convolution and
global average pooling that enables it to create deeper architecture. In the architecture, we will
discuss some of these methods:
1×1 convolution : The inception architecture uses 1×1 convolution in its architecture. These
convolutions used to decrease the number of parameters (weights and biases) of the
architecture. By reducing the parameters we also increase the depth of the architecture. Let’s
look at an example of a 1×1 convolution below:
For Example, If we want to perform 5×5 convolution having 48 filters without using 1×1
convolution as intermediate.
GoogLeNet is a 22-layer deep convolutional neural network that’s a variant of the Inception
Network, a Deep Convolutional Neural Network developed by researchers at Google.
Today GoogLeNet is used for other computer vision tasks such as face detection and
recognition, adversarial training etc.
When designing a deep learning model, one needs to decide what convolution filter size to use
(whether it should be 3×3, 5×5, or 1×3) as it affects the model’s learning and performance, and
when to max pool the layers. However, the inception module, the key innovation introduced by
a team of Google researchers solved this problem creatively. Instead of deciding what filter size
to use and when to perform a max pooling operation, they combined multiple convolution
filters.
Stacking multiple convolution filters together instead of just one increases the parameter count
many times. However, GoogLeNet demonstrated by using the inception module that depth and
width in a neural network could be increased without exploding computations. We will
investigate the inception module in depth.
Resnet:
After the first CNN-based architecture (AlexNet) that win the ImageNet 2012 competition,
Every subsequent winning architecture uses more layers in a deep neural network to reduce
the error rate. This works for less number of layers, but when we increase the number of layers,
there is a common problem in deep learning associated with that called the
Vanishing/Exploding gradient. This causes the gradient to become 0 or too large. Thus when we
increases number of layers, the training and test error rate also increases.
In the above plot, we can observe that a 56-layer CNN gives more error rate on both training
and testing dataset than a 20-layer CNN architecture. After analyzing more on error rate the
authors were able to reach conclusion that it is caused by vanishing/exploding gradient.
ResNet, which was proposed in 2015 by researchers at Microsoft Research introduced a new
architecture called Residual Network.
Residual Network: In order to solve the problem of the vanishing/exploding gradient, this
architecture introduced the concept called Residual Blocks. In this network, we use a technique
called skip connections. The skip connection connects activations of a layer to further layers by
skipping some layers in between. This forms a residual block. Resnets are made by stacking
these residual blocks together.
The approach behind this network is instead of layers learning the underlying mapping, we
allow the network to fit the residual mapping. So, instead of say H(x), initial mapping, let the
network fit,
ResNet (short for Residual Network) is a type of neural network architecture introduced in 2015
by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun from Microsoft Research. It was
designed to solve the problem of vanishing gradients in deep neural networks, which hindered
their performance on large-scale image recognition tasks.
This tutorial will discuss the ResNet architecture in detail, including its history, key features, and
applications in various domains.
The ResNet architecture is usually divided into four parts, each containing multiple residual
blocks with different depths. The first part of the Network comprises a single convolutional
layer, followed by max pooling, to reduce the spatial dimensions of the input. The second part
of the Network contains 64 filters, while the third and fourth parts contain 128 and 256 filters,
respectively. The final part of the Network consists of global average pooling and a fully
connected layer that produces the output.
Residual learning is a concept that was introduced in the ResNet architecture to tackle the
vanishing gradient problem. In traditional deep neural networks, each layer applies a set of
transformations to the input to obtain the output. ResNet introduces residual connections that
enable the Network to learn residual mappings, which are the differences between the input
and output of a layer.
The residual connections are formed by adding the input to the output of a layer, which allows
the gradients to flow directly through the Network without being attenuated. This enables the
Network to learn the residual mapping using a shortcut connection that bypasses the layer's
transformation.
ResNet Architecture
The ResNet architecture consists of several layers, each containing residual blocks. A residual
block is a set of layers that perform a set of transformations on the input to obtain the output
and includes a shortcut connection that adds the input to the output.
The ResNet architecture has several variants, including ResNet-18, ResNet-34, ResNet-50,
ResNet-101, and ResNet-152. The number in each variant corresponds to the number of layers
in the Network. For example, ResNet-50 has 50 layers, while ResNet-152 has 152 layers.
The ResNet-50 architecture is one of the most popular variants, and it consists of five stages,
each containing several residual blocks. The first stage consists of a convolutional layer followed
by a max-pooling layer, which reduces the spatial dimensions of the input.
The second stage contains three residual blocks, each containing two convolutional layers and a
shortcut connection. The third, fourth, and fifth stages contain four, six, and three residual
blocks, respectively. Each block in these stages contains several convolutional layers and a
shortcut connection.
The output of the last stage is fed into a global average pooling layer, which reduces the spatial
dimensions of the feature maps to a single value for each channel. The output of the global
average pooling layer is then fed into a fully connected layer with softmax activation, which
produces the final output of the Network.
Applications
ResNet has achieved state-of-the-art results on various computer vision tasks, including image
classification, object detection, and semantic segmentation. In the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) 2015, the ResNet-152 architecture achieved a top-5 error rate
of 3.57%, significantly better than the previous state-of-the-art error rate of 3.57%.
Benefits of ResNet
ResNet has several benefits that make it a popular choice for deep learning applications:
Deeper networks
ResNet enables the construction of deeper neural networks, with more than a hundred layers,
which was previously impossible due to the vanishing gradient problem. The residual
connections allow the Network to learn better representations and optimize the gradient flow,
making it easier to train deeper networks.
Improved accuracy
Faster convergence
ResNet enables faster convergence during training, thanks to the residual connections that
allow for better gradient flow and optimization. This results in faster training and better
convergence to the optimal solution.
Transfer learning
ResNet is suitable for transfer learning, allowing the Network to reuse previously
learned features for new tasks. This is especially useful in scenarios where the amount of
Labeled data is limited, as the pre-trained ResNet can be fine-tuned on the new dataset to
achieve good performance.
Drawbacks of ResNet
Despite its numerous benefits, ResNet has a few drawbacks that should be considered:
Complexity
ResNet is a complex architecture that requires more memory and computational resources
than shallower networks. This can be a limitation in scenarios with limited resources, such as
mobile devices or embedded systems.
Overfitting
ResNet can be prone to overfitting, especially when the Network is too deep or when the
dataset is small. This can be mitigated by regularization techniques, such as dropout, or by
using smaller networks with fewer layers.
Interpretability
ResNet's interpretability can be challenging, as the Network learns complex and abstract
representations that are difficult to understand. This can be a limitation in scenarios where
interpretability is crucial, such as medical diagnosis or fraud detection.