ML Unit 3 Notes
ML Unit 3 Notes
The term " Neural Network" is derived from Biological neural networks that develop the
structure of a human brain. Similar to the human brain that has neurons interconnected to
one
anothe
r,
artifici
al
neural
networks also
have neurons
that are
interconnected to one another in various layers of the networks. These neurons are known
as nodes.
There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000. In the human brain,
data is stored in such a manner as to be distributed, and we can extract more than one
piece of this data when necessary from our memory parallelly. We can say that the
human brain is made up of incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example
of a digital logic gate that takes an input and gives an output. "OR" gate, which takes
two inputs. If one or both the inputs are "On," then we get "On" in output. If both the
inputs are "Off," then we get "Off" in output. Here the output depends upon input.
Our brain does not perform the same task. The outputs to inputs relationship keep
changing because of the neurons in our brain, which are "learning."
Input Layer:
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs
and includes a bias. This computation is represented in the form of a transfer function.
Perceptrons
Perceptron is a building block of an Artificial Neural Network. Initially, in the mid of 19th
century, Mr. Frank Rosenblatt invented the Perceptron for performing certain
calculations to detect input data capabilities or business intelligence. Perceptron is a
linear Machine Learning algorithm used for supervised learning for various binary
classifiers. This algorithm enables neurons to learn elements and processes them one by one
during preparation. In this tutorial, "Perceptron in Machine Learning," we will
discuss in-depth knowledge of Perceptron and its basic functions in brief. Let's
start with the basic introduction of Perceptron..
Perceptron model is also treated as one of the best and simplest types of Artificial
Neural networks. However, it is a supervised learning algorithm of binary classifiers.
Hence, we can consider it as a single-layer neural network with four main
parameters, i.e., input values, weights and Bias, net sum, and an activation function.
This is the primary component of Perceptron which accepts the initial data into the system
for further processing. Each input node contains a real numerical value.
Weight parameter represents the strength of the connection between units. This is another
most important parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.
Activation Function:
These are the final and important components that help to determine whether the neuron
will fire or not. Activation Function can be considered primarily as a step function.
Sign function
Step function, and
Sigmoid function
The data scientist uses the activation function to take a subjective decision based on
various problem statements and forms the desired outputs. Activation function may
differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether the
learning process is slow or has vanishing or exploding gradients.
Step-1
In the first step first, multiply all input values with corresponding weight values and
then add them to determine the weighted sum. Mathematically, we can calculate the
weighted sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted
sum, which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Based on the layers, Perceptron models are divided into two types. These are as follows:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold transfer
function inside
the model. The main objective of the single-layer perceptron model is to analyze the
linearly separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters. Further, it sums up all
inputs (weight). After adding all inputs, if the total sum of all inputs is more than a pre-
determined value, the model gets activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this
model is stated as satisfied, and weight demand does not change. However, this model
consists of a few discrepancies triggered when multiple weight inputs values are fed into
the model. Hence, to find desired output and minimize errors, some changes should be
necessary for the weights input.
Like a single-layer perceptron model, a multi-layer perceptron model also has the
same model structure but has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm,
which executes in two stages as follows:
Forward Stage: Activation functions start from the input layer in the forward
stage and terminate on the output layer.
Backward Stage: In the backward stage, weight and bias values are modified as
per the model's requirement. In this stage, the error between actual output and
demanded originated backward on the output layer and ended on the input layer.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with
the learned weight coefficient 'w'.
Mathematically, we can express it as follows:
f(x)=1; if
w.x+b>0
otherwise,
f(x)=0
Characteristics of Perceptron
The output of a perceptron can only be a binary number (0 or 1) due to the hard
limit transfer function.
Perceptron can only be used to classify the linearly separable sets of input vectors.
If input vectors are non-linear, it is not easy to classify them properly.
Activation Functions
Activation function also helps to normalize the output of any input in the range between 1 to
-1. Activation function must be efficient and it should reduce the computation time
because the neural network sometimes trained on millions of data points.
Without an activation function, a neural network will become a linear regression model.
But introducing the activation function the neural network will perform a non-linear
transformation to the input and will be suitable to solve problems like image
classification, sentence prediction, or langue translation.
The neuron is basically is a weighted average of input, then this sum is passed through
an activation function to get an output.
Y = ∑ (weights*input +
bias)
Here Y can be anything for a neuron between range -infinity to +infinity. So, we have
to bound our output to get the desired prediction or generalized results.
Without activation function, weight and bias would only have a linear
transformation, or neural network is just a linear regression model, a linear equation is
polynomial of one degree only which is simple to solve but limited in terms of ability to
solve complex problems or higher degree polynomials.
But opposite to that, the addition of activation function to neural network executes the non-
linear transformation to input and make it capable to solve complex problems such
as language translations and image classifications.
In addition to that, Activation functions are differentiable due to which they can
easily implement back propagations, optimized strategy while performing
backpropagations to measure gradient loss functions in the neural networks.
Types of Activation
Functions
The ultimate activation function of the last layer is nothing more than a linear function
of input from the first layer, regardless of how many levels we have if they are all
linear in nature. -inf to +inf is the range.
Uses: The output layer is the only location where the activation function's function is
applied.
If we separate a linear function to add non-linearity, the outcome will no longer depend
on the input "x," the function will become fixed, and our algorithm won't exhibit any
novel behaviour.
Equation : f(x) = x
Range : (-infinity to infinity)
It doesn’t help with the complexity or various parameters of usual data that is fed to
the neural networks.
Non-linear in nature. Observe that while Y values are fairly steep, X values range from -2 to
2. To put it another way, small changes in x also would cause significant shifts in the value
of
Y. spans from 0 to 1.
Uses: Sigmoid function is typically employed in the output nodes of a classi?cation, where
the result may only be either 0 or 1. Since the value for the sigmoid function only ranges
from 0 to 1, the result can be easily anticipated to be 1 if the value is more than 0.5 and 0 if
it is not.
Tanh Function
Value Range :- -1 to +1
Nature :- non-linear
Uses :- Usually used in hidden layers of a neural network as it’s values lies between -
1 to 1 hence the mean for the hidden layer comes out be 0 or very close to it,
hence helps in centering the data by bringing mean close to 0. This makes
learning for the next layer much easier.
Currently, the ReLU is the activation function that is employed the most globally.
Since practically all convolutional neural networks and deep learning systems employ
it.
The derivative and the function are both monotonic.
However, the problem is that all negative values instantly become zero, which reduces
the model's capacity to effectively fit or learn from the data. This means that any negative
input to a ReLU activation function immediately becomes zero in the graph, which has an
impact
on the final graph by improperly mapping the negative values.
It Stands for Rectified linear unit. It is the most widely used activation function.
Chiefly implemented in hidden layers of Neural network.
Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
Value Range :- [0, inf)
Nature :- non-linear, which means we can easily backpropagate the errors and
have multiple layers of neurons being activated by the ReLU function.
Uses :- ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations. At a time only a few neurons are
activated making the network sparse making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function
Although it is a subclass of the sigmoid function, the softmax function comes in handy
when dealing with multiclass classification issues.
Used frequently when managing several classes. In the output nodes of image
classification issues, the softmax was typically present. The softmax function would split
by the sum of the outputs and squeeze all outputs for each category between 0 and 1.
The output unit of the classifier, where we are actually attempting to obtain the
probabilities to determine the class of each input, is where the softmax function is best
applied.
The usual rule of thumb is to utilise RELU, which is a usual perceptron in hidden layers
and is employed in the majority of cases these days, if we really are unsure of what
encoder to apply.
A very logical choice for the output layer is the sigmoid function if your input is for
binary classification. If our output involves multiple classes, Softmax can be quite
helpful in
predicting the odds for each class.
The softmax function is also a type of sigmoid function but is handy when we are trying
to handle multi- class classification problems.
Nature :- non-linear
Uses :- Usually used when trying to handle multiple classes. the softmax
function was commonly found in the output layer of image classification
problems.The softmax function would squeeze the outputs for each class
between 0 and 1 and would also divide by the sum of the outputs.
Output:- The softmax function is ideally used in the output layer of the
classifier where we are actually trying to attain the probabilities to define the
class of each input.
The basic rule of thumb is if you really don’t know what activation function to
use, then simply use RELU as it is a general activation function in hidden layers
and is used in most cases these days.
If your output is for binary classification then, sigmoid function is very natural
choice for output layer.
If your output is for multi-class classification then, Softmax is very useful to
predict the probabilities of each classes.
Artificial Neural Networks (ANN) are algorithms based on brain function and are
used to model complicated patterns and forecast issues. The Artificial Neural Network
(ANN) is a deep learning method that arose from the concept of the human brain
Biological Neural Networks. The development of ANN was the result of an attempt to
replicate the workings of the human brain. The workings of ANN are extremely similar to
those of biological neural networks, although they are not identical. ANN algorithm
accepts only numeric and structured data.
1. There are three layers in the network architecture: the input layer, the hidden layer
(more than one), and the output layer. Because of the numerous layers are sometimes
referred to as the MLP (Multi-Layer Perceptron).
This model captures the presence of non-linear relationships between the inputs.
It contributes to the conversion of the input into a more usable output.
The core component of ANNs is artificial neurons. Each neuron receives inputs from
several other neurons, multiplies them by assigned weights, adds them and passes the
sum to one or more neurons. Some artificial neurons might apply an activation function
to the output before passing it to the next variable.
At its core, this might sound like a very trivial math operation. But when you place
hundreds, thousands and millions of neurons in multiple layers and stack them up on top of
each other, you’ll obtain an artificial neural network that can perform very complicated
tasks, such as classifying images or recognizing speech.
Artificial neural networks are composed of an input layer, which receives data from outside
sources (data files, images, hardware sensors, microphone…), one or more hidden layers
that process the data, and an output layer that provides one or more data points
based on the function of the network. For instance, a neural network that detects persons,
cars and animals will have an output layer with three nodes. A network that
classifies bank transactions between safe and fraudulent will have a single output.
Artificial neural networks start by assigning random values to the weights of the
connections between neurons. The key for the ANN to perform its task correctly
and accurately is to adjust these weights to the right numbers. But finding the right
weights is not very easy, especially when you’re dealing with multiple layers and
thousands of neurons.
This calibration is done by “training” the network with annotated examples. For instance, if
you want to train the image classifier mentioned above, you provide it with multiple photos,
each labeled with its corresponding class (person, car or animal). As you provide it with
more and more training examples, the neural network gradually adjusts its weights to
map each input to the correct outputs.
Basically, what happens during training is the network adjust itself to glean specific
patterns from the data. Again, in the case of an image classifier network, when you train the
AI model with quality examples, each layer detects a specific class of features. For
instance, the first layer might detect horizontal and vertical edges, the next layers
might detect corners and round shapes. Further down the network, deeper layers will start
to pick out more advanced features such as faces and objects.
3 Backpropagation Algorithm
Backpropagation is an algorithm that backpropagates the errors from the output nodes to
the input nodes. Therefore, it is simply referred to as the backward propagation of
errors. It uses in the vast applications of neural networks in data mining like
Character recognition, Signature verification, etc.
The Back propagation algorithm in neural network computes the gradient of the loss
function for a single weight by the chain rule. It efficiently computes one layer at a
time, unlike a native direct computation. It computes the gradient, but it does not define
how the gradient is used. It generalizes the computation in the delta rule.
Consider the following Back propagation neural network example diagram to understand:
Backpropagation Algorithm:
Step 1: Inputs X, arrive through the preconnected path.
Step 2: The input is modeled using true weights W. Weights are usually chosen randomly.
Step 3: Calculate the output of each neuron from the input layer to the hidden layer to the
output layer.
Step 4: Calculate the error in the outputs
Backpropagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the weights to reduce
the error.
Step 6: Repeat the process until the desired output is achieved.
Training Algorithm :
Step 5 : Each hidden unit Zj (z=1 to a) sums its weighted input signal to calculate its
net input
Applying activation function zj = f(zinj) and sends this signals to all units in the
layer about i.e output units
signals. yk = f(yink)
Backpropagation Error :
δk = ( tk – yk ) + yink
Step 7: Each hidden unit Zj (j=1 to a) sums its input from all units in the layer above
δinj = Σ δj wjk
δj = δinj + zinj
Δ wjk = α δk zj
and the bias correction term is given by Δwk = α δk.
for each hidden unit zj (j=1 to a) update its bias and weights (i=0 to n) the
weight connection term
Δ vij = α δj xi
Δ v0j = α
δj
v0j(new) = v0j(old) +
Δv0j
Step 9: Test the stopping condition. The stopping condition can be the minimization of
error, number of epochs.
Types of Backpropagation
Advantage
s:
Disadvantage
s:
It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate
results.
Performance is highly dependent on input data.
Spending too much time training.
The matrix-based approach is preferred over a mini-batch.
The Convolutional Neural Networks, which are also called as covnets, are nothing but
neural networks, sharing their parameters. Suppose that there is an image, which is
embodied as a cuboid, such that it encompasses length, width, and height. Here the
dimensions of the image
are represented by the Red, Green, and Blue channels, as shown in the image given
below.
Now assume that we have taken a small patch of the same image, followed by
running a small neural network on it, having k number of outputs, which is
represented in a vertical manner. Now when we slide our small neural network all
over the image, it will result in another image constituting different width, height as well
as depth. We will notice that rather than having R, G, B channels, we have come across
some more channels that, too, with less width and height, which is actually the concept of
Convolution. In case, if we accomplished in having similar patch size as that of the
image, then it would have been a regular neural
network. We have some wights due to this small
patch.
Working of CNN
Generally, a Convolutional Neural Network has three layers, which are as follows;
Locally Connected: It can be defined as a regular neural network layer that receives an
input from the preceding layer followed by computing the class scores and results
in a 1-
Dimensional array that has the equal size to that of the number of classes.
We will start with an input image to which we will be applying multiple feature
detectors, which are also called as filters to create the feature maps that comprises
of a Convolution layer. Then on the top of that layer, we will be applying the ReLU or
Rectified Linear Unit to remove any linearity or increase non-linearity in our images.
Next, we will apply a Pooling layer to our Convolutional layer, so that from every
feature map we create a Pooled feature map as the main purpose of the pooling layer is
to make sure that we have spatial invariance in our images. It also helps to reduce the
size of our images as well as avoid any kind of overfitting of our data. After that, we
will flatten all of our pooled
images into one long vector or column of all of these values, followed by inputting
these values into our artificial neural network. Lastly, we will feed it into the locally
connected layer to achieve the final output.
Pooling Layers
The pooling operation involves sliding a two-dimensional filter over each channel of
feature map and summarising the features lying within the region covered by the filter.
For a feature map having dimensions nh x nw x nc, the dimensions of output obtained after a
pooling layer is
where,
Pooling layers are used to reduce the dimensions of the feature maps. Thus, it
reduces the number of parameters to learn and the amount of computation
performed in the network.
The pooling layer summarises the features present in a region of the feature map
generated by a convolution layer. So, further operations are performed on
summarised features instead of precisely positioned features generated by the
convolution layer. This makes the model more robust to variations in the position of
the features in the input image.
Max Pooling
1. Max pooling is a pooling operation that selects the maximum element from the
region of the feature map covered by the filter. Thus, the output after max-
pooling layer would be a feature map containing the most prominent features of the
previous feature
map.
This can be achieved using MaxPooling2D layer in keras as follows
Average Pooling
Average pooling computes the average of the elements present in the region of feature map
covered by the filter. Thus, while max pooling gives the most prominent feature in
a particular patch of the feature map, average pooling gives the average of features present
in a patch.
Global
Pooling
1. Global pooling reduces each channel in the feature map to a single value. Thus, an
nh x nw x nc feature map is reduced to 1 x 1 x nc feature map. This is equivalent to
using a filter of dimensions nh x nw i.e. the dimensions of the feature
map. Further, it can be either global max pooling or global average
pooling.
In convolutional neural networks (CNNs), the pooling layer is a common type of layer that
is typically added after convolutional layers. The pooling layer is used to reduce the
spatial dimensions (i.e., the width and height) of the feature maps, while preserving the
depth (i.e., the number of channels).
1. The pooling layer works by dividing the input feature map into a set of
non- overlapping regions, called pooling regions. Each pooling region is then
transformed into a single output value, which represents the presence of a particular
feature in that region. The most common types of pooling operations are max
pooling and average pooling.
2. In max pooling, the output value for each pooling region is simply the
maximum value of the input values within that region. This has the effect of
preserving the most salient features in each pooling region, while discarding
less relevant information. Max pooling is often used in CNNs for object
recognition tasks, as it helps to identify the most distinctive features of an object,
such as its edges and corners.
3. In average pooling, the output value for each pooling region is the average
of the input values within that region. This has the effect of preserving more
information than max pooling, but may also dilute the most salient features.
Average pooling is often used in CNNs for tasks such as image segmentation and
object detection, where a more fine-grained representation of the input is required.
Pooling layers are typically used in conjunction with convolutional layers in a CNN,
with each pooling layer reducing the spatial dimensions of the feature maps,
while the convolutional layers extract increasingly complex features from the input.
The resulting feature maps are then passed to a fully connected layer, which
performs the final classification or regression task.
1. Dimensionality reduction: The main advantage of pooling layers is that they help
in reducing the spatial dimensions of the feature maps. This reduces the
computational cost and also helps in avoiding overfitting by reducing the number of
parameters in the model.
2. Translation invariance: Pooling layers are also useful in achieving
translation invariance in the feature maps. This means that the position of an object
in the image does not affect the classification result, as the same features are
detected regardless of the position of the object.
3. Feature selection: Pooling layers can also help in selecting the most important
features from the input, as max pooling selects the most salient features and average
pooling preserves more information.
Disadvantages of Pooling Layer:
1. Information loss: One of the main disadvantages of pooling layers is that they
discard some information from the input feature maps, which can be important
for the final classification or regression task.
2. Over-smoothing: Pooling layers can also cause over-smoothing of the feature
maps, which can result in the loss of some fine-grained details that are
important for the final classification or regression task.
3. Hyperparameter tuning: Pooling layers also introduce hyperparameters such as
the size of the pooling regions and the stride, which need to be tuned in order to
achieve optimal performance. This can be time-consuming and requires some
expertise in model building.
RNNs have the same input and output architecture as any other deep neural
architecture. However, differences arise in the way information flows from input to output.
Unlike Deep neural networks where we have different weight matrices for each Dense
network in RNN, the weight across the network remains the same. It calculates state hidden
state Hi for every input Xi . By using the following formulas:
h= σ(UX + Wh-1 + B)
Y = O(Vh + C)
Hence
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at timestep i
The parameters in the network are W, U, V, c, b which are shared across timestep
The Recurrent Neural Network consists of multiple fixed activation function units, one
for each time step. Each unit has an internal state which is called the hidden state of the
unit. This hidden state signifies the past knowledge that the network currently holds
at a given time step. This hidden state is updated at every time step to signify the change
in the knowledge of the network about the past. The hidden state is updated using the
following recurrence relation:-
where:
Yt -> output
These parameters are updated using Backpropagation. However, since RNN works on
sequential data here we use an updated backpropagation which is known as
Backpropagation through time.
Training through
RNN
1. A single-time step of the input is provided to the network.
2. Then calculate its current state using a set of current input and the previous state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the problem and join the
information from all the previous states.
5. Once all the time steps are completed the final current state is used to calculate the
output.
6. The output is then compared to the actual output i.e the target output and the error is
generated.
7. The error is then back-propagated to the network to update the weights and hence the
network (RNN) is trained using back propagation through time
o For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3 classes, it
is
3*3 table, and so on.
o The matrix is divided into two dimensions, that are predicted values and
actual values along with the total number of predictions.
o Predicted values are those values, which are predicted by the model, and actual
values are the true values for the given observations.
o It looks like the below table:
o True Negative: Model has given prediction No, and the real or actual value was also
No.
o True Positive: The model has predicted yes, and the actual value was also true.
o False Negative: The model has predicted no, but the actual value was Yes, it is
also called as Type-II error.
o False Positive: The model has predicted Yes, but the actual value was No. It is also
called a Type-I error.
While building any machine learning model, the first thing that comes to our mind is how
we can build an accurate & 'good fit' model and what the challenges are that will come
during the entire procedure. Precision and Recall are the two most important but confusing
concepts in Machine Learning. Precision and recall are performance metrics
used for pattern recognition and classification in machine learning. These concepts are
essential to build a perfect machine learning model which gives more precise and accurate
results. Some of the models in machine learning require more precision and some model
requires more recall. So, it is important to know the balance between Precision and recall
or, simply, precision-recall trade-off.
Accuracy
It’s the ratio of the correctly labeled subjects to the whole pool of subjects.
Accuracy answers the following question: How many students did we correctly label
out of all the students?
Accuracy = (TP+TN)/(TP+FP+FN+TN)
numerator: all correctly labeled subject (All
trues) denominator: all subject