0% found this document useful (0 votes)
22 views99 pages

ANN Material

Module 1 covers the fundamentals of neural learning, detailing how artificial neural networks (ANNs) mimic human brain processing to solve problems through interconnected neurons. It explains the structure of neural networks, including layers, weights, biases, and activation functions, as well as the history and types of learning algorithms such as supervised, unsupervised, and reinforced learning. Additionally, it discusses the importance of data preprocessing and feature engineering in preparing data for machine learning models.

Uploaded by

anjukadamati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views99 pages

ANN Material

Module 1 covers the fundamentals of neural learning, detailing how artificial neural networks (ANNs) mimic human brain processing to solve problems through interconnected neurons. It explains the structure of neural networks, including layers, weights, biases, and activation functions, as well as the history and types of learning algorithms such as supervised, unsupervised, and reinforced learning. Additionally, it discusses the importance of data preprocessing and feature engineering in preparing data for machine learning models.

Uploaded by

anjukadamati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 99

Module 1

Fundamentals of Neural Learning

Introduction
Neural Computers mimic certain processing capabilities of the human brain.
- Neural Computing is an information processing paradigm, inspired by biological
system, composed of a large number of highly interconnected processing elements
(neurons) working in unison to solve specific problems.
- Artificial Neural Networks (ANNs), like people, learn by example.
- An ANN is configured for a specific application, such as pattern recognition or data
classification, through a learning process.
- Learning in biological systems involves adjustments to the synaptic connections
that exist between the neurons.

Definition of Neural Networks

A Neural Network or Neural Net is a system of interconnected processing units


called neurons.

Artificial Neural Networks (ANN) or Neural Networks is an integral part of Artificial


Intelligence and the foundation of Deep Learning. ANN is the computational
architecture consisting of neurons that mathematically represent how a biological
neural network operates to identify and recognize relationships within the data.

Essentially Neural networks are non-linear machine learning models, which can be
used for both supervised or unsupervised learning. Neural networks are also seen
as a set of algorithms, which are modeled loosely based on the human brain and
are built to identify patterns.

The Basic Concept of Artificial Neural Networks

An artificial neural network (ANN) is a computing system designed to simulate how


the human brain analyzes and processes information. It is the foundation of
artificial intelligence (AI) and solves problems that would prove impossible or
difficult by human or statistical standards.

Artificial Neural Networks are primarily designed to mimic and simulate the
functioning of the human brain. Using the mathematical structure, it is ANN
constructed to replicate the biological neurons.

A human brain has a decision-making process: it sees or gets exposed to


information through the five sense organs; this information gets stored, correlates
the registered piece of information with any previous learnings, and makes certain
decisions accordingly.

The concept of ANN follows the same process as that of a natural neural net. The
objective of ANN is to make the machines or systems understand and ape how a
human brain makes a decision and then ultimately takes action.
What is Neural Net ?
• A neural net is an artificial representation of the human brain that tries to
simulate its learning process. An artificial neural network (ANN) is often called a
"Neural Network" or simply Neural Net (NN).
• Traditionally, the word neural network is referred to a network of biological
neurons in the nervous system that process and transmit information.
• Artificial neural network is an interconnected group of artificial neurons that uses
a mathematical model or computational model for information processing based on
a connectionist approach to computation.
• The artificial neural networks are made of interconnecting artificial neurons which
may share some properties of biological neural networks.
• Artificial Neural network is a network of simple processing elements (neurons)
which can exhibit complex global behavior, determined by the connections between
the processing elements and element parameters.

Neural Networks follow different paradigm for computing.


The von Neumann machines are based on the processing/memory abstraction of
human information processing.
The neural networks are based on the parallel architecture of biological brains.
■ Neural networks are a form of multiprocessor computer system, with
- simple processing elements ,
- a high degree of interconnection,
- simple scalar messages, and
- adaptive interaction between elements.

History

McCulloch and Pitts (1943) are generally recognized as the designers of the first
neural-network. They combined many simple processing units together that could
lead to an overall increase in computational power. They suggested many ideas like
: a neuron has a threshold level and once that level is reached the neuron fires. It is
still the fundamental way in which ANNs operate. The McCulloch and Pitts's network
had a fixed set of weights.
Hebb (1949) developed the first learning rule, that is if two neurons are active at
the same time then the strength between them should be increased.

Neural network regained importance in 1985-86. The researchers, Parker and


LeCun discovered a learning algorithm for multi-layer networks called back
propagation that could solve problems that were not linearly separable.
Biological Neuron Model
The human brain consists of a large number, more than a billion of neural cells that
process information. Each cell works like a simple processor. The massive
interaction between all cells and their parallel processing only makes the brain's
abilities possible.
Dendrites are branching fibers that extend from the cell body or soma. Soma or cell
body of a neuron contains the nucleus and other structures, support chemical
processing and production of
neurotransmitters.
Axon is a singular fiber carries information away from the soma to the synaptic sites
of other neurons (dendrites and somas), muscles, or glands.
Axon hillock is the site of summation for incoming information. At any moment, the
collective influence of all neurons that conduct impulses to a given neuron will
determine whether or not anaxon hillock and propagated along the axon.
Myelin Sheath consists of fat-containing cells that insulate the axon from electrical
activity. This insulation acts to increase the rate of transmission of signals. A gap
exists between each myelin sheath cell along the axon. Since fat inhibits the
propagation of electricity, the signals jump from one gap to the next.
Nodes of Ranvier are the gaps (about 1 m) between myelin sheath cells long axons
are Since fat serves as a good insulator, the myelin sheaths speed the rate of
transmission of an electrical impulse along the axon.
Synapse is the point of connection between two neurons or a neuron and a muscle
or a gland. Electrochemical communication between neurons takes place at these
junctions.
Terminal Buttons of a neuron are the small knobs at the end of an axon that release
chemicals called neurotransmitters.

Information flow in a Neural Cell


Dendrites receive activation from other neurons.

Soma processes the incoming activations and converts them into


output activations.
Axons act as transmission lines to send activation to other neurons.

Synapses the junctions allow signal transmission between


the axons and dendrites.
The process of transmission is by diffusion of chemicals
called neuro-transmitters.

Key Components of Neural Networks

a. Neurons (Nodes):

 Neurons are the basic units of a neural network.


 Each neuron takes inputs, applies weights and biases, performs a
computation, and passes the output through an activation function.

b. Layers:

Neural networks are composed of three types of layers:

1. Input Layer: Receives raw data and passes it to the next layer.
2. Hidden Layers: Perform computations and feature transformations using
weights, biases, and activation functions.
3. Output Layer: Produces the final result (e.g., classification, regression value).

c. Weights and Biases:

 Weights: Represent the strength of connections between neurons.


 Biases: Allow the model to shift the activation function, enabling better
learning.

d. Activation Functions:

Functions that introduce non-linearity into the model, allowing it to solve complex
problems. Common types:

Sigmoid:σ(x)=1/1+e−x1 (used for probabilities).

 ReLU (Rectified Linear Unit): f(x)=max(0,x) (common in deep learning).


 Tanh: tanh(x)=ex+e−x/ex−e−x (outputs values between -1 and 1).
 Softmax: Used for multi-class classification problems.

Model of Artificial Neuron


W1 Activation
Function

W2
Neural Network Architectures

An Artificial Neural Network (ANN) is a data processing system,


consisting large number of simple highly interconnected processing
elements as artificial neuron in a network structure that can be
represented using a directed graph G, an ordered 2-tuple (V, E) ,
consisting a set V of vertices and a set E of edges.
- The vertices may represent neurons (input/output) and

e5
V1 V3
V5
e2
e4
e5

V2 V4
e3
Vertices V = { v1 , v2 , v3 , v4, v5 } Edges
E = { e1 , e2 , e3 , e4, e5 }

Single Layer Feed-forward Network

The Single Layer Feed-forward Network consists of a single layer of weights ,


where the inputs are directly connected to the outputs, via a series of
weights. The synaptic links carrying weights connect every input to every
output , but not other way. This way it is considered a network of feed-
forward type. The sum of the products of the weights and the inputs is
calculated in each neuron node, and if the value is above some threshold
(typically 0) the neuron fires and takes the activated value (typically 1);
otherwise it takes the deactivated value (typically-1).

w11
w21
w12
w22

w2m
w1m
wn1
wn2
Multi Layer Feed-forward Network

The name suggests, it consists of multiple layers. The architecture of this


class of network, besides having the input and the output layers, also have
one or more intermediary layers called hidden layers. The computational
units of the hidden layer are known as hidden neurons.

Recurrent Networks

The Recurrent Networks differ from feed-forward architecture. A Recurrent


network has at least one feed back loop.

y1
x1

y1 y2
x2

Yn
ym
Xℓ

Input Layer Hidden Output


neurons xi Layer Layer

Classification of Learning Algorithms

Fig. below indicate the hierarchical representation of the algorithms mentioned


in the previous slide. These algorithms are explained in subsequent slides.

Supervised Learning

A teacher is present during learning process and presents expected output.


Every input pattern is used to train thenetwork.
Learning process is based on comparison, between network's computed
output and the correct expected output, generating "error".
The "error" generated is used to change network parameters thatresult
improved performance.
Neural Network
Learning algorithms

Supervised Learning Reinforced Learning Unsupervised Learning


(Error based) (Output based)

Stochastic Hebbian Competitive

Least Mean Back


Square Propagation

Unsupervised Learning

- No teacher is present.
- The expected or desired output is not presented to the network.
- The system learns of it own by discovering and adapting to the
structural features in the input patterns.
Reinforced learning
- A teacher is present but does not present the expected or
desired output but only indicated if the computed output is
correct or incorrect.
- The information provided helps the network in its learning process.
- A reward is given for correct answer computed and a penalty for a
wrong answer.

Regression vs. Classification in Machine Learning

Regression and Classification algorithms are Supervised Learning algorithms. Both


the algorithms are used for prediction in Machine learning and work with the
labeled datasets. But the difference between both is how they are used for different
machine learning problems.

The main difference between Regression and Classification algorithms that


Regression algorithms are used to predict the continuous values such as price,
salary, age, etc. and Classification algorithms are used to predict/Classify the
discrete values such as Male or Female, True or False, Spam or Not Spam, etc.
Classification:

Classification is a process of finding a function which helps in dividing the dataset


into classes based on different parameters. In Classification, a computer program is
trained on the training dataset and based on that training, it categorizes the data
into different classes.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the following types:

o Logistic Regression
o K-Nearest Neighbours
o Support Vector Machines
o Kernel SVM
o Naive Bayes
o Decision Tree Classification
o Random Forest Classification

Regression:

Regression is a process of finding the correlations between dependent and


independent variables. It helps in predicting the continuous variables such as
prediction of Market Trends, prediction of House prices, etc.

The task of the Regression algorithm is to find the mapping function to map the
input variable(x) to the continuous output variable(y).
Example: Suppose we want to do weather forecasting, so for this, we will use the
Regression algorithm. In weather prediction, the model is trained on the past data,
and once the training is completed, it can easily predict the weather for future days.

Types of Regression Algorithm:

o Simple Linear Regression


o Multiple Linear Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression

Difference between Regression and Classification

Regression AlgorithmRegression ClassificClassificationation Algorithm


In Regression, the output variable must be In Classification, the output variable must
of continuous nature or real value. be a discrete value.
The task of the regression algorithm is to The task of the classification algorithm is
map the input value (x) with the to map the input value(x) with the
continuous output variable(y). discrete output variable(y).
Regression Algorithms are used with Classification Algorithms are used with
continuous data. discrete data.
In Regression, we try to find the best fit In Classification, we try to find the
line, which can predict the output more decision boundary, which can divide the
accurately. dataset into different classes.
Classification Algorithms can be used to
Regression algorithms can be used to
solve classification problems such as
solve the regression problems such as
Identification of spam emails, Speech
Weather Prediction, House price
Recognition, Identification of cancer cells,
prediction, etc.
etc.
The regression Algorithm can be further The Classification algorithms can be
divided into Linear and Non-linear divided into Binary Classifier and Multi-
Regression. class Classifier.
What is Data Preprocessing?

Data preprocessing is the process of cleaning and preparing the raw data to enable
feature engineering. After getting large volumes of data from sources like
databases, object stores, data lakes, engineers prepare them so data scientists can
create features. This includes basic cleaning, crunching and joining different sets of
raw data. In an operational environment this preprocessing would run as an ETL job
for batch processing, or it could be part of a streaming processing for live data.

Once the data is ready for the data scientist - then comes the feature engineering
part.

Data Preprocessing
Normalization: Normalization is the process of scaling numeric features to a
standard range, typically between 0 and 1. This ensures that all features contribute
equally to the model, preventing one dominant feature from overshadowing others.
Encoding: Categorical data, such as gender or country names, needs to be
converted into numerical format for machine learning algorithms. Encoding
techniques like one-hot encoding or label encoding transform categorical variables
into a format that algorithms can understand.
Handling Missing Data: Dealing with missing data is essential for robust model
performance. Strategies include removing rows with missing values, imputing
missing values with statistical measures, or using advanced techniques like
machine learning-based imputation.

What is Feature Engineering?

Feature engineering is the creation of features from raw data. Feature engineering
includes:

 Determining required features for ML mode


 Analysis for understanding statistics, distribution, implementing one hot
encoding and imputation, and more. Tools like Python and Python libraries are
used.
 Preparing features for ML model consumption
 Building the models
 Testing if the features achieve what is needed
 Repeating the preparation and testing process, by running experiments with
different features, adding, removing and changing features. During the
process, the data scientist might find out data is missing from the sources.
The data scientist will request preprocessing again from the data engineer.
 Deployment to the ML pipeline

Feature Engineering

Creation of Derived Features: Feature engineering involves creating new


features that enhance the predictive power of the model. For example, extracting
the day of the week from a date or creating interaction terms between existing
features can provide valuable information.
Dimensionality Reduction: High-dimensional datasets may suffer from the curse of
dimensionality, leading to increased computational complexity and potential
overfitting. Techniques like Principal Component Analysis (PCA) help reduce
dimensionality while preserving essential information.
Handling Outliers: Outliers can distort model training, and addressing them is
crucial. Techniques such as trimming, winsorizing, or transforming features can
mitigate the impact of outliers on model performance.

Feature extraction:

 The feature extraction algorithms transform the data onto a new feature space.

 When it is important to derive useful information from the data, hence creating
new feature subspace doesn’t affect the model.

 Used to improve the predictive performance of the models.

Two categories of Feature extraction:

1. Linear
It assumes that the data falls on a linear subspace or classes of data can be
distinguished linearly

2. non-linear
It assumes that the pattern of data is more complex and exists on a non-linear
sub-manifold

Unsupervised Feature Extraction:

They mostly concentrate on the variation and distribution of data.

1. PCA:

 Linear unsupervised method

 The aim of PCA is to find orthogonal directions which represent the data with the
least error

 PCA tries to maximize this variance to find the most variant orthonormal
directions of data.

 The desired directions are the eigenvectors of the covariance matrix of data.

2. Kernel Principle Component analysis:

 KPCA finds the non-linear subspace of data which is useful if the data pattern is
not linear.

 The kernel PCA uses kernel method which maps data to a higher dimensional
space .

 Kernel PCA relies on the blessing of dimensionality by using kernels. i.e., it


assumes in higher dimensions, the representation or discrimination of data is
easier.

There are so many other unsupervised feature extraction techniques like:

1. Dual PCA

2. Multidimensional Scaling

3. Isomap

4. Locally linear embedding

5. Laplacian Eigenmap

6. Maximum variance unfolding

7. Autoencoders and Neural Networks

8. T-distributed stochastic neighbor embedding


Curse of Dimensionality

The Curse of Dimensionality in Machine Learning arises when working with high-
dimensional data, leading to increased computational complexity, overfitting, and
spurious correlations. Techniques like dimensionality reduction, feature selection,
and careful model design are essential for mitigating its effects and improving
algorithm performance. Navigating this challenge is crucial for unlocking the
potential of high-dimensional datasets and ensuring robust machine-learning
solutions.

What is the Curse of Dimensionality?


 The Curse of Dimensionality refers to the phenomenon where the efficiency and
effectiveness of algorithms deteriorate as the dimensionality of the data
increases exponentially.
 In high-dimensional spaces, data points become sparse, making it challenging to
discern meaningful patterns or relationships due to the vast amount of data
required to adequately sample the space.
 The Curse of Dimensionality significantly impacts machine learning algorithms in
various ways. It leads to increased computational complexity, longer training
times, and higher resource requirements. Moreover, it escalates the risk of
overfitting and spurious correlations, hindering the algorithms' ability to
generalize well to unseen data.
How to Overcome the Curse of Dimensionality?
To overcome the curse of dimensionality, you can consider the following
strategies:

Dimensionality Reduction Techniques:

 Feature Selection: Identify and select the most relevant features from the
original dataset while discarding irrelevant or redundant ones. This reduces the
dimensionality of the data, simplifying the model and improving its efficiency.
 Feature Extraction: Transform the original high-dimensional data into a lower-
dimensional space by creating new features that capture the essential
information. Techniques such as Principal Component Analysis (PCA) and t-
distributed Stochastic Neighbor Embedding (t-SNE) are commonly used for
feature extraction.

Data Preprocessing:

 Normalization: Scale the features to a similar range to prevent certain features


from dominating others, especially in distance-based algorithms.
 Handling Missing Values: Address missing data appropriately through imputation
or deletion to ensure robustness in the model training process.

Training the classifiers

Training Before Dimensionality Reduction: Train a Random Forest classifier


(clf_before) on the original scaled features (X_train_scaled) without dimensionality
reduction.
Evaluation Before Dimensionality Reduction: Make predictions (y_pred_before) on
the test set (X_test_scaled) using the classifier trained before dimensionality
reduction, and calculate the accuracy (accuracy_before) of the model.
Training After Dimensionality Reduction: Train a new Random Forest classifier
(clf_after) on the reduced feature set (X_train_pca) after dimensionality reduction.
Evaluation After Dimensionality Reduction: Make predictions (y_pred_after) on the
test set (X_test_pca) using the classifier trained after dimensionality reduction, and
calculate the accuracy (accuracy_after) of the model.

Polynomial Curve Fitting


Any pattren recognition or machine learning task can be primarily divided into two
categories: Supervised and Unsupervised Learning. In a supervised machine
learning problem, we have the input and corresponding desired output. For
any supervised learning problem, the aim of the pattern recognition algorithm is to
come up with an algorithm or model which can predict the output given the input.
Based on the output, the supervised learning problem can be divided into two
categories: Classification (when we have a finite number of discrete output)
and Regression (if the desired output consists of one or more continuous variables).
For an unsupervised learning problem, we don’t have the desired output for the
input variables. The goal of pattern recognition is: to discover groups of similar
examples within the data (clustering), to determine the distribution of data within
the input space (density estimation), to project the data from a high-dimensional
space to low-dimensional space etc..

For simplicity, we can generate the data for this task from the
function Sin(2πx)Sin(2πx) with random gaussian noise included in the targetv
varible, i.e, for any input xx, target t=Sin(2πx)+ϵt=Sin(2πx)+ϵ. Let the training set
consists of NN samples with inputs as x=(x1,x2,…,xN)Tx=(x1,x2,…,xN)T and the
corresponding target variables as t=(t1,t2,…,tN)Tt=(t1,t2,…,tN)T. Let
the polynomial fuction used for the prediction, whose order is MM, is:

y(x,w)=w0+w1x+w2x2+…+wMxM=∑j=0Mwjxj
This polynomial function is linear with respect to the coefficients ww. The goal of
the pattern recognition task is to minimize the error in predicting tt. Or we can say
that we have to minimize some error function which should encode how much we
deviated from the actual value while doing the prediction. One of the common
choice of error fuction is:

E(w)=12∑n=1N[y(xn,w)−tn]2E(w)=12∑n=1N[y(xn,w)−tn]2

The error function is quadratic in ww and hence taking it’s derivative w.r.t ww and
equating it to 00 gives us a unique solution w∗w∗ for the problem.

One of the important parameter in deciding how well the solution will perform on the
unseen data is the order of the polynomial function MM. As shown in the below
figure, if we keep on increasing MM, we will get a perfect fit on the training data
getting the training error E(w∗)=0E(w∗)=0 (called as overfitting) but the prediction
on unseen data will be flawed. The best fit polynomial seems to be the one which
has on order M=3M=3.
Based on these coefficients, one of the techniques which can be used to
coompensate for the problem of overfitting is regularization which involves adding a
penalty term to the error function which discourages the coefficients from getting
larger in magnitude. The modified error functio is given as:

E˜(w)=12∑n=1N(y(xn,w)−tn)2+λ2∥w∥2E~(w)=12∑n=1N(y(xn,w)−tn)2+λ2‖w‖2

where ∥w∥2=wTw=w20+w21+…+w2M‖w‖2=wTw=w02+w12+…+wM2. This


technique is also called as shrinkage method and a quadratic regularizer is called
as ridge regression.

Another way to reduce overfitting or to use the complex models for prediction is
by increasing the sample size of the training data. The same
order M=9M=9 polynomial is fit on N=15N=15 and N=100N=100 datapoints and
result is shown in the left and the right figure below. It can be seen that the
increasing the number of datapoinsts reduces the problem of overfitting.

Model Complexity in Machine Learning


What is Model Complexity?
Model complexity refers is a measure of how well a model can capture the
underlying patterns in the data. In the context of machine learning, model
complexity is often associated with the number of parameters in a model and its
ability to fit both the training data and generalize to new, unseen data.
There are two main aspects of model complexity:
1. Simple Models: Simple models have few parameters, making them less flexible
therefore they struggle to capture the complexity of the underlying patterns in
the data leading to underfitting, where the model performs poorly on the
training data as well as on unseen data.
2. Complex Models: Complex models have a larger number of parameters, allowing
them to represent more intricate relationships in the data. While complex
models may perform well on the training data, model tends to overfitting.
Modelling complexity can be influenced by several factors:
1. Number of Features: The more attributes or features your model scrutinizes, the
higher its complexity is likely to be. Too many features can potentially magnify
noise and result in overfitting.
2. Model Algorithm: The nature of the algorithm used influences the complexity of
the model. For instance, decision trees are considerably simpler than neural
networks.
3. Hyperparameters: Settings such as the learning rate, number of hidden layers,
and regularization parameters can influence the complexity of a machine
learning model.
Why Model Complexity is Important?
Finding the optimal model complexity is important because:
1. Bias-Variance Tradeoff: Model complexity is closely related to the bias-variance
tradeoff. Simple models may have high bias (systematic errors), while complex
models may have high variance (sensitivity to small fluctuations in the training
data). Finding the right level of complexity involves managing this tradeoff to
achieve good predictive performance.
2. Computational Resources: Complex models often require more computational
resources for training and inference. The choice of model complexity may be
influenced by practical considerations such as available computing power and
time constraints.
3. Interpretability: Simple models are often more interpretable, making it easier to
understand and explain their decision-making processes. In some cases,
interpretability is crucial, especially in sensitive applications where decisions
impact individuals' lives.
How to Avoid Model Complexity and Overfitting?
Addressing model complexity and overfitting is critical to achieving robust
machine learning models. Here are some strategies:
1. Regularization: Regularization techniques introduce penalties for complexity in
the loss function of the model which discourages learning overly complex model
parameters, discouraging overfitting. L1 and L2 regularization are common
methods to control the magnitude of coefficients, preventing the model from
becoming overly complex.
2. Cross-validation: Cross-Validation is a technique that Assess model
generalization and provides a realistic measure of how well the model is likely to
perform on unseen data, helping to assess its level of complexity and overfitting.
3. Reducing Features: By minimizing the number of input features, we could lower
the complexity, and thus, prevent overfitting.
4. Use of Ensemble Models: Combining predictions from multiple diverse models
can often lead to better performance and reduced risk of overfitting compared to
relying on a single model. This is because individual models may have unique
strengths and weaknesses, and averaging their predictions can lead to a more
robust and generalizable result.
5. Early Stopping: By monitoring the validation error during training, we can stop
the training process when the validation error starts to increase, even if the
training error continues to decrease. This prevents the model from learning
irrelevant patterns in the training data that could lead to overfitting.
6. Split the dataset into training and testing Data: Splitting your dataset is crucial
because it ensures the model doesn't simply memorize the training data and can
generalize to unseen examples.
Multivariate Non-linear Functions in Neural Networks

1 Linearity
A neural network is only non-linear if you squash the output signal from the nodes
with a non-linear activation function. A complete neural network (with non-linear
activation functions) is an arbitrary function approximator.

Bonus: It should be noted that if you are using linear activation functions in multiple
consecutive layers, you could just as well have pruned them down to a single layer
due to them being linear. (The weights would be changed to more extreme values).
Creating a network with multiple layers using linear activation functions would not
be able to model more complicated functions than a network with a single layer.

2 Activation signal
Interpreting the squashed output signal could very well be interpreted as the
strength of this signal (biologically speaking). Thought it might be incorrect to
interpret the output strength as an equivalent of confidence as in fuzzy logic.

3 Non-linear activation functions


Yes, you are spot on. The input signals along with their respective weights are a
linear combination. The non-linearity comes from your selection of activation
functions. Remember that a linear function is drawn as a line - sigmoid, tanh, ReLU
and so on may not be drawn with a single straight line.

Why do we need non-linear activation functions?


Most functions and classification tasks are probably best described by non-linear
functions. If we decided to use linear activation functions we would end up with a
much coarser approximation on a complex function.

Universal approximators

1. What Are Multivariate Non-linear Functions?

 Multivariate non-linear functions involve multiple variables and exhibit non-


linear relationships between their inputs and outputs.
 In neural networks, these functions model complex relationships between
features in the data that cannot be captured by simple linear equations.

2. Role in Neural Networks

 Neural networks use non-linear activation functions to transform linear


combinations of inputs into non-linear mappings.
 This non-linearity allows the network to approximate complex decision
boundaries and solve problems like image recognition, speech processing,
and natural language tasks.
3. Mathematical Formulation

Given input features x1,x2,…,xn, weights w1,w2,…,wnw_1, w_2, \ldots, w_nw1,w2,


…,wn, and bias bbb: z=w1x1+w2x2+…+wnxn+bz = w_1x_1 + w_2x_2 + \ldots +
w_nx_n + bz=w1x1+w2x2+…+wnxn+b

 The non-linear transformation f(z)f(z)f(z) is applied to produce: y=f(z)y =


f(z)y=f(z)

4. Common Non-linear Activation Functions

 Sigmoid: For probabilities (e.g., binary classification). f(z)=1/1+e−z1


 Tanh: For zero-centered outputs. f(z)=ez+e−z/ez−e−z
 ReLU: For introducing sparsity.f(z)=max(0,z)

 Softmax: For multi-class classification. f(zi)=∑j=1Nezjezi

5. Importance of Non-linearity

Without non-linearity:

 Neural networks become equivalent to a linear regression model, regardless


of the number of layers.
 They cannot capture the intricate patterns in data required for tasks like
image or speech recognition.

With non-linearity:

 Networks can approximate any continuous function, as stated by the


Universal Approximation Theorem.

Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of
another event that has already occurred. Bayes’ theorem is stated mathematically
as the following equation:
P(A∣B)=P(B∣A)P(A)P(B)P(A∣B)=P(B)P(B∣A)P(A)
where A and B are events and P(B) ≠ 0
 Basically, we are trying to find probability of event A, given the event B is true.
Event B is also termed as evidence.
 P(A) is the priori of A (the prior probability, i.e. Probability of event before
evidence is seen). The evidence is an attribute value of an unknown
instance(here, it is event B).
 P(B) is Marginal Probability: Probab



 ility of Evidence.
 P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is
seen.
 P(B|A) is Likelihood probability i.e the likelihood that a hypothesis will come true
based on the evidence.
Now, with regards to our dataset, we can apply Bayes’ theorem in following way:
P(y∣X)=P(X∣y)P(y)P(X)P(y∣X)=P(X)P(X∣y)P(y)
where, y is class variable and X is a dependent feature vector (of size n) where:
X=(x1,x2,x3,…..,xn)X=(x1,x2,x3,…..,xn)
Just to clear, an example of a feature vector and corresponding class variable can
be: (refer 1st row of dataset)
X = (Rainy, Hot, High, False)
y = No
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature makes an:
 Feature independence: The features of the data are conditionally independent of
each other, given the class label.
 Continuous features are normally distributed: If a feature is continuous, then it is
assumed to be normally distributed within each class.
 Discrete features have multinomial distributions: If a feature is discrete, then it
is assumed to have a multinomial distribution within each class.
 Features are equally important: All features are assumed to contribute equally to
the prediction of the class label.
 No missing data: The data should not contain any missing values.
Naive Bayes Classifiers
A Naive Bayes classifiers, a family of algorithms based on Bayes’ Theorem.
Despite the “naive” assumption of feature independence, these classifiers are
widely utilized for their simplicity and efficiency in machine learning. The article
delves into theory, implementation, and applications, shedding light on their
practical utility despite oversimplified assumptions.
Why it is Called Naive Bayes?
The “Naive” part of the name indicates the simplifying assumption made by the
Naïve Bayes classifier. The classifier assumes that the features used to describe
an observation are conditionally independent, given the class label. The “Bayes”
part of the name refers to Reverend Thomas Bayes, an 18th-century statistician
and theologian who formulated Bayes’ theorem.
Consider a fictional dataset that describes the weather conditions for playing a
game of golf. Given the weather conditions, each tuple classifies the conditions as
fit(“Yes”) or unfit(“No”) for playing golf.Here is a tabular representation of our
dataset.
Advantages of Naive Bayes Classifier
 Easy to implement and computationally efficient.
 Effective in cases with a large number of features.
 Performs well even with limited training data.
 It performs well in the presence of categorical features.
 For numerical features data is assumed to come from normal distributions
Disadvantages of Naive Bayes Classifier
 Assumes that features are independent, which may not always hold in real-world
data.
 Can be influenced by irrelevant attributes.
 May assign zero probability to unseen events, leading to poor generalization.
Applications of Naive Bayes Classifier
 Spam Email Filtering: Classifies emails as spam or non-spam based on features.
 Text Classification: Used in sentiment analysis, document categorization, and
topic classification.
 Medical Diagnosis: Helps in predicting the likelihood of a disease based on
symptoms.
 Credit Scoring: Evaluates creditworthiness of individuals for loan approval.
 Weather Prediction: Classifies weather conditions based on various factors.

Decision boundary
In a statistical-classification problem with two classes, a decision
boundary or decision surface is a hypersurface that partitions the underlying vector
space into two sets, one for each class. The classifier will classify all the points on
one side of the decision boundary as belonging to one class and all those on the
other side as belonging to the other class.

A decision boundary is a hypersurface in machine learning that delineates the


boundaries of classes. When the model’s prediction shifts from one class to another,
the feature space is represented by this area.

A decision boundary is the region of a problem space in which the output label of
a classifier is ambiguous.[1]

If the decision surface is a hyperplane, then the classification problem is linear, and
the classes are linearly separable.

Decision boundaries are not always clear cut. That is, the transition from one class
in the feature space to another is not discontinuous, but gradual. This effect is
common in fuzzy logic based classification algorithms, where membership in one
class or another is ambiguous.

Decision boundaries can be approximations of optimal stopping boundaries. [2] The


decision boundary is the set of points of that hyperplane that pass through
zero. [3] For example, the angle between a vector and points in a set must be zero
for points that are on or close to the decision boundary.

In neural networks and support vector models

In the case of backpropagation based artificial neural networks or perceptrons, the


type of decision boundary that the network can learn is determined by the number
of hidden layers the network has. If it has no hidden layers, then it can only learn
linear problems. If it has one hidden layer, then it can learn any continuous
function on compact subsets of Rn as shown by the universal approximation
theorem, thus it can have an arbitrary decision boundary.

In particular, support vector machines find a hyperplane that separates the feature
space into two classes with the maximum margin. If the problem is not originally
linearly separable, the kernel trick can be used to turn it into a linearly separable
one, by increasing the number of dimensions. Thus a general hypersurface in a
small dimension space is turned into a hyperplane in a space with much larger
dimensions.
Neural networks try to learn the decision boundary which minimizes the empirical
error, while support vector machines try to learn the decision boundary which
maximizes the empirical margin between the decision boundary and data points.

Types of Decision Boundaries


The complexity of the model and the characteristics used determine the kind of
decision boundary learned by a machine learning method. Common decision-
learning boundaries in machine learning include the following:

Linear
A linear decision boundary is a line that demarcates one feature space class from
another.

Non-linear
A non-linear decision border is a curve or surface that delineates a set of categories.
Learning non-linear decision boundaries is possible in non-linear models like decision
trees, support vector machines, and neural networks.

Piecewise Linear
Linear segments are joined together to produce a piecewise linear curve, which is
the piecewise linear decision boundary. Piecewise linear decision boundaries may be
learned by both decision trees and random forests.

Clustering
The boundaries between groups of data points in a feature space are called
“clustering decision boundaries.” K-means and DBSCAN are only two examples
of clustering algorithms whose decision limits may be learned.

Probabilistic
A data point’s likelihood of belonging to one group or another is represented by a
border called a probabilistic decision boundary. Probabilistic models may be trained
to learn probabilistic decision boundaries, including Naive Bayes and Gaussian
Mixture Models.

Risk minimization

Risk Minimization refers to the process of minimizing the expected risk in learning
tasks. It involves the analysis and minimization of both empirical risk and functional
risk, with the goal of finding the optimal solution.
We can interpret learning as the outcome of the minimization of the expected risk.
In the previous section, we have analyzed different kinds of loss function, that are
used to construct the expected and the empirical risk. In this section, we discuss
the minimization of the risk with the main purpose of understanding the role of the
chosen loss in the optimal solution. In general, we would like to minimize the
expected risk E(f) as given by definition 2.1.1–(9). This can be carried out by an
elegant analysis based on variational calculus that is given in Section 2.5. Here, we
will concentrate on the minimization of the empirical risk, which is what is used in
the real-word. Interestingly, the study on the empirical risk will also indicate the
structure of the minimum of the functional risk.
We begin discussing the case of regression. Let us consider the class of loss
functions defined by Eq. 2.1.1–(7). We discuss the cases p=0,1,2, and +∞, which
are more commonly adopted. We will also assume that the marginal probability
density p1 is strictly positive: p1>0. The process of learning is converted into the
problem of minimizing
(1) E=E‖Y−f(X)‖pp=1p∫X×Y|y−f(x)|pdP(x,y).

Empirical Risk Minimization (ERM)

The Empirical Risk Minimization (ERM) principle is a learning paradigm which


consists in selecting the model with minimal average error over the training set.
This so-called training error can be seen as an estimate of the risk (due to the law of
large numbers), hence the alternative name of empirical risk.

By minimizing the empirical risk, we hope to obtain a model with a low value of the
risk. The larger the training set size is, the closer to the true risk the empirical risk
is.

If we were to apply the ERM principle without more care, we would end up learning
by heart, which we know is bad. This issue is more generally related to
the overfitting phenomenon, which can be avoided by restricting the space of
possible models when searching for the one with minimal error. The most severe
and yet common restriction is encountered in the contexts of linear
classification or linear regression. Another approach consists in controlling the
complexity of the model by regularization.

There are five basic techniques of risk management:


Avoidance.
Retention.
Spreading.
Loss Prevention and Reduction.
Transfer (through Insurance and Contracts)
Avoidance: Many times it is not possible to completely avoid risk but the possibility
should not be overlooked. For example, at the height of a thunderstorm, Physical
Plant may not release vehicles for travel until the weather begins to clear, thus
avoiding the risk of auto accidents during severe weather. Some buildings on
campus have had repeated water problems in some areas. By not allowing storage
of records or supplies in those areas, some water damage claims may be avoided.
Retention: At times, based on the likely frequency and severity of the risks
presented, retaining the risk or a portion of the risk may be cost-effective even
though other methods of handling the risk are available. For example, the
University retains the risk of loss to fences, signs, gates and light poles because of
the difficulty of enumerating and evaluating all of these types of structures. When
losses occur, the cost of repairs is absorbed by the campus maintenance budget,
except for those situations involving the negligence of a third party. Although
insurance is available, the University retains the risk of loss to most University
personal property.
Spreading: It is possible to spread the risk of loss to property and persons.
Duplication of records and documents and then storing the duplicate copies in a
different location is an example of spreading risk. A small fire in a single room can
destroy the entire records of a department's operations. Placing people in a large
number of buildings instead of a single facility will help spread the risk of potential
loss of life or injury.
Loss Prevention and Reduction: When risk cannot be avoided, the effect of loss can
often be minimized in terms of frequency and severity. For example, Risk
Management encourages the use of security devices on certain audio visual
equipment to reduce the risk of theft. The University requires the purchase of
health insurance by students who are studying abroad, so that they might avoid
the risk of financial difficulty, should they incur medical expenses in another
country.
Transfer: In some cases risk can be transferred to others, usually by contract.
When outside organizations use University facilities for public events, they must
provide evidence of insurance and name the University as an additional insured
under their policy, thereby transferring the risk of the event from the University to
the facility user. The purchase of insurance is also referred to as a risk transfer
since the policy actually shifts the financial risk of loss, contractually, from the
insured entity to the insurance company. Insurance should be the last option and
used only after all other techniques have been evaluated.
Contracts: Often vendors and service providers will attempt through a contract to
release themselves from all liability for their actions relating to the contract. These
are often referred to as "hold harmless or indemnification" clauses. Due to the
complexity of interpreting these provisions, the President has delegated
contracting authority for the University solely to staff in Contracts & Procurement.
The Office of University Risk Management reviews contracts and agreements as
requested by Contracts & Procurement to identify and assess risks, evaluate
insurance standards, and review hold harmless and indemnification provisions. The
Chancellor's Office requires that the University obtain in most instances not only a
Certificate of Insurance, but also an Endorsement. Collecting these documents is
often the most time consuming aspect of the contracting process.
Density estimation

In statistics, probability density estimation or simply density estimation is the


construction of an estimate, based on observed data, of an unobservable
underlying probability density function. The unobservable density function is
thought of as the density according to which a large population is distributed; the
data are usually thought of as a random sample from that population.

Example

We will consider records of the incidence of diabetes. The following is quoted


verbatim from the data set description:

A population of women who were at least 21 years old, of Pima Indian heritage
and living near Phoenix, Arizona, was tested for diabetes mellitus according
to World Health Organization criteria. The data were collected by the US National
Institute of Diabetes and Digestive and Kidney Diseases. We used the 532
complete records.[2][3]
In this example, we construct three density estimates for "glu"
(plasma glucose concentration), one conditional on the presence of diabetes, the
second conditional on the absence of diabetes, and the third not conditional on
diabetes. The conditional density estimates are then used to construct the
probability of diabetes conditional on "glu".
The "glu" data were obtained from the MASS package[4] of the R programming
language. Within R, ?Pima.tr and ?Pima.te give a fuller account of the data.

The mean of "glu" in the diabetes cases is 143.1 and the standard deviation is
31.26. The mean of "glu" in the non-diabetes cases is 110.0 and the standard
deviation is 24.29. From this we see that, in this data set, diabetes cases are
associated with greater levels of "glu". This will be made clearer by plots of the
estimated density functions.

The first figure shows density estimates of p(glu | diabetes=1), p(glu | diabetes=0),
and p(glu). The density estimates are kernel density estimates using a Gaussian
kernel. That is, a Gaussian density function is placed at each data point, and the
sum of the density functions is computed over the range of the data.

From the density of "glu" conditional on diabetes, we can obtain the probability of
diabetes conditional on "glu" via Bayes' rule. For brevity, "diabetes" is abbreviated

"db." in this formula.

The second figure shows the estimated posterior probability p(diabetes=1 | glu).
From these data, it appears that an increased level of "glu" is associated with
diabetes.

Application and purpose

A very natural use of density estimates is in the informal investigation of the


properties of a given set of data. Density estimates can give a valuable indication of
such features as skewness and multimodality in the data. In some cases they will
yield conclusions that may then be regarded as self-evidently true, while in others
all they will do is to point the way to further analysis and/or data collection.[5]
An important aspect of statistics is often the presentation of data back to the client
in order to provide explanation and illustration of conclusions that may possibly
have been obtained by other means. Density estimates are ideal for this purpose,
for the simple reason that they are fairly easily comprehensible to non-
mathematicians.
More examples illustrating the use of density estimates for exploratory and
presentational purposes, including the important case of bivariate data.[7]

Density estimation is also frequently used in anomaly detection or novelty


detection:[8] if an observation lies in a very low-density region, it is likely to be an
anomaly or a novelty.

In hydrology the histogram and estimated density function of rainfall and river

discharge data, analysed with a probability distribution, are used to gain insight
in their behaviour and frequency of occurrence.[9] An example is shown in the
blue figure.
Parametric Methods
Parametric methods are statistical techniques that rely on specific assumptions
about the underlying distribution of the population being studied. These methods
typically assume that the data follows a known Probability distribution, such as the
normal distribution, and estimate the parameters of this distribution using the
available data.
The basic idea behind the Parametric method is that there is a set of fixed
parameters that are used to determine a probability model that is used in Machine
Learning as well. Parametric methods are those methods for which we priory know
that the population is normal, or if not then we can easily approximate it using
a Normal Distribution which is possible by invoking the Central Limit Theorem.
Parameters for using the normal distribution are as follows:
 Mean
 Standard Deviation
Eventually, the classification of a method to be parametric completely depends on
the presumptions that are made about a population.

Assumptions for Parametric Methods

Parametric methods require several assumptions about the data:


 Normality: The data follows a normal (Gaussian) distribution.
 Homogeneity of variance: The variance of the population is the same across all
groups.
 Independence: Observations are independent of each other.

What are Parametric Methods?

 Statistical Tests:
o t-test: Tests for the difference between the means of two independent
groups.
o ANOVA: Tests for the difference between the means of three or more
groups.
o F-test: Compares the variances of two groups.
o Chi-square test: Tests for relationships between categorical variables.
o Correlation analysis: Measures the strength and direction of the linear
relationship between two continuous variables.
 Machine Learning Models:
o Linear regression: Predicts a continuous outcome based on a linear
relationship with one or more independent variables.
o Logistic regression: Predicts a binary outcome (e.g., yes/no) based on a set
of independent variables.
o Naive Bayes: Classifies data points based on Bayes’ theorem and assuming
independence between features.
o Hidden Markov Models: Models sequential data with hidden states and
observable outputs.

Advantages of Parametric Methods

 More powerful: When the assumptions are met, parametric tests are generally
more powerful than non-parametric tests, meaning they are more likely to
detect a real effect when it exists.
 More efficient: Parametric tests require smaller sample sizes than non-
parametric tests to achieve the same level of power.
 Provide estimates of population parameters: Parametric methods provide
estimates of the population mean, variance, and other parameters, which can be
used for further analysis.

Disadvantages of Parametric Methods

 Sensitive to assumptions: If the assumptions of normality, homogeneity of


variance, and independence are not met, parametric tests can be invalid and
produce misleading results.
 Limited flexibility: Parametric methods are limited to the specific probability
distribution they are based on.
 May not capture complex relationships: Parametric methods are not well-suited
for capturing complex non-linear relationships between variables.

Applications of Parametric Methods

Parametric methods are widely used in various fields, including:


 Biostatistics: Comparing the effectiveness of different treatments.
 Social sciences: Investigating relationships between variables.
 Finance: Estimating risk and return of investments.
 Engineering: Analyzing the performance of systems.

Maximum Likelihood Method

In statistics, maximum likelihood estimation (MLE) is a method


of estimating the parameters of an assumed probability distribution, given some
observed data. This is achieved by maximizing a likelihood function so that, under
the assumed statistical model, the observed data is most probable. The point in
the parameter space that maximizes the likelihood function is called the maximum
likelihood estimate.[1] The logic of maximum likelihood is both intuitive and flexible,
and as such the method has become a dominant means of statistical inference
Principles

The maximum likelihood method is the most popular technique for deriving
estimators. It is based on the likelihood function which, for an observed sample x, is
defined as the probability (or density) of x expressed as a function of θ; in symbols
L(θ)=∏i=1nf(xi;θ)
This function provides a measure of plausibility of each possible value of θ on the
basis of the observed data. Then, the method at issue consists of
estimating θ through the value of θ which maximizes L(θ) since this corresponds to
the parameter value for which the observed sample is most likely. The estimate
found in this way, that is,
θˆ=θˆ(x)such thatL(θˆ)=supθ∈ΘL(θ)
The Maximum-Likelihood Approach
The maximum-likelihood approach is, far and away, the preferred approach to
correcting for non-response bias, and it is the one advocated by Little and Rubin.
The maximum-likelihood approach begins by writing down a probability distribution
that defines the likelihood of the observed sample as a function of population and
distribution parameters θ. If x1 and x2 represent responses to two different survey
questions by a single individual, the likelihood associated with a complete response
may be expressed as f(x1, x2; θ), where f is the joint probability density
function of x1 and x2. For individuals who only report x1, the likelihood associated
with x1 is ∫−∞∞fx1,x2;θdx2, which can, under the assumption of joint normality, be
simplified to a more convenient form. In this way, a likelihood function is specified
that includes terms corresponding to each observation, whether completely or only
partially observed. The likelihood objective is then maximized with respect to θ,
which produces estimates of the desired characteristics, enjoying all the well-known
properties of maximum-likelihood estimation.
Bayesian inference
Bayesian inference is a method of statistical inference in which Bayes' theorem is
used to calculate a probability of a hypothesis, given prior evidence, and update it
as more information becomes available. Fundamentally, Bayesian inference uses
a prior distribution to estimate posterior probabilities. Bayesian inference is an
important technique in statistics, and especially in mathematical statistics. Bayesian
updating is particularly important in the dynamic analysis of a sequence of data.
Bayesian inference has found application in a wide range of activities,
including science, engineering, philosophy, medicine, sport, and law. In the
philosophy of decision theory, Bayesian inference is closely related to subjective
probability, often called "Bayesian probability".
Introduction to Bayes' rule
Bayesian inference derives the posterior probability as a consequence of
two antecedents: a prior probability and a "likelihood function" derived from
a statistical model for the observed data. Bayesian inference computes the
posterior probability according to Bayes' theorem:

***Thank You***
Module 2

SINGLE LAYER NETWORKS


Linear Discriminant Analysis

Linear Discriminant Analysis (LDA), also known as Normal Discriminant Analysis or


Discriminant Function Analysis, is a dimensionality reduction technique primarily
utilized in supervised classification problems. It facilitates the modeling of
distinctions between groups, effectively separating two or more classes. LDA
operates by projecting features from a higher-dimensional space into a lower-
dimensional one. In machine learning, LDA serves as a supervised learning
algorithm specifically designed for classification tasks, aiming to identify a linear
combination of features that optimally segregates classes within a dataset.
For example, we have two classes and we need to separate them efficiently.
Classes can have multiple features. Using only a single feature to classify them
may result in some overlapping as shown in the below figure. So, we will keep on
increasing the number of features for proper classification.

Assumptions of LDA
LDA assumes that the data has a Gaussian distribution and that
the covariance matrices of the different classes are equal. It also assumes that the
data is linearly separable, meaning that a linear decision boundary can accurately
classify the different classes.
Suppose we have two sets of data points belonging to two different classes that
we want to classify. As shown in the given 2D graph, when the data points are
plotted on the 2D plane, there’s no straight line that can separate the two classes
of data points completely. Hence, in this case, LDA (Linear Discriminant Analysis)
is used which reduces the 2D graph into a 1D graph in order to maximize the
separability between the two classes.

Linearly Separable Dataset


Here, Linear Discriminant Analysis uses both axes (X and Y) to create a new axis
and projects data onto a new axis in a way to maximize the separation of the two
categories and hence, reduces the 2D graph into a 1D graph.

Two criteria are used by LDA to create a new axis:


1. Maximize the distance between the means of the two classes.
2. Minimize the variation within each class.

The perpendicular distance between the line and points

How does LDA work?


LDA works by projecting the data onto a lower-dimensional space that maximizes
the separation between the classes. It does this by finding a set of linear
discriminants that maximize the ratio of between-class variance to within-class
variance. In other words, it finds the directions in the feature space that best
separates the different classes of data.

Mathematical Intuition Behind LDA

Let’s suppose we have two classes and a d- dimensional samples such as x1, x2 …
xn, where:
 n1 samples coming from the class (c1) and n2 coming from the class (c2).
If xi is the data point, then its projection on the line represented by unit vector v
can be written as vTxi
Let’s consider u1 and u2 to be the means of samples class c1 and c2 respectively
before projection and u1hat denote the mean of the samples of class after
projection and it can be calculated by:
μ1~ =1n1∑xi∈c1n1vTxi=vTμ1 μ1 =n11∑xi∈c1n1vTxi=vTμ1

Extensions to LDA

1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of
variance (or covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs
are used such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the
estimate of the variance (actually covariance), moderating the influence of
different variables on LDA.
Linear Separability
Linear Separability refers to the data points in binary classification problems which
can be separated using linear decision boundary. if the data points can be
separated using a line, linear function, or flat hyperplane are considered linearly
separable.
 Linear separability is an important concept in neural networks. If the separate
points in n-dimensional space
follows then it is said linearly
separable
 For two-dimensional inputs, if there exists a line (whose equation is
) that separates all samples of one class from
the other class, then an appropriate perception can be derived from the
equation of the separating line. such classification problems are called “Linear
separable” i.e, separating by a linear combination of i/p.
 The logical AND gate example shown below illustrates a two-dimensional
example of a linearly separable problem.

Linear Separability as Mathematics:

Linear separability is introduced in the context of linear algebra and optimization


theory. It speaks of the capacity of a hyperplane to divide two classes of data
points in a high-dimensional space.
Let’s use the example of a set of data points in a p-dimensional space, where p is
the number of features or variables that each point has to characterize it.
A linear function can be used to represent the hyperplane mathematically,
where are the features of the data point, are corresponding weights. so that we
can separate two different categories with a straight line and can represent them
on the graph then we will say it is linearly separable the condition is only that it
should be in the form y = ax + b form is the power of x should be 1 only then we
can separate them linearly.
Linear Separability refers to the data points in binary classification problems which
can be separated using linear decision boundary. if the data points can be
separated using a line, linear function, or flat hyperplane are considered linearly
separable.
 Linear separability is an important concept in neural networks. If the separate
points in n-dimensional space
follows then it is said linearly
separable
 For two-dimensional inputs, if there exists a line (whose equation is
) that separates all samples of one class from
the other class, then an appropriate perception can be derived from the
equation of the separating line. such classification problems are called “Linear
separable” i.e, separating by a linear combination of i/p.
 The logical AND gate example shown below illustrates a two-dimensional
example of a linearly separable problem.
Linear Separability as Mathematics:

Linear separability is introduced in the context of linear algebra and optimization


theory. It speaks of the capacity of a hyperplane to divide two classes of data
points in a high-dimensional space.
Let’s use the example of a set of data points in a p-dimensional space, where p is
the number of features or variables that each point has to characterize it.

Methods for checking linear separability:

1. Visual Inspection: If a distinct straight line or plane divides the various groups, it
can be visually examined by plotting the data points in a 2D or 3D space. The
data may be linearly separable if such a boundary can be seen.
2. Perceptron Learning Algorithm: This binary linear classifier divides the input into
two classes by learning a separating hyperplane iteratively. The data are linearly
separable if the method finds a separating hyperplane and converges. If not, it is
not.
3. Support vector machines: SVMs are a well-liked classification technique that can
handle data that can be separated linearly. To optimize the margin between the
two classes, they identify the separating hyperplane. The data can be linearly
separated if the margin is bigger than zero.
4. Kernel methods: The data can be transformed into a higher-dimensional space
using this family of techniques, where it might then be linearly separable. The
original data is also linearly separable if the converted data is linearly separable.
5. Quadratic programming: Finding the separation hyperplane that reduces the
classification error can be done using quadratic programming. If a solution is
found, the data can be separated linearly.

Checking Linear separability

 Import the necessary libraries


 Define custom dataset
 Build and train the linear model
 predict from new input

from sklearn import svm


import numpy as np

# Making dataset
X = np.array([[1, 2], [2, 3], [3, 1], [4, 3]])
Y = np.array([0, 0, 1, 1])

# Now lets train svm model


model = svm.SVC(kernel='linear')
model.fit(X, Y)

# Lets predict for new input


n_data = np.array([[5, 2], [2, 1]])
pred = model.predict(n_data)
print(pred)
Output

Convert Non-separable data to separable:

 Import the necessary libraries


 Create the non-linear dataset
 plot the dataset using matplotlib
from sklearn.datasets import make_circles
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import numpy as np

# first lets create non-linear dataset


x_val, y_val = make_circles(n_samples=50, factor=0.5)

# Now lets plot and see our dataset


plt.scatter(x_val[:, 0], x_val[:, 1], c=y_val, cmap='plasma')
plt.show()
Output :
Apply kernel trick to map data into higher-dimensional space
 apply kernel trick to map data into higher-dimensional space.
 Now fit SVM on mapped data
 Plot decision boundary in mapped space
 plot mapped data
# apply kernel trick to map data into higher-dimensional space
x_new = np.vstack((x_val[:, 0]**2, x_val[:, 1]**2)).T

# Now fit SVM on mapped data


svm = SVC(kernel='linear')
svm.fit(x_new, y_val)

# plot decision boundary in mapped space


w = svm.coef_
a = -w[0][0] / w[0][1]
x = np.linspace(0, 1)
y = a * x - (svm.intercept_[0]) / w[0][1]
plt.plot(x, y, 'k-')
# plot mapped data
plt.scatter(x_new[:, 0], x_new[:, 1], c=y_val, cmap='plasma')
plt.show()

Output:
Least Square Method

Least Square method is a fundamental mathematical technique widely used


in data analysis, statistics, and regression modeling to identify the best-fitting
curve or line for a given set of data points. This method ensures that the overall
error is reduced, providing a highly accurate model for predicting future data
trends.
In statistics, when the data can be represented on a cartesian plane by using the
independent and dependent variable as the x and y coordinates, it is called scatter
data. This data might not be useful in making interpretations or predicting the
values of the dependent variable for the independent variable. So, we try to get
an equation of a line that fits best to the given data points with the help of
the Least Square Method.
In this article, we will learn the least square method, its formula, graph, and solved
examples on it.

Least Square Method Definition

Least Squares method is a statistical technique used to find the equation of best-
fitting curve or line to a set of data points by minimizing the sum of the squared
differences between the observed values and the values predicted by the model.
Formula for Least Square Method
Least Square Method formula is used to find the best-fitting line through a set of
data points. For a simple linear regression, which is a line of the form y=mx+c,
where y is the dependent variable, x is the independent variable, a is the slope of
the line, and b is the y-intercept, the formulas to calculate the slope (m) and
intercept (c) of the line are derived from the following equations:
1. Slope (m) Formula: m = n(∑xy)−(∑x)(∑y) / n(∑x2)−(∑x)2
2. Intercept (c) Formula: c = (∑y)−a(∑x) / n
Where:
 n is the number of data points,
 ∑xy is the sum of the product of each pair of x and y values,
 ∑x is the sum of all x values,
 ∑y is the sum of all y values,
 ∑x2 is the sum of the squares of x values.
The steps to find the line of best fit by using the least square method is discussed
below:
 Step 1: Denote the independent variable values as xi and the dependent ones as
yi.
 Step 2: Calculate the average values of xi and yi as X and Y.
 Step 3: Presume the equation of the line of best fit as y = mx + c, where m is
the slope of the line and c represents the intercept of the line on the Y-axis.
 Step 4: The slope m can be calculated from the following formula:
m = [Σ (X – xi)×(Y – yi)] / Σ(X – xi)2
 Step 5: The intercept c is calculated from the following formula:
c = Y – mX

Limitations of the Least Square Method

The Least Square method assumes that the data is evenly distributed and doesn’t
contain any outliers for deriving a line of best fit. But, this method doesn’t provide
accurate results for unevenly distributed data or for data containing outliers.
The Perceptron-Artificial Neural Networks
he Perceptron is one of the simplest artificial neural network architectures,
introduced by Frank Rosenblatt in 1957. It is primarily used for binary
classification.
At that time, traditional methods like Statistical Machine Learning and
Conventional Programming were commonly used for predictions. Despite being
one of the simplest forms of artificial neural networks, the Perceptron model
proved to be highly effective in solving specific classification problems, laying the
groundwork for advancements in AI and machine learning.
What is Perceptron?
Perceptron is a type of neural network that performs binary classification that
maps input features to an output decision, usually classifying data into one of two
categories, such as 0 or 1.
Perceptron consists of a single layer of input nodes that are fully connected to a
layer of output nodes. It is particularly good at learning linearly separable
patterns. It utilizes a variation of artificial neurons called Threshold Logic Units
(TLU), which were first introduced by McCulloch and Walter Pitts in the 1940s. This
foundational model has played a crucial role in the development of more advanced
neural networks and machine learning algorithms.

Types of Perceptron

1. Single-Layer Perceptron is a type of perceptron is limited to learning linearly


separable patterns. It is effective for tasks where the data can be divided into
distinct categories through a straight line. While powerful in its simplicity, it
struggles with more complex problems where the relationship between inputs
and outputs is non-linear.
2. Multi-Layer Perceptron possess enhanced processing capabilities as they consist
of two or more layers, adept at handling more complex patterns and
relationships within the data.
Basic Components of Perceptron
A Perceptron is composed of key components that work together to process
information and make predictions.
 Input Features: The perceptron takes multiple input features, each representing
a characteristic of the input data.
 Weights: Each input feature is assigned a weight that determines its influence on
the output. These weights are adjusted during training to find the optimal
values.
 Summation Function: The perceptron calculates the weighted sum of its inputs,
combining them with their respective weights.
 Activation Function: The weighted sum is passed through the Heaviside step
function, comparing it to a threshold to produce a binary output (0 or 1).
 Output: The final output is determined by the activation function, often used
for binary classification tasks.
 Bias: The bias term helps the perceptron make adjustments independent of the
input, improving its flexibility in learning.
 Learning Algorithm: The perceptron adjusts its weights and bias using a learning
algorithm, such as the Perceptron Learning Rule, to minimize prediction errors.
How does Perceptron work?
A weight is assigned to each input node of a perceptron, indicating the importance
of that input in determining the output. The Perceptron’s output is calculated as a
weighted sum of the inputs, which is then passed through an activation function to
decide whether the Perceptron will fire.
The weighted sum is computed as:
z=w1x1+w2x2+…+wnxn=XTWz=w1x1+w2x2+…+wnxn=XTW
h(z)={01if z<Thresholdif z≥Threshold

A perceptron consists of a single layer of Threshold Logic Units (TLU), with each
TLU fully connected to all input nodes.

In a fully connected layer, also known as a dense layer, all neurons in one layer
are connected to every neuron in the previous layer.
The output of the fully connected layer is computed as:
fW,b(X)=h(XW+b)fW,b(X)=h(XW+b)
where XX is the input WW is the weight for each inputs neurons and bb is the bias
and hh is the step function.
Fisher's linear discriminant

The terms Fisher's linear discriminant and LDA are often used interchangeably,
although Fisher's original article[2] actually describes a slightly different
discriminant, which does not make some of the assumptions of LDA such
as normally distributed classes or equal class covariances.
This measure is, in some sense, a measure of the signal-to-noise ratio for the class
labelling. It can be shown that the maximum separation occurs when

Be s

ure to note that the vector is the normal to the discriminant hyperplane. As
an example, in a two dimensional problem, the line that best divides the two groups

is perpendicular to .

Otsu's method is related to Fisher's linear discriminant, and was created to binarize
the histogram of pixels in a grayscale image by optimally picking the black/white
threshold that minimizes intra-class variance and maximizes inter-class variance
within/between grayscales assigned to black and white pixel classes.

Gradient-Based Strategy

Gradient-based strategy is commonly known as Gradient boosting which is a


fundamental machine learning technique used by many gradient boosting
algorithms like LightGBM to optimize and enhance the performance of predictive
models. In a gradient-based strategy, multiple weak learners(commonly Decision
trees) are combined to achieve a high-performance model. There are some key
processes associated with a gradient-based strategy which are listed below:
 Gradient Descent: In the gradient-based strategy, the optimization algorithm
(usually gradient descent) is used to minimize a loss function that measures the
difference between predicted values and actual target values.
 Iterative Learning: The model iteratively updates its predictions for each step by
calculating gradients (slopes) of the loss function with respect to the model's
parameters. These gradients are calculated to know the right way to minimize
the loss.
 Boosting: In gradient boosting, weak learners (decision trees) are trained
sequentially where each tree attempting to correct the errors made by the
previous ones and the final prediction is the combination of predictions from all
the trees.

Benefits of Gradient-based strategy

We can get several benefits in our predictive model if we utilize gradient-based


strategy which are listed below:
1.
2. Model Accuracy: Gradient boosting, including Light GBM, is known for its high
predictive accuracy which is capable to capture complex relationships in the
data by iteratively refining the model.
3. Robustness: The ensemble nature of gradient boosting makes it robust
against overfitting problem. Each new tree focuses on the mistakes of the
previous trees which reduces the risk of capturing noise in the data.
4. Flexibility: Gradient boosting has in-build mechanism to handle various types of
data including both numerical and categorical features which makes it suitable
for a wide range of machine learning tasks.
5. Interpretability: While ensemble models can be complex but they can offer
interpretability through feature importance rankings which can be used in
conjunction with interpretability tools like SHAP values to understand model
decisions.

Learning Rate Decay

Learning rate decay is a technique used in machine learning models,


especially deep neural networks. It is sometimes referred to as learning rate
scheduling or learning rate annealing. Throughout the training phase, it entails
gradually lowering the learning rate. Learning rate decay is used to gradually
adjust the learning rate, usually by lowering it, to facilitate the optimization
algorithm's more rapid convergence to a better solution. This method tackles
problems that are frequently linked to a fixed learning rate, such as oscillations
and sluggish convergence.
Learning rate decay can be accomplished by a variety of techniques, such as step
decay, exponential decay, and 1/t decay. Degradation strategy selection is based
on the particular challenge and architecture. When training deep learning models,
learning rate decay is a crucial hyperparameter that, when used properly, can
result in faster training, better convergence, and increased model performance.
How Learning Rate Decay works
Learning rate decay is like driving a car towards a parking spot. At first, you drive
fast to reach the spot quickly. As you get closer, you slow down to park accurately.
In machine learning, the learning rate determines how much the model changes
based on the mistakes it makes. If it's too high, the model might miss the best fit;
too low, and it's too slow. Learning rate decay starts with a higher learning rate,
letting the model learn fast. As training progresses, the rate gradually decreases,
making the model adjustments more precise. This ensures the model finds a good
solution efficiently. Different methods reduce the rate in various ways, either
stepwise or smoothly, to optimize the training process.

Mathematical representation of Learning rate decay

A basic learning rate decay plan can be mathematically represented as follows:


Assume that the starting learning rate is \eta_{0} and that the learning rate at
epoch t is \eta_{t} .
A typical decay schedule for learning rates is based on a constant decay rate \
alpha , where \alpha \epsilon (0,1) , applied at regular intervals (e.g., every n
epochs):
\eta_{t} = \frac{\eta_{0}}{1 + \alpha \cdot t}
Where,
 \eta_{t} is the learning rate at epoch t.
 \eta_{0} is the initial learning rate at the start of training.
 \alpha is the fixed decay rate, typically a small positive value, such as 0.1 or
0.01.
 t is the current epoch during training.
 The learning rate \eta_{t} decreases as t increases, leading to smaller step size
as training progresses.

Basic decay schedules

In order to enhance the convergence of machine learning models, learning rate


decay schedules are utilized to gradually lower the learning rate during training.
Here are a few simple schedules for learning rate decay:
 Step Decay: In step decay, after a predetermined number of training epochs, the
learning rate is decreased by a specified factor (decay rate). The mathematical
formula for step decay is:

lr = lr_{initial} * drop \;rate^{\frac{epoch}{step\;size}}
 Exponential Decay: The learning rate is progressively decreased over time
by exponential decay. At each epoch, a factor is used to adjust the learning rate.
The mathematical formula for Exponential decay is:
lr = lr_{initial}\; * \; e^{-decay\;rate \;* \;epoch}
 Inverse Time Decay: A factor inversely proportional to the number of epochs is
used to reduce the learning rate through inverse decay. The mathematical
formula for Inverse Time decay is:
lr = lr_{initial} \; * \; \frac{1}{1 + decay * epoch}
 Polynomial Decay: When a polynomial function, usually a power of the epoch
number, is followed, polynomial decay lowers the learning rate.The
mathematical formula for Polynomial decay is:
lr = lr_{initial} \; * \; \left ( 1 - \frac{epoch}{max\;epoch}\right )^{power}
Steps Needed to implement Learning Rate Decay
 Set Initial Learning Rate: Start by establishing a base learning rate. It shouldn't
be too high to cause drastic updates, nor too low to stall the learning process.
 Choose a Decay Method: Common methods include exponential decay, step
decay, or inverse time decay. The choice depends on your specific machine
learning problem.
 Implement the Decay: Apply the chosen decay method after a set number of
epochs, or based on the performance of the model.
 Monitor and Adjust: Keep an eye on the model's performance. If it's not
improving, you might need to adjust the decay rate or the method.

Momentum-based Gradient Optimizer introduction

Gradient Descent is an optimization technique used in Machine Learning


frameworks to train different models. The training process consists of an objective
function (or the error function), which determines the error a Machine Learning
model has on a given dataset.
While training, the parameters of this algorithm are initialized to random values.
As the algorithm iterates, the parameters are updated such that we reach closer
and closer to the optimal value of the function.
However, Adaptive Optimization Algorithms are gaining popularity due to their
ability to converge swiftly. All these algorithms, in contrast to the conventional
Gradient Descent, use statistics from the previous iterations to robustify the
process of convergence.
Momentum-based Gradient Optimizer is a technique used in optimization
algorithms, such as Gradient Descent, to accelerate the convergence of the
algorithm and overcome local minima. In the Momentum-based Gradient
Optimizer, a fraction of the previous update is added to the current update, which
creates a momentum effect that helps the algorithm to move faster towards the
minimum.
The momentum term can be viewed as a moving average of the gradients. The
larger the momentum term, the smoother the moving average, and the more
resistant it is to changes in the gradients. The momentum term is typically set to a
value between 0 and 1, with a higher value resulting in a more stable optimization
process.
The update rule for the Momentum-based Gradient Optimizer can be expressed as
foll
ows:

makefile
v = beta * v - learning_rate * gradient
parameters = parameters + v

// Where v is the velocity vector, beta is the momentum term,


// learning_rate is the step size,
// gradient is the gradient of the cost function with respect to the parameters,
// and parameters are the parameters of the model.

Momentum-based Optimization:

An Adaptive Optimization Algorithm uses exponentially weighted averages of


gradients over previous iterations to stabilize the convergence, resulting in quicker
optimization. For example, in most real-world applications of Deep Neural
Networks, the training is carried out on noisy data. It is, therefore, necessary to
reduce the effect of noise when the data are fed in batches during Optimization.
This problem can be tackled using Exponentially Weighted Averages (or
Exponentially Weighted Moving Averages).

Implementing Exponentially Weighted Averages:

In order to approximate the trends in a noisy dataset of size N:


θ0,θ1,θ2,…,θN θ0,θ1,θ2,…,θN , we maintain a set of parameters v0,v1,v2,v3,
…,vN v0,v1,v2,v3,…,vN . As we iterate through all the values in the dataset, we
calculate the parameters below:
On iteration t:
Get next θtθtt:vθ=βvθ+(1−β)θtt:vθ=βvθ+(1−β)θt
This algorithm averages the value of vθ vθ over its values from previous 11–
β 1–β1 iterations. This averaging ensures that only the trend is retained and the
noise is averaged out. This method is used as a strategy in momentum-based
gradient descent to make it robust against noise in data samples, resulting in
faster training.

Learning Rate in Neural Network

What is the Learning Rate?


Learning rate is a hyperparameter that controls how much to change the model in
response to the estimated error each time the model weights are updated. It
determines the size of the steps taken towards a minimum of the loss function
during optimization.
In mathematical terms, when using a method like Stochastic Gradient Descent
(SGD), the learning rate (often denoted as αα or ηη) is multiplied by the gradient
of the loss function to update the weights:
w=w–α⋅∇L(w)w=w–α⋅∇L(w)
Where:
 ww represents the weights,
 αα is the learning rate,
 ∇L(w)∇L(w) is the gradient of the loss function concerning the weights.
Impact of Learning Rate on Model
The learning rate influences the training process of a machine learning model by
controlling how much the weights are updated during training. A well-calibrated
learning rate balances convergence speed and solution quality.
If set too low, the model converges slowly, requiring many epochs and leading to
inefficient resource use. Conversely, a high learning rate can cause the model to
overshoot optimal weights, resulting in instability and divergence of the loss
function. An optimal learning rate should be low enough for accurate convergence
while high enough for reasonable training time. Smaller rates require more
epochs, potentially yielding better final weights, whereas larger rates can cause
fluctuations around the optimal solution.
Imagine learning to play a video game where timing your jumps over obstacles is
crucial. Jumping too early or late leads to failure, but small adjustments can help
you find the right timing to succeed. In machine learning, a low learning rate
results in longer training times and higher costs, while a high learning rate can
cause overshooting or failure to converge. Thus, finding the optimal learning rate
is essential for efficient and effective training.
Techniques for Adjusting the Learning Rate in Neural Networks

1. Fixed Learning Rate

A fixed learning rate is a common optimization approach where a constant


learning rate is selected and maintained throughout the training process. Initially,
parameters are assigned random values, and a cost function is generated based
on these initial values. The algorithm then iteratively improves the parameter
estimations to minimize the cost function. While simple to implement, a fixed
learning rate may not adapt well to the complexities of various training scenarios.

2. Learning Rate Schedules

Learning rate schedules adjust the learning rate based on predefined rules or
functions, enhancing convergence and performance. Some common methods
include:
 Step Decay: The learning rate decreases by a specific factor at designated
epochs or after a fixed number of iterations.
 Exponential Decay: The learning rate is reduced exponentially over time,
allowing for a rapid decrease in the initial phases of training.
 Polynomial Decay: The learning rate decreases polynomially over time, providing
a smoother reduction.

3. Adaptive Learning Rate

Adaptive learning rates dynamically adjust the learning rate based on the model’s
performance and the gradient of the cost function. This approach can lead to
optimal results by adapting the learning rate depending on the steepness of the
cost function curve:
 AdaGrad: This method adjusts the learning rate for each parameter individually
based on historical gradient information, reducing the learning rate for
frequently updated parameters.
 RMSprop: A variation of AdaGrad, RMSprop addresses overly aggressive learning
rate decay by maintaining a moving average of squared gradients to adapt the
learning rate effectively.
 Adam: Combining concepts from both AdaGrad and RMSprop, Adam incorporates
adaptive learning rates and momentum to accelerate convergence.

4. Scheduled Drop Learning Rate

In this technique, the learning rate is decreased by a specified proportion at set


intervals, contrasting with decay techniques where the learning rate continuously
diminishes. This allows for more controlled adjustments during training.

5. Cycling Learning Rate

Cycling learning rate techniques involve cyclically varying the learning rate within
a predefined range throughout the training process. The learning rate fluctuates in
a triangular shape between minimum and maximum values, maintaining a
constant frequency. One popular strategy is the triangular learning rate policy,
where the learning rate is linearly increased and then decreased within a cycle.
This method aims to explore various learning rates during training, helping the
model escape poor local minima and speeding up convergence.

6. Decaying Learning Rate

In this approach, the learning rate decreases as the number of epochs or iterations
increases. This gradual reduction helps stabilize the training process as the model
converges to a minimum.

Cliffs and Higher-Order Instability


Cliffs and Exploding GradientsNeural networks with various layers regularly have
very steep regions like cliffs. This outcome is from the development of certain big
weights together. The gradient inform step may change the parameters very far on
the face of a really steep cliff structure.It is normally jumping off of the cliff
structure overall.
The cliff may be risky whether we approach it from overhead or from underneath.
Though, its most thoughtful values can be dodged using the gradient clipping
heuristic. The simple idea is to recall that the gradient does not require the optimal
step size. The gradient clipping heuristic occurs to decreases the step size to be
small enough. That it is less possible to go outer the region where the gradient
specifies the direction of about steepest descent.
Cliff structures are most common in the cost functions for recurrent neural
networks, because such models involve the multiplication of many factors, with one
factor for each time step. Long temporal sequences thus incur an extreme amount
of multiplication.

Identification and catching of Exploding Gradients

The proof of identity of these gradient problems is difficult to understand before the
training process is even started. We have to continually monitor the logs and record
unexpected jumps in the cost function when the network is a deep recurrent one.
This would tell us whether these jumps are recurrent. And if the norm of the
gradient is growing exponentially. The best way to do this is by checking logs in a
visualization dashboard.

Fixation of Exploding Gradients

There are various methods to address the exploding gradients. Below is the list of
some best-practice methods that we can use.
Gradient Clipping

Gradient Clipping is the process that helps maintain numerical stability by


preventing the gradients from growing too large. When training a neural network,
the loss gradients are computed through backpropagation. However, if these
gradients become too large, the updates to the model weights can also become
excessively large, leading to numerical instability. This can result in the model
producing NaN (Not a Number) values or overflow errors, which can be
problematic. This problem is often referred to as 'gradient exploding', it could be
solved by clipping the gradient to the value that we want it to be. Let's thoroughly
discuss gradient clipping.

How does Gradient Clipping work?

Let's discuss the step-by-step description of gradient clipping:


1. Calculate Gradients:
When the model is learning, it's like a student taking an
exam. Backpropagation is like a teacher grading the exam and giving feedback
to the student. It calculates the gradients of the model's parameters with
respect to the loss function, helping the model learn and improve its
performance. So, think of backpropagation as a helpful teacher guiding the
model to success!
2. Compute Gradient Norm:
To measure the magnitude of the gradients, we can use different types of norms
such as the L2 norm (also known as the Euclidean norm) or the L1 norm. These
norms help us to quantify the size of the gradients and understand how fast the
parameters are changing. The L2 norm calculates the square root of the sum of
the squares of the individual gradients, while the L1 norm calculates the sum of
the absolute values of the gradients. By measuring the norm of the gradients,
we can monitor the training process and adjust the learning rate accordingly to
ensure that the model is converging efficiently.
3. Clip Gradients:
If the computed gradient norm exceeds the predefined clip threshold, the
gradients are scaled down to ensure that the norm does not exceed this
threshold. The scaling factor is determined by dividing the clip threshold by the
gradient norm.
 clip \,factor = \frac{clip\_threshold}{gradient\_norm}
 The clipped gradients become, clip \,factor * gradients.
4. Update Model Parameters:
The clipped gradients are used to update the model parameters. By using the
clipped gradients to update the model parameters, we can prevent the weights
from being updated by excessively large amounts, which can lead to numerical
instability and slow down the training process. This helps to ensure that the
model is learning effectively and converging towards a good solution.
The clip_threshold discussed here is a type of hyperparameter whose value could
be determined by experimenting on the dataset present in front of us.
Types of Gradient Clipping Techniques
There are two different gradient clipping techniques that are used, gradient
clipping by value and gradient clipping by norm, let's discuss them thoroughly.

Clipping by Value:

'Clipping by value' is the most straightforward and effective gradient clipping


technique, in this method the gradients are individually clipped so that they lie in
the predefined range that is mentioned. This technique is done elementwise, so
each component of the gradient vector is clipped individually. In this gradient
clipping technique, the minimum and maximum thresholds are defined, and the
range is set accordingly so that the gradient's value lies in between the minimum
and maximum value.
After the computation of the gradients through backpropagation, inspection of the
gradient component is done, if the gradient component is greater than the
maximum threshold it's value is set to maximum threshold and if the gradient
component is lower than the minimum threshold value mentioned then the value
of the gradient component is set to minimum threshold value and if the value of
the gradient component lies in between the range of minimum and maximum
threshold value then the gradient component is set as it is and not changed.

Clipping by Norm:

In the 'clipping by norm' technique of gradient clipping the gradients are clipped if
their norm (or their size) is greater than the specified threshold value. In contrast
to the 'clipping by value' here in this case the values of the gradients greater than
or less than the threshold values are not set to the threshold values. It makes sure
that the norm of the updated gradients remains small and manageable, and the
learning process is more stable. There are different types of 'clipping by norm'
techniques let's explore them one by one.
 L2 Norm Clipping:
In this form of norm clipping technique the gradient value is clipped down if it's
L2 norm (Euclidean norm) exceeds the predefined threshold value. The L2 norm
or the Euclidean norm is calculated as the square root of the squared values of
its components. Considering gradient vector as g = [\nabla\theta_1, \nabla\
theta_2, ..., \nabla\theta_n ] where g_i is the gradient with respect to
the i^{th} parameter of the model and n is the total number of model
parameters. Therefore, the L2 norm is represented as:
||\nabla\theta||_2 = \sqrt{\Sigma_{i=1}^n {\nabla{\theta{_i^2}}}}
Now, if the L2 norm exceeds the threshold value the upgraded gradient after
clipping of the components becomes:
\nabla\theta = \frac{threshold}{||\nabla\theta||_2}.\nabla\theta
 L1 Norm Clipping:
L1 norm technique of gradient is similar to the L2 norm gradient clipping
technique, in this technique if the L1 norm exceeds the threshold that we
defined in alignment with our specific requirements. L1 norm of a gradient is the
sum of the absolute values of all its components. Therefore, the L1 norm is
represented as:
||\nabla\theta||_1 = \Sigma_{i=1}^n|\nabla\theta_i|
If the L1 norm exceeds the threshold value, the upgraded gradient after clipping
becomes:
\nabla\theta = \frac{threshold}{||\nabla\theta||_1}.\nabla\theta
Gradient clipping by norm provides a more global control over the gradients, and it
is often used to address the exploding gradient problem.

Necessity of Gradient Clipping

Gradient Clipping is a crucial step in the training of neural networks since it helps
in addressing the issue of exploding gradients. The exploding gradients problem
arises when the gradients in the backpropagation process becomes excessively
large in value, causing instability in training of model. Now we will see some of the
most important points which explains the necessity of gradient clipping in neural
network model training.
1. Stability of Training: During the training of neural network, the optimization
algorithm adjusts the value of model parameters with the help of obtained
gradients. If the gradients are obtained to be too large the weights of the model
updates by a large value causing the model to oscillate and diverge instead of
converging to an optimal solution. Whereas gradient clipping limits the size of
the gradient and eliminating this issue of instability in the model.
2. Improving Generalization: Large gradients might cause the model to overfit to
the training data, which in turn might capture more noise and makes the model
bad at generalization. Gradient clipping removes this hindrance and makes the
model generalize better on new data preventing extreme updates.
3. Convergence to Optimal Solution: Exploding gradient prevents the model to
converge to optimal solution and instead produces more unstable modelling of
data. By clipping the gradient values, we can suspend the possibility of
instability, the model gets better at navigating the parameter space enabling
consistent progress toward optimal solution.
4. Compatibility with Activation Function: Some of the activation functions such as
'Sigmoid' and 'tanh' functions are sensitive to large input. Gradient clipping
ensures the gradient passed through the activation function is within a
reasonable range which also helps in removing undesirable behavior like
saturation.
5. Mitigating Vanishing Gradient Problem: Sometimes the gradient of the loss
function with respect to the weights become extremely small which causes the
weight to stop updating or even halt the process of training the model. Norm-
based gradient clipping helps in preventing the vanishing gradient problem by
maintaining the range of value for the gradient that is effective for training the
model.

Gradient Clipping in Keras

Keras helps gradient clipping on every optimization algorithm. It supports the


similar order applied to all layers in the model. Gradient clipping may be used with
an optimization algorithm, for example, stochastic gradient descent, with an extra
argument when configuring the optimization algorithm.
We can use two types of gradient clipping.

 Gradient norm scaling

 Gradient value clipping

Second Order Derivatives

The Second Order Derivative is defined as the derivative of the first derivative of
the given function. The first-order derivative at a given point gives us the
information about the slope of the tangent at that point or the instantaneous rate
of change of a function at that point.
Second-Order Derivative gives us the idea of the shape of the graph of a given
function. The second derivative of a function f(x) is usually denoted as f”(x). It is
also denoted by D2y or y2 or y” if y = f(x).

Let y = f(x)
Then, dy/dx = f'(x)
If f'(x) is differentiable, we may differentiate (1) again w.r.t x. Then, the left-hand
side becomes d/dx(dy/dx) which is called the second order derivative of y w.r.t x.
Second Order Derivatives Overview
Category Details

The second order derivative of a function


f(x)f(x)f(x)f(x)f(x)f(x) is the derivative of its first derivative f′(x)f′
Definition
(x)f′(x)f′(x)f′(x)f′(x).
It measures how the rate of change of f′(x)f'(x)f′(x) itself changes.

Denoted as
Notation f′′(x)f”(x)f′′(x),d2ydx2d2ydx2dx2d2y,ord2fdx2d2fdx2dx2d2f.f′′
(x)f”(x)f′′(x),d2ydx2dx2d2ydx2d2y,ord2fdx2dx2d2fdx2d2f.

– Concavity: [Tex]f′′(x)&gt;0f”(x) &gt; 0f′′(x)&gt;0[/Tex] indicates


the graph of f(x)f(x)f(x)f(x)f(x)f(x) is concave up. [Tex]f′′
Significance (x)&lt;0f”(x) &lt; 0f′′(x)&lt;0[/Tex] indicates concave down
– Inflection Points: Points where f′′(x)=0f”(x) = 0f′′(x)=0 may
indicate a change in concavity, known as inflection points.

−ConstantRule:f(x)=c⇒f′′(x)=0f(x)=c⇒f”(x)=0f(x)=c⇒f′′(x)=0–
PowerRule:f(x)=xn⇒f′′
(x)=n(n−1)xn−2f(x)=xn⇒f”(x)=n(n−1)xn−2f(x)=xn⇒f′′
(x)=n(n−1)xn−2–ExponentialRule:f(x)=ex⇒f′′
(x)=exf(x)=ex⇒f”(x)=exf(x)=ex⇒f′′(x)=ex–
LogarithmicRule:f(x)=ln⁡(x)⇒f′′
(x)=−1x2f(x)=ln⁡(x)⇒f”(x)=−1x2f(x)=ln(x)⇒f′′(x)=−x21
Basic Rules
−ConstantRule:f(x)=c⇒f′′(x)=0f(x)=c⇒f”(x)=0f(x)=c⇒f′′(x)=0–
PowerRule:f(x)=xn⇒f′′
(x)=n(n−1)xn−2f(x)=xn⇒f”(x)=n(n−1)xn−2f(x)=xn⇒f′′
(x)=n(n−1)xn−2–ExponentialRule:f(x)=ex⇒f′′
(x)=exf(x)=ex⇒f”(x)=exf(x)=ex⇒f′′(x)=ex–
LogarithmicRule:f(x)=ln⁡(x)⇒f′′(x)=−1x2f(x)=ln(x)⇒f”(x)=−x21
f(x)=ln(x)⇒f′′(x)=−x21

Second Order Derivatives Examples


Example 1: Find d2y/dx2, if y = x3?
Given that, y = x3
Then, first derivative will be
dy/dx = d/dx (x3) = 3x2
Again, we will differentiate further to find its
second derivative,

Therefore, d2y/dx2 = d/dx (dy/dx)


= d/dx (3x2)
= 6x

Example 2: y = logx, Find d2y/dx2?


Solution:
Given that, y = logx
Then first derivative will be,
dy/dx = d/dx (logx)
= (1 / x)
Again, we will further differentiate to find its second derivative,
d2y/dx2 = d/dx (dy/dx)
= d/dx (1 / x) (from first derivative)
= -1 / x2

Polyak Averaging Technique:

Polyak averaging addresses this issue by maintaining a running average of the


model parameters throughout the training process. Instead of using the parameters
from the final iteration as the solution, Polyak averaging computes a weighted
average of the model parameters obtained during all iterations. This averaging
smoothens out the parameter updates and helps in reducing the impact of noise
and oscillations, leading to a more stable and accurate final model.

Polyak Averaging provides numerous advantages to optimization algorithms. One


crucial benefit is its ability to mitigate overfitting. Overfitting occurs when an
algorithm becomes excessively intricate, fitting the noise in the data rather than the
actual pattern. By computing the average of recent parameters, Polyak Averaging
smooths the algorithm's trajectory, minimizing the risk of overfitting.
Objective Function and Optimization Algorithm:
Suppose you have an optimization problem, such as training a machine learning
model. In this context, you have a loss function that you want to minimize, typically
representing he difference between your model's predictions and the actual target
values.
You employ an optimization algorithm, like stochastic gradient descent (SGD), Adam,
or RMSprop, to update the model's parameters iteratively.
Noisy or Fluctuating Objective Function:
In real-world scenarios, the objective function may not be smooth and can exhibit
fluctuations due to various factors like noisy data, batch sampling variability, or
inherent randomness in the problem.
Optimization algorithms can sometimes get stuck in local minima or exhibit oscillatory
behavior when dealing with such non-smooth and fluctuating functions.
Parameter Averaging:
Instead of considering only the final set of parameters obtained after a fixed number of
optimization iterations, Polyak averaging introduces the concept of maintaining two
sets of parameters:
Current Parameters: These are the parameters that the optimization algorithm actively
updates during each iteration.
Averaged Parameters: These are the parameters obtained by averaging the current
parameters over multiple iterations.
Advantages of Polyak Averaging:
Improved Generalization: Averaging the parameters helps in finding a solution that
generalizes better to unseen data. It reduces the risk of overfitting by producing a
smoother and more stable model.
Reduced Sensitivity to Hyperparameters: Polyak averaging can make the model less
sensitive to the learning rate and other hyperparameters, as the averaging process
mitigates the impact of abrupt parameter changes.
Enhanced Robustness: By reducing the impact of noisy updates, Polyak averaging
makes the training process more robust, especially in scenarios where the data is
noisy or the optimization landscape is complex.
Applications of Polyak Averaging
Polyak averaging is particularly useful in deep learning, where finding an optimal
set of parameters is challenging due to the high dimensionality of the model. By
providing a stable and accurate solution, Polyak averaging contributes significantly
to the training of deep neural networks.

Local Minima
A Local Minima point is a point on any function where the function attains its
minimum value within a certain interval. A point (x = a) of a function f (a) is called
a Local minimum if the value of f(a) is lesser than or equal to all the values of f(x).
Mathematically, f (a) ≤ f (a -h) and f (a) ≤ f (a + h) where h > 0, then a
is called the Local minimum point.
Definition of Local Maxima and Local Minima
Local Maxima and Minima are the initial values of any function to get an idea
about its boundaries such as the highest and lowest output values. Local Minima
and Local Maxima are also called Local Extrema.

Local Maxima

A Local Maxima point is a point on any function where the function attains its
maximum value within a certain inteval. A point (x = a) of a function f (a) is called
a Local maximum if the value of f(a) is greater than or equal to all the values of
f(x).
Terms Related to Local Maxima and Local Minima
Important terminology related to Local Maxima and Minima are discussed below:

Maximum Value

If any function gives the maximum output value for the input value of x. That
value of x is called maximum value. If it is defined within a specific range. Then
that point is called Local Maxima.

Absolute Maximum

If any function gives the maximum output value for the input value of x along the
entire range of the function. That value of x is called Absolute Maximum.

Minimum Value
If any function gives the minimum output value for the input value of x. That value
of x is called minimum value. If it is defined within a specific range. Then that
point is called Local Minima.

Absolute Minimum

If any function gives the minimum output value for the input value of x along the
entire range of the function. That value of x is called Absolute Minimum.

Point of Inversion

If the value of x within the range of given function does not show the highest and
lowest output, is called the Point of Inversion.
Properties of Local Maxima and Minima
Understanding the properties of local maxima and minima can help in their
identification:
1. If a function f(x) is continuous in its domain, it must have at least one maximum
or minimum between any two points where the function values are equal.
2. Local maxima and minima occur alternately; between two minima, there must
be a maximum, and vice versa.
3. If f(x) approaches infinity as x approaches the endpoints of the interval and has
only one critical point within the interval, that critical point is an extremum.
Solved Examples on Local Maxima and Local Minima
Example 1: Analyze the Local Maxima and Local Minima of the function f(x) =
2x3 – 3x2 – 12x + 5 by using the first derivative test.
Solution:
Given function is f(x) = 2x3 – 3x2 – 12x + 5
First derivative of function is f'(x) = 6x2 – 6x – 12, it will use to find out
the critical points.
To find the critical point, f'(x) = 0;
6x2 – 6x – 12 = 0
6(x2 – x – 2) = 0
6(x + 1)(x – 2) = 0
Hence, critical points are x = -1, and x = 2.
Analyze the First derivative immediate point to the critical point x = -
1. The points are {-2, 0}.
f'(-2) = 6(4 + 2 – 2) = 6(4) = +24 and f'(0) = 6(0 + 0 – 2) = 6(-2) = -
12
Sign of derivative is postive towards the left of x = -1, and is negative
towards the right. Hence, it indicaes x = -1 is the Local Maxima.
Let us now analyze the First derivative immediate point to the critical
point x = 2. The points are {1,3}.
f'(1) = 6(1 -1 -2) = 6(-2) = -12 and f'(3) = 6(9 + -3 – 2) = 6(4) = +24
Sign of derivative is negative towards the left of x = 2, and is positive
towards the right. Hence, it indicates x = 2 is the Local Minima.

***Thank You***
Module 3

Multi-Layer Perceptron (MLP)

A Multi-Layer Perceptron (MLP) is one of the most widely used types of neural
networks.

Multi-Layer Perceptron (MLP) is an artificial neural network widely used for


solving classification and regression tasks. In this article, we will explore
the concept of MLP in-depth and demonstrate how to implement it in
Python using the TensorFlow library.

What is a Multilayer Perceptron?


A Multi-Layer Perceptron (MLP) consists of fully connected dense layers that
transform input data from one dimension to another. It is called “multi-layer”
because it contains an input layer, one or more hidden layers, and an output layer.
The purpose of an MLP is to model complex relationships between inputs and
outputs, making it a powerful tool for various machine learning tasks.
The key components of Multi-Layer Perceptron includes:
 Input Layer: Each neuron (or node) in this layer corresponds to an input
feature. For instance, if you have three input features, the input layer will have
three neurons.
 Hidden Layers: An MLP can have any number of hidden layers, with each layer
containing any number of nodes. These layers process the information received
from the input layer.
 Output Layer: The output layer generates the final prediction or result. If there
are multiple outputs, the output layer will have a corresponding number of
neurons.

Working of Multi-Layer Perceptron


Let’s delve in to the working of the multi-layer perceptron. The key mechanisms
such as forward propagation, loss function, backpropagation, and optimization.
Step 1: Forward Propagation
In forward propagation, the data flows from the input layer to the output layer,
passing through any hidden layers. Each neuron in the hidden layers processes the
input as follows:
1. Weighted Sum: The neuron computes the weighted sum of the inputs:
z=∑iwixi+bz=∑iwixi+b
Where:
o xixi is the input feature.
o wiwi is the corresponding weight.
o bb is the bias term.
2. Activation Function: The weighted sum z is passed through an activation
function to introduce non-linearity. Common activation functions include:
Sigmoid: σ(z)=1/1+e−z
ReLU (Rectified Linear Unit): f(z)=max(0,z)
Tanh (Hyperbolic Tangent): tanh(z)=2/(1+e−2z)–1
Step 2: Loss Function
Once the network generates an output, the next step is to calculate the loss using
a loss function. In supervised learning, this compares the predicted output to the
actual label.
For a classification problem, the commonly used binary cross-entropy loss function
is:
Where:
 yiyi is the actual label.
 y^iy^i is the predicted label.
 NN is the number of samples.
For regression problems, the mean squared error (MSE) is often used:

Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the
network’s weights and biases. This is achieved through backpropagation:
1. Gradient Calculation: The gradients of the loss function with respect to each
weight and bias are calculated using the chain rule of calculus.
2. Error Propagation: The error is propagated back through the network, layer by
layer.
3. Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss: w=w–η⋅∂L∂ww=w–η⋅∂w∂L
Where:
o ww is the weight.
o ηη is the learning rate.
o ∂L∂w∂w∂L is the gradient of the loss function with respect to the weight.
Step 4: Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases
during training. Popular optimization methods include:
Stochastic Gradient Descent (SGD): Updates the weights based on a single sample
or a small batch of data: w=w–η⋅∂L∂ww=w–η⋅∂w∂L
Adam Optimizer: An extension of SGD that incorporates momentum and adaptive
learning rates for more efficient training:

Here, gtgt represents the gradient at time tt, and β1,β2β1,β2 are decay rates.
Feed-forward Network Mappings

What is a Feedforward Neural Network?


A Feedforward Neural Network (FNN) is a type of artificial neural network where
connections between the nodes do not form cycles. This characteristic
differentiates it from recurrent neural networks (RNNs). The network consists of an
input layer, one or more hidden layers, and an output layer. Information flows in
one direction—from input to output—hence the name "feedforward."

Structure of a Feedforward Neural Network

1. Input Layer: The input layer consists of neurons that receive the input data.
Each neuron in the input layer represents a feature of the input data.
2. Hidden Layers: One or more hidden layers are placed between the input and
output layers. These layers are responsible for learning the complex patterns in
the data. Each neuron in a hidden layer applies a weighted sum of inputs
followed by a non-linear activation function.
3. Output Layer: The output layer provides the final output of the network. The
number of neurons in this layer corresponds to the number of classes in a
classification problem or the number of outputs in a regression problem.
Mathematical Representation:
For a given input xx, the network output y=f(Wx+b)

where:

o xx = input vector
o WW = weight matrix
o bb = bias vector
o ff = activation function (like Sigmoid, ReLU, etc.)

Activation Functions
Activation functions introduce non-linearity into the network, enabling it to learn
and model complex data patterns. Common activation functions include:
 Sigmoid: σ(x)=σ(x)=11+e−xσ(x)=1+e−x1
 Tanh: tanh(x)=ex−e−xex+e−xtanh(x)=ex+e−xex−e−x
 ReLU (Rectified Linear Unit): ReLU(x)=max⁡(0,x)ReLU(x)=max(0,x)
 Leaky ReLU: Leaky ReLU(x)=max⁡(0.01x,x)Leaky ReLU(x)=max(0.01x,x)
Training a Feedforward Neural Network
Training a Feedforward Neural Network involves adjusting the weights of the
neurons to minimize the error between the predicted output and the actual output.
This process is typically performed using backpropagation and gradient descent.
1. Forward Propagation: During forward propagation, the input data passes
through the network, and the output is calculated.
2. Loss Calculation: The loss (or error) is calculated using a loss function such as
Mean Squared Error (MSE) for regression tasks or Cross-Entropy Loss for
classification tasks.
3. Backpropagation: In backpropagation, the error is propagated back through
the network to update the weights. The gradient of the loss function with respect
to each weight is calculated, and the weights are adjusted using gradient
descent.
Evaluation of Feedforward neural network
Evaluating the performance of the trained model involves several metrics:
 Accuracy: The proportion of correctly classified instances out of the total
instances.
 Precision: The ratio of true positive predictions to the total predicted positives.
 Recall: The ratio of true positive predictions to the actual positives.
 F1 Score: The harmonic mean of precision and recall, providing a balance
between the two.
 Confusion Matrix: A table used to describe the performance of a classification
model, showing the true positives, true negatives, false positives, and false
negatives.

Threshold Units

The Threshold Logic Units (TLUs)

Threshold Logic Units (TLUs) also known as Linear Threshold Units


(LTU) were initially proposed by Frank Rosenblatt in the late 1950s as the basic
units of Perceptrons.
These computational units are also based on a threshold activation
function i.e., they apply a step function to the weighted sum of their
inputs and then produce outputs that are now numbers (instead of binary
on/off values used in McCulloch-Pitts model).
Threshold Logic Unit - Image Source: O'Reily
There are two common step functions used in TLUs: Heaviside and Sign.

TLUs are capable of defining linear decision boundaries in input space. In


two-dimensional space, for example, a single TLU can separate points into two
classes (binary classification) using a straight line.
It simply computes a linear combination of the inputs: if the result exceeds a
threshold, it outputs the positive class or else outputs the negative class (just like a
Logistic Regression classifier). Training a TLU in this case means finding the right
values for weights.

Used in Perceptrons for binary classification.

 Limitation: Threshold units are linear, so they can't solve non-linear


problems like XOR. This is why MLPs use non-linear activation functions.

2.3 Sigmoidal Units

A sigmoid function is any mathematical function whose graph has a characteristic


S-shaped or sigmoid curve.

A sigmoid function is a bounded, differentiable, real function that is defined for all
real input values and has a non-negative derivative at each point and exactly
one inflection point.

A common example of a sigmoid function is the logistic function, which is defined by


the formula:[1]
Properties:

o Outputs values between (0, 1).


o Used in binary classification problems.
o Unlike threshold units, they are smooth and differentiable, making them
useful for gradient-based learning.

Properties

In general, a sigmoid function is monotonic, and has a first derivative which is bell
shaped. Conversely, the integral of any continuous, non-negative, bell-shaped
function (with one local maximum and no local minimum, unless degenerate) will be
sigmoidal. Thus the cumulative distribution functions for many common probability
distributions are sigmoidal. One such example is the error function, which is related
to the cumulative distribution function of a normal distribution.

A sigmoid function is convex for values less than a particular point, and it
is concave for values greater than that point: in many of the examples here, that
point is 0.

Examples

 Logistic function
 Hyperbolic tangent (shifted and scaled version of the logistic function, above)

 Arctangent function

 Gudermannian function

 Error function

Applications
Many natural processes, such as those of complex system learning curves, exhibit a
progression from small beginnings that accelerates and approaches a climax over
time. When a specific mathematical model is lacking, a sigmoid function is often
used.[6]

The van Genuchten–Gupta model is based on an inverted S-curve and applied to the
response of crop yield to soil salinity.

Examples of the application of the logistic S-curve to the response of crop yield
(wheat) to both the soil salinity and depth to water table in the soil are shown
in modeling crop response in agriculture.

In artificial neural networks, sometimes non-smooth functions are used instead for
efficiency; these are known as hard sigmoids.

Weight-space Symmetries

In the context of deep learning, weight space symmetry means that non-
identifiable models are invariant to random permutations in their weight layers. This
symmetry holds since in deep learning there are generally not enough training
samples to rule out all parameter settings but one, there usually exist a large
amount of possible weight combinations for a given dataset that yield similar model
performance.
Weight-space symmetry is a property of neural network landscapes that describes
how permutation symmetries give rise to multiple equivalent global minima in the
weight space. This property can have implications for training dynamics, and can
also be used to uncover a model's underlying structure

 Weight-space symmetry can also give rise to first-order saddle points on the path
between the global minima.
 A challenging problem in machine learning is to process weight-space features,
which involves transforming or extracting information from the weights and
gradients of a neural network.
 The weight space is a concatenation of all the weight and biases.
 The symmetry group acts on each one of those independently, which is called a
direct-sum of representations.

2.5 Error Back-propagation

Backpropagation in Neural Network

Backpropagation is a powerful algorithm in deep learning, primarily used to train


artificial neural networks, particularly feed-forward networks. It works iteratively,
minimizing the cost function by adjusting weights and biases.
In each epoch, the model adapts these parameters, reducing loss by following the
error gradient. Backpropagation often utilizes optimization algorithms like
gradient descent or stochastic gradient descent. The algorithm computes the
gradient using the chain rule from calculus, allowing it to effectively navigate
complex layers in the neural network to minimize the cost function.
Working of Backpropagation Algorithm
The Backpropagation algorithm involves two main steps: the Forward Pass and
the Backward Pass.

How Does the Forward Pass Work?

In the forward pass, the input data is fed into the input layer. These inputs,
combined with their respective weights, are passed to hidden layers.
For example, in a network with two hidden layers (h1 and h2 as shown in Fig. (a)),
the output from h1 serves as the input to h2. Before applying an activation
function, a bias is added to the weighted inputs.
Each hidden layer applies an activation function like ReLU (Rectified Linear Unit),
which returns the input if it’s positive and zero otherwise. This adds non-linearity,
allowing the model to learn complex relationships in the data. Finally, the outputs
from the last hidden layer are passed to the output layer, where an activation
function, such as softmax, converts the weighted outputs into probabilities for
classification.

How Does the Backward Pass Work?


In the backward pass, the error (the difference between the predicted and actual
output) is propagated back through the network to adjust the weights and biases.
One common method for error calculation is the Mean Squared Error (MSE), given
by:
MSE=(Predicted Output−Actual Output)2MSE=(Predicted Output−Actual Output)2
Once the error is calculated, the network adjusts weights using gradients, which
are computed with the chain rule. These gradients indicate how much each weight
and bias should be adjusted to minimize the error in the next iteration. The
backward pass continues layer by layer, ensuring that the network learns and
improves its performance. The activation function, through its derivative, plays a
crucial role in computing these gradients during backpropagation.
Example of Backpropagation in Machine Learning
Let’s walk through an example of backpropagation in machine learning. Assume
the neurons use the sigmoid activation function for the forward and backward
pass. The target output is 0.5, and the learning rate is 1.

Error Calculation
To calculate the error, we can use the below formula:
Errorj=ytarget−y5Errorj=ytarget−y5
Error=0.5−0.67=−0.17Error=0.5−0.67=−0.17
Using this error value, we will be backpropagating.

Backpropagation

1. Calculating Gradients
The change in each weight is calculated as:
Δwij=η×δj×OjΔwij=η×δj×Oj
Where:
 δjδj is the error term for each unit,
 ηη is the learning rate.
2. Output Unit Error
For O3:
δ5=y5(1−y5)(ytarget−y5)
=0.67(1−0.67)(−0.17)=−0.0376=0.67(1−0.67)(−0.17)=−0.0376

Advantages of Backpropagation for Neural Network Training


The key benefits of using the backpropagation algorithm are:
 Ease of Implementation: Backpropagation is beginner-friendly, requiring no
prior neural network knowledge, and simplifies programming by adjusting
weights via error derivatives.
 Simplicity and Flexibility: Its straightforward design suits a range of tasks,
from basic feedforward to complex convolutional or recurrent networks.
 Efficiency: Backpropagation accelerates learning by directly updating weights
based on error, especially in deep networks.
 Generalization: It helps models generalize well to new data, improving
prediction accuracy on unseen examples.
 Scalability: The algorithm scales efficiently with larger datasets and more
complex networks, making it ideal for large-scale tasks.
Challenges with Backpropagation
While backpropagation is powerful, it does face some challenges:
1. Vanishing Gradient Problem: In deep networks, the gradients can become
very small during backpropagation, making it difficult for the network to learn.
This is common when using activation functions like sigmoid or tanh.
2. Exploding Gradients: The gradients can also become excessively large,
causing the network to diverge during training.
3. Overfitting: If the network is too complex, it might memorize the training data
instead of learning general patterns.

Module 3: Radial Basis Function Networks (RBFNs)

Radial Basis Function (RBF) Neural Networks are a specialized type of Artificial
Neural Network (ANN) used primarily for function approximation tasks. Known for
their distinct three-layer architecture and universal approximation capabilities,
RBF Networks offer faster learning speeds and efficient performance in
classification and regression problems. This article delves into the workings,
architecture, and applications of RBF Neural Networks.
What are Radial Basis Functions?
Radial Basis Functions (RBFs) are a special category of feed-forward neural
networks comprising three layers:
1. Input Layer: Receives input data and passes it to the hidden layer.
2. Hidden Layer: The core computational layer where RBF neurons process the
data.
3. Output Layer: Produces the network’s predictions, suitable for classification or
regression tasks.
How Do RBF Networks Work?
RBF Networks are conceptually similar to K-Nearest Neighbor (k-NN) models,
though their implementation is distinct. The fundamental idea is that an item's
predicted target value is influenced by nearby items with similar predictor variable
values. Here’s how RBF Networks operate:
1. Input Vector: The network receives an n-dimensional input vector that needs
classification or regression.
2. RBF Neurons: Each neuron in the hidden layer represents a prototype vector
from the training set. The network computes the Euclidean distance between
the input vector and each neuron's center.
3. Activation Function: The Euclidean distance is transformed using a Radial
Basis Function (typically a Gaussian function) to compute the neuron’s activation
value. This value decreases exponentially as the distance increases.
4. Output Nodes: Each output node calculates a score based on a weighted sum
of the activation values from all RBF neurons. For classification, the category
with the highest score is chosen.
Key Characteristics of RBFs
 Radial Basis Functions: These are real-valued functions dependent solely on
the distance from a central point. The Gaussian function is the most commonly
used type.
 Dimensionality: The network's dimensions correspond to the number of predictor
variables.
 Center and Radius: Each RBF neuron has a center and a radius (spread). The
radius affects how broadly each neuron influences the input space.
Architecture of RBF Networks
The architecture of an RBF Network typically consists of three layers:
Input Layer
 Function: After receiving the input features, the input layer sends them straight
to the hidden layer.
 Components: It is made up of the same number of neurons as the
characteristics in the input data. One feature of the input vector corresponds to
each neuron in the input layer.
Hidden Layer
 Function: This layer uses radial basis functions (RBFs) to conduct the non-linear
transformation of the input data.
 Components: Neurons in the buried layer apply the RBF to the incoming data.
The Gaussian function is the RBF that is most frequently utilized.
 RBF Neurons: Every neuron in the hidden layer has a spread parameter (σ) and
a center, which are also referred to as prototype vectors. The spread parameter
modulates the distance between the center of an RBF neuron and the input
vector, which in turn determines the neuron's output.
Output Layer
 Function: The output layer uses weighted sums to integrate the hidden layer
neurons' outputs to create the network's final output.
 Components: It is made up of neurons that combine the outputs of the hidden
layer in a linear fashion. To reduce the error between the network's predictions
and the actual target values, the weights of these combinations are changed
during training.
Training Process of radial basis function neural network

An RBF neural network must be trained in three stages: choosing the center's,
figuring out the spread parameters, and training the output weights.
Step 1: Selecting the Centers
 Techniques for Centre Selection: Centre's can be picked at random from the
training set of data or by applying techniques such as k-means clustering.
 K-Means Clustering: The center's of these clusters are employed as the
center's for the RBF neurons in this widely used center selection technique,
which groups the input data into k groups.
Step 2: Determining the Spread Parameters
 The spread parameter (σ) governs each RBF neuron's area of effect and
establishes the width of the RBF.
 Calculation: The spread parameter can be manually adjusted for each neuron
or set as a constant for all neurons. Setting σ based on the separation between
the center's is a popular method, frequently accomplished with the help of a
heuristic like dividing the greatest distance between canters by the square root
of twice the number of center's
Step 3: Training the Output Weights
 Linear Regression: The objective of linear regression techniques, which are
commonly used to estimate the output layer weights, is to minimize the error
between the anticipated output and the actual target values.
 Pseudo-Inverse Method: One popular technique for figuring out the weights is
to utilize the pseudo-inverse of the hidden layer outputs matrix

Assessment Questions
1. Define feed-forward network mapping and explain its significance in MLPs.
2. Explain the role of backpropagation in training MLPs.
3. What is weight-space symmetry, and how does it affect neural network
training?
4. Describe the role of the radial basis function in RBFNs.
5. How can you train an RBF network using K-means clustering?

***Thank You***

Module 4
ERROR FUNCTIONS
Error function
In mathematics, the error function (also called the Gauss error function), often
denoted by erf, is a function erf=c->c defined as:

The integral here is a complex contour integral which is path-independent


because is holomorphic on the whole complex plane . In many applications, the
function argument is a real number, in which case the function value is also real.

In some old texts, the error function is defined without the factor of .
This nonelementary integral is a sigmoid function that occurs often
in probability, statistics, and partial differential equations.

Two closely related functions are the

The name "error function" and its abbreviation erf were proposed by J. W. L.
Glaisher in 1871 on account of its connection with "the theory of Probability, and
notably the theory of Errors."[3] The error function complement was also discussed
by Glaisher in a separate publication in the same year.[4] For the "law of facility" of
errors whose density is given by
Applications
When the results of a series of measurements are described by a normal
distribution with standard deviation σ and expected value 0, then erf (a/σ √2) is the
probability that the error of a single measurement lies between −a and +a, for
positive a. This is useful, for example, in determining the bit error rate of a digital
communication system.

The error and complementary error functions occur, for example, in solutions of
the heat equation when boundary conditions are given by the Heaviside step
function.

Sum of Squares
The sum of squares means the sum of the squares of the given numbers. In
statistics, it is the sum of the squares of the variation of a dataset. For this, we need
to find the mean of the data and find the variation of each data point from the
mean, square them and add them. In algebra, the sum of the square of two
numbers is determined using the (a + b) 2 identity. We can also find the sum of
squares of the first n natural numbers using a formula. The formula can be derived
using the principle of mathematical induction. We do these basic arithmetic
operations which are required in statistics and algebra. There are different
techniques to find the sum of squares of given numbers.
In this article, we will discuss the different sum of squares formulas. To calculate the
sum of two or more squares in an expression, the sum of squares formula is used.
Also, the sum of squares formula is used to describe how well the data being
modeled is represented by a model. Let us learn these along with a few solved
examples in the upcoming sections for a better understanding.

What is the Sum of Squares?

The sum of squares in statistics is a tool that is used to evaluate the dispersion of a
dataset. To evaluate this, we take the sum of the square of the variation of each
data point. In algebra, we find the sum of squares of two numbers using
the algebraic identity of (a + b)2. Also, in mathematics, we find the sum of squares
of n natural numbers using a specific formula which is derived using the principle of
mathematical induction. Let us now discuss the formulas of finding the sum of
squares in different areas of mathematics.
Sum of Squares Formula

The sum of squares formula in statistics is used to describe how well the data being
modeled is represented by a model. It shows the dispersion of the dataset. To
calculate the sum of two or more squares in an expression, the sum
of squares formula is used. Thus, a few sums of squares formulas are,
 In statistics : Sum of squares of n data points = ∑ni=0 (xi - x̄)2
 In algebra : Sum of squares = a2 + b2 = (a + b)2 - 2ab
 Sum of squares of n natural numbers formula: 1 2 + 22 + 32 + ... + n2 = [n(n+1)
(2n+1)] / 6
Where,
 ∑ = represents sum
 xi = each value in the set
 x̄ = mean of the values
 xi – x̄ = deviation from the mean value
 (xi – x̄)2 = square of the deviation
 a, b = arbitrary numbers
 n = number of terms in the series
Let a and b be the two numbers. Assuming the squares of a and b are a 2 and b2. The
sum of the squares of a and b is a 2 + b2. We could obtain a formula using the
known algebraic identity (a+b)2 = a2 + b2 + 2ab. Subtracting 2ab from both the
sides we can conclude that a 2 + b2 = (a + b)2 - 2ab. Let a, b, c be the 3 numbers for
which we are supposed to find the sum of squares. The sum of their squares is a 2 +
b2 + c2. Using the known algebraic identity (a+b+c) 2 = a2 + b2 + c2 + 2ab + 2bc
+2ca, we can evaluate that a2 + b2 + c2 = (a+b+c)2 - 2ab - 2bc - 2ca.
In statistics, the sum of squares error (SSE) is the difference between the observed
value and the predicted value. It is also called the sum of squares residual (SSR) as
it is the sum of the squares of the residual, that is, the deviation of predicted values
from the actual values. The formula for the sum of squares error is given by,
SSE = ∑ni=0 (yi - f(xi))2, where yi is the ith value of the variable to be predicted, f(xi)
is the predicted value, and xi is the ith value of the explanatory variable.
We can also evaluate the sum of squares error (SSE) by subtracting the sum of
squares regression (SSR) from the sum of squares total (SST), that is, SSE = SST -
SSR

Important Notes on Sum of Squares


 The sum of squares in statistics is a tool that is used to evaluate the dispersion of
a dataset.
 SSE = SST - SSR
 Sum of squares of n data points = ∑ni=0 (xi - x̄)2
Steps to Find Sum of Squares

The total sum of squares can be calculated in statistics using the following steps:
 Step 1: In the dataset, count the number of data points.
 Step 2: Calculate the mean of the data.
 Step 3: Subtract each data point from the mean.
 Step 4: Determine the square of the difference determined in step 3.
 Step 5: Add the squares determined in step 4.

Sum of Squares Examples

Example 1: Using the sum of squares formula, find the value of 4 2 + 62?
Solution: To find : value of 42 + 62
Given: a = 4, b = 6
Using sum of squares formula a2 + b2 = (a + b)2 − 2ab, we have
42 + 62 = (4 + 6)2 − 2(4)(6)
= 100 − 2(24)
= 100 − 48
= 52
Answer: The value of 42 + 62 is 52.
Example 2 : Calculate the sum of the following series 12 + 22 + 32 ……. 1002
Solution:
To Find: Sum of the series
Using sum of squares formula for n terms, 1 2 + 22 + 32 + ... + n2 = [n(n+1)(2n+1)] /
6
Given: n =100
= [100(100+1)(2×100+1)] / 6
= (100 × 101 × 201) / 6
= 338350
Answer: The sum of the given series is 338350.

Minkowski distance
The Minkowski distance or Minkowski metric is a metric in a normed vector
space which can be considered as a generalization of both the Euclidean
distance and the Manhattan distance. It is named after the Polish
mathematician Hermann Minkowski.
Definition

For p>=1 the Minkowski distance is a metric as a result of the Minkowski


inequality.Whenp<1 the distance between (0,0) and (1,1) is but the point is at a
distance (0,1)from both of these points. Since this violates the triangle inequality,
for it is not a metric. However, a metric can be obtained for these values by simply
removing the exponent of The resulting metric is also an F-norm.
Minkowski distance is typically used with being 1 or 2, which correspond to
the Manhattan distance and the Euclidean distance, respectively.[2] In the limiting
case of reaching infinity, we obtain the Chebyshev distance:

The Minkowski distance can also be viewed as a multiple of the power mean of the
component-wise differences between P and Q

Applications
The Minkowski metric is very useful in the field of machine learning and AI. Many
popular machine learning algorithms use specific distance metrics such as the
aforementioned to compare the similarity of two data points. Depending on the
nature of the data being analyzed, various metrics can be used. The Minkowski
metric is most useful for numerical datasets where you want to determine the
similarity of size between multiple datapoint vectors.

Input-dependent variance

Input-dependent variance is a concept that can be used in a variety of contexts,


including signal processing, wireless signal strength, and regression problems

Conditional probability distribution


In probability theory and statistics, the conditional probability distribution is a
probability distribution that describes the probability of an outcome given the
occurrence of a particular event. Given two jointly distributed random variables
Y and X, the conditional probability distribution of X given Y is the probability
distribution of X when Y is known to be a particular value; in some cases the
conditional probabilities may be expressed as functions containing the unspecified
value x of X as a parameter. When both X and Y are categorical variables,
a conditional probability table is typically used to represent the conditional
probability. The conditional distribution contrasts with the marginal distribution of a
random variable, which is its distribution without reference to the value of the other
variable.

If the conditional distribution of Y given X is a continuous distribution, then


its probability density function is known as the conditional density function.
[1] The properties of a conditional distribution, such as the moments, are often
referred to by corresponding names such as the conditional mean and conditional
variance.

More generally, one can refer to the conditional distribution of a subset of a set of
more than two variables; this conditional distribution is contingent on the values of
all the remaining variables, and if more than one variable is included in the subset
then this conditional distribution is the conditional joint distribution of the included
variables.

Conditional discrete distributions

For discrete random variables, the conditional probability mass function of Y given
X=x can be written according to its definition as:

Example
Relation to independence

Posterior probability
The posterior probability is a type of conditional probability that results
from updating the prior probability with information summarized by
the likelihood via an application of Bayes' rule. From an epistemological
perspective, the posterior probability contains everything there is to know about an
uncertain proposition (such as a scientific hypothesis, or parameter values), given
prior knowledge and a mathematical model describing the observations available at
a particular time. After the arrival of new information, the current posterior
probability may serve as the prior in another round of Bayesian updating.

Definition in the distributional case


In Bayesian statistics, the posterior probability is the probability of the
parameters given the evidence , and is denoted .

The posterior probability is therefore proportional to the product Likelihood · Prior


probability
Example
Suppose there is a school with 60% boys and 40% girls as students. The girls wear
trousers or skirts in equal numbers; all boys wear trousers. An observer sees a
(random) student from a distance; all the observer can see is that this student is
wearing trousers. What is the probability this student is a girl? The correct answer
can be computed using Bayes' theorem.

Calculation
The posterior probability distribution of one random variable given the value of
another can be calculated with Bayes' theorem by multiplying the prior probability
distribution by the likelihood function, and then dividing by the normalizing
constant, as follows:

Cross-entropy for classification


Computer vision and deep learning literature usually say one should
use binary_crossentropy for a binary (two-class) problem
and categorical_crossentropy for more than two classes.

categorical_crossentropy:

 accepts only one correct class per sample


 will take "only" the true neuron and make the crossentropy calculation with
that neuron
binary_crossentropy:

 accepts many correct classes per sample


 will do the crossentropy calculation for "all neurons", considering that each
neuron can be two classes, 0 and 1.
A 2-class problem can be modeled as:

 2-neuron output with only one correct class: softmax + categorical_crossentropy


 1-neuron output, one class is 0, the other is 1: sigmoid + binary_crossentropy

Now notice how binary crossentropy (the second equation in the picture) has two
terms, one for considering 1 as the correct class, another for considering 0 as the
correct class.
Categorical Cross-Entropy in Multi-Class Classification
Categorical Cross-Entropy (CCE), also known as softmax loss or log loss, is one of
the most commonly used loss functions in machine learning, particularly for
classification problems. It measures the difference between the predicted
probability distribution and the actual (true) distribution of classes. The function
helps a machine learning model determine how far its predictions are from the
true labels and guides it in learning to make more accurate predictions.

Understanding Categorical Cross-Entropy


Categorical cross-entropy is used when you have more than two classes in your
classification problem (multi-class classification). It measures the difference
between two probability distributions: the predicted probability distribution and
the true distribution, which is represented by a one-hot encoded vector.
In a one-hot encoded vector, the correct class is represented as "1" and all other
classes as "0." Categorical cross-entropy penalizes predictions based on how
confident the model is about the correct class.
Mathematical Representation of Categorical Cross-Entropy
The categorical cross-entropy formula is expressed as:

Example : Calculating Categorical Cross-Entropy

Let's break down the categorical cross-entropy calculation with a mathematical


example using the following true labels and predicted probabilities.
We have 3 samples, each belonging to one of 3 classes (Class 1, Class 2, or Class
3). The true labels are one-hot encoded.
1. True Labels (y_true):
Example 1: Class 2 → [0, 1, 0]
Example 2: Class 1 → [1, 0, 0]
Example 3: Class 3 → [0, 0, 1]
2. Predicted Probabilities (y_pred):
Example 1: [0.1, 0.8, 0.1]
Example 2: [0.7, 0.2, 0.1]
Example 3: [0.2, 0.3, 0.5]
Step-by-Step Calculation
How Categorical Cross-Entropy Works
To understand how CCE works, let's break it down:
1. Prediction of Probabilities: The model outputs probabilities for each class.
These probabilities are the likelihood of a data point belonging to each class.
Typically, this is done using a softmax function, which converts raw scores into
probabilities.
2. Comparison with True Class: Categorical cross-entropy compares the
predicted probabilities with the actual class labels (one-hot encoded).
3. Calculation of Loss: The logarithm of the predicted probability for the correct
class is taken, and the loss function penalizes the model based on how far the
prediction was from the actual class.
Application of Categorical Cross-Entropy in Multi-Class Classification
Categorical cross-entropy is essential in multi-class classification, where a
model must classify an instance into one of several classes. For example, in an
image classification task, the model might need to identify whether an image is of
a cat, dog, or bird. CCE helps the model adjust its weights during training to make
better predictions.
It's important to note that the CCE loss function assumes that each data point
belongs to exactly one class. If you have a problem where a data point can belong
to multiple classes simultaneously, binary cross-entropy would be a better
choice.
Differences Between Categorical and Binary Cross-Entropy
While both binary and categorical cross-entropy are used to calculate loss in
classification problems, they differ in use cases and how they handle multiple
classes:
 Binary Cross-Entropy is used for binary classification problems where there are
only two possible outcomes (e.g., "yes" or "no").
 Categorical Cross-Entropy is used for multi-class classification where there
are three or more categories, and the model assigns probabilities to each.
Hyperparameters Optimization methods
What are the Hyperparameters?

Hyperparameters are those parameters that we set for training. Hyperparameters


have major impacts on accuracy and efficiency while training the model. Therefore
it needed to be set accurately to get better and efficient results. Let’s discuss
some Hyperparameters Optimization Methods to optimize the hyperparameter.
Hyperparameters Optimization Technique

Exhaustive Search Methods

Let’s first discuss some Exhaustive Search Methods to optimize the


hyperparameter.
 Grid Search: In Grid Search, the possible values of hyperparameters are
defined in the set. Then these sets of possible values of hyperparameters are
combined by using Cartesian product and form a multidimensional grid. Then we
try all the parameters in the grid and select the hyperparameter setting with the
best result.
 Random Search: This is another variant of Grid Search in which instead of
trying all the points in the grid we try random points. This solves a couple of
problems that are in Grid Search such as we don’t need to expand our search
space exponentially every time add a new hyperparameter

Drawback:
Random Search and Grid Search are easy to implement and can run in parallel but
here are few drawbacks of these algorithm:
 If the hyperparameter search space is large, it takes a lot of time and
computational power to optimize the hyperparameter.
 There is no guarantee that these algorithms find local maxima if the sample is
not meticulously done.

Bayesian Optimization:

Instead of random guess, In bayesian optimization we use our previous knowledge


to guess the hyper parameter. They use these results to form a probabilistic model
mapping hyperparameters to a probability function of a score on the objective
function. These probability function is defined below.
This function is also called “surrogate” of objective function. It is much easier to
optimize than Objective function. Below are the steps for applying Bayesian
Optimization for hyperparameter optimization:
1. Build a surrogate probability model of the objective function
2. Find the hyperparameters that perform best on the surrogate
3. Apply these hyperparameters to the original objective function
4. Update the surrogate model by using the new results
5. Repeat steps 2–4 until n number of iteration
Other Hyperparameter Estimation Algorithms:

Hyperband:

The underlying principle of this algorithm is that if a hyperparameter configuration


is destined to be the best after a large number of iterations, it is more likely to
perform in the top half of configurations after a small number of iterations. Below
is step-by-step implementation of Hyperband.
 Randomly sample n number of hyperparameter sets in the search space.
 After k iterations evaluate the validation loss of these hyperpameters.
 Discard the half of lowest performing hyperparameters .
 Run the good ones for k iterations more and evaluate and discard the bottom
half.
 Repeat until we have only one model of hyperparameter left.
Drawbacks:
If number of samples is large some good performing hyperparameter sets which
required some time to converge may be discarded early in the optimization.

Population based Training (PBT):

Population based Training (PBT) starts similar to random based training by training
many models in parallel. But rather than the networks training independently, it
uses information from the remainder of the population to refine the
hyperparameters and direct computational resources to models which show
promise. This takes its inspiration from genetic algorithms where each member of
the population, referred to as a worker, can exploit information from the rest of
the population. for instance, a worker might copy the model parameters from a far
better performing worker. It also can explore new hyperparameters by changing
the present values randomly.

Bayesian Optimization and HyperBand (BOHB):

BOHB (Bayesian Optimization and HyperBand) is a combination of the Hyperband


algorithm and Bayesian optimization. First, it uses Hyperband capability to sample
many configurations with a small budget to explore quickly and efficiently the
hyper-parameter search space and get very soon promising configurations, then it
uses the Bayesian optimizer predictive capability to propose set of
hyperparameters that are close to optimum.This algorithm can also be run in
parallel (as Hyperband) which overcomes a strong drawback of Bayesian
optimization.
Gradient Descent Algorithm
A gradient is nothing but a derivative that defines the effects on outputs of the
function with a little bit of variation in inputs.

What is Gradient Descent?

Gradient Descent stands as a cornerstone orchestrating the intricate dance of


model optimization. At its core, it is a numerical optimization algorithm that aims
to find the optimal parameters—weights and biases—of a neural network by
minimizing a defined cost function.
Gradient Descent (GD) is a widely used optimization algorithm in machine learning
and deep learning that minimises the cost function of a neural network model
during training. It works by iteratively adjusting the weights or parameters of the
model in the direction of the negative gradient of the cost function until the
minimum of the cost function is reached.
Gradient Descent is a fundamental optimization algorithm in machine
learning used to minimize the cost or loss function during model training.
 It iteratively adjusts model parameters by moving in the direction of the
steepest decrease in the cost function.
 The algorithm calculates gradients, representing the partial derivatives of the
cost function concerning each parameter.
Gradient Descent Implementation
Import the necessary libraries

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

Set the input and output data

# set random seed for reproducibility


torch.manual_seed(42)

# set number of samples


num_samples = 1000

# create random features with 2 dimensions


x = torch.randn(num_samples, 2)

# create random weights and bias for the linear regression model
true_weights = torch.tensor([1.3, -1])
true_bias = torch.tensor([-3.5])

# Target variable
y = x @ true_weights.T + true_bias

# Plot the dataset


fig, ax = plt.subplots(1, 2, sharey=True)
ax[0].scatter(x[:,0],y)
ax[1].scatter(x[:,1],y)
ax[0].set_xlabel('X1')
ax[0].set_ylabel('Y')
ax[1].set_xlabel('X2')
ax[1].set_ylabel('Y')
plt.show()

Output:

*** Thank You ***

Module 5
LEARNING AND GENERALIZATION

Bias and Variance in Machine Learning

Bias is one type of error that occurs due to wrong assumptions about data such as
assuming data is linear when in reality, data follows a complex function. On the
other hand, variance gets introduced with high sensitivity to variations in training
data. This also is one type of error since we want to make our model robust
against noise. There are two types of error in machine learning. Reducible error
and Irreducible error. Bias and Variance come under reducible error.
What is Bias?
Bias is simply defined as the inability of the model because of that there is some
difference or error occurring between the model’s predicted value and the actual
value. These differences between actual or expected values and the predicted
values are known as error or bias error or error due to bias. Bias is a systematic
error that occurs due to wrong assumptions in the machine learning process.
Let YY be the true value of a parameter, and let Y^Y^ be an estimator of YY based
on a sample of data. Then, the bias of the estimator Y^Y^ is given by:

 Low Bias: Low bias value means fewer assumptions are taken to build the
target function. In this case, the model will closely match the training dataset.
 High Bias: High bias value means more assumptions are taken to build the
target function. In this case, the model will not match the training dataset
closely.
The high-bias model will not be able to capture the dataset trend. It is considered
as the underfitting model which has a high error rate. It is due to a very simplified
algorithm.
For example, a linear regression model may have a high bias if the data has a non-
linear relationship.

Ways to reduce high bias in Machine Learning:

 Use a more complex model: One of the main reasons for high bias is the very
simplified model. it will not be able to capture the complexity of the data. In
such cases, we can make our mode more complex by increasing the number of
hidden layers in the case of a deep neural network. Or we can use a more
complex model like Polynomial regression for non-linear datasets, CNN for image
processing, and RNN for sequence learning.
 Increase the number of features: By adding more features to train the
dataset will increase the complexity of the model. And improve its ability to
capture the underlying patterns in the data.
 Reduce Regularization of the model: Regularization techniques such as L1 or
L2 regularization can help to prevent overfitting and improve the generalization
ability of the model. if the model has a high bias, reducing the strength of
regularization or removing it altogether can help to improve its performance.
 Increase the size of the training data: Increasing the size of the training
data can help to reduce bias by providing the model with more examples to
learn from the dataset.
What is Variance?
Variance is the measure of spread in data from its mean position. In machine
learning variance is the amount by which the performance of a predictive model
changes when it is trained on different subsets of the training data. More
specifically, variance is the variability of the model that how much it is sensitive to
another subset of the training dataset. i.e. how much it can adjust on the new
subset of the training dataset.
Let Y be the actual values of the target variable, and Y^ Y^ be the predicted
values of the target variable. Then the variance of a model can be measured as
the expected value of the square of the difference between predicted values and
the expected value of the predicted values.

Variance errors are either low or high-variance errors.


 Low variance: Low variance means that the model is less sensitive to changes
in the training data and can produce consistent estimates of the target function
with different subsets of data from the same distribution. This is the case of
underfitting when the model fails to generalize on both training and test data.
 High variance: High variance means that the model is very sensitive to
changes in the training data and can result in significant changes in the estimate
of the target function when trained on different subsets of data from the same
distribution. This is the case of overfitting when the model performs well on the
training data but poorly on new, unseen test data. It fits the training data too
closely that it fails on the new training dataset.

Ways to Reduce the reduce Variance in Machine Learning:

 Cross-validation: By splitting the data into training and testing sets multiple
times, cross-validation can help identify if a model is overfitting or underfitting
and can be used to tune hyperparameters to reduce variance.
 Feature selection: By choosing the only relevant feature will decrease the
model’s complexity. and it can reduce the variance error.
 Regularization: We can use L1 or L2 regularization to reduce variance in
machine learning models
 Ensemble methods: It will combine multiple models to improve generalization
performance. Bagging, boosting, and stacking are common ensemble methods
that can help reduce variance and improve generalization performance.
 Simplifying the model: Reducing the complexity of the model, such as
decreasing the number of parameters or layers in a neural network, can also
help reduce variance and improve generalization performance.
 Early stopping: Early stopping is a technique used to prevent overfitting by
stopping the training of the deep learning model when the performance on the
validation set stops improving.
Different Combinations of Bias-Variance
There can be four combinations between bias and variance.
 High Bias, Low Variance: A model with high bias and low variance is said to
be underfitting.
 High Variance, Low Bias: A model with high variance and low bias is said to
be overfitting.
 High-Bias, High-Variance: A model has both high bias and high variance,
which means that the model is not able to capture the underlying patterns in the
data (high bias) and is also too sensitive to changes in the training data (high
variance). As a result, the model will produce inconsistent and inaccurate
predictions on average.
 Low Bias, Low Variance: A model that has low bias and low variance means
that the model is able to capture the underlying patterns in the data (low bias)
and is not too sensitive to changes in the training data (low variance). This is the
ideal scenario for a machine learning model, as it is able to generalize well to
new, unseen data and produce consistent and accurate predictions. But in
practice, it’s not possible.

Bias Variance Tradeoff


If the algorithm is too simple (hypothesis with linear equation) then it may be on
high bias and low variance condition and thus is error-prone. If algorithms fit too
complex (hypothesis with high degree equation) then it may be on high variance
and low bias. In the latter condition, the new entries will not perform well. Well,
there is something between both of these conditions, known as a Trade-off or Bias
Variance Trade-off. This tradeoff in complexity is why there is a tradeoff between
bias and variance. An algorithm can’t be more complex and less complex at the
same time. For the graph, the perfect tradeoff will be like this.
What are Bias and Variance?
Bias refers to the errors which occur when we try to fit a statistical model on real-
world data which does not fit perfectly well on some mathematical model. If we
use a way too simplistic a model to fit the data then we are more probably face
the situation of High Bias which refers to the case when the model is unable to
learn the patterns in the data at hand and hence performs poorly.
Variance implies the error value that occurs when we try to make predictions by
using data that is not previously seen by the model. There is a situation known
as high variance that occurs when the model learns noise that is present in the
data.

Regularization
Regularization introduces a penalty for more complex models, effectively reducing
their complexity and encouraging the model to learn more generalized patterns.
This method strikes a balance between underfitting and overfitting, where
underfitting occurs when the model is too simple to capture the underlying trends
in the data, leading to both training and validation accuracy being low.
Role Of Regularization
In Python, Regularization is a technique used to prevent overfitting by adding a
penalty term to the loss function, discouraging the model from assigning too much
importance to individual features or coefficients.
Let’s explore some more detailed explanations about the role of Regularization in
Python:
1. Complexity Control: Regularization helps control model complexity by
preventing overfitting to training data, resulting in better generalization to new
data.
2. Preventing Overfitting: One way to prevent overfitting is to use
regularization, which penalizes large coefficients and constrains their
magnitudes, thereby preventing a model from becoming overly complex and
memorizing the training data instead of learning its underlying patterns.
3. Balancing Bias and Variance: Regularization can help balance the trade-off
between model bias (underfitting) and model variance (overfitting) in machine
learning, which leads to improved performance.
4. Feature Selection: Some regularization methods, such as L1 regularization
(Lasso), promote sparse solutions that drive some feature coefficients to zero.
This automatically selects important features while excluding less important
ones.
5. Handling Multicollinearity: When features are highly correlated
(multicollinearity), regularization can stabilize the model by reducing coefficient
sensitivity to small data changes.
6. Generalization: Regularized models learn underlying patterns of data for better
generalization to new data, instead of memorizing specific examples.

Overfitting is a phenomenon that occurs when a Machine Learning model is


constrained to the training set and not able to perform well on unseen data. That
is when our model learns the noise in the training data as well. This is the case
when our model memorizes the training data instead of learning the patterns in it.
Underfitting on the other hand is the case when our model is not able to learn
even the basic patterns available in the dataset. In the case of the underfitting
model is unable to perform well even on the training data hence we cannot expect
it to perform well on the validation data. This is the case when we are supposed to
increase the complexity of the model or add more features to the feature set.
Train Neural Networks With Noise to Reduce Overfitting
Neural networks have revolutionized artificial intelligence but they often fall into
the trap of overfitting which may potentially reduce the model’s accuracy and
reliability.
Training Neural Networks With Noise
In the context of the neural network, noise can be defined as random or unwanted
data that interrupts the model’s ability to detect the target patterns or
relationships. In some instances, noise can adversely impact the efficient learning
capability of a model which tends to provide decreased performance and reduce
the model’s accuracy.
However, adding a little noise can improve neural network performance. By
introducing randomness during training, known as noise injection, acts like a
magic potion for the models.
Noise Injection Techniques
Data augmentation is one of the effective techniques that is used to inject the
noise into the input. Perhaps, data augmentation can significantly reduce the
generalization error that often occurs in machine learning techniques.
When we have adequate training data, our machine learning model can generalize
better. Certainly, in the real world in some instances, the amount of data that
we have is limited, which puts the machine learning model in restriction to
generalize better. To resolve this kind of issue, we introduce fake data generally
known as noise to the training set.
Gaussian noise is one of the most used techniques in data augmentation to
inject noise into input data which helps to reduce the overfitting. It has a zero
mean and a controllable standard deviation, allowing to adjust the intensity of the
noise. It's typically added to the input variables before feeding them to the
network.
 The type and amount of noise added are crucial hyperparameters. Too little
noise has minimal impact, while too much can make learning difficult.
Experimentation is needed to find the optimal settings.
 Noise Injection Timing: Noise is typically only added during training. The
model should be evaluated and used for predictions on clean data without any
noise injection.

Alternative Noise Injection Techniques

Alternatively, the Gaussian noise can be injected into input variables, activations,
weights, gradients, and outputs.
 Injecting noise to activations: The noise injection in the activation layer,
where the noise is injected directly into the activation layer permitting the
injected noise to be utilized by the network at any point in time during the
forward pass through the network layer. Injecting noise into an activation layer
is very helpful when we have a very deep neural network which helps the
network to regularize well and prevents overfitting. The output layer can inject
the noise by itself with the help of a noisy activation function.
 Injecting noise to weights: In the context of recurrent neural
networks, adding noise to the weights is one of the beneficial techniques to
regularize the model. When the noise is injected into the weights it generally
encourages the stability in the function being learned by the neural network.
This is an efficient injecting method because it directly injects the noise into
weights rather than injecting noise into input or output layers in the neural
network.
 Injecting noise to gradients: Instead of focusing on the structure of the input
domain, injecting noise to the gradients primarily centers on enhancing the
robustness of the optimization process. Just like gradient descent, the amount of
noise can begin high while training and can also generally decrease over time.
When we have a deep neural network, injecting noise into a gradient is one of
the most effective methods to be noticed.

Benefits of Adding Random Noise

 Prevents overfitting: When we introduce noise into the training process, it


adds variability to the data, which means that the introduction of noise can
cause the data points to be less distinct from each other rather than the network
trying to fit into each data point correctly. This prevents the network from fitting
the training samples too closely and hence it mitigates overfitting.
 Low generalization error: The presence of noise discourages the network
from memorizing the specific training samples and encourages the network to
learn the generalizable features from the data leading to low generalization
error.
 Improved performance: The injection of noise during the training of a neural
network can significantly improve the generalization performance of the model.
In addition to that noise injection during the training of a neural network carries
the regularization effect that possibly helps to improve the model’s robustness.
 Serves as data augmentation: Noise injection introduces a data
augmentation technique, that helps us to add random noise to the input
variables during training. Since it can uniquely transform the input variables
whenever it is revealed to the model, it helps the model from overfitting.
Implementation: Training Neural Network with Noise
For below example, Neural network model is trained on the MNIST dataset with
noise injection for regularization starting off with Input layer with a shape of 784,
representing the flattened dimension of MNIST images. During training, Gaussian
noise with a standard deviation of 0.1 is added to the input data.
 Training is performed on the training dataset with noisy inputs.
 Validation is conducted on the testing dataset to evaluate model performance.

import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, GaussianNoise
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

def build_model(input_shape, num_classes):


inputs = Input(shape=input_shape)
noisy_inputs = GaussianNoise(0.1)(inputs)
x= Dense(128, activation='relu')(noisy_inputs)
x= Dense(64, activation='relu')(x)
outputs = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=inputs, outputs=outputs)
return model

input_shape=(784,)
num_classes=10
model = build_model(input_shape, num_classes)
model.compile(optimizer=Adam(), loss=SparseCategoricalCrossentropy(),
metrics=['accuracy'])
#Load the dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
#Preprocessing the data
x_train = x_train.reshape(-1, 784).astype('float32')/255.0
x_test = x_test.reshape(-1, 784).astype('float32')/255.0
history= model.fit(x_train, y_train, batch_size=32, epochs=10,
validation_data=(x_test, y_test))
Output:
Epoch 1/10
1875/1875 [==============================] - 13s 6ms/step -
loss: 0.2555 - accuracy: 0.9247 - val_loss: 0.1313 - val_accuracy: 0.9601
Epoch 2/10
1875/1875 [==============================] - 8s 5ms/step -
loss: 0.1173 - accuracy: 0.9643 - val_loss: 0.0953 - val_accuracy: 0.9702
Epoch 3/10
1875/1875 [==============================] - 10s 5ms/step -
loss: 0.0847 - accuracy: 0.9740 - val_loss: 0.0919 - val_accuracy: 0.9728
Epoch 4/10
1875/1875 [==============================] - 9s 5ms/step -
loss: 0.0688 - accuracy: 0.9780 - val_loss: 0.0803 - val_accuracy: 0.9745
Epoch 5/10
1875/1875 [==============================] - 9s 5ms/step -
loss: 0.0563 - accuracy: 0.9825 - val_loss: 0.0771 - val_accuracy: 0.9768
Epoch 6/10
1875/1875 [==============================] - 9s 5ms/step -
loss: 0.0483 - accuracy: 0.9844 - val_loss: 0.0843 - val_accuracy: 0.9746
Epoch 7/10
1875/1875 [==============================] - 8s 4ms/step -
loss: 0.0423 - accuracy: 0.9859 - val_loss: 0.0796 - val_accuracy: 0.9756
Epoch 8/10
1875/1875 [==============================] - 9s 5ms/step -
loss: 0.0363 - accuracy: 0.9875 - val_loss: 0.0860 - val_accuracy: 0.9766
Epoch 9/10
1875/1875 [==============================] - 9s 5ms/step -
loss: 0.0353 - accuracy: 0.9884 - val_loss: 0.0740 - val_accuracy: 0.9790
Epoch 10/10
1875/1875 [==============================] - 8s 4ms/step -
loss: 0.0302 - accuracy: 0.9900 - val_loss: 0.0715 - val_accuracy: 0.9811
Soft weight sharing

Each Soft Tone Weight is a high-density weight that is covered in a soft Lycra fabric
and is ultra-comfortable to wear. They stretch, fit either wrist or ankle, and stay in
place during exercise.

Soft Weight-Sharing for Neural Network Compression


The success of deep learning in numerous application domains created the de- sire
to run and train them on mobile devices. This however, conflicts with their
computationally, memory and energy intense nature, leading to a growing interest
in compression.

Growing and pruning algorithms


Growing and pruning algorithms are used to optimize neural networks and improve
the performance of machine learning models

Pruning decision trees


Decision tree pruning is a critical technique in machine learning used to optimize
decision tree models by reducing overfitting and improving generalization to new
data. In this guide, we'll explore the importance of decision tree pruning,
its types, implementation, and its significance in machine learning model
optimization.
What is Decision Tree Pruning?
Decision tree pruning is a technique used to prevent decision trees from
overfitting the training data. Pruning aims to simplify the decision tree by
removing parts of it that do not provide significant predictive power, thus
improving its ability to generalize to new data.
Decision Tree Pruning removes unwanted nodes from the overfitted decision tree
to make it smaller in size which results in more fast, more accurate and more
effective predictions.
Types Of Decision Tree Pruning
There are two main types of decision tree pruning: Pre-Pruning and Post-
Pruning.

Pre-Pruning (Early Stopping)

Sometimes, the growth of the decision tree can be stopped before it gets too
complex, this is called pre-pruning. It is important to prevent the overfitting of the
training data, which results in a poor performance when exposed to new data.

Some common pre-pruning techniques include:


 Maximum Depth: It limits the maximum level of depth in a decision tree.
 Minimum Samples per Leaf: Set a minimum threshold for the number of
samples in each leaf node.
 Minimum Samples per Split: Specify the minimal number of samples needed
to break up a node.
 Maximum Features: Restrict the quantity of features considered for splitting.
Post-Pruning (Reducing Nodes)

After the tree is fully grown, post-pruning involves removing branches or nodes to
improve the model's ability to generalize. Some common post-pruning techniques
include:
 Cost-Complexity Pruning (CCP): This method assigns a price to each subtree
primarily based on its accuracy and complexity, then selects the subtree with
the lowest fee.
 Reduced Error Pruning: Removes branches that do not significantly affect the
overall accuracy.
 Minimum Impurity Decrease: Prunes nodes if the decrease in impurity (Gini
impurity or entropy) is beneath a certain threshold.
Region growing
Region growing is a simple region-based image segmentation method. It is also
classified as a pixel-based image segmentation method since it involves the
selection of initial seed points.

This approach to segmentation examines neighboring pixels of initial seed points


and determines whether the pixel neighbors should be added to the region. The
process is iterated on, in the same manner as general data clustering algorithms. A
general discussion of the region growing algorithm is described below.

Region-based segmentation

The main goal of segmentation is to partition an image into regions. Some


segmentation methods such as thresholding achieve this goal by looking for the
boundaries between regions based on discontinuities
in grayscale or color properties. Region-based segmentation is a technique for
determining the region directly. The basic formulation is

Region Growing Algorithm

A basic region-growing algorithm based on 8-connectivity can be summarized as


follows:
 Find all connected components in the seed array S(x, y) and erode each
connected component to one pixel, labeling all such pixels as 1. All other pixels
in S are labeled 0.
 Form an image fo such that, at a pair of coordinates (x, y), fo(x, y) = 1 if the
input image satisfies the given predicate Q at those coordinates; otherwise, fo(x,
y) = 0.
 Let g be an image formed by appending to each seed point in S all the 1-valued
points in fo that are 8-connected to that seed point.
 Label each connected component in g with a different region label (e.g., 1, 2,
3, ...). This is the segmented image obtained by region growing.
Advantages

 Can correctly separate the regions that have the same properties we define.
 Can provide the original images which have clear edges with good segmentation
results.
 Simple concept: only need a small number of seed points to represent the
property we want, then grow the region.
 Can determine the seed points and the criteria we want to make.
 Can choose the multiple criteria at the same time.
 Theoretical very efficient due to visiting each pixel by a limited bound of times.
Disadvantages

 Unless image has had a threshold function applied, a continuous path of points
related to color may exist, which connects any two points in the image.
 Practically random memory access slows down the algorithm, so adaption might
be needed
Committees and Networks
A committee machine is a type of artificial neural network using a divide and
conquer strategy in which the responses of multiple neural networks (experts) are
combined into a single response.[1] The combined response of the committee
machine is supposed to be superior to those of its constituent experts.
Committee machine
Types

Static structures
In this class of committee machines, the responses of several predictors (experts)
are combined by means of a mechanism that does not involve the input signal,
hence the designation static. This category includes the following methods:

 Ensemble averaging
In ensemble averaging, outputs of different predictors are linearly combined to
produce an overall output.

 Boosting
In boosting, a weak algorithm is converted into one that achieves arbitrarily high
accuracy.
Dynamic structures
In this second class of committee machines, the input signal is directly involved in
actuating the mechanism that integrates the outputs of the individual experts into
an overall output, hence the designation dynamic. There are two kinds of dynamic
structures:

 Mixture of experts
In mixture of experts, the individual responses of the experts are non-linearly
combined by means of a single gating network.

 Hierarchical mixture of experts


In hierarchical mixture of experts, the individual responses of the individual experts
are non-linearly combined by means of several gating networks arranged in a
hierarchical fashion.

Mixture of experts
Mixture of experts (MoE) is a machine learning technique where multiple
expert networks (learners) are used to divide a problem space into homogeneous
regions. MoE represents a form of ensemble learning.
Basic theory
MoE always has the following components, but they are implemented and combined
differently according to the problem being solved:

Both the experts and the weighting function are trained by minimizing some loss
function, generally via gradient descent. There is much freedom in choosing the
precise form of experts, the weighting function, and the loss function.

Meta-pi network
The meta-pi network, reported by Hampshire and Waibel
output. The model is trained by performing gradient descent on the mean-squared
error loss

The experts may be arbitrary functions.

In their original publication, they were solving the problem of


classifying phonemes in speech signal from 6 different Japanese speakers, 2
females and 4 males. They trained 6 experts, each being a "time-delayed neural
network"[4] (essentially a multilayered convolution network over the mel
spectrogram). They found that the resulting mixture of experts dedicated 5 experts
for 5 of the speakers, but the 6th (male) speaker does not have a dedicated expert,
instead his voice was classified by a linear combination of the experts for the other
3 male speakers.

***Thank You ***

You might also like