ANN Material
ANN Material
Introduction
Neural Computers mimic certain processing capabilities of the human brain.
- Neural Computing is an information processing paradigm, inspired by biological
system, composed of a large number of highly interconnected processing elements
(neurons) working in unison to solve specific problems.
- Artificial Neural Networks (ANNs), like people, learn by example.
- An ANN is configured for a specific application, such as pattern recognition or data
classification, through a learning process.
- Learning in biological systems involves adjustments to the synaptic connections
that exist between the neurons.
Essentially Neural networks are non-linear machine learning models, which can be
used for both supervised or unsupervised learning. Neural networks are also seen
as a set of algorithms, which are modeled loosely based on the human brain and
are built to identify patterns.
Artificial Neural Networks are primarily designed to mimic and simulate the
functioning of the human brain. Using the mathematical structure, it is ANN
constructed to replicate the biological neurons.
The concept of ANN follows the same process as that of a natural neural net. The
objective of ANN is to make the machines or systems understand and ape how a
human brain makes a decision and then ultimately takes action.
What is Neural Net ?
• A neural net is an artificial representation of the human brain that tries to
simulate its learning process. An artificial neural network (ANN) is often called a
"Neural Network" or simply Neural Net (NN).
• Traditionally, the word neural network is referred to a network of biological
neurons in the nervous system that process and transmit information.
• Artificial neural network is an interconnected group of artificial neurons that uses
a mathematical model or computational model for information processing based on
a connectionist approach to computation.
• The artificial neural networks are made of interconnecting artificial neurons which
may share some properties of biological neural networks.
• Artificial Neural network is a network of simple processing elements (neurons)
which can exhibit complex global behavior, determined by the connections between
the processing elements and element parameters.
History
McCulloch and Pitts (1943) are generally recognized as the designers of the first
neural-network. They combined many simple processing units together that could
lead to an overall increase in computational power. They suggested many ideas like
: a neuron has a threshold level and once that level is reached the neuron fires. It is
still the fundamental way in which ANNs operate. The McCulloch and Pitts's network
had a fixed set of weights.
Hebb (1949) developed the first learning rule, that is if two neurons are active at
the same time then the strength between them should be increased.
a. Neurons (Nodes):
b. Layers:
1. Input Layer: Receives raw data and passes it to the next layer.
2. Hidden Layers: Perform computations and feature transformations using
weights, biases, and activation functions.
3. Output Layer: Produces the final result (e.g., classification, regression value).
d. Activation Functions:
Functions that introduce non-linearity into the model, allowing it to solve complex
problems. Common types:
W2
Neural Network Architectures
e5
V1 V3
V5
e2
e4
e5
V2 V4
e3
Vertices V = { v1 , v2 , v3 , v4, v5 } Edges
E = { e1 , e2 , e3 , e4, e5 }
w11
w21
w12
w22
w2m
w1m
wn1
wn2
Multi Layer Feed-forward Network
Recurrent Networks
y1
x1
y1 y2
x2
Yn
ym
Xℓ
Supervised Learning
Unsupervised Learning
- No teacher is present.
- The expected or desired output is not presented to the network.
- The system learns of it own by discovering and adapting to the
structural features in the input patterns.
Reinforced learning
- A teacher is present but does not present the expected or
desired output but only indicated if the computed output is
correct or incorrect.
- The information provided helps the network in its learning process.
- A reward is given for correct answer computed and a penalty for a
wrong answer.
o Logistic Regression
o K-Nearest Neighbours
o Support Vector Machines
o Kernel SVM
o Naive Bayes
o Decision Tree Classification
o Random Forest Classification
Regression:
The task of the Regression algorithm is to find the mapping function to map the
input variable(x) to the continuous output variable(y).
Example: Suppose we want to do weather forecasting, so for this, we will use the
Regression algorithm. In weather prediction, the model is trained on the past data,
and once the training is completed, it can easily predict the weather for future days.
Data preprocessing is the process of cleaning and preparing the raw data to enable
feature engineering. After getting large volumes of data from sources like
databases, object stores, data lakes, engineers prepare them so data scientists can
create features. This includes basic cleaning, crunching and joining different sets of
raw data. In an operational environment this preprocessing would run as an ETL job
for batch processing, or it could be part of a streaming processing for live data.
Once the data is ready for the data scientist - then comes the feature engineering
part.
Data Preprocessing
Normalization: Normalization is the process of scaling numeric features to a
standard range, typically between 0 and 1. This ensures that all features contribute
equally to the model, preventing one dominant feature from overshadowing others.
Encoding: Categorical data, such as gender or country names, needs to be
converted into numerical format for machine learning algorithms. Encoding
techniques like one-hot encoding or label encoding transform categorical variables
into a format that algorithms can understand.
Handling Missing Data: Dealing with missing data is essential for robust model
performance. Strategies include removing rows with missing values, imputing
missing values with statistical measures, or using advanced techniques like
machine learning-based imputation.
Feature engineering is the creation of features from raw data. Feature engineering
includes:
Feature Engineering
Feature extraction:
The feature extraction algorithms transform the data onto a new feature space.
When it is important to derive useful information from the data, hence creating
new feature subspace doesn’t affect the model.
1. Linear
It assumes that the data falls on a linear subspace or classes of data can be
distinguished linearly
2. non-linear
It assumes that the pattern of data is more complex and exists on a non-linear
sub-manifold
1. PCA:
The aim of PCA is to find orthogonal directions which represent the data with the
least error
PCA tries to maximize this variance to find the most variant orthonormal
directions of data.
The desired directions are the eigenvectors of the covariance matrix of data.
KPCA finds the non-linear subspace of data which is useful if the data pattern is
not linear.
The kernel PCA uses kernel method which maps data to a higher dimensional
space .
1. Dual PCA
2. Multidimensional Scaling
3. Isomap
5. Laplacian Eigenmap
The Curse of Dimensionality in Machine Learning arises when working with high-
dimensional data, leading to increased computational complexity, overfitting, and
spurious correlations. Techniques like dimensionality reduction, feature selection,
and careful model design are essential for mitigating its effects and improving
algorithm performance. Navigating this challenge is crucial for unlocking the
potential of high-dimensional datasets and ensuring robust machine-learning
solutions.
Feature Selection: Identify and select the most relevant features from the
original dataset while discarding irrelevant or redundant ones. This reduces the
dimensionality of the data, simplifying the model and improving its efficiency.
Feature Extraction: Transform the original high-dimensional data into a lower-
dimensional space by creating new features that capture the essential
information. Techniques such as Principal Component Analysis (PCA) and t-
distributed Stochastic Neighbor Embedding (t-SNE) are commonly used for
feature extraction.
Data Preprocessing:
For simplicity, we can generate the data for this task from the
function Sin(2πx)Sin(2πx) with random gaussian noise included in the targetv
varible, i.e, for any input xx, target t=Sin(2πx)+ϵt=Sin(2πx)+ϵ. Let the training set
consists of NN samples with inputs as x=(x1,x2,…,xN)Tx=(x1,x2,…,xN)T and the
corresponding target variables as t=(t1,t2,…,tN)Tt=(t1,t2,…,tN)T. Let
the polynomial fuction used for the prediction, whose order is MM, is:
y(x,w)=w0+w1x+w2x2+…+wMxM=∑j=0Mwjxj
This polynomial function is linear with respect to the coefficients ww. The goal of
the pattern recognition task is to minimize the error in predicting tt. Or we can say
that we have to minimize some error function which should encode how much we
deviated from the actual value while doing the prediction. One of the common
choice of error fuction is:
E(w)=12∑n=1N[y(xn,w)−tn]2E(w)=12∑n=1N[y(xn,w)−tn]2
The error function is quadratic in ww and hence taking it’s derivative w.r.t ww and
equating it to 00 gives us a unique solution w∗w∗ for the problem.
One of the important parameter in deciding how well the solution will perform on the
unseen data is the order of the polynomial function MM. As shown in the below
figure, if we keep on increasing MM, we will get a perfect fit on the training data
getting the training error E(w∗)=0E(w∗)=0 (called as overfitting) but the prediction
on unseen data will be flawed. The best fit polynomial seems to be the one which
has on order M=3M=3.
Based on these coefficients, one of the techniques which can be used to
coompensate for the problem of overfitting is regularization which involves adding a
penalty term to the error function which discourages the coefficients from getting
larger in magnitude. The modified error functio is given as:
E˜(w)=12∑n=1N(y(xn,w)−tn)2+λ2∥w∥2E~(w)=12∑n=1N(y(xn,w)−tn)2+λ2‖w‖2
Another way to reduce overfitting or to use the complex models for prediction is
by increasing the sample size of the training data. The same
order M=9M=9 polynomial is fit on N=15N=15 and N=100N=100 datapoints and
result is shown in the left and the right figure below. It can be seen that the
increasing the number of datapoinsts reduces the problem of overfitting.
1 Linearity
A neural network is only non-linear if you squash the output signal from the nodes
with a non-linear activation function. A complete neural network (with non-linear
activation functions) is an arbitrary function approximator.
Bonus: It should be noted that if you are using linear activation functions in multiple
consecutive layers, you could just as well have pruned them down to a single layer
due to them being linear. (The weights would be changed to more extreme values).
Creating a network with multiple layers using linear activation functions would not
be able to model more complicated functions than a network with a single layer.
2 Activation signal
Interpreting the squashed output signal could very well be interpreted as the
strength of this signal (biologically speaking). Thought it might be incorrect to
interpret the output strength as an equivalent of confidence as in fuzzy logic.
Universal approximators
5. Importance of Non-linearity
Without non-linearity:
With non-linearity:
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of
another event that has already occurred. Bayes’ theorem is stated mathematically
as the following equation:
P(A∣B)=P(B∣A)P(A)P(B)P(A∣B)=P(B)P(B∣A)P(A)
where A and B are events and P(B) ≠ 0
Basically, we are trying to find probability of event A, given the event B is true.
Event B is also termed as evidence.
P(A) is the priori of A (the prior probability, i.e. Probability of event before
evidence is seen). The evidence is an attribute value of an unknown
instance(here, it is event B).
P(B) is Marginal Probability: Probab
ility of Evidence.
P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is
seen.
P(B|A) is Likelihood probability i.e the likelihood that a hypothesis will come true
based on the evidence.
Now, with regards to our dataset, we can apply Bayes’ theorem in following way:
P(y∣X)=P(X∣y)P(y)P(X)P(y∣X)=P(X)P(X∣y)P(y)
where, y is class variable and X is a dependent feature vector (of size n) where:
X=(x1,x2,x3,…..,xn)X=(x1,x2,x3,…..,xn)
Just to clear, an example of a feature vector and corresponding class variable can
be: (refer 1st row of dataset)
X = (Rainy, Hot, High, False)
y = No
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature makes an:
Feature independence: The features of the data are conditionally independent of
each other, given the class label.
Continuous features are normally distributed: If a feature is continuous, then it is
assumed to be normally distributed within each class.
Discrete features have multinomial distributions: If a feature is discrete, then it
is assumed to have a multinomial distribution within each class.
Features are equally important: All features are assumed to contribute equally to
the prediction of the class label.
No missing data: The data should not contain any missing values.
Naive Bayes Classifiers
A Naive Bayes classifiers, a family of algorithms based on Bayes’ Theorem.
Despite the “naive” assumption of feature independence, these classifiers are
widely utilized for their simplicity and efficiency in machine learning. The article
delves into theory, implementation, and applications, shedding light on their
practical utility despite oversimplified assumptions.
Why it is Called Naive Bayes?
The “Naive” part of the name indicates the simplifying assumption made by the
Naïve Bayes classifier. The classifier assumes that the features used to describe
an observation are conditionally independent, given the class label. The “Bayes”
part of the name refers to Reverend Thomas Bayes, an 18th-century statistician
and theologian who formulated Bayes’ theorem.
Consider a fictional dataset that describes the weather conditions for playing a
game of golf. Given the weather conditions, each tuple classifies the conditions as
fit(“Yes”) or unfit(“No”) for playing golf.Here is a tabular representation of our
dataset.
Advantages of Naive Bayes Classifier
Easy to implement and computationally efficient.
Effective in cases with a large number of features.
Performs well even with limited training data.
It performs well in the presence of categorical features.
For numerical features data is assumed to come from normal distributions
Disadvantages of Naive Bayes Classifier
Assumes that features are independent, which may not always hold in real-world
data.
Can be influenced by irrelevant attributes.
May assign zero probability to unseen events, leading to poor generalization.
Applications of Naive Bayes Classifier
Spam Email Filtering: Classifies emails as spam or non-spam based on features.
Text Classification: Used in sentiment analysis, document categorization, and
topic classification.
Medical Diagnosis: Helps in predicting the likelihood of a disease based on
symptoms.
Credit Scoring: Evaluates creditworthiness of individuals for loan approval.
Weather Prediction: Classifies weather conditions based on various factors.
Decision boundary
In a statistical-classification problem with two classes, a decision
boundary or decision surface is a hypersurface that partitions the underlying vector
space into two sets, one for each class. The classifier will classify all the points on
one side of the decision boundary as belonging to one class and all those on the
other side as belonging to the other class.
A decision boundary is the region of a problem space in which the output label of
a classifier is ambiguous.[1]
If the decision surface is a hyperplane, then the classification problem is linear, and
the classes are linearly separable.
Decision boundaries are not always clear cut. That is, the transition from one class
in the feature space to another is not discontinuous, but gradual. This effect is
common in fuzzy logic based classification algorithms, where membership in one
class or another is ambiguous.
In particular, support vector machines find a hyperplane that separates the feature
space into two classes with the maximum margin. If the problem is not originally
linearly separable, the kernel trick can be used to turn it into a linearly separable
one, by increasing the number of dimensions. Thus a general hypersurface in a
small dimension space is turned into a hyperplane in a space with much larger
dimensions.
Neural networks try to learn the decision boundary which minimizes the empirical
error, while support vector machines try to learn the decision boundary which
maximizes the empirical margin between the decision boundary and data points.
Linear
A linear decision boundary is a line that demarcates one feature space class from
another.
Non-linear
A non-linear decision border is a curve or surface that delineates a set of categories.
Learning non-linear decision boundaries is possible in non-linear models like decision
trees, support vector machines, and neural networks.
Piecewise Linear
Linear segments are joined together to produce a piecewise linear curve, which is
the piecewise linear decision boundary. Piecewise linear decision boundaries may be
learned by both decision trees and random forests.
Clustering
The boundaries between groups of data points in a feature space are called
“clustering decision boundaries.” K-means and DBSCAN are only two examples
of clustering algorithms whose decision limits may be learned.
Probabilistic
A data point’s likelihood of belonging to one group or another is represented by a
border called a probabilistic decision boundary. Probabilistic models may be trained
to learn probabilistic decision boundaries, including Naive Bayes and Gaussian
Mixture Models.
Risk minimization
Risk Minimization refers to the process of minimizing the expected risk in learning
tasks. It involves the analysis and minimization of both empirical risk and functional
risk, with the goal of finding the optimal solution.
We can interpret learning as the outcome of the minimization of the expected risk.
In the previous section, we have analyzed different kinds of loss function, that are
used to construct the expected and the empirical risk. In this section, we discuss
the minimization of the risk with the main purpose of understanding the role of the
chosen loss in the optimal solution. In general, we would like to minimize the
expected risk E(f) as given by definition 2.1.1–(9). This can be carried out by an
elegant analysis based on variational calculus that is given in Section 2.5. Here, we
will concentrate on the minimization of the empirical risk, which is what is used in
the real-word. Interestingly, the study on the empirical risk will also indicate the
structure of the minimum of the functional risk.
We begin discussing the case of regression. Let us consider the class of loss
functions defined by Eq. 2.1.1–(7). We discuss the cases p=0,1,2, and +∞, which
are more commonly adopted. We will also assume that the marginal probability
density p1 is strictly positive: p1>0. The process of learning is converted into the
problem of minimizing
(1) E=E‖Y−f(X)‖pp=1p∫X×Y|y−f(x)|pdP(x,y).
By minimizing the empirical risk, we hope to obtain a model with a low value of the
risk. The larger the training set size is, the closer to the true risk the empirical risk
is.
If we were to apply the ERM principle without more care, we would end up learning
by heart, which we know is bad. This issue is more generally related to
the overfitting phenomenon, which can be avoided by restricting the space of
possible models when searching for the one with minimal error. The most severe
and yet common restriction is encountered in the contexts of linear
classification or linear regression. Another approach consists in controlling the
complexity of the model by regularization.
Example
A population of women who were at least 21 years old, of Pima Indian heritage
and living near Phoenix, Arizona, was tested for diabetes mellitus according
to World Health Organization criteria. The data were collected by the US National
Institute of Diabetes and Digestive and Kidney Diseases. We used the 532
complete records.[2][3]
In this example, we construct three density estimates for "glu"
(plasma glucose concentration), one conditional on the presence of diabetes, the
second conditional on the absence of diabetes, and the third not conditional on
diabetes. The conditional density estimates are then used to construct the
probability of diabetes conditional on "glu".
The "glu" data were obtained from the MASS package[4] of the R programming
language. Within R, ?Pima.tr and ?Pima.te give a fuller account of the data.
The mean of "glu" in the diabetes cases is 143.1 and the standard deviation is
31.26. The mean of "glu" in the non-diabetes cases is 110.0 and the standard
deviation is 24.29. From this we see that, in this data set, diabetes cases are
associated with greater levels of "glu". This will be made clearer by plots of the
estimated density functions.
The first figure shows density estimates of p(glu | diabetes=1), p(glu | diabetes=0),
and p(glu). The density estimates are kernel density estimates using a Gaussian
kernel. That is, a Gaussian density function is placed at each data point, and the
sum of the density functions is computed over the range of the data.
From the density of "glu" conditional on diabetes, we can obtain the probability of
diabetes conditional on "glu" via Bayes' rule. For brevity, "diabetes" is abbreviated
The second figure shows the estimated posterior probability p(diabetes=1 | glu).
From these data, it appears that an increased level of "glu" is associated with
diabetes.
In hydrology the histogram and estimated density function of rainfall and river
discharge data, analysed with a probability distribution, are used to gain insight
in their behaviour and frequency of occurrence.[9] An example is shown in the
blue figure.
Parametric Methods
Parametric methods are statistical techniques that rely on specific assumptions
about the underlying distribution of the population being studied. These methods
typically assume that the data follows a known Probability distribution, such as the
normal distribution, and estimate the parameters of this distribution using the
available data.
The basic idea behind the Parametric method is that there is a set of fixed
parameters that are used to determine a probability model that is used in Machine
Learning as well. Parametric methods are those methods for which we priory know
that the population is normal, or if not then we can easily approximate it using
a Normal Distribution which is possible by invoking the Central Limit Theorem.
Parameters for using the normal distribution are as follows:
Mean
Standard Deviation
Eventually, the classification of a method to be parametric completely depends on
the presumptions that are made about a population.
Statistical Tests:
o t-test: Tests for the difference between the means of two independent
groups.
o ANOVA: Tests for the difference between the means of three or more
groups.
o F-test: Compares the variances of two groups.
o Chi-square test: Tests for relationships between categorical variables.
o Correlation analysis: Measures the strength and direction of the linear
relationship between two continuous variables.
Machine Learning Models:
o Linear regression: Predicts a continuous outcome based on a linear
relationship with one or more independent variables.
o Logistic regression: Predicts a binary outcome (e.g., yes/no) based on a set
of independent variables.
o Naive Bayes: Classifies data points based on Bayes’ theorem and assuming
independence between features.
o Hidden Markov Models: Models sequential data with hidden states and
observable outputs.
More powerful: When the assumptions are met, parametric tests are generally
more powerful than non-parametric tests, meaning they are more likely to
detect a real effect when it exists.
More efficient: Parametric tests require smaller sample sizes than non-
parametric tests to achieve the same level of power.
Provide estimates of population parameters: Parametric methods provide
estimates of the population mean, variance, and other parameters, which can be
used for further analysis.
The maximum likelihood method is the most popular technique for deriving
estimators. It is based on the likelihood function which, for an observed sample x, is
defined as the probability (or density) of x expressed as a function of θ; in symbols
L(θ)=∏i=1nf(xi;θ)
This function provides a measure of plausibility of each possible value of θ on the
basis of the observed data. Then, the method at issue consists of
estimating θ through the value of θ which maximizes L(θ) since this corresponds to
the parameter value for which the observed sample is most likely. The estimate
found in this way, that is,
θˆ=θˆ(x)such thatL(θˆ)=supθ∈ΘL(θ)
The Maximum-Likelihood Approach
The maximum-likelihood approach is, far and away, the preferred approach to
correcting for non-response bias, and it is the one advocated by Little and Rubin.
The maximum-likelihood approach begins by writing down a probability distribution
that defines the likelihood of the observed sample as a function of population and
distribution parameters θ. If x1 and x2 represent responses to two different survey
questions by a single individual, the likelihood associated with a complete response
may be expressed as f(x1, x2; θ), where f is the joint probability density
function of x1 and x2. For individuals who only report x1, the likelihood associated
with x1 is ∫−∞∞fx1,x2;θdx2, which can, under the assumption of joint normality, be
simplified to a more convenient form. In this way, a likelihood function is specified
that includes terms corresponding to each observation, whether completely or only
partially observed. The likelihood objective is then maximized with respect to θ,
which produces estimates of the desired characteristics, enjoying all the well-known
properties of maximum-likelihood estimation.
Bayesian inference
Bayesian inference is a method of statistical inference in which Bayes' theorem is
used to calculate a probability of a hypothesis, given prior evidence, and update it
as more information becomes available. Fundamentally, Bayesian inference uses
a prior distribution to estimate posterior probabilities. Bayesian inference is an
important technique in statistics, and especially in mathematical statistics. Bayesian
updating is particularly important in the dynamic analysis of a sequence of data.
Bayesian inference has found application in a wide range of activities,
including science, engineering, philosophy, medicine, sport, and law. In the
philosophy of decision theory, Bayesian inference is closely related to subjective
probability, often called "Bayesian probability".
Introduction to Bayes' rule
Bayesian inference derives the posterior probability as a consequence of
two antecedents: a prior probability and a "likelihood function" derived from
a statistical model for the observed data. Bayesian inference computes the
posterior probability according to Bayes' theorem:
***Thank You***
Module 2
Assumptions of LDA
LDA assumes that the data has a Gaussian distribution and that
the covariance matrices of the different classes are equal. It also assumes that the
data is linearly separable, meaning that a linear decision boundary can accurately
classify the different classes.
Suppose we have two sets of data points belonging to two different classes that
we want to classify. As shown in the given 2D graph, when the data points are
plotted on the 2D plane, there’s no straight line that can separate the two classes
of data points completely. Hence, in this case, LDA (Linear Discriminant Analysis)
is used which reduces the 2D graph into a 1D graph in order to maximize the
separability between the two classes.
Let’s suppose we have two classes and a d- dimensional samples such as x1, x2 …
xn, where:
n1 samples coming from the class (c1) and n2 coming from the class (c2).
If xi is the data point, then its projection on the line represented by unit vector v
can be written as vTxi
Let’s consider u1 and u2 to be the means of samples class c1 and c2 respectively
before projection and u1hat denote the mean of the samples of class after
projection and it can be calculated by:
μ1~ =1n1∑xi∈c1n1vTxi=vTμ1 μ1 =n11∑xi∈c1n1vTxi=vTμ1
Extensions to LDA
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of
variance (or covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs
are used such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the
estimate of the variance (actually covariance), moderating the influence of
different variables on LDA.
Linear Separability
Linear Separability refers to the data points in binary classification problems which
can be separated using linear decision boundary. if the data points can be
separated using a line, linear function, or flat hyperplane are considered linearly
separable.
Linear separability is an important concept in neural networks. If the separate
points in n-dimensional space
follows then it is said linearly
separable
For two-dimensional inputs, if there exists a line (whose equation is
) that separates all samples of one class from
the other class, then an appropriate perception can be derived from the
equation of the separating line. such classification problems are called “Linear
separable” i.e, separating by a linear combination of i/p.
The logical AND gate example shown below illustrates a two-dimensional
example of a linearly separable problem.
1. Visual Inspection: If a distinct straight line or plane divides the various groups, it
can be visually examined by plotting the data points in a 2D or 3D space. The
data may be linearly separable if such a boundary can be seen.
2. Perceptron Learning Algorithm: This binary linear classifier divides the input into
two classes by learning a separating hyperplane iteratively. The data are linearly
separable if the method finds a separating hyperplane and converges. If not, it is
not.
3. Support vector machines: SVMs are a well-liked classification technique that can
handle data that can be separated linearly. To optimize the margin between the
two classes, they identify the separating hyperplane. The data can be linearly
separated if the margin is bigger than zero.
4. Kernel methods: The data can be transformed into a higher-dimensional space
using this family of techniques, where it might then be linearly separable. The
original data is also linearly separable if the converted data is linearly separable.
5. Quadratic programming: Finding the separation hyperplane that reduces the
classification error can be done using quadratic programming. If a solution is
found, the data can be separated linearly.
# Making dataset
X = np.array([[1, 2], [2, 3], [3, 1], [4, 3]])
Y = np.array([0, 0, 1, 1])
Output:
Least Square Method
Least Squares method is a statistical technique used to find the equation of best-
fitting curve or line to a set of data points by minimizing the sum of the squared
differences between the observed values and the values predicted by the model.
Formula for Least Square Method
Least Square Method formula is used to find the best-fitting line through a set of
data points. For a simple linear regression, which is a line of the form y=mx+c,
where y is the dependent variable, x is the independent variable, a is the slope of
the line, and b is the y-intercept, the formulas to calculate the slope (m) and
intercept (c) of the line are derived from the following equations:
1. Slope (m) Formula: m = n(∑xy)−(∑x)(∑y) / n(∑x2)−(∑x)2
2. Intercept (c) Formula: c = (∑y)−a(∑x) / n
Where:
n is the number of data points,
∑xy is the sum of the product of each pair of x and y values,
∑x is the sum of all x values,
∑y is the sum of all y values,
∑x2 is the sum of the squares of x values.
The steps to find the line of best fit by using the least square method is discussed
below:
Step 1: Denote the independent variable values as xi and the dependent ones as
yi.
Step 2: Calculate the average values of xi and yi as X and Y.
Step 3: Presume the equation of the line of best fit as y = mx + c, where m is
the slope of the line and c represents the intercept of the line on the Y-axis.
Step 4: The slope m can be calculated from the following formula:
m = [Σ (X – xi)×(Y – yi)] / Σ(X – xi)2
Step 5: The intercept c is calculated from the following formula:
c = Y – mX
The Least Square method assumes that the data is evenly distributed and doesn’t
contain any outliers for deriving a line of best fit. But, this method doesn’t provide
accurate results for unevenly distributed data or for data containing outliers.
The Perceptron-Artificial Neural Networks
he Perceptron is one of the simplest artificial neural network architectures,
introduced by Frank Rosenblatt in 1957. It is primarily used for binary
classification.
At that time, traditional methods like Statistical Machine Learning and
Conventional Programming were commonly used for predictions. Despite being
one of the simplest forms of artificial neural networks, the Perceptron model
proved to be highly effective in solving specific classification problems, laying the
groundwork for advancements in AI and machine learning.
What is Perceptron?
Perceptron is a type of neural network that performs binary classification that
maps input features to an output decision, usually classifying data into one of two
categories, such as 0 or 1.
Perceptron consists of a single layer of input nodes that are fully connected to a
layer of output nodes. It is particularly good at learning linearly separable
patterns. It utilizes a variation of artificial neurons called Threshold Logic Units
(TLU), which were first introduced by McCulloch and Walter Pitts in the 1940s. This
foundational model has played a crucial role in the development of more advanced
neural networks and machine learning algorithms.
Types of Perceptron
A perceptron consists of a single layer of Threshold Logic Units (TLU), with each
TLU fully connected to all input nodes.
In a fully connected layer, also known as a dense layer, all neurons in one layer
are connected to every neuron in the previous layer.
The output of the fully connected layer is computed as:
fW,b(X)=h(XW+b)fW,b(X)=h(XW+b)
where XX is the input WW is the weight for each inputs neurons and bb is the bias
and hh is the step function.
Fisher's linear discriminant
The terms Fisher's linear discriminant and LDA are often used interchangeably,
although Fisher's original article[2] actually describes a slightly different
discriminant, which does not make some of the assumptions of LDA such
as normally distributed classes or equal class covariances.
This measure is, in some sense, a measure of the signal-to-noise ratio for the class
labelling. It can be shown that the maximum separation occurs when
Be s
ure to note that the vector is the normal to the discriminant hyperplane. As
an example, in a two dimensional problem, the line that best divides the two groups
is perpendicular to .
Otsu's method is related to Fisher's linear discriminant, and was created to binarize
the histogram of pixels in a grayscale image by optimally picking the black/white
threshold that minimizes intra-class variance and maximizes inter-class variance
within/between grayscales assigned to black and white pixel classes.
Gradient-Based Strategy
makefile
v = beta * v - learning_rate * gradient
parameters = parameters + v
Momentum-based Optimization:
Learning rate schedules adjust the learning rate based on predefined rules or
functions, enhancing convergence and performance. Some common methods
include:
Step Decay: The learning rate decreases by a specific factor at designated
epochs or after a fixed number of iterations.
Exponential Decay: The learning rate is reduced exponentially over time,
allowing for a rapid decrease in the initial phases of training.
Polynomial Decay: The learning rate decreases polynomially over time, providing
a smoother reduction.
Adaptive learning rates dynamically adjust the learning rate based on the model’s
performance and the gradient of the cost function. This approach can lead to
optimal results by adapting the learning rate depending on the steepness of the
cost function curve:
AdaGrad: This method adjusts the learning rate for each parameter individually
based on historical gradient information, reducing the learning rate for
frequently updated parameters.
RMSprop: A variation of AdaGrad, RMSprop addresses overly aggressive learning
rate decay by maintaining a moving average of squared gradients to adapt the
learning rate effectively.
Adam: Combining concepts from both AdaGrad and RMSprop, Adam incorporates
adaptive learning rates and momentum to accelerate convergence.
Cycling learning rate techniques involve cyclically varying the learning rate within
a predefined range throughout the training process. The learning rate fluctuates in
a triangular shape between minimum and maximum values, maintaining a
constant frequency. One popular strategy is the triangular learning rate policy,
where the learning rate is linearly increased and then decreased within a cycle.
This method aims to explore various learning rates during training, helping the
model escape poor local minima and speeding up convergence.
In this approach, the learning rate decreases as the number of epochs or iterations
increases. This gradual reduction helps stabilize the training process as the model
converges to a minimum.
The proof of identity of these gradient problems is difficult to understand before the
training process is even started. We have to continually monitor the logs and record
unexpected jumps in the cost function when the network is a deep recurrent one.
This would tell us whether these jumps are recurrent. And if the norm of the
gradient is growing exponentially. The best way to do this is by checking logs in a
visualization dashboard.
There are various methods to address the exploding gradients. Below is the list of
some best-practice methods that we can use.
Gradient Clipping
Clipping by Value:
Clipping by Norm:
In the 'clipping by norm' technique of gradient clipping the gradients are clipped if
their norm (or their size) is greater than the specified threshold value. In contrast
to the 'clipping by value' here in this case the values of the gradients greater than
or less than the threshold values are not set to the threshold values. It makes sure
that the norm of the updated gradients remains small and manageable, and the
learning process is more stable. There are different types of 'clipping by norm'
techniques let's explore them one by one.
L2 Norm Clipping:
In this form of norm clipping technique the gradient value is clipped down if it's
L2 norm (Euclidean norm) exceeds the predefined threshold value. The L2 norm
or the Euclidean norm is calculated as the square root of the squared values of
its components. Considering gradient vector as g = [\nabla\theta_1, \nabla\
theta_2, ..., \nabla\theta_n ] where g_i is the gradient with respect to
the i^{th} parameter of the model and n is the total number of model
parameters. Therefore, the L2 norm is represented as:
||\nabla\theta||_2 = \sqrt{\Sigma_{i=1}^n {\nabla{\theta{_i^2}}}}
Now, if the L2 norm exceeds the threshold value the upgraded gradient after
clipping of the components becomes:
\nabla\theta = \frac{threshold}{||\nabla\theta||_2}.\nabla\theta
L1 Norm Clipping:
L1 norm technique of gradient is similar to the L2 norm gradient clipping
technique, in this technique if the L1 norm exceeds the threshold that we
defined in alignment with our specific requirements. L1 norm of a gradient is the
sum of the absolute values of all its components. Therefore, the L1 norm is
represented as:
||\nabla\theta||_1 = \Sigma_{i=1}^n|\nabla\theta_i|
If the L1 norm exceeds the threshold value, the upgraded gradient after clipping
becomes:
\nabla\theta = \frac{threshold}{||\nabla\theta||_1}.\nabla\theta
Gradient clipping by norm provides a more global control over the gradients, and it
is often used to address the exploding gradient problem.
Gradient Clipping is a crucial step in the training of neural networks since it helps
in addressing the issue of exploding gradients. The exploding gradients problem
arises when the gradients in the backpropagation process becomes excessively
large in value, causing instability in training of model. Now we will see some of the
most important points which explains the necessity of gradient clipping in neural
network model training.
1. Stability of Training: During the training of neural network, the optimization
algorithm adjusts the value of model parameters with the help of obtained
gradients. If the gradients are obtained to be too large the weights of the model
updates by a large value causing the model to oscillate and diverge instead of
converging to an optimal solution. Whereas gradient clipping limits the size of
the gradient and eliminating this issue of instability in the model.
2. Improving Generalization: Large gradients might cause the model to overfit to
the training data, which in turn might capture more noise and makes the model
bad at generalization. Gradient clipping removes this hindrance and makes the
model generalize better on new data preventing extreme updates.
3. Convergence to Optimal Solution: Exploding gradient prevents the model to
converge to optimal solution and instead produces more unstable modelling of
data. By clipping the gradient values, we can suspend the possibility of
instability, the model gets better at navigating the parameter space enabling
consistent progress toward optimal solution.
4. Compatibility with Activation Function: Some of the activation functions such as
'Sigmoid' and 'tanh' functions are sensitive to large input. Gradient clipping
ensures the gradient passed through the activation function is within a
reasonable range which also helps in removing undesirable behavior like
saturation.
5. Mitigating Vanishing Gradient Problem: Sometimes the gradient of the loss
function with respect to the weights become extremely small which causes the
weight to stop updating or even halt the process of training the model. Norm-
based gradient clipping helps in preventing the vanishing gradient problem by
maintaining the range of value for the gradient that is effective for training the
model.
The Second Order Derivative is defined as the derivative of the first derivative of
the given function. The first-order derivative at a given point gives us the
information about the slope of the tangent at that point or the instantaneous rate
of change of a function at that point.
Second-Order Derivative gives us the idea of the shape of the graph of a given
function. The second derivative of a function f(x) is usually denoted as f”(x). It is
also denoted by D2y or y2 or y” if y = f(x).
Let y = f(x)
Then, dy/dx = f'(x)
If f'(x) is differentiable, we may differentiate (1) again w.r.t x. Then, the left-hand
side becomes d/dx(dy/dx) which is called the second order derivative of y w.r.t x.
Second Order Derivatives Overview
Category Details
Denoted as
Notation f′′(x)f”(x)f′′(x),d2ydx2d2ydx2dx2d2y,ord2fdx2d2fdx2dx2d2f.f′′
(x)f”(x)f′′(x),d2ydx2dx2d2ydx2d2y,ord2fdx2dx2d2fdx2d2f.
−ConstantRule:f(x)=c⇒f′′(x)=0f(x)=c⇒f”(x)=0f(x)=c⇒f′′(x)=0–
PowerRule:f(x)=xn⇒f′′
(x)=n(n−1)xn−2f(x)=xn⇒f”(x)=n(n−1)xn−2f(x)=xn⇒f′′
(x)=n(n−1)xn−2–ExponentialRule:f(x)=ex⇒f′′
(x)=exf(x)=ex⇒f”(x)=exf(x)=ex⇒f′′(x)=ex–
LogarithmicRule:f(x)=ln(x)⇒f′′
(x)=−1x2f(x)=ln(x)⇒f”(x)=−1x2f(x)=ln(x)⇒f′′(x)=−x21
Basic Rules
−ConstantRule:f(x)=c⇒f′′(x)=0f(x)=c⇒f”(x)=0f(x)=c⇒f′′(x)=0–
PowerRule:f(x)=xn⇒f′′
(x)=n(n−1)xn−2f(x)=xn⇒f”(x)=n(n−1)xn−2f(x)=xn⇒f′′
(x)=n(n−1)xn−2–ExponentialRule:f(x)=ex⇒f′′
(x)=exf(x)=ex⇒f”(x)=exf(x)=ex⇒f′′(x)=ex–
LogarithmicRule:f(x)=ln(x)⇒f′′(x)=−1x2f(x)=ln(x)⇒f”(x)=−x21
f(x)=ln(x)⇒f′′(x)=−x21
Local Minima
A Local Minima point is a point on any function where the function attains its
minimum value within a certain interval. A point (x = a) of a function f (a) is called
a Local minimum if the value of f(a) is lesser than or equal to all the values of f(x).
Mathematically, f (a) ≤ f (a -h) and f (a) ≤ f (a + h) where h > 0, then a
is called the Local minimum point.
Definition of Local Maxima and Local Minima
Local Maxima and Minima are the initial values of any function to get an idea
about its boundaries such as the highest and lowest output values. Local Minima
and Local Maxima are also called Local Extrema.
Local Maxima
A Local Maxima point is a point on any function where the function attains its
maximum value within a certain inteval. A point (x = a) of a function f (a) is called
a Local maximum if the value of f(a) is greater than or equal to all the values of
f(x).
Terms Related to Local Maxima and Local Minima
Important terminology related to Local Maxima and Minima are discussed below:
Maximum Value
If any function gives the maximum output value for the input value of x. That
value of x is called maximum value. If it is defined within a specific range. Then
that point is called Local Maxima.
Absolute Maximum
If any function gives the maximum output value for the input value of x along the
entire range of the function. That value of x is called Absolute Maximum.
Minimum Value
If any function gives the minimum output value for the input value of x. That value
of x is called minimum value. If it is defined within a specific range. Then that
point is called Local Minima.
Absolute Minimum
If any function gives the minimum output value for the input value of x along the
entire range of the function. That value of x is called Absolute Minimum.
Point of Inversion
If the value of x within the range of given function does not show the highest and
lowest output, is called the Point of Inversion.
Properties of Local Maxima and Minima
Understanding the properties of local maxima and minima can help in their
identification:
1. If a function f(x) is continuous in its domain, it must have at least one maximum
or minimum between any two points where the function values are equal.
2. Local maxima and minima occur alternately; between two minima, there must
be a maximum, and vice versa.
3. If f(x) approaches infinity as x approaches the endpoints of the interval and has
only one critical point within the interval, that critical point is an extremum.
Solved Examples on Local Maxima and Local Minima
Example 1: Analyze the Local Maxima and Local Minima of the function f(x) =
2x3 – 3x2 – 12x + 5 by using the first derivative test.
Solution:
Given function is f(x) = 2x3 – 3x2 – 12x + 5
First derivative of function is f'(x) = 6x2 – 6x – 12, it will use to find out
the critical points.
To find the critical point, f'(x) = 0;
6x2 – 6x – 12 = 0
6(x2 – x – 2) = 0
6(x + 1)(x – 2) = 0
Hence, critical points are x = -1, and x = 2.
Analyze the First derivative immediate point to the critical point x = -
1. The points are {-2, 0}.
f'(-2) = 6(4 + 2 – 2) = 6(4) = +24 and f'(0) = 6(0 + 0 – 2) = 6(-2) = -
12
Sign of derivative is postive towards the left of x = -1, and is negative
towards the right. Hence, it indicaes x = -1 is the Local Maxima.
Let us now analyze the First derivative immediate point to the critical
point x = 2. The points are {1,3}.
f'(1) = 6(1 -1 -2) = 6(-2) = -12 and f'(3) = 6(9 + -3 – 2) = 6(4) = +24
Sign of derivative is negative towards the left of x = 2, and is positive
towards the right. Hence, it indicates x = 2 is the Local Minima.
***Thank You***
Module 3
A Multi-Layer Perceptron (MLP) is one of the most widely used types of neural
networks.
Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the
network’s weights and biases. This is achieved through backpropagation:
1. Gradient Calculation: The gradients of the loss function with respect to each
weight and bias are calculated using the chain rule of calculus.
2. Error Propagation: The error is propagated back through the network, layer by
layer.
3. Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss: w=w–η⋅∂L∂ww=w–η⋅∂w∂L
Where:
o ww is the weight.
o ηη is the learning rate.
o ∂L∂w∂w∂L is the gradient of the loss function with respect to the weight.
Step 4: Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases
during training. Popular optimization methods include:
Stochastic Gradient Descent (SGD): Updates the weights based on a single sample
or a small batch of data: w=w–η⋅∂L∂ww=w–η⋅∂w∂L
Adam Optimizer: An extension of SGD that incorporates momentum and adaptive
learning rates for more efficient training:
Here, gtgt represents the gradient at time tt, and β1,β2β1,β2 are decay rates.
Feed-forward Network Mappings
1. Input Layer: The input layer consists of neurons that receive the input data.
Each neuron in the input layer represents a feature of the input data.
2. Hidden Layers: One or more hidden layers are placed between the input and
output layers. These layers are responsible for learning the complex patterns in
the data. Each neuron in a hidden layer applies a weighted sum of inputs
followed by a non-linear activation function.
3. Output Layer: The output layer provides the final output of the network. The
number of neurons in this layer corresponds to the number of classes in a
classification problem or the number of outputs in a regression problem.
Mathematical Representation:
For a given input xx, the network output y=f(Wx+b)
where:
o xx = input vector
o WW = weight matrix
o bb = bias vector
o ff = activation function (like Sigmoid, ReLU, etc.)
Activation Functions
Activation functions introduce non-linearity into the network, enabling it to learn
and model complex data patterns. Common activation functions include:
Sigmoid: σ(x)=σ(x)=11+e−xσ(x)=1+e−x1
Tanh: tanh(x)=ex−e−xex+e−xtanh(x)=ex+e−xex−e−x
ReLU (Rectified Linear Unit): ReLU(x)=max(0,x)ReLU(x)=max(0,x)
Leaky ReLU: Leaky ReLU(x)=max(0.01x,x)Leaky ReLU(x)=max(0.01x,x)
Training a Feedforward Neural Network
Training a Feedforward Neural Network involves adjusting the weights of the
neurons to minimize the error between the predicted output and the actual output.
This process is typically performed using backpropagation and gradient descent.
1. Forward Propagation: During forward propagation, the input data passes
through the network, and the output is calculated.
2. Loss Calculation: The loss (or error) is calculated using a loss function such as
Mean Squared Error (MSE) for regression tasks or Cross-Entropy Loss for
classification tasks.
3. Backpropagation: In backpropagation, the error is propagated back through
the network to update the weights. The gradient of the loss function with respect
to each weight is calculated, and the weights are adjusted using gradient
descent.
Evaluation of Feedforward neural network
Evaluating the performance of the trained model involves several metrics:
Accuracy: The proportion of correctly classified instances out of the total
instances.
Precision: The ratio of true positive predictions to the total predicted positives.
Recall: The ratio of true positive predictions to the actual positives.
F1 Score: The harmonic mean of precision and recall, providing a balance
between the two.
Confusion Matrix: A table used to describe the performance of a classification
model, showing the true positives, true negatives, false positives, and false
negatives.
Threshold Units
A sigmoid function is a bounded, differentiable, real function that is defined for all
real input values and has a non-negative derivative at each point and exactly
one inflection point.
Properties
In general, a sigmoid function is monotonic, and has a first derivative which is bell
shaped. Conversely, the integral of any continuous, non-negative, bell-shaped
function (with one local maximum and no local minimum, unless degenerate) will be
sigmoidal. Thus the cumulative distribution functions for many common probability
distributions are sigmoidal. One such example is the error function, which is related
to the cumulative distribution function of a normal distribution.
A sigmoid function is convex for values less than a particular point, and it
is concave for values greater than that point: in many of the examples here, that
point is 0.
Examples
Logistic function
Hyperbolic tangent (shifted and scaled version of the logistic function, above)
Arctangent function
Gudermannian function
Error function
Applications
Many natural processes, such as those of complex system learning curves, exhibit a
progression from small beginnings that accelerates and approaches a climax over
time. When a specific mathematical model is lacking, a sigmoid function is often
used.[6]
The van Genuchten–Gupta model is based on an inverted S-curve and applied to the
response of crop yield to soil salinity.
Examples of the application of the logistic S-curve to the response of crop yield
(wheat) to both the soil salinity and depth to water table in the soil are shown
in modeling crop response in agriculture.
In artificial neural networks, sometimes non-smooth functions are used instead for
efficiency; these are known as hard sigmoids.
Weight-space Symmetries
In the context of deep learning, weight space symmetry means that non-
identifiable models are invariant to random permutations in their weight layers. This
symmetry holds since in deep learning there are generally not enough training
samples to rule out all parameter settings but one, there usually exist a large
amount of possible weight combinations for a given dataset that yield similar model
performance.
Weight-space symmetry is a property of neural network landscapes that describes
how permutation symmetries give rise to multiple equivalent global minima in the
weight space. This property can have implications for training dynamics, and can
also be used to uncover a model's underlying structure
Weight-space symmetry can also give rise to first-order saddle points on the path
between the global minima.
A challenging problem in machine learning is to process weight-space features,
which involves transforming or extracting information from the weights and
gradients of a neural network.
The weight space is a concatenation of all the weight and biases.
The symmetry group acts on each one of those independently, which is called a
direct-sum of representations.
In the forward pass, the input data is fed into the input layer. These inputs,
combined with their respective weights, are passed to hidden layers.
For example, in a network with two hidden layers (h1 and h2 as shown in Fig. (a)),
the output from h1 serves as the input to h2. Before applying an activation
function, a bias is added to the weighted inputs.
Each hidden layer applies an activation function like ReLU (Rectified Linear Unit),
which returns the input if it’s positive and zero otherwise. This adds non-linearity,
allowing the model to learn complex relationships in the data. Finally, the outputs
from the last hidden layer are passed to the output layer, where an activation
function, such as softmax, converts the weighted outputs into probabilities for
classification.
Error Calculation
To calculate the error, we can use the below formula:
Errorj=ytarget−y5Errorj=ytarget−y5
Error=0.5−0.67=−0.17Error=0.5−0.67=−0.17
Using this error value, we will be backpropagating.
Backpropagation
1. Calculating Gradients
The change in each weight is calculated as:
Δwij=η×δj×OjΔwij=η×δj×Oj
Where:
δjδj is the error term for each unit,
ηη is the learning rate.
2. Output Unit Error
For O3:
δ5=y5(1−y5)(ytarget−y5)
=0.67(1−0.67)(−0.17)=−0.0376=0.67(1−0.67)(−0.17)=−0.0376
Radial Basis Function (RBF) Neural Networks are a specialized type of Artificial
Neural Network (ANN) used primarily for function approximation tasks. Known for
their distinct three-layer architecture and universal approximation capabilities,
RBF Networks offer faster learning speeds and efficient performance in
classification and regression problems. This article delves into the workings,
architecture, and applications of RBF Neural Networks.
What are Radial Basis Functions?
Radial Basis Functions (RBFs) are a special category of feed-forward neural
networks comprising three layers:
1. Input Layer: Receives input data and passes it to the hidden layer.
2. Hidden Layer: The core computational layer where RBF neurons process the
data.
3. Output Layer: Produces the network’s predictions, suitable for classification or
regression tasks.
How Do RBF Networks Work?
RBF Networks are conceptually similar to K-Nearest Neighbor (k-NN) models,
though their implementation is distinct. The fundamental idea is that an item's
predicted target value is influenced by nearby items with similar predictor variable
values. Here’s how RBF Networks operate:
1. Input Vector: The network receives an n-dimensional input vector that needs
classification or regression.
2. RBF Neurons: Each neuron in the hidden layer represents a prototype vector
from the training set. The network computes the Euclidean distance between
the input vector and each neuron's center.
3. Activation Function: The Euclidean distance is transformed using a Radial
Basis Function (typically a Gaussian function) to compute the neuron’s activation
value. This value decreases exponentially as the distance increases.
4. Output Nodes: Each output node calculates a score based on a weighted sum
of the activation values from all RBF neurons. For classification, the category
with the highest score is chosen.
Key Characteristics of RBFs
Radial Basis Functions: These are real-valued functions dependent solely on
the distance from a central point. The Gaussian function is the most commonly
used type.
Dimensionality: The network's dimensions correspond to the number of predictor
variables.
Center and Radius: Each RBF neuron has a center and a radius (spread). The
radius affects how broadly each neuron influences the input space.
Architecture of RBF Networks
The architecture of an RBF Network typically consists of three layers:
Input Layer
Function: After receiving the input features, the input layer sends them straight
to the hidden layer.
Components: It is made up of the same number of neurons as the
characteristics in the input data. One feature of the input vector corresponds to
each neuron in the input layer.
Hidden Layer
Function: This layer uses radial basis functions (RBFs) to conduct the non-linear
transformation of the input data.
Components: Neurons in the buried layer apply the RBF to the incoming data.
The Gaussian function is the RBF that is most frequently utilized.
RBF Neurons: Every neuron in the hidden layer has a spread parameter (σ) and
a center, which are also referred to as prototype vectors. The spread parameter
modulates the distance between the center of an RBF neuron and the input
vector, which in turn determines the neuron's output.
Output Layer
Function: The output layer uses weighted sums to integrate the hidden layer
neurons' outputs to create the network's final output.
Components: It is made up of neurons that combine the outputs of the hidden
layer in a linear fashion. To reduce the error between the network's predictions
and the actual target values, the weights of these combinations are changed
during training.
Training Process of radial basis function neural network
An RBF neural network must be trained in three stages: choosing the center's,
figuring out the spread parameters, and training the output weights.
Step 1: Selecting the Centers
Techniques for Centre Selection: Centre's can be picked at random from the
training set of data or by applying techniques such as k-means clustering.
K-Means Clustering: The center's of these clusters are employed as the
center's for the RBF neurons in this widely used center selection technique,
which groups the input data into k groups.
Step 2: Determining the Spread Parameters
The spread parameter (σ) governs each RBF neuron's area of effect and
establishes the width of the RBF.
Calculation: The spread parameter can be manually adjusted for each neuron
or set as a constant for all neurons. Setting σ based on the separation between
the center's is a popular method, frequently accomplished with the help of a
heuristic like dividing the greatest distance between canters by the square root
of twice the number of center's
Step 3: Training the Output Weights
Linear Regression: The objective of linear regression techniques, which are
commonly used to estimate the output layer weights, is to minimize the error
between the anticipated output and the actual target values.
Pseudo-Inverse Method: One popular technique for figuring out the weights is
to utilize the pseudo-inverse of the hidden layer outputs matrix
Assessment Questions
1. Define feed-forward network mapping and explain its significance in MLPs.
2. Explain the role of backpropagation in training MLPs.
3. What is weight-space symmetry, and how does it affect neural network
training?
4. Describe the role of the radial basis function in RBFNs.
5. How can you train an RBF network using K-means clustering?
***Thank You***
Module 4
ERROR FUNCTIONS
Error function
In mathematics, the error function (also called the Gauss error function), often
denoted by erf, is a function erf=c->c defined as:
In some old texts, the error function is defined without the factor of .
This nonelementary integral is a sigmoid function that occurs often
in probability, statistics, and partial differential equations.
The name "error function" and its abbreviation erf were proposed by J. W. L.
Glaisher in 1871 on account of its connection with "the theory of Probability, and
notably the theory of Errors."[3] The error function complement was also discussed
by Glaisher in a separate publication in the same year.[4] For the "law of facility" of
errors whose density is given by
Applications
When the results of a series of measurements are described by a normal
distribution with standard deviation σ and expected value 0, then erf (a/σ √2) is the
probability that the error of a single measurement lies between −a and +a, for
positive a. This is useful, for example, in determining the bit error rate of a digital
communication system.
The error and complementary error functions occur, for example, in solutions of
the heat equation when boundary conditions are given by the Heaviside step
function.
Sum of Squares
The sum of squares means the sum of the squares of the given numbers. In
statistics, it is the sum of the squares of the variation of a dataset. For this, we need
to find the mean of the data and find the variation of each data point from the
mean, square them and add them. In algebra, the sum of the square of two
numbers is determined using the (a + b) 2 identity. We can also find the sum of
squares of the first n natural numbers using a formula. The formula can be derived
using the principle of mathematical induction. We do these basic arithmetic
operations which are required in statistics and algebra. There are different
techniques to find the sum of squares of given numbers.
In this article, we will discuss the different sum of squares formulas. To calculate the
sum of two or more squares in an expression, the sum of squares formula is used.
Also, the sum of squares formula is used to describe how well the data being
modeled is represented by a model. Let us learn these along with a few solved
examples in the upcoming sections for a better understanding.
The sum of squares in statistics is a tool that is used to evaluate the dispersion of a
dataset. To evaluate this, we take the sum of the square of the variation of each
data point. In algebra, we find the sum of squares of two numbers using
the algebraic identity of (a + b)2. Also, in mathematics, we find the sum of squares
of n natural numbers using a specific formula which is derived using the principle of
mathematical induction. Let us now discuss the formulas of finding the sum of
squares in different areas of mathematics.
Sum of Squares Formula
The sum of squares formula in statistics is used to describe how well the data being
modeled is represented by a model. It shows the dispersion of the dataset. To
calculate the sum of two or more squares in an expression, the sum
of squares formula is used. Thus, a few sums of squares formulas are,
In statistics : Sum of squares of n data points = ∑ni=0 (xi - x̄)2
In algebra : Sum of squares = a2 + b2 = (a + b)2 - 2ab
Sum of squares of n natural numbers formula: 1 2 + 22 + 32 + ... + n2 = [n(n+1)
(2n+1)] / 6
Where,
∑ = represents sum
xi = each value in the set
x̄ = mean of the values
xi – x̄ = deviation from the mean value
(xi – x̄)2 = square of the deviation
a, b = arbitrary numbers
n = number of terms in the series
Let a and b be the two numbers. Assuming the squares of a and b are a 2 and b2. The
sum of the squares of a and b is a 2 + b2. We could obtain a formula using the
known algebraic identity (a+b)2 = a2 + b2 + 2ab. Subtracting 2ab from both the
sides we can conclude that a 2 + b2 = (a + b)2 - 2ab. Let a, b, c be the 3 numbers for
which we are supposed to find the sum of squares. The sum of their squares is a 2 +
b2 + c2. Using the known algebraic identity (a+b+c) 2 = a2 + b2 + c2 + 2ab + 2bc
+2ca, we can evaluate that a2 + b2 + c2 = (a+b+c)2 - 2ab - 2bc - 2ca.
In statistics, the sum of squares error (SSE) is the difference between the observed
value and the predicted value. It is also called the sum of squares residual (SSR) as
it is the sum of the squares of the residual, that is, the deviation of predicted values
from the actual values. The formula for the sum of squares error is given by,
SSE = ∑ni=0 (yi - f(xi))2, where yi is the ith value of the variable to be predicted, f(xi)
is the predicted value, and xi is the ith value of the explanatory variable.
We can also evaluate the sum of squares error (SSE) by subtracting the sum of
squares regression (SSR) from the sum of squares total (SST), that is, SSE = SST -
SSR
The total sum of squares can be calculated in statistics using the following steps:
Step 1: In the dataset, count the number of data points.
Step 2: Calculate the mean of the data.
Step 3: Subtract each data point from the mean.
Step 4: Determine the square of the difference determined in step 3.
Step 5: Add the squares determined in step 4.
Example 1: Using the sum of squares formula, find the value of 4 2 + 62?
Solution: To find : value of 42 + 62
Given: a = 4, b = 6
Using sum of squares formula a2 + b2 = (a + b)2 − 2ab, we have
42 + 62 = (4 + 6)2 − 2(4)(6)
= 100 − 2(24)
= 100 − 48
= 52
Answer: The value of 42 + 62 is 52.
Example 2 : Calculate the sum of the following series 12 + 22 + 32 ……. 1002
Solution:
To Find: Sum of the series
Using sum of squares formula for n terms, 1 2 + 22 + 32 + ... + n2 = [n(n+1)(2n+1)] /
6
Given: n =100
= [100(100+1)(2×100+1)] / 6
= (100 × 101 × 201) / 6
= 338350
Answer: The sum of the given series is 338350.
Minkowski distance
The Minkowski distance or Minkowski metric is a metric in a normed vector
space which can be considered as a generalization of both the Euclidean
distance and the Manhattan distance. It is named after the Polish
mathematician Hermann Minkowski.
Definition
The Minkowski distance can also be viewed as a multiple of the power mean of the
component-wise differences between P and Q
Applications
The Minkowski metric is very useful in the field of machine learning and AI. Many
popular machine learning algorithms use specific distance metrics such as the
aforementioned to compare the similarity of two data points. Depending on the
nature of the data being analyzed, various metrics can be used. The Minkowski
metric is most useful for numerical datasets where you want to determine the
similarity of size between multiple datapoint vectors.
Input-dependent variance
More generally, one can refer to the conditional distribution of a subset of a set of
more than two variables; this conditional distribution is contingent on the values of
all the remaining variables, and if more than one variable is included in the subset
then this conditional distribution is the conditional joint distribution of the included
variables.
For discrete random variables, the conditional probability mass function of Y given
X=x can be written according to its definition as:
Example
Relation to independence
Posterior probability
The posterior probability is a type of conditional probability that results
from updating the prior probability with information summarized by
the likelihood via an application of Bayes' rule. From an epistemological
perspective, the posterior probability contains everything there is to know about an
uncertain proposition (such as a scientific hypothesis, or parameter values), given
prior knowledge and a mathematical model describing the observations available at
a particular time. After the arrival of new information, the current posterior
probability may serve as the prior in another round of Bayesian updating.
Calculation
The posterior probability distribution of one random variable given the value of
another can be calculated with Bayes' theorem by multiplying the prior probability
distribution by the likelihood function, and then dividing by the normalizing
constant, as follows:
categorical_crossentropy:
Now notice how binary crossentropy (the second equation in the picture) has two
terms, one for considering 1 as the correct class, another for considering 0 as the
correct class.
Categorical Cross-Entropy in Multi-Class Classification
Categorical Cross-Entropy (CCE), also known as softmax loss or log loss, is one of
the most commonly used loss functions in machine learning, particularly for
classification problems. It measures the difference between the predicted
probability distribution and the actual (true) distribution of classes. The function
helps a machine learning model determine how far its predictions are from the
true labels and guides it in learning to make more accurate predictions.
Drawback:
Random Search and Grid Search are easy to implement and can run in parallel but
here are few drawbacks of these algorithm:
If the hyperparameter search space is large, it takes a lot of time and
computational power to optimize the hyperparameter.
There is no guarantee that these algorithms find local maxima if the sample is
not meticulously done.
Bayesian Optimization:
Hyperband:
Population based Training (PBT) starts similar to random based training by training
many models in parallel. But rather than the networks training independently, it
uses information from the remainder of the population to refine the
hyperparameters and direct computational resources to models which show
promise. This takes its inspiration from genetic algorithms where each member of
the population, referred to as a worker, can exploit information from the rest of
the population. for instance, a worker might copy the model parameters from a far
better performing worker. It also can explore new hyperparameters by changing
the present values randomly.
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
# create random weights and bias for the linear regression model
true_weights = torch.tensor([1.3, -1])
true_bias = torch.tensor([-3.5])
# Target variable
y = x @ true_weights.T + true_bias
Output:
Module 5
LEARNING AND GENERALIZATION
Bias is one type of error that occurs due to wrong assumptions about data such as
assuming data is linear when in reality, data follows a complex function. On the
other hand, variance gets introduced with high sensitivity to variations in training
data. This also is one type of error since we want to make our model robust
against noise. There are two types of error in machine learning. Reducible error
and Irreducible error. Bias and Variance come under reducible error.
What is Bias?
Bias is simply defined as the inability of the model because of that there is some
difference or error occurring between the model’s predicted value and the actual
value. These differences between actual or expected values and the predicted
values are known as error or bias error or error due to bias. Bias is a systematic
error that occurs due to wrong assumptions in the machine learning process.
Let YY be the true value of a parameter, and let Y^Y^ be an estimator of YY based
on a sample of data. Then, the bias of the estimator Y^Y^ is given by:
Low Bias: Low bias value means fewer assumptions are taken to build the
target function. In this case, the model will closely match the training dataset.
High Bias: High bias value means more assumptions are taken to build the
target function. In this case, the model will not match the training dataset
closely.
The high-bias model will not be able to capture the dataset trend. It is considered
as the underfitting model which has a high error rate. It is due to a very simplified
algorithm.
For example, a linear regression model may have a high bias if the data has a non-
linear relationship.
Use a more complex model: One of the main reasons for high bias is the very
simplified model. it will not be able to capture the complexity of the data. In
such cases, we can make our mode more complex by increasing the number of
hidden layers in the case of a deep neural network. Or we can use a more
complex model like Polynomial regression for non-linear datasets, CNN for image
processing, and RNN for sequence learning.
Increase the number of features: By adding more features to train the
dataset will increase the complexity of the model. And improve its ability to
capture the underlying patterns in the data.
Reduce Regularization of the model: Regularization techniques such as L1 or
L2 regularization can help to prevent overfitting and improve the generalization
ability of the model. if the model has a high bias, reducing the strength of
regularization or removing it altogether can help to improve its performance.
Increase the size of the training data: Increasing the size of the training
data can help to reduce bias by providing the model with more examples to
learn from the dataset.
What is Variance?
Variance is the measure of spread in data from its mean position. In machine
learning variance is the amount by which the performance of a predictive model
changes when it is trained on different subsets of the training data. More
specifically, variance is the variability of the model that how much it is sensitive to
another subset of the training dataset. i.e. how much it can adjust on the new
subset of the training dataset.
Let Y be the actual values of the target variable, and Y^ Y^ be the predicted
values of the target variable. Then the variance of a model can be measured as
the expected value of the square of the difference between predicted values and
the expected value of the predicted values.
Cross-validation: By splitting the data into training and testing sets multiple
times, cross-validation can help identify if a model is overfitting or underfitting
and can be used to tune hyperparameters to reduce variance.
Feature selection: By choosing the only relevant feature will decrease the
model’s complexity. and it can reduce the variance error.
Regularization: We can use L1 or L2 regularization to reduce variance in
machine learning models
Ensemble methods: It will combine multiple models to improve generalization
performance. Bagging, boosting, and stacking are common ensemble methods
that can help reduce variance and improve generalization performance.
Simplifying the model: Reducing the complexity of the model, such as
decreasing the number of parameters or layers in a neural network, can also
help reduce variance and improve generalization performance.
Early stopping: Early stopping is a technique used to prevent overfitting by
stopping the training of the deep learning model when the performance on the
validation set stops improving.
Different Combinations of Bias-Variance
There can be four combinations between bias and variance.
High Bias, Low Variance: A model with high bias and low variance is said to
be underfitting.
High Variance, Low Bias: A model with high variance and low bias is said to
be overfitting.
High-Bias, High-Variance: A model has both high bias and high variance,
which means that the model is not able to capture the underlying patterns in the
data (high bias) and is also too sensitive to changes in the training data (high
variance). As a result, the model will produce inconsistent and inaccurate
predictions on average.
Low Bias, Low Variance: A model that has low bias and low variance means
that the model is able to capture the underlying patterns in the data (low bias)
and is not too sensitive to changes in the training data (low variance). This is the
ideal scenario for a machine learning model, as it is able to generalize well to
new, unseen data and produce consistent and accurate predictions. But in
practice, it’s not possible.
Regularization
Regularization introduces a penalty for more complex models, effectively reducing
their complexity and encouraging the model to learn more generalized patterns.
This method strikes a balance between underfitting and overfitting, where
underfitting occurs when the model is too simple to capture the underlying trends
in the data, leading to both training and validation accuracy being low.
Role Of Regularization
In Python, Regularization is a technique used to prevent overfitting by adding a
penalty term to the loss function, discouraging the model from assigning too much
importance to individual features or coefficients.
Let’s explore some more detailed explanations about the role of Regularization in
Python:
1. Complexity Control: Regularization helps control model complexity by
preventing overfitting to training data, resulting in better generalization to new
data.
2. Preventing Overfitting: One way to prevent overfitting is to use
regularization, which penalizes large coefficients and constrains their
magnitudes, thereby preventing a model from becoming overly complex and
memorizing the training data instead of learning its underlying patterns.
3. Balancing Bias and Variance: Regularization can help balance the trade-off
between model bias (underfitting) and model variance (overfitting) in machine
learning, which leads to improved performance.
4. Feature Selection: Some regularization methods, such as L1 regularization
(Lasso), promote sparse solutions that drive some feature coefficients to zero.
This automatically selects important features while excluding less important
ones.
5. Handling Multicollinearity: When features are highly correlated
(multicollinearity), regularization can stabilize the model by reducing coefficient
sensitivity to small data changes.
6. Generalization: Regularized models learn underlying patterns of data for better
generalization to new data, instead of memorizing specific examples.
Alternatively, the Gaussian noise can be injected into input variables, activations,
weights, gradients, and outputs.
Injecting noise to activations: The noise injection in the activation layer,
where the noise is injected directly into the activation layer permitting the
injected noise to be utilized by the network at any point in time during the
forward pass through the network layer. Injecting noise into an activation layer
is very helpful when we have a very deep neural network which helps the
network to regularize well and prevents overfitting. The output layer can inject
the noise by itself with the help of a noisy activation function.
Injecting noise to weights: In the context of recurrent neural
networks, adding noise to the weights is one of the beneficial techniques to
regularize the model. When the noise is injected into the weights it generally
encourages the stability in the function being learned by the neural network.
This is an efficient injecting method because it directly injects the noise into
weights rather than injecting noise into input or output layers in the neural
network.
Injecting noise to gradients: Instead of focusing on the structure of the input
domain, injecting noise to the gradients primarily centers on enhancing the
robustness of the optimization process. Just like gradient descent, the amount of
noise can begin high while training and can also generally decrease over time.
When we have a deep neural network, injecting noise into a gradient is one of
the most effective methods to be noticed.
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, GaussianNoise
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
input_shape=(784,)
num_classes=10
model = build_model(input_shape, num_classes)
model.compile(optimizer=Adam(), loss=SparseCategoricalCrossentropy(),
metrics=['accuracy'])
#Load the dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
#Preprocessing the data
x_train = x_train.reshape(-1, 784).astype('float32')/255.0
x_test = x_test.reshape(-1, 784).astype('float32')/255.0
history= model.fit(x_train, y_train, batch_size=32, epochs=10,
validation_data=(x_test, y_test))
Output:
Epoch 1/10
1875/1875 [==============================] - 13s 6ms/step -
loss: 0.2555 - accuracy: 0.9247 - val_loss: 0.1313 - val_accuracy: 0.9601
Epoch 2/10
1875/1875 [==============================] - 8s 5ms/step -
loss: 0.1173 - accuracy: 0.9643 - val_loss: 0.0953 - val_accuracy: 0.9702
Epoch 3/10
1875/1875 [==============================] - 10s 5ms/step -
loss: 0.0847 - accuracy: 0.9740 - val_loss: 0.0919 - val_accuracy: 0.9728
Epoch 4/10
1875/1875 [==============================] - 9s 5ms/step -
loss: 0.0688 - accuracy: 0.9780 - val_loss: 0.0803 - val_accuracy: 0.9745
Epoch 5/10
1875/1875 [==============================] - 9s 5ms/step -
loss: 0.0563 - accuracy: 0.9825 - val_loss: 0.0771 - val_accuracy: 0.9768
Epoch 6/10
1875/1875 [==============================] - 9s 5ms/step -
loss: 0.0483 - accuracy: 0.9844 - val_loss: 0.0843 - val_accuracy: 0.9746
Epoch 7/10
1875/1875 [==============================] - 8s 4ms/step -
loss: 0.0423 - accuracy: 0.9859 - val_loss: 0.0796 - val_accuracy: 0.9756
Epoch 8/10
1875/1875 [==============================] - 9s 5ms/step -
loss: 0.0363 - accuracy: 0.9875 - val_loss: 0.0860 - val_accuracy: 0.9766
Epoch 9/10
1875/1875 [==============================] - 9s 5ms/step -
loss: 0.0353 - accuracy: 0.9884 - val_loss: 0.0740 - val_accuracy: 0.9790
Epoch 10/10
1875/1875 [==============================] - 8s 4ms/step -
loss: 0.0302 - accuracy: 0.9900 - val_loss: 0.0715 - val_accuracy: 0.9811
Soft weight sharing
Each Soft Tone Weight is a high-density weight that is covered in a soft Lycra fabric
and is ultra-comfortable to wear. They stretch, fit either wrist or ankle, and stay in
place during exercise.
Sometimes, the growth of the decision tree can be stopped before it gets too
complex, this is called pre-pruning. It is important to prevent the overfitting of the
training data, which results in a poor performance when exposed to new data.
After the tree is fully grown, post-pruning involves removing branches or nodes to
improve the model's ability to generalize. Some common post-pruning techniques
include:
Cost-Complexity Pruning (CCP): This method assigns a price to each subtree
primarily based on its accuracy and complexity, then selects the subtree with
the lowest fee.
Reduced Error Pruning: Removes branches that do not significantly affect the
overall accuracy.
Minimum Impurity Decrease: Prunes nodes if the decrease in impurity (Gini
impurity or entropy) is beneath a certain threshold.
Region growing
Region growing is a simple region-based image segmentation method. It is also
classified as a pixel-based image segmentation method since it involves the
selection of initial seed points.
Region-based segmentation
Can correctly separate the regions that have the same properties we define.
Can provide the original images which have clear edges with good segmentation
results.
Simple concept: only need a small number of seed points to represent the
property we want, then grow the region.
Can determine the seed points and the criteria we want to make.
Can choose the multiple criteria at the same time.
Theoretical very efficient due to visiting each pixel by a limited bound of times.
Disadvantages
Unless image has had a threshold function applied, a continuous path of points
related to color may exist, which connects any two points in the image.
Practically random memory access slows down the algorithm, so adaption might
be needed
Committees and Networks
A committee machine is a type of artificial neural network using a divide and
conquer strategy in which the responses of multiple neural networks (experts) are
combined into a single response.[1] The combined response of the committee
machine is supposed to be superior to those of its constituent experts.
Committee machine
Types
Static structures
In this class of committee machines, the responses of several predictors (experts)
are combined by means of a mechanism that does not involve the input signal,
hence the designation static. This category includes the following methods:
Ensemble averaging
In ensemble averaging, outputs of different predictors are linearly combined to
produce an overall output.
Boosting
In boosting, a weak algorithm is converted into one that achieves arbitrarily high
accuracy.
Dynamic structures
In this second class of committee machines, the input signal is directly involved in
actuating the mechanism that integrates the outputs of the individual experts into
an overall output, hence the designation dynamic. There are two kinds of dynamic
structures:
Mixture of experts
In mixture of experts, the individual responses of the experts are non-linearly
combined by means of a single gating network.
Mixture of experts
Mixture of experts (MoE) is a machine learning technique where multiple
expert networks (learners) are used to divide a problem space into homogeneous
regions. MoE represents a form of ensemble learning.
Basic theory
MoE always has the following components, but they are implemented and combined
differently according to the problem being solved:
Both the experts and the weighting function are trained by minimizing some loss
function, generally via gradient descent. There is much freedom in choosing the
precise form of experts, the weighting function, and the loss function.
Meta-pi network
The meta-pi network, reported by Hampshire and Waibel
output. The model is trained by performing gradient descent on the mean-squared
error loss