1.
Historical Trends in Deep Learning
The evolution of deep learning has been shaped by decades of research, breakthroughs in
theory, advancements in hardware, and the availability of large datasets. Its roots can be
traced back to the 1940s with the introduction of the McCulloch-P itts neuron, a simple
mathematical model of a biological neuron. This was followed by the development of the
Perceptron in the 1950s by Frank Rosenblatt, which demonstrated that machines could
learn from data. However, enthusiasm declined in the 1970s due to limitations in the
perceptron’s capability to solve non-linearly separable problems, such as the XOR
problem. A major resurgence occurred in the 1980s with the introduction of
backpropagation, a learning algorithm that allowed multi-layer neural networks to be
trained effectively. Despite this progress, deep learning faced another slowdown due to
limited computational power and insufficient data. The early 2000s marked a turning
point, with renewed interest driven by increases in GPU computing, large-scale datasets,
and improved algorithms. The breakthrough came in 2012 when a deep convolutional
neural network, AlexNet, won the ImageNet competition by a wide margin, showcasing
the power of deep architectures in visual recognition tasks. This success triggered a wave
of innovations in architectures, such as VGG, ResNet, and Transformers, expanding deep
learning’s impact across domains like computer vision, natural language processing, and
speech recognition. Over the years, the field has moved toward deeper, wider, and more
efficient models, along with the use of self-supervised learning, transfer learning, and
foundation models. Today, deep learning stands at the forefront of artificial intelligence,
with ongoing research aimed at improving interpretability, robustness, and efficiency.
2. McCulloch-Pitts Neuron
The McCulloch-Pitts neuron, proposed in 1943 by Warren McCulloch and Walter Pitts,
represents the earliest mathematical model of a biological neuron and laid the foundation
for artificial neural networks. It is a simple binary computational model designed to
mimic the functioning of a single neuron in the human brain. The model receives
multiple binary inputs, each representing the presence (1) or absence (0) of a signal.
These inputs are summed and compared against a fixed threshold. If the total input
exceeds the threshold, the neuron “fires” and outputs a 1; otherwise, it outputs a 0.
Mathematically, it can be represented as a threshold logic unit without the use of weights
or learning mechanisms. Although rudimentary, the McCulloch-Pitts neuron was capable
of computing basic logical functions such as AND, OR, and NOT, making it a crucial
stepping stone in understanding how complex computation can arise from simple units.
However, due to its limitations—such as the inability to learn from data or solve non-
linearly separable problems like XOR—it was later extended and improved upon in
models like the Perceptron. Despite its simplicity, the McCulloch-Pitts neuron remains
historically significant as the first formal attempt to bridge neuroscience and
computation, inspiring the development of modern neural network models.
Algorithm of McCulloch-Pitts Neuron
Inputs:
A list of binary inputs: x1, x2, ..., xn (each input is either 0 or 1)
A threshold value: θ (theta), a positive integer
Steps:
1. Initialize sum = 0
2. For each input xi (where i = 1 to n):
o Add xi to sum
3. Compare the total sum with the threshold:
o If sum >= θ:
Output = 1
o Else:
Output = 0
Return:
Output (either 0 or 1)
Example – Implementing AND Gate with McCulloch-Pitts Neuron
Let:
Input x1 = 1
Input x2 = 1
Threshold θ = 2
Steps:
1. sum = x1 + x2 = 1 + 1 = 2
2. Since sum (2) >= threshold (2), the neuron fires:
Output = 1
Other combinations:
x1 = 1, x2 = 0 → sum = 1 → Output = 0
x1 = 0, x2 = 1 → sum = 1 → Output = 0
x1 = 0, x2 = 0 → sum = 0 → Output = 0
Thus, the McCulloch-P itts neuron correctly performs the logic of an AND gate.
3. Thresholding Logic
Thresholding logic is a fundamental concept in both biological and artificial neural
computation, where a decision is made based on whether an input signal surpasses a
predefined limit known as the threshold. In its simplest form, thresholding logic involves
summing input signals and comparing the result to a fixed threshold value. If the sum
equals or exceeds the threshold, the system activates or produces an output of 1;
otherwise, the output is 0. This binary decision-making process mimics the behavior of
neurons in the brain, which fire only when their inputs collectively reach a certain
activation level. Thresholding logic is the core mechanism behind early neural models
like the McCulloch-P itts neuron and underlies the implementation of basic logic gates
such as AND, OR, and NOT. It plays a critical role in pattern recognition, classification,
and decision-making systems by enabling machines to differentiate between input
patterns based on activation criteria. Although modern neural networks use more
complex activation functions, thresholding remains a foundational idea that helps in
understanding how discrete decis ions are made in both artificial and biological contexts.
4. Perceptron
The Perceptron is a fundamental model in the history of artificial neural networks,
introduced by Frank Rosenblatt in 1958 as a computational algorithm designed to mimic
the learning ability of the human brain. It is a supervised learning model used primarily
for binary classification tasks. A perceptron consists of a single layer of artificial neurons,
each of which receives multiple input signals, applies corresponding weights, sums the m,
and passes the result through an activation function—typically a step function. If the
weighted sum exceeds a predefined threshold, the output is 1; otherwise, it is 0. Unlike
the earlier McCulloch-P itts neuron, the perceptron is capable of learning by adjusting its
weights during training using an error-correction rule. The learning algorithm updates
weights iteratively based on the difference between the actual and desired output,
allowing the model to minimize classification errors over time. Despite its effectiveness
in solving linearly separable problems (like AND and OR), the perceptron cannot solve
problems involving non-linear decision boundaries, such as the XOR problem.
Nevertheless, the perceptron laid the groundwork for more advanced neural network
models, including multi-layer networks and modern deep learning architectures, making
it a pivotal innovation in the development of artificial intelligence.
5. Perceptron Learning Algorithm
The Perceptron Learning Algorithm is one of the earliest and simplest algorithms for
training a single-layer neural network for binary classification. It was introduced by
Frank Rosenblatt and is used to find the optimal weights of a perceptron based on labeled
training data. The perceptron processes each input vector by computing a weighted sum
and comparing it against a threshold. If the result is above the threshold, the output is 1;
otherwise, it is 0. The learning algorithm updates the weights whenever the output does
not match the target value, gradually reducing the classification error over time.
Inputs:
A set of training examples: {(x1, y1), (x2, y2), ..., (xn, yn)}
o Each input xi is a vector of features: [x1, x2, ..., xm]
o Each target yi is either 0 or 1
Learning rate: η (eta), typically a small positive number (e.g., 0.1)
Initial weights: [w1, w2, ..., wm] and bias b (often set to 0)
Algorithm Steps:
1. Initialize weights (w1, w2, ..., wm) and bias b to small random values or zeros.
2. For each training sample (xi, yi):
a. Compute the weighted sum:
net = (w1 * x1) + (w2 * x2) + ... + (wm * xm) + b
b. Apply the activation function (step function):
If net ≥ 0, then predicted output ŷ = 1
Else, predicted output ŷ = 0
c. Update the weights and bias if the prediction is incorrect:
o For each weight wj:
wj = wj + η * (yi - ŷ) * xj
o Update the bias:
b = b + η * (yi - ŷ)
3. Repeat steps 2a to 2c for all samples across multiple passes (epochs) until the
weights stabilize (converge) or a maximum number of epochs is reached.
6. Representation Power of MLPs
Multilayer Perceptron’s (MLPs) are powerful models in the field of deep learning due to
their ability to approximate complex functions. At their core, MLPs consist of multiple
layers of neurons where each layer applies a linear transformation followed by a non-
linear activation function. This combination enables MLPs to model highly non-linear
relationships between inputs and outputs. One of the most important theoretical results
about MLPs is the Universal Approximation Theorem, which states that a feed forward
neural network with at least one hidden layer containing a finite number of neurons can
approximate any continuous function on a compact domain, given suitable activation
functions. This means that MLPs have the potential to learn a wide variety of tasks
including classification, regression, and pattern recognition. The representation power of
an MLP increases with the number of hidden units and layers, allowing it to capture more
intricate data patterns. However, simply increasing the size of an MLP doesn't guarantee
better performance it also requires proper training, regularization, and sufficient data.
Overall, the ability of MLPs to represent complex functions makes them a foundational
model in neural network-based machine learning.
7. Sigmoid Neurons:
Sigmoid neurons are a fundamental component of artificial neural networks, especially in
the context of binary classification problems. They use the sigmoid activation function,
defined as σ(z) = 1 / (1 + e^(-z)), where z is the weighted sum of inputs plus a bias term
(z = w·x + b). This function maps any real-valued number into a range between 0 and 1,
making it especially useful for predicting probabilities. One of the key advantages of the
sigmoid function is that it is smooth and differentiable, which a llows for efficient
learning through gradient descent and backpropagation. Biologically, it mimics the way
real neurons activate gradually, firing more strongly with higher input stimuli. By
introducing non-linearity, sigmoid neurons enable networks to learn complex patterns
that linear models cannot. However, they are also known to suffer from the vanishing
gradient problem: for large positive or negative inputs, the gradient becomes very small,
which can hinder learning in deep networks. Despite this drawback, sigmoid neurons are
still foundational in neural network theory and are especially central to logistic regression
models. The output of a sigmoid neuron can be interpreted as the probability that the
input belongs to a particular class. Historically, sigmoid neurons were widely used before
more advanced activation functions like ReLU became standard. Their derivative, σ'(z) =
σ(z)(1 - σ(z)), plays a critical role in adjusting weights during the training process of
neural networks.
8. Gradient Descent:
Gradient Descent is a fundamental optimization algorithm used in machine learning and
deep learning to minimize a loss or cost function by iteratively adjusting the model’s
parameters. The main idea is to find the direction in which the function decreases most
rapidly, this is done by computing the gradient (partial derivatives) of the cost function
with respect to each parameter. Starting from some initial values, the algorithm updates
each parameter in the opposite direction of the gradient, scaled by a factor called the
learning rate. This process continues until the algorithm converges to a minimum, ideally
the global minimum of the cost function. Gradient Descent is essential in training models
like neural networks, where manually finding optimal weights is infeasible due to high
dimensionality and complex surfaces.
Gradient Descent Algorithm
Algorithm: Gradient Descent
Input:
A differentiable cost function J(θ)
Learning rate α
Initial parameters θ
Convergence criteria (e.g., small change in cost or max number of iterations)
Steps:
1. Initialize the parameters θ randomly or with some guess.
2. Repeat until convergence:
o Compute the gradient:
∇J(θ) = [∂J(θ)/∂θ₁, ∂J(θ)/∂θ₂, ..., ∂J(θ)/∂θn]
o Update the parameters using:
θ := θ - α × ∇J(θ)
3. Return the optimized parameters θ.
Description of the Algorithm
Gradient Descent is an iterative optimization process that helps minimize a given cost
functions, typically representing the error of a machine learning model. The cost function
measures how well the model performs; a lower value means better accuracy. At each
step, the algorithm calculates the slope (or gradient) of the cost function with respect to
each parameter (like weights in a neural network). The gradient shows the direction of
the steepest increase in cost, so by moving in the opposite direction, we reduce the cost.
The step size is controlled by a value called the learning rate α, which must be chosen
carefully, too large may overshoot the minimum, too small may slow down convergence.
The process repeats until the parameters settle around a minimum point where the model
performs optimally. Gradient Descent is widely used because it is simple, effective, and
scalable to large datasets and high-dimensional models.
9. Feed forward Neural Networks
A Feedforward Neural Network (FNN) is a fundamental type of artificial neural
network architecture in which connections between the nodes do not form cycles.
It consists of an input layer, one or more hidden layers, and an output layer. Each
layer is made up of units called neurons, which are inspired by biological neurons.
In a feed forward network, data flows in one direction—from the input layer
through the hidden layers to the output layer. Each neuron in a layer is connected
to every neuron in the subsequent layer, and each connection is associated with a
numerical weight. When data is input into the network, it is multiplied by these
weights and passed through a non-linear activation function such as the sigmoid,
ReLU (Rectified Linear Unit), or tanh function. The output of each neuron
becomes the input to the neurons in the next layer. The final layer produces the
network’s prediction or output. During training, a learning algorithm like back
propagation is used along with an optimization method such as gradient descent to
adjust the weights by minimizing the error between the predicted output and the
actual target. This learning process continues until the network’s performance
reaches a satisfactory level. FNNs are widely used for tasks such as classification,
regression, and pattern recognition due to their simplicity and ability to
approximate complex functions.
Feed forward Algorithm for Neural Networks
Input:
• Input feature vector X = *x1, x2, ..., xn+
• Weight matrices for each layer: W1, W2, ..., WL (where L is the number of layers)
• Bias vectors for each layer: b1, b2, ..., bL
• Activation function f (e.g., sigmoid, tanh, ReLU)
Output:
• Output vector Y_hat (predicted output)
Algorithm Steps:
1. Initialize input layer:
- Set A0 = X (This is the input to the first layer)
2. For each layer l = 1 to L:
- Compute the linear combination:
Zl = Wl * Al-1 + bl
- Apply activation function:
Al = f(Zl)
3. Output of the final layer:
- Y_hat = AL
Example for 3-layer Network (1 hidden layer + output):
Let:
• Input layer: X
• Hidden layer: W1, b1, activation f1
• Output layer: W2, b2, activation f2
Then:
Z1 = W1 * X + b1
A1 = f1(Z1)
Z2 = W2 * A1 + b2
Y_hat = f2(Z2)
Additional information
Feedforward Neural Network - GeeksforGeeks
Backpropagation in Neural Network - GeeksforGeeks
A Step by Step Backpropagation Example _ Matt Mazur
10. Representation Power of Feed forward Neural Networks
Feed forward Neural Networks (FNNs) possess remarkable representational power,
making them highly effective in approximating complex functions. At their core, FNNs
consist of layers of interconnected neurons where information flows in one direction—
from the input layer, through one or more hidden layers, to the output layer. Each neuron
applies a non-linear activation function (such as sigmoid, tanh, or ReLU) to a weighted
sum of its inputs, allowing the network to model non-linear relationships. The true
strength of FNNs lies in the Universal Approximation Theorem, which states that a feed
forward neural network with just one hidden layer containing a finite number of neurons
can approximate any continuous function on a compact domain, given appropriate
weights and activation functions. This means that FNNs are capable of learning and
representing highly complex mappings between input and output spaces, including those
with intricate patterns or high-dimensional data. The depth and width of a network further
influence its ability to capture subtle structures in data; deeper networks (with more
layers) often yield more compact representations of complex functions than shallow ones.
However, this power comes with a trade-off—training deeper networks can be
computationally intensive and susceptible to issues like vanishing gradients. Nonetheless,
with sufficient data, proper initialization, and training strategies such as back propagation
and optimization algorithms (e.g., gradient descent), feed forward networks serve as
foundational tools in modern machine learning for tasks ranging from classification and
regression to feature extraction and representation learning.
Summary of the Difference
Feature FNNs MLPs
Any forward-only neural Fully connected forward-only
Definition
network neural network
Broad (includes CNNs, MLPs, Narrow (only dense-layered
Scope
etc.) structures)
Connectivity May not be fully connected Always fully connected
Representation Universal function Universal function approximation
Power approximation (general) (specific)
Use Case CNNs, shallow nets, deep Standard classification/regression
Examples dense nets models