Feedforward
Feedforward
In multilayer feed forward networks, the network consists of a set of sensory units
(source nodes) that constitute the input layer, one or more hidden layers of
computation nodes, and an output layer of computation nodes. The input signal
propagates through the network in a forward direction, on a layer-by-layer basis.
These neural networks are commonly referred to as multilayer perceptron (MLPs).
1. The model of each neuron in the network includes a nonlinear activation function.
A commonly used form of nonlinearity that satisfies this requirement is a sigmoidal
nonlinearity defined by the logistic function:
where Vj is the induced local field (i.e., the weighted sum of all synaptic inputs plus the
bias) of neuron j, and Yj is the output of the neuron. The presence of nonlinearities is
important because otherwise the input-output relation of the network could be reduced to
that of a single-layer perceptron.
2. The network contains one or more layers of hidden neurons that are not part of the input
or output of the network. These hidden neurons enable the network to learn complex tasks
by extracting progressively more meaningful features from the input patterns (vectors).
3. The network exhibits high degrees of connectivity, determined by the synapses of
the network. A change in the connectivity of the network requires a change in the
population of synaptic connections or their weights.
Feed-forward networks have the following characteristics:
1. Perceptrons are arranged in layers, with the first layer taking in inputs and the
last layer producing outputs. The middle layers have no connection with the external
world, and hence are called hidden layers.
2. Each perceptron in one layer is connected to every perceptron on the next layer.
Hence information is constantly "feed forward" from one layer to the next., and this
explains why these networks are called feed-forward networks.
1. The computation of the function signal appearing at the output of a neuron, which
is expressed as a continuous nonlinear function of the input signal and synaptic
weights associated with that neuron.
2. The computation of an estimate of the gradient vector (i.e., the gradients of the
error surface with respect to the weights connected to the inputs of a neuron), which
is needed for the backward pass through the network.
What is Backpropagation?
• where the error signal ej(n) pertains to output neuron j for training example n.
• The error ej(n) equals the difference between dj(n) and yj(n), which represents the
jth element of the desired response vector dj(n) and the corresponding value of the
network output, respectively.
• Here the inner summation with respect to j is performed over all the neurons in the
output layer of the network, whereas the outer summation with respect to n is
performed over the entire training set in the epoch at hand.
• For a learning-rate parameter 11, the adjustment applied to synaptic weight wji
connecting neuron i to neuron j, is defined by the delta rule.
• To get a brief overview of what Neural Networks are, a neural network is simply a collection
of Neurons(also known as activations), that are connected through various layers.
• It attempts to learn the mapping of input data to output data, on being provided a training set.
• The training of the neural network later facilitates the predictions made by it on a testing data
of the same distribution.
• This mapping is attained by a set of trainable parameters called weights, distributed over
different layers.
• The weights are learned by the backpropagation algorithm whose aim is to minimize a loss
function.
• A loss function measures how distant the predictions made by the network are from the
actual values.
• Every layer in a neural network is followed by an activation layer that performs some
additional operations on the neurons.
The Universal Approximation Theorem
• Mathematically speaking, any neural network architecture aims at finding any
mathematical function y= f(x) that can map attributes(x) to output(y).
• The accuracy of this function i.e. mapping differs depending on the distribution of
the dataset and the architecture of the network employed.
• The function f(x) can be arbitrarily complex.
• The Universal Approximation Theorem tells us that Neural Networks has a kind
of universality i.e. no matter what f(x) is, there is a network that can
approximately approach the result and do the job! This result holds for any
number of inputs and outputs.
• If we observe the neural network above, considering the input attributes provided
as height and width, our job is to predict the gender of the person.
• If we exclude all the activation layers from the above network, we realize that h₁
is a linear function of both weight and height with parameters w₁, w₂, and the
bias term b₁.
• Therefore mathematically,
h₁ = w₁*weight + w₂*height + b₁
Similarily,
h2 = w₃*weight + w₄*height + b₂
• Going along these lines we realize that o1 is also a linear function of h₁ and h2,
and therefore depends linearly on input attributes weight and height as well.
• An activation layer is applied right after a linear layer in the Neural Network to
provide non-linearities.
• Non-linearities help Neural Networks perform more complex tasks.
• An activation layer operates on activations (h₁, h2 in this case) and modifies them
according to the activation function provided for that particular activation layer.
• Activation functions are generally non-linear except for the identity function.
• Some commonly used activation functions are ReLu, sigmoid, softmax, etc.
• With the introduction of non-linearities along with linear terms, it becomes possible
for a neural network to model any given function approximately on having
appropriate parameters(w₁, w₂, b₁, etc in this case).
• The parameters converge to appropriateness on training suitably.
• You can get better acquainted mathematically with the Universal Approximation
theorem from here.
PRACTICAL AND DESIGN ISSUES OF BACK PROPAGATION
LEARNING
• The universal approximation theorem is important from a theoretical viewpoint,
because it provides the necessary mathematical tool for the viability of feed
forward networks with a single hidden layer as a class of approximate solutions.
• Without such theorem, we could conceivably be searching for a solution that
cannot exist.
• However, the theorem is not constructive, that is, it does not actually specify how
to determine a multilayer perceptron with the stated approximation properties.
• The universal approximation theorem assumes that the continuous function to be
approximated is given and that a hidden layer of unlimited size is available for the
approximation.
• Both of these assumptions are violated in most practical applications of multilayer
perceptrons.
• The problem with multilayer perceptrons using a single hidden layer is that the
neurons there in tend to interact with each other globally.
• In complex situations this interaction makes it difficult to improve the
approximation at one point without worsening it at some other point.
• In particular, we may proceed as follows:
1. Local features are extracted in the first hidden layer. Specifically, some neurons
in the first hidden layer are used to partition the input space into regions, and other
neurons in that layer learn the local features characterizing those regions.
2. Global features are extracted in the second hidden layer. Specifically, a neuron in
the secondhidden layer combines the outputs of neurons in the first hidden layer
operating on a particular region of the input space, and thereby learns the global
features for that region and outputs zero elsewhere.