0% found this document useful (0 votes)
6 views34 pages

AI17-Neural Networks

Artificial Neural Networks (ANNs) are computational models inspired by biological neural systems, designed to understand intelligent behavior through interconnected units. Learning in ANNs can be achieved through algorithms like perceptron and backpropagation, allowing for the representation of complex functions and patterns. The structure of neural networks can vary, including single-layer and multi-layer configurations, with considerations for overfitting and optimal architecture during the learning process.

Uploaded by

zaydenguide
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views34 pages

AI17-Neural Networks

Artificial Neural Networks (ANNs) are computational models inspired by biological neural systems, designed to understand intelligent behavior through interconnected units. Learning in ANNs can be achieved through algorithms like perceptron and backpropagation, allowing for the representation of complex functions and patterns. The structure of neural networks can vary, including single-layer and multi-layer configurations, with considerations for overfitting and optimal architecture during the learning process.

Uploaded by

zaydenguide
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Artificial Neural Networks

Neural Networks
• Analogy to biological neural systems
• Attempt to understand natural biological systems through computational modeling
• Intelligent behavior as an “emergent” property of large number of simple units rather
than from explicitly encoded symbolic rules and algorithms
• A neural network is just a collection of units connected together; the properties of the
network are determined by its topology and the properties of the “neurons”
• Researchers in AI and statistics became interested in the more abstract properties of
neural networks, such as their ability to perform distributed computation, to tolerate
noisy inputs, and to learn
• Hence they aimed to create artificial neural networks. (Other names include
connectionism, parallel distributed processing, and neural computation.)
Real Neuron

• A neuron is a cell in the brain whose principal function is the collection,


processing, and dissemination of electrical signals
• Brain's information-processing capacity is from networks of such neurons
Neural Network Learning
• Learning approach based on modeling adaptation in biological neural systems
• Perceptron: Initial algorithm for learning simple neural networks (single layer)
developed in the 1950’s.
• Backpropagation: More complex algorithm for learning multi-layer neural
networks developed in the 1980’s.
Units in neural networks
• Neural networks are composed of nodes or units connected by directed links
• A link from unit i to unit j serves to propagate the activation ai from i to j
• Each link has a numeric weight, Wi,j associated with it, which determines the strength and sign of
the connection
• Each unit j first computes a weighted sum of its inputs:

• an activation function g is applied to this sum to derive the output:


• The activation function g is typically either a hard threshold, in which case the
unit is called a perceptron, or a logistic function, in which case the term sigmoid
perceptron is sometimes used.
• Both of these nonlinear activation function ensure the important property that
the entire network of units can represent a nonlinear function.
• Logistic activation function has the added advantage of being differentiable.
• The activation function g is designed to meet:
• we want the unit to be "active" (near +1) when the "right" inputs are given, and "inactive"
(near 0) when the "wrong" inputs are given
• the activation needs to be nonlinear, otherwise the entire neural network collapses into a
simple linear function
• sets the actual threshold for the unit, in the sense that the unit is activated when
the weighted sum of "real" inputs (i.e., excluding the bias input) exceeds W0,i
Network structures
• acyclic or feed-forward networks
• has connections only in one direction
• Every node receives input from “upstream” nodes and delivers output to “downstream”
nodes; there are no loops
• represents a function of its current input
• it has no internal state other than the weights themselves
• cyclic or recurrent networks
• feeds its outputs back into its own inputs
• means that the activation levels of the network form a dynamical system that may reach
a stable state or exhibit oscillations or even chaotic behavior
• the response of the network to a given input depends on its initial state, which may
depend on previous inputs
• can support short-term memory
• Feed-forward networks are usually arranged in layers, such that each unit receives
input only from units in the immediately preceding layer.
• single layer networks, which have no hidden units
• multilayer networks, which have one or more layers of hidden units that are not connected to
the outputs of the network
• neural networks can be used in cases where multiple outputs are appropriate
• A neural network can be used for classification or regression
• For Boolean classification with continuous outputs (e.g., with sigmoid units), it is
traditional to have a single output unit, with a value over 0.5 interpreted as one
class and a value below 0.5 as the other
• For k-way classification, one could divide the single output unit's range into k
portions, but it is more common to have k separate output units, with the value of
each one representing the relative likelihood of that class given the current input
Single layer feed-forward neural networks

a3 = g(w0,3 + w1,3 a1 + w2,3 a2)


= g(w0,3 + w1,3 x1 + w2,3 x2)
a4 = g(w0,4 + w1,4 a1 + w2,4 a2)
= g(w0,4 + w1,4 x1 + w2,4 x2)

a simple two-input, two-output


perceptron network
Multi-layer feed-forward neural networks

A neural network with two inputs, one


hidden layer of two units, and one
output unit

a5 = g(w0,5+w3,5 a3 + w4,5 a4)


= g(w0,5+w3,5 g(w0,3 + w1,3 a1 + w2,3 a2) + w4,5 g(w0,4 + w1,4 a1 + w2,4 a2))
= g(w0,5+w3,5 g(w0,3 + w1,3 x1 + w2,3 x2) + w4,5 g(w0,4 + w1,4 x1 + w2,4 x2)).
Multi-layer feed-forward neural networks

A neural network with two inputs, one


hidden layer of two units, and one
output layer of two units

a5 = g(w0,5+w3,5 a3 + w4,5 a4)


= g(w0,5+w3,5 g(w0,3 + w1,3 a1 + w2,3 a2) + w4,5 g(w0,4 + w1,4 a1 + w2,4 a2))
= g(w0,5+w3,5 g(w0,3 + w1,3 x1 + w2,3 x2) + w4,5 g(w0,4 + w1,4 x1 + w2,4 x2)).
a6 = g(w0,6+w3,6 a3 + w4,6 a4)
= g(w0,6+w3,6 g(w0,3 + w1,3 a1 + w2,3 a2) + w4,6 g(w0,4 + w1,4 a1 + w2,4 a2))
= g(w0,6+w3,6 g(w0,3 + w1,3 x1 + w2,3 x2) + w4,6 g(w0,4 + w1,4 x1 + w2,4 x2)).
Single layer feed-forward neural networks
(perceptrons)
• A network with all the inputs connected directly to the outputs is called a single-
layer neural network , or a perceptron network.
• Since each output unit is independent of the others-each weight affects only one
of the outputs
• With a threshold activation function, we can view the perceptron as representing
a Boolean function
• can represent some quite "complex" Boolean functions very compactly and cannot represent
some

• defines a hyperplane in the input space, so the perceptron returns 1 if and only if the input is
on one side of that hyperplane
• depending on the type of activation function used, the training processes will be
either the perceptron learning rule or gradient descent rule for the logistic
regression
• linearly separable functions constitute just a small fraction of all Boolean
functions
• Each cycle through the examples is called an epoch.
• Epochs are repeated until some stopping criterion is reached-typically, that the
weight changes have become very small.
Re-visiting weight updates
• perceptron learning rule

• Signmoid perceptron learning rule


Expressiveness of MLPs
• The advantage of adding hidden layers is that it enlarges the space of hypotheses
that the network can represent
• With more hidden units, we can produce more bumps of different sizes in more
places
• With a single, sufficiently large hidden layer, it is possible to represent any
continuous function of the inputs with arbitrary accuracy
• With two layers, even discontinuous functions can be represented
• Unfortunately, for any particular network structure, it is harder to characterize
exactly which functions can be represented and which ones cannot.
• The problem of choosing the right number of hidden units in advance is still not
well understood
Learning in multilayer networks
• one minor complication arise in multilayer networks: interactions among the
learning problems when the network has multiple outputs.
• In such cases, we should think of the network as implementing a vector function
hw rather than a scalar function; for example, the network returns a vector [a5, a6].
• the target output will be a vector y.
• A perceptron network decomposes into m separate learning problems for an m-
output problem, this decomposition fails in a multilayer network.
• For example, both a5 and a6 depend on all of the input-layer weights, so updates to those
weights will depend on errors in both a5 and a6.
• this dependency is very simple in the case of any loss function that is additive
across the components of the error vector y − hw(x).
• For the L2 loss, we have, for any weight w,

where the index k ranges over nodes in the output layer


• The major complication comes from the addition of hidden layers to the network.
• Whereas the error y − hw at the output layer is clear, the error at the hidden
layers seems mysterious because the training data do not say what value the
hidden nodes should have.
• It turns out that we can back-propagate the error from the output layer to the
hidden layers. The back-propagation process emerges directly from a derivation
of the overall error gradient.
• Idea is that hidden node j is "responsible" for some fraction of the error in each
of the output nodes to which it connects
• ∆k values are divided according to the strength of the connection between the
hidden node and the output node and are propagated back to provide the ∆𝑗
values for the hidden layer
• We have multiple output units, so let Errk be the kth component of the error
vector y − hw.
• Let us define a modified error Δk =Errk × g’(ink), so that the weight-update rule
becomes

• The propagation rule for the Δ values is the following:

• the weight-update rule for the weights between the inputs and the hidden layer
is essentially identical to the update rule for the output layer:
• The back-propagation process can be summarized as follows:
• The gradient of loss with respect to weights connecting the hidden layer to the
output layer will be zero except for weights wj,k that connect to the kth output
unit. For those weights, we have
• To obtain the gradient with respect to the wi,j weights connecting the input layer
to the hidden layer, we have to expand out the activations aj and reapply the
chain rule.
Learning neural network structures
• How to find the best network structure?
• neural networks are subject to overfitting when there are too many parameters in the model
• Fully connected networks, the only choices to be made concern the number of hidden
layers and their sizes
• try several and keep the best
• cross-validation techniques are needed
• not fully connected, then need to find some effective search method through the very
large space of possible connection topologies
• optimal brain damage algorithm begins with a fully connected network and removes connections
from it
• After the network is trained for the first time, an information-theoretic approach identifies an
optimal selection of connections that can be dropped
• The network is then retrained, and if its performance has not decreased then the process is repeated
• It is also possible to remove units that are not contributing much to the result
Learning neural network structures …
• Several algorithms have been proposed for growing a larger network from a
smaller one.
• Tiling algorithm
• The idea is to start with a single unit that does its best to produce the correct output on as
many of the training examples as possible.
• Subsequent units are added to take care of the examples that the first unit got wrong.
• The algorithm adds only as many units as are needed to cover all the examples.

You might also like