0% found this document useful (0 votes)
98 views72 pages

unit 2

Uploaded by

nusrathshaikh07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views72 pages

unit 2

Uploaded by

nusrathshaikh07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 72

UNIT - II

ARTIFICIAL NEURAL NETWORKS


2
 UNIT - II
 Artificial Neural Networks-1– Introduction, neural network
representation, appropriate problems for neural network learning,
perceptions, multilayer networks and the back-propagation algorithm.
 Artificial Neural Networks-2- Remarks on the Back-Propagation
algorithm, An illustrative example: face recognition, advanced topics in
artificial neural networks.
 Evaluation Hypotheses – Motivation, estimation hypothesis accuracy,
basics of sampling theory, a general approach for deriving confidence
intervals, difference in error of two hypotheses, comparing learning
algorithms.

05/01/2025
TOPICS
1. Artificial Neural Networks
1.1. Neurons and Biological Motivation
1.2. Neural Network Representations
1.3. Problems for Neural Network Learning

2. Perceptrons
2.1. Representational Power of Perceptrons
2.2. The Perceptron Training Rules

3. Multilayer Networks and the Back


Propagation Algorithm
3.1. A Differentiable Threshold Unit
3.2. The Backpropagation Algorithm
1. Artificial Neural Networks
4

 Neural network learning methods provide a


robust approach to approximating real-valued,
discrete-valued and vector-valued target
functions.

 ANN learning is well matched for certain types of


problems, like:
 Interpret complex real-world sensor data.

 Speech recognition.

 Visual scenes.

 Learning robot control strategies. etc…


05/01/2025
1. Artificial Neural Networks
(Continued)
5

 The Backpropagation algorithm is one of the


efficient ANN algorithm , which is successful in
many practical problems such as:
 Learning to recognize handwritten characters

 Learning to recognize faces

 Learning to recognize spoken words

 The Backpropagation algorithm use gradient


descent to tune network parameters to best fit a
training set of input-output pairs.

05/01/2025
1.1. Neurons and Biological Motivation
6

 The study of artificial neural network(ANN) has


been inspired by the observation that biological
learning systems are built of very high complex,
nonlinear parallel interconnections of neurons.

 The human body is made up of trillions of cells.


Cells of the nervous system, called nerve cells
or neurons, are specialized to carry "messages"
through an electrochemical process.

05/01/2025
1.1. Neurons and Biological Motivation
(Continued)
7

 Neurons have specialize cell


part called
dendrites and
axons. Dendrites bring
electrical signals to the cell
body and axons take
information away from the cell
body.

 To a neuron dendrites are used


to take input to neuron cell and
axons are used to give output
from that neuron cell , which
may input to another neuron.
05/01/2025
1.1. Neurons and Biological Motivation
(Continued)
8

A Neuron structure

Communication between two neurons


05/01/2025
1.1. Neurons and Biological Motivation
(Continued)
9

 The human brain , is estimated to contain an


interconnected network of approximately 1011 neurons,
each connected, on average, to 104 others.
 Neuron activity is typically excited through connections
to other neurons.
 Signals can be transmitted unchanged or they can be
altered by synapses. A synapse is able to increase or
decrease the strength of the connection from the
neuron to neuron and cause excitation or inhibition of
a subsequence neuron. This is where information is
stored.
 ANNs are loosely motivated by biological neural systems,
there are many complexities to biological neural systems
05/01/2025
that are not modeled by ANNs.
1.1. Neurons and Biological Motivation
(Continued)
10

 The information processing abilities of biological


neural systems must follow from highly parallel
processes operating on representations that are
distributed over many neurons. One motivation for
ANN is to capture this kind of highly parallel
computation based on distributed representations.

 Historically, two groups of researchers have worked


with artificial neural networks.
 ANNs to study and model biological learning

processes.
 To obtaining highly effective machine learning
05/01/2025
algorithms
1.2. Neural Network Representations
11

 A prototypical example of ANN learning is


ALVINN( (Autonomous Land Vehicle In
a Neural Network)is a perception system, which
uses a learned ANN to steer an autonomous
vehicle driving at normal speeds on public
highways.
 Input: The input to the neural network is a 30 x
32 grid of pixel intensities obtained from a
forward-pointed camera mounted on the vehicle.
 Output: The network output is the direction in
which the vehicle is steered
Machine Learning GITAM University 05/01/2025
1.2. Neural Network Representations
(Continued)
12

 The ANN is trained to mimic the observed


steering commands of a human driving the
vehicle for approximately 5 minutes. ALVINN has
used its learned networks to successfully drive at
speeds up to 70 miles per hour and for distances
of 90 miles on public highways.

 Figures in the next slide illustrates the neural


network representation used in one version of the
ALVINN system, and illustrates the kind of
representation typical of many ANN systems.
05/01/2025
1.2. Neural Network Representations
(Continued)
13

Figure 2

Figure 3
Figure 1
05/01/2025
1.2. Neural Network Representations
(Continued)
14
 The ALVINN system uses BACKPROPAGATION to learn to
steer an autonomous vehicle (photo at top) driving at
speeds up to 70 miles per hour.
 The diagram on the left (figure 1) shows how the image of
a forward-mounted camera is mapped to 960 neural
network inputs, which are fed forward to 4 hidden units,
connected to 30 output units.
 The figure 3 shows 30 x 32 weights into the hidden
unit are displayed in the large matrix, with white blocks
indicating positive and black indicating negative weights.
 As can be seen from these output weights, activation of
this particular hidden unit encourages a turn toward the
left.
05/01/2025
1.2. Neural Network Representations
(Continued)
15

 Each ANN is composed of a collection of


perceptrons grouped in layers. A typical structure
is shown in below figure.

Note the three


layers: input,
intermediate (called
the hidden layer)
and output.
Several hidden layers
can be placed
between the input
and output layers. 05/01/2025
1.3. Problems for Neural Network
Learning
16

 ANN learning is well-suited to problems in which


the training data corresponds to noisy, complex
sensor data. It is also applicable to problems for
which more symbolic representations are used.

 The back propagation (BP) algorithm is the most


commonly used ANN learning technique. It is
appropriate for problems with the characteristics:

05/01/2025
1.3. Problems for Neural Network
Learning (Continued)
17

 Instances are represented by many attribute-


value pairs.
 The target function output may be discrete-
valued, real-valued, or a vector of several real- or
discrete-valued attributes.
 The training examples may contain errors.
 Long training times are acceptable.
 Fast evaluation of the learned target function
may be required.
 The ability of humans to understand the learned
target function is not important
05/01/2025
2. Perceptrons
18

 One type of ANN system is based on a unit called


a perceptron, illustrated in below figure.

A Perceptron
 An ANN consists of perceptrons. Each of the
perceptrons receives inputs, processes inputs
and delivers a single output.
05/01/2025
2. Perceptrons (Continued)
19

 A perceptron takes a vector of real-valued inputs,


then calculates a linear combination of these
inputs, after that it returns1if the result is greater
than some threshold and -1 otherwise as its
output.

 where each wi is a real-valued constant, or


weight, that determines the contribution of input
xi to the perceptron output.
05/01/2025
2.1. Representational Power of
Perceptrons
20
 We can view the perceptron as representing a
hyperplane decision surface in the n-
dimensional space of instances (i.e. points).
 The perceptron outputs a 1 for instances lying on
one side of the hyperplane and outputs a –1 for
instances lying on the other side.
 Examples can be separated by a single
perceptron are called linearly separated set of
examples.

05/01/2025
2.1. Representational Power of
Perceptrons (Continued)
21

 A single perceptron can be used to represent many


boolean functions.
 For example, if we assume boolean values of 1 (true)
and -1 (false), then one way to use a two-input
perceptron to implement the AND function is to set the
weights wo = -0.8, and w1 = w2 = .5.
 This perceptron can be made to represent the OR
function instead by altering the threshold to wo =-0.3.
 Unfortunately, however, some boolean functions
cannot be represented by a single perceptron, such as
the XOR function. Those are called linearly
nonseparable training examples

05/01/2025
2.1. Representational Power of
Perceptrons (Continued)
22

 A single perceptron can be used to represent


many boolean functions.
· AND function(linearly separable) :
x
<Trai ni ng exam
1 x2
pl es>
out put
Decision hyper plane :
0 0 -1
w0 + w 1 x1 + w 2 x2 = 0
0 1 -1 -0.8 + 0.5 x1 + 0.5 x2 = 0
1 0 -1 x2
1 1 1

<Test Resul ts> - +


x1 x2 wixi output
0 0 - 0. 8 -1
0 1 - 0. 3 -1 - -
x1
1 0 - 0. 3 -1
1 1 0. 2 1 -0.8 + 0.5 x1 + 0.5 x2 = 0

05/01/2025
2.1. Representational Power of
Perceptrons (Continued)
23

• OR function (linearly separable):


• The two-input perceptron can implement the OR
function when we set the weights: w0 = -0.3, w1
=
x1
w2 = 0.5 . o utp ut
< Tra in in g e x a mp le s >
x2 Decision hyperplane :
0 0 -1 w 0 + w 1 x1 + w 2 x2 = 0
0 1 1
-0.3 + 0.5 x1 + 0.5 x2 = 0
1 0 1
1 1 1 x2

< Te s t Re s ults > + +


x1 x2 w x o u tp ut
i i
0 0 - 0.3 -1
0 1 0.2 -1 - +
1 0 0.2 -1 x1
1 1 0.7 1 -0.3 + 0.5 x1 + 0.5 x2 = 0
05/01/2025
2.1. Representational Power of
Perceptrons (Continued)
24

 XOR function(Non linearly


separable) :
 It’s impossible to implement the XOR function by
x2
a single perception.
<Trai ni ng exampl es>
x1 x2 out put
0 0 -1
0 1 1
+ -
1 0 1
1 1 -1
- +
A two-layer network of x1
perceptrons can represent
XOR function.

Refer to this equation, 05/01/2025


2.2. The Perceptron Training
25
Rule
 let us begin by understanding how to learn the
weights for a single perceptron.

 They are important to ANNs because they


provide the basis for learning networks of many
units.

 Several algorithms are known to solve this


learning problem. Here we consider two:
 The Perceptron rule.
 The Delta rule.
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
26

 The Perceptron Rule:


 One way to learn an acceptable weight vector is to begin
with random weights, then iteratively apply the
perceptron to each training example, modifying the
perceptron weights whenever it misclassifies an example.
 This process is repeated until the perceptron classifies all
training examples correctly.
 Weights are modified at each step according to the
perceptron training rule , which revises the weight wi
associated with input xi according to the rule.

Where ,
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
27

In the above equation:


t  is the target output for the current training
example
o  is the output generated by the perceptron
and
 is a positive constant called the learning
rate

 The role of the learning rate is to moderate the


degree to which weights are changed at each step.
05/01/2025
2.2. The Perceptron Training Rule
(Continued)

28
 If t = -1 and o = 1, then weights associated with
positive xi will be decreased rather than increased.
 If t=1 and o=-1 , then weights associated with positive
xi will be increased rather than decreased.
 Example: if xi = 0.8, = 0.1, t = 1, and o = - 1, then
the weight update will be = O.1(1 - (-
1))0.8 = 0.16.
 the above learning procedure can be proven to
converge within a finite number of applications of the
perceptron training rule to a weight vector that
correctly classifies all training examples, provided the
training examples are linearly separable and provided
a sufficiently small n is used. If the data are not
linearly separable, convergence is not assured.
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
29

• Gradient Descent and the Delta Rule:


 The perceptron rule finds a successful weight vector when
the training examples are linearly separable, it can fail to
converge if the examples are not linearly separable.
 A second training rule, called the delta rule, is designed to
overcome this difficulty.
 The key idea of delta rule is to use gradient descent to
search the space of possible weight vector to find the
weights that best fit the training examples.
 This rule is important because gradient descent provides
the basis for the BACKPROPAGATION algorithm, it can
serve as the basis for learning algorithms that must
search through hypothesis spaces containing many
different types of continuously parameterized hypotheses.
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
30

 The delta training rule is best understood by


considering the task of training an
unthresholded perceptron; that is, a linear
unit for which the output o is given by
 In order to derive a weight learning rule for
linear units, let us begin by specifying a
measure for the training error of a
hypothesis (weight vector), relative to the
training examples. Although there are many
ways to define this error, one common measure
used is
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
31

Error of different hypotheses. For a linear unit with two


weights, the hypothesis space H is the w0, w1 plane.

05/01/2025
2.2. The Perceptron Training Rule
(Continued)
32

 Gradient descent search determines a


weight vector that minimizes E by
starting with an arbitrary initial weight
vector, then repeatedly modifying it in
small steps.
 At each step, the weight vector is
altered in the direction that produces the
steepest descent along the error surface.
 This process continues until the global
minimum error is reached
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
33

Then , it will update the


weights as:

 the gradient descent algorithm for training linear


units is as follows: Pick an initial random weight
vector. Apply the linear unit to all training
examples, then compute  wi for each weight
according to Equation Update each weight wi by
adding  wi, then repeat this process.
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
34

 Derivation of Gradient – Descent Rule:


 How can we calculate the direction of
steepest descent along the error surface?

 This can be done by computing the


derivative of E with respect to each
component of the vector .

 This vector derivative is called the gradient


of E with respect to , written .
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
35

 A linear unit for which the output o is given by


-------------- (1)
 In order to derive a weight learning rule for linear
units, let us begin by specifying a measure for the
training error of a hypothesis, one common
measure that will turn out to be especially
convenient is
--------- (2)
 This vector derivative is called the gradient of E
with respect to the vector <w0,…,wn>, written E .
------------(3)
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
36

 Since the gradient specifies the direction of steepest


increase of E, the training rule for gradient descent is

where
----------- (4)
 Here is a positive constant called the learning rate.
 The negative sign is present because we want to
move the weight vector in the direction that
decreases E.
 Equation 4 can also be written as ------
(5)

05/01/2025
2.2. The Perceptron Training Rule
(Continued)
37

 which makes it clear that steepest descent is achieved by


altering each component w i of weight vector in proportion to
E/wi.The vector of E/wi derivatives that form the gradient
can be obtained by differentiating E from Equation (2), as

-------
----------
05/01/2025
(6)
2.2. The Perceptron Training Rule
(Continued)
38

 Substituting Equation 6 into Equation 5


yields the weight update rule for
gradient descent

05/01/2025
2.2. The Perceptron Training Rule
(Continued)
39

Gradient-
Descent
Algorithm:

05/01/2025
Gradient-Descent Algorithm
40

 To summarize, the gradient descent algorithm for


training linear units is as follows:
 Pick an initial random weight vector. Apply the
linear unit to all training examples, then compute
delta-wi for each weight
 Update each weight wi by adding delta-wi, then
repeat this process.
 Because the error surface contains only a single
global minimum, this algorithm will converge to a
weight vector with minimum error, regardless of
whether the training examples are linearly
separable
05/01/2025
stochastic gradient descent

41

 The key practical difficulties in applying gradient descent.


 converging to a local minimum can sometimes be quite
slow.
 if there are multiple local minima in the error surface,
then there is no guarantee that the procedure will find
the global minimum.
 One common variation on gradient descent intended to
alleviate these difficulties is called incremental gradient
descent, or alternatively stochastic gradient descent.
 The idea behind stochastic gradient descent is to
approximate this gradient descent search by updating
weights incrementally, following the calculation of the error
for each individual example.
05/01/2025
stochastic gradient descent
42

The modified training rule is

error function defined for each individual training


example d as follows

05/01/2025
Remarks
44

 Perceptron training rule


 1.Updates weights based on the error in the thresholded
perceptron output
 2.converges after a finite number of iterations to a hypothesis
that perfectly classifies the training data, provided the training
examples are linearly separable.

 Delta rule
 1. Updates weights based on the error in the un-thresholded
linear combination of inputs
 2. converges only asymptotically toward the minimum error
hypothesis, possibly requiring unbounded time, but converges
regardless of whether the training data are linearly separable.

05/01/2025
3. Multilayer Networks
45

 The kind of multilayer networks learned by the


BACKPROPACATION algorithm are capable of
expressing a rich variety of nonlinear decision surfaces.

 It is possible for the multilayer network to represent


highly nonlinear decision surfaces that are much more
expressive than the linear decision surfaces.

 For example, in figure 3.0. the speech recognition task


involves distinguishing among 10 possible vowels, all
spoken in the context of "h-d" (i.e., “hid”, “had”,
“head”, “hood”, etc.).

05/01/2025
3. Multilayer Networks (Continued)

46

Figure 3.0.
 The network input consists of two parameters, F1 and
F2, obtained from a spectral analysis of the sound.
The 10 network outputs correspond to the 10 possible
vowel sounds. The plot on the right illustrates the
highly nonlinear decision surface.
05/01/2025
47 05/01/2025
3. Multilayer Networks
(Continued)
48

A Differentiable
Threshold Unit:

 Like the perceptron, the sigmoid unit first


computes a linear combination of its inputs, then
applies a threshold to the result. In the case of
sigmoid unit, however, the threshold output is a
continuous function of its input. The sigmoid function
(x) is also called the logistic function.
05/01/2025
3. Multilayer Networks
(Continued)
49

 Interesting property:

 Output ranges between 0 and 1, increasing


monotonically with its input.

We can derive gradient decent rules to train


 One sigmoid unit
 Multilayer networks of sigmoid units 
Backpropagation

05/01/2025
50 05/01/2025
3. Multilayer Networks
(Continued)
51
 The Backpropagation (BP)Algorithm :
 The BP algorithm learns the weights for a multilayer network,
given a network with a fixed set of units and interconnections. It
employs a gradient descent to attempt to minimize the squared
error between the network output values and the target values for
these outputs.
 Considering networks with multiple output units rather than single
units, redefine E to sum the errors over all of the network output
units
 E( )    ((tkd – okd)2
d D koutputs

 where outputs is the set of output units in the network,


 Tkd and okd are the target and output values05/01/2025
associated with kth
output unit and training example d.
52

 The learning problem in Backpropagation is to


search a large hypothesis space defined by all
possible weight values for all the units in the
network.
 In multilayer network the error surface can have
multiple local minima.
 The gradient descent is guaranteed only to
converge toward some local minimum, but not
necessarily to global minimum error.

05/01/2025
3. Multilayer Networks
(Continued)

53

05/01/2025
54

 The algorithm here applies to layered feedforward networks


containing two layers of sigmoid units, with units at each
layer connected to all units from the preceding layer.
 Notations :
 An index(integer) is assigned to each node in the network,
where a ‘node’ is either an input to the network or the output
of some unit in the network.
 xji denotes the input from node i to unit j , and wji denotes the
corresponding weight.
  n denotes the error term associated with unit n. It plays a role
similar to the quantity (t – o) in delta training rule. Where
  = E 05/01/2025
n netn
Working of algorithm
55

 The algorithm starts by constructing a network with the


desired no. of hidden input and output units.
 Initializing all network weights to small random values.
 The main loop of the algorithm then repeatedly iterates over
the training examples.
 For each training example, it applies the network to the
example, calculates the error of the network output for this
example , the update all the weights in the network.
 The gradient descent weight-update rule w(ji jxji )
updates each weight in proportion to the learning rate
,the input value xji to which the weight is applied, and the
error in the output of the unit. 05/01/2025
56

 The exact form of follows


j from the derivation of the weight-
tuning rule.
 k is computed for each network output unit k as :
k  ok (1  ok )(tk  ok )
 The factor ok (1is derivative
ok ) of sigmoid squashing function.
h value for each hidden unit h is calculated as

h oh (1  oh )  wkhk
koutputs
 Training examples provide target values tk only for network
output,no target values are directly available to indicate the error
of hidden unit’s values.
05/01/2025
57

 Error term for hidden unit h is calculated by summing the error


terms k for each output unit influenced by h, weighing each of
the k ‘s by wkh, the weight from hidden unit h to output unit k.
 The weight characterizes the degree to which from hidden unit h
is ‘responsible for’ the error in output unit k.
 the algorithm updates weights incrementally , following the
presentation of each training example.
 The true gradient of E one would sum the jxji values over all
training examples before altering weight values.

05/01/2025
58

 The weigh-update loop in BackPropagation may be iterated


thousands of times in a typical application.
 The termination conditions are

i) after fixed number of iterations through the loop or,


ii) once the error on the training examples falls below
some threshold or,
iii) once the error on a separate validation set of
examples meet some criterion.
The choice of termination plays important role, as few no of
iterations can fail to reduce error and too many can lead to
overfitting the training data. 05/01/2025
Adding Momentum
59

 We can alter the weigh-update rule in wji jxji


in the algorithm by making the weight update on the nth iteration
depend partially on the update that occurred during (n-1)th
iteration, as follows :
wji n  jxji  wji n  1
wji n 
 is the weight update performed during the nth iteration
0  1 is a constant called the momentum.

 On RHS of the equation :


first term is just weight-update equation of Backpropagation
algorithm,second term is new and called as momentum term
05/01/2025
Learning in arbitrary acyclic networks
60

 The Backpropagation easily generalizes to feedforward


networks of arbitrary depth. 
 The weight update rule : wji jxji is used ,with
 of values.
small changes in computing
 The general equation for computing is :
r or (1  or ) w sr
slayer ( m 1)
s

r value for a unit ‘r’ in layer ‘m’ is computed from the values at the next
deeper layer m + 1 .

05/01/2025
61

 We can generalize the algorithm to any directed


acyclic graph, regardless of whether the network
units are arranged in uniform layers or not.
 The rule for calculating  for any internal unit is
r or (1  or )  wsrs
sDownstream ( r )

 where Downstream(r) is the set of units immediately


downstream from unit ‘r’ in the network i.e all
units whose inputs include the output of unit ‘r’.
05/01/2025
Derivation of the Backpropagation Rule
62

 We will derive the weight-tunning rule of Back-


propagation algorithm.
 First we will derive the stochastic gradient descent rule
implemented by the algorithm.
 The stochastic gradient descent involves iterating through
the training examples one at a time, for each training
example d descending the gradient of the error E d w.r.t this
single example.
 For each trainingexample
wji
d every weight wji is updated by
adding to it is updated by adding to it
05/01/2025
Ed
wji  
wji
63

 where Ed is the error on training example d


 Summing over all output units in the
network Ed ( )  1  (tk  ok )
2 2 koutputs
 Here,
outputs is the set of units in the network,
tk is the traget value of unit k for training
example d

05/01/2025
Subscripts and variables used in notation of
stochastic gradient descent rule:
64

 i -> ith input


 J -> jth unit of the network
 xji -> the ith input to unit j
 Wji -> the weight associated with ith input to unit j
 netj -> ( the weighted sum of inputs for
w
i
ji xji
unit j)
oj -> the output computed for unit j
tj -> the target output for unit j
 -> the sigmoid function
outputs -> the set of units in the final layer of the
05/01/2025
network
Downstream(j) = the set of units whose
immediate inputs include the 0utput of unit j
65

 In order to implement the stochastic


gradient rule , we will derive
Edan
wji
expression for
 The weight wji can influence the rest of
network only through
Ed

Enet j .j
d net

wji netj wji


 Using chain rule :Ed
 xji
netj

 Ed
Now we have to derive a convenient
netj
expression for 05/01/2025
ADVANCED TOPICS IN ARTIFICIAL NEURAL
NETWORKS
66
 Alternative Error Functions:
 gradient descent can be performed for any function E that is
differentiable with respect to the arameterized hypothesis space.
While the basic BP algorithm defines E in terms of the sum of
squared errors of the network, other definitions have been suggested
in order to incorporate other constraints into the weight-tuning rule.
 Examples of alternative definitions of E include:
1. Adding a penalty term for weight magnitude. we can add a term
to E that increases with the magnitude of the weight vector. This
causes the gradient descent search to seek weight vectors with small
magnitudes, thereby reducing the risk of overfitting. One way to do
this is to redefine E as

05/01/2025
67

 2. Adding a term for errors in the slope, or derivative of the target function.
In some cases, training information may be available regarding desired
derivatives of the target function, as well as desired values.

 3. Minimizing the cross entropy of the network with respect to the target
values. Consider learning a probabilistic function, such as predicting whether a
loan applicant will pay back a loan based on attributes such as the applicant's age
and bank balance. Although the training examples exhibit only boolean target
values, the underlying target function might be best modeled by outputting the
probability that the given applicant will repay the loan, rather than attempting to
output the actual 1 and 0 value for each input instance. probability estimates that
are given by the network that minimizes the cross entropy, defined as
68

4. Altering the effective error function can also be accomplished


by weight sharing, or "tying together" weights associated
with different units or inputs. For example, in an application
of neural networks to speech recognition, the various units
that receive input from different portions of the time window
are forced to share weights. Such weight sharing is typically
implemented by first updating each of the shared weights
separately within each unit that uses the weight, then
replacing each instance of the shared weight by the mean of
their values.

05/01/2025
Alternative Error Minimization Procedures

69
 While gradient descent is one of the most general search methods for finding a
hypothesis to minimize the error function, it is not always the most efficient. It
is not uncommon for BP to require tens of thousands of iterations through the
weight update loop when training complex networks. For this reason, a number
of alternative weight optimization algorithms have been proposed and
explored.
 One optimization method, known as line search, involves a different approach
to choosing the distance for the weight update. once a line is chosen that
specifies the direction of the update, the update distance is chosen by finding
the minimum of the error function along this line.
 A second method, that builds on the idea of line search, is called the conjugate
gradient method. Here, a sequence of line searshes is performed to search for a
minimum in the error surface. On the first step in this sequence, the direction
chosen is the negative of the gradient. On each subsequent step, a new
direction is chosen so that the component of the error gradient that has just
been made zero, remains zero. 05/01/2025
Recurrent Networks

70

 Recurrent networks are artificial neural networks that apply to


time series data and that use outputs of network units at time t as
the input to other units at time t + 1. In this way, they support a
form of directed cycles in the network.
 consider the time series prediction task of predicting the next day's
stock market average y(t + 1 ) based on the current day's economic
indicators x(t). Given a time series of such data, one obvious
approach is to train a feedforward network to predict y(t + 1 ) as
its output, based on the input values x(t). Such a network is shown
in Figure 4.11(a).
 One limitation of such a network is that the prediction of y(t + 1 )
depends only on x(t) and cannot capture possible dependencies of y
(t + 1 ) on earlier values of x. 05/01/2025
71

 if we wish the network to consider an arbitrary window of time in the past


when predicting y(t+l), then a different solution is required. The recurrent
network shown in Figure 4.1 1(b) provides one such solution. the input
value c(t) to the network at one time step is simply copied from the value
of unit b on the previous time step. this implements a recurrence relation.
Many other network topologies also can be used to represent recurrence
relations. For example, we could have inserted several layers of units
between the input and unit b, and we could have added several context
units in parallel where we added the single units b and c.
 Figure 4.11(c), which shows the data flow of the recurrent network
"unfolded in time. Here we have made several copies of the recurrent
network, replacing the feedback loop by connections between the various
copies. This large unfolded network contains no cycles. Therefore, the
weights in the unfolded network can be trained directly using BP.

05/01/2025
72

05/01/2025
Dynamically Modifying Network Structure:
A variety of methods have been proposed to dynamically grow or shrink
the number of network units and interconnections in an attempt to
improve generalization accuracy and training efficiency.
• One idea is to begin with a network containing no hidden units, then
grow the network as needed by adding hidden units until the training error
is reduced to some acceptable level. The CASCADE-CORRELATlON
algorithm is one such algorithm.
•A second idea for dynamically altering network structure is to take the
opposite approach. Instead of beginning with the simplest possible
network and adding complexity, we begin with a complex network and
prune it as we find that certain connections are inessential. One way to
decide whether a particular weight is inessential is to see whether its value
is close to zero. A second way, which appears to be more successful in
practice, is to consider the effect that a small variation in the weight has on
the error E. The effect on E of varying w can be taken as a measure of the
salience of the connection.
73 05/01/2025

You might also like