unit 2
unit 2
05/01/2025
TOPICS
1. Artificial Neural Networks
1.1. Neurons and Biological Motivation
1.2. Neural Network Representations
1.3. Problems for Neural Network Learning
2. Perceptrons
2.1. Representational Power of Perceptrons
2.2. The Perceptron Training Rules
Speech recognition.
Visual scenes.
05/01/2025
1.1. Neurons and Biological Motivation
6
05/01/2025
1.1. Neurons and Biological Motivation
(Continued)
7
A Neuron structure
processes.
To obtaining highly effective machine learning
05/01/2025
algorithms
1.2. Neural Network Representations
11
Figure 2
Figure 3
Figure 1
05/01/2025
1.2. Neural Network Representations
(Continued)
14
The ALVINN system uses BACKPROPAGATION to learn to
steer an autonomous vehicle (photo at top) driving at
speeds up to 70 miles per hour.
The diagram on the left (figure 1) shows how the image of
a forward-mounted camera is mapped to 960 neural
network inputs, which are fed forward to 4 hidden units,
connected to 30 output units.
The figure 3 shows 30 x 32 weights into the hidden
unit are displayed in the large matrix, with white blocks
indicating positive and black indicating negative weights.
As can be seen from these output weights, activation of
this particular hidden unit encourages a turn toward the
left.
05/01/2025
1.2. Neural Network Representations
(Continued)
15
05/01/2025
1.3. Problems for Neural Network
Learning (Continued)
17
A Perceptron
An ANN consists of perceptrons. Each of the
perceptrons receives inputs, processes inputs
and delivers a single output.
05/01/2025
2. Perceptrons (Continued)
19
05/01/2025
2.1. Representational Power of
Perceptrons (Continued)
21
05/01/2025
2.1. Representational Power of
Perceptrons (Continued)
22
05/01/2025
2.1. Representational Power of
Perceptrons (Continued)
23
Where ,
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
27
28
If t = -1 and o = 1, then weights associated with
positive xi will be decreased rather than increased.
If t=1 and o=-1 , then weights associated with positive
xi will be increased rather than decreased.
Example: if xi = 0.8, = 0.1, t = 1, and o = - 1, then
the weight update will be = O.1(1 - (-
1))0.8 = 0.16.
the above learning procedure can be proven to
converge within a finite number of applications of the
perceptron training rule to a weight vector that
correctly classifies all training examples, provided the
training examples are linearly separable and provided
a sufficiently small n is used. If the data are not
linearly separable, convergence is not assured.
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
29
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
32
where
----------- (4)
Here is a positive constant called the learning rate.
The negative sign is present because we want to
move the weight vector in the direction that
decreases E.
Equation 4 can also be written as ------
(5)
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
37
-------
----------
05/01/2025
(6)
2.2. The Perceptron Training Rule
(Continued)
38
05/01/2025
2.2. The Perceptron Training Rule
(Continued)
39
Gradient-
Descent
Algorithm:
05/01/2025
Gradient-Descent Algorithm
40
41
05/01/2025
Remarks
44
Delta rule
1. Updates weights based on the error in the un-thresholded
linear combination of inputs
2. converges only asymptotically toward the minimum error
hypothesis, possibly requiring unbounded time, but converges
regardless of whether the training data are linearly separable.
05/01/2025
3. Multilayer Networks
45
05/01/2025
3. Multilayer Networks (Continued)
46
Figure 3.0.
The network input consists of two parameters, F1 and
F2, obtained from a spectral analysis of the sound.
The 10 network outputs correspond to the 10 possible
vowel sounds. The plot on the right illustrates the
highly nonlinear decision surface.
05/01/2025
47 05/01/2025
3. Multilayer Networks
(Continued)
48
A Differentiable
Threshold Unit:
Interesting property:
05/01/2025
50 05/01/2025
3. Multilayer Networks
(Continued)
51
The Backpropagation (BP)Algorithm :
The BP algorithm learns the weights for a multilayer network,
given a network with a fixed set of units and interconnections. It
employs a gradient descent to attempt to minimize the squared
error between the network output values and the target values for
these outputs.
Considering networks with multiple output units rather than single
units, redefine E to sum the errors over all of the network output
units
E( ) ((tkd – okd)2
d D koutputs
05/01/2025
3. Multilayer Networks
(Continued)
53
05/01/2025
54
h oh (1 oh ) wkhk
koutputs
Training examples provide target values tk only for network
output,no target values are directly available to indicate the error
of hidden unit’s values.
05/01/2025
57
05/01/2025
58
r value for a unit ‘r’ in layer ‘m’ is computed from the values at the next
deeper layer m + 1 .
05/01/2025
61
05/01/2025
Subscripts and variables used in notation of
stochastic gradient descent rule:
64
Ed
Now we have to derive a convenient
netj
expression for 05/01/2025
ADVANCED TOPICS IN ARTIFICIAL NEURAL
NETWORKS
66
Alternative Error Functions:
gradient descent can be performed for any function E that is
differentiable with respect to the arameterized hypothesis space.
While the basic BP algorithm defines E in terms of the sum of
squared errors of the network, other definitions have been suggested
in order to incorporate other constraints into the weight-tuning rule.
Examples of alternative definitions of E include:
1. Adding a penalty term for weight magnitude. we can add a term
to E that increases with the magnitude of the weight vector. This
causes the gradient descent search to seek weight vectors with small
magnitudes, thereby reducing the risk of overfitting. One way to do
this is to redefine E as
05/01/2025
67
2. Adding a term for errors in the slope, or derivative of the target function.
In some cases, training information may be available regarding desired
derivatives of the target function, as well as desired values.
3. Minimizing the cross entropy of the network with respect to the target
values. Consider learning a probabilistic function, such as predicting whether a
loan applicant will pay back a loan based on attributes such as the applicant's age
and bank balance. Although the training examples exhibit only boolean target
values, the underlying target function might be best modeled by outputting the
probability that the given applicant will repay the loan, rather than attempting to
output the actual 1 and 0 value for each input instance. probability estimates that
are given by the network that minimizes the cross entropy, defined as
68
05/01/2025
Alternative Error Minimization Procedures
69
While gradient descent is one of the most general search methods for finding a
hypothesis to minimize the error function, it is not always the most efficient. It
is not uncommon for BP to require tens of thousands of iterations through the
weight update loop when training complex networks. For this reason, a number
of alternative weight optimization algorithms have been proposed and
explored.
One optimization method, known as line search, involves a different approach
to choosing the distance for the weight update. once a line is chosen that
specifies the direction of the update, the update distance is chosen by finding
the minimum of the error function along this line.
A second method, that builds on the idea of line search, is called the conjugate
gradient method. Here, a sequence of line searshes is performed to search for a
minimum in the error surface. On the first step in this sequence, the direction
chosen is the negative of the gradient. On each subsequent step, a new
direction is chosen so that the component of the error gradient that has just
been made zero, remains zero. 05/01/2025
Recurrent Networks
70
05/01/2025
72
05/01/2025
Dynamically Modifying Network Structure:
A variety of methods have been proposed to dynamically grow or shrink
the number of network units and interconnections in an attempt to
improve generalization accuracy and training efficiency.
• One idea is to begin with a network containing no hidden units, then
grow the network as needed by adding hidden units until the training error
is reduced to some acceptable level. The CASCADE-CORRELATlON
algorithm is one such algorithm.
•A second idea for dynamically altering network structure is to take the
opposite approach. Instead of beginning with the simplest possible
network and adding complexity, we begin with a complex network and
prune it as we find that certain connections are inessential. One way to
decide whether a particular weight is inessential is to see whether its value
is close to zero. A second way, which appears to be more successful in
practice, is to consider the effect that a small variation in the weight has on
the error E. The effect on E of varying w can be taken as a measure of the
salience of the connection.
73 05/01/2025