Overview of Madaline Neural Networks
Overview of Madaline Neural Networks
Fundamental developments in feedfonvard artificial neural net- Widrow devised a reinforcement learning algorithm
works from the past thirty years are reviewed. The central theme of called “punish/reward” or ”bootstrapping” [IO], [ I l l in the
this paper is a description of the history, origination, operating
characteristics, and basic theory of several supervised neural net- mid-1960s.This can be used to solve problems when uncer-
work training algorithms including the Perceptron rule, the LMS tainty about the error signal causes supervised training
algorithm, three Madaline rules, and the backpropagation tech- methods t o be impractical. A related reinforcement learn-
nique. These methods were developed independently, but with ing approach was later explored in a classic paper by Barto,
the perspective of history they can a / / be related to each other. The
Sutton, and Anderson o n the “credit assignment” problem
concept underlying these algorithms is the “minimal disturbance
principle,” which suggests that during training it is advisable to [12]. Barto et al.’s technique is also somewhat reminiscent
inject new information into a network in a manner that disturbs of Albus’s adaptive CMAC, a distributed table-look-up sys-
stored information to the smallest extent possible. tem based o n models of human memory [13], [14].
In the 1970s Grossberg developed his Adaptive Reso-
I. INTRODUCTION nance Theory (ART), a number of novel hypotheses about
the underlying principles governing biological neural sys-
This year marks the 30th anniversary of the Perceptron tems [15]. These ideas served as the basis for later work by
rule and the L M S algorithm, two early rules for training Carpenter and Grossberg involving three classes of ART
adaptive elements. Both algorithms were first published in architectures: ART 1 [16], ART 2 [17], and ART 3 [18]. These
1960. In the years following these discoveries, many new are self-organizing neural implementations of pattern clus-
techniques have been developed in the field of neural net- tering algorithms. Other important theory on self-organiz-
works, and the discipline is growing rapidly. One early ing systems was pioneered by Kohonen with his work o n
development was Steinbuch’s Learning Matrix [I], a pattern feature maps [19], [201.
recognition machine based o n linear discriminant func- In the early 1980s, Hopfield and others introduced outer
tions. At the same time, Widrow and his students devised product rules as well as equivalent approaches based o n
Madaline Rule I (MRI), the earliest popular learning rule for the early work of Hebb [21] for training a class of recurrent
neural networks with multiple adaptive elements [2]. Other (signal feedback) networks now called Hopfield models [22],
early work included the “mode-seeking” technique of [23]. More recently, Kosko extended some of the ideas of
Stark, Okajima, and Whipple [3]. This was probably the first Hopfield and Grossberg t o develop his adaptive Bidirec-
example of competitive learning in the literature, though tional Associative Memory (BAM) [24], a network model
it could be argued that earlierwork by Rosenblatt on “spon- employing differential as well as Hebbian and competitive
taneous learning” [4], [5] deserves this distinction. Further learning laws. Other significant models from the past de-
pioneering work o n competitive learning and self-organi- cade include probabilistic ones such as Hinton, Sejnowski,
zation was performed in the 1970s by von der Malsburg [6] and Ackley‘s Boltzmann Machine [25], [26] which, to over-
and Grossberg [7l. Fukushima explored related ideas with simplify, is a Hopfield model that settles into solutions by
his biologically inspired Cognitron and Neocognitron a simulated annealing process governed by Boltzmann sta-
models [8], [9]. tistics. The Boltzmann Machine i s trained by a clever two-
phase Hebbian-based technique.
Manuscript received September 12,1989; revised April 13,1990. While these developments were taking place, adaptive
This work was sponsored by SDI0 Innovative Science and Tech- systems research at Stanford traveled an independent path.
nologyofficeand managed by ONR under contract no. N00014-86-
K-0718, by the Dept. of the Army Belvoir RD&E Center under con- After devising their Madaline I rule, Widrow and his stu-
tracts no. DAAK70-87-P-3134andno. DAAK-70-89-K-0001,by a grant dents developed uses for the Adaline and Madaline. Early
from the Lockheed Missiles and Space Co., by NASA under con- applications included, among others, speech and pattern
tract no. NCA2-389, and by Rome Air Development Center under recognition [27], weather forecasting [28], and adaptive con-
contract no. F30602-88-D-0025,subcontract no. E-21-T22-S1. trols [29]. Work then switched to adaptive filtering and
The authors are with the Information Systems Laboratory,
Department of Electrical Engineering, Stanford University, Stan- adaptive signal processing [30] after attempts t o develop
ford, CA 94305-4055, USA. learning rules for networks with multiple adaptive layers
IEEE Log Number 9038824. were unsuccessful. Adaptive signal processing proved t o
1
many orders of magnitude faster than current estimates of Input
with various ambiguities must be designed into the soft- Fig. 1. Adaptive linear combiner.
ware, even a relatively simple problem can quickly become
unmanageable. The slow progress over the past 25 years or
so in machinevision and otherareasofartificial intelligence continuous analog values or binary values. The weights are
i s testament to the difficulties associated with solving essentially continuously variable, and can take on negative
ambiguous and computationally intensive problems on von as well as positive values.
Neumann computers and related architectures. During the training process, input patterns and corre-
Thus, there i s some reason t o consider attacking certain sponding desired responses are presented to the linear
problems by designing naturally parallel computers, which combiner. An adaptation algorithm automatically adjusts
process information and learn by principles borrowed from the weights so that the output responses to the input pat-
the nervous systems of biological creatures. This does not terns will be as close as possible to their respective desired
necessarily mean we should attempt to copy the brain part reponses. In signal processing applications, the most pop-
for part. Although the bird served to inspire development ular method for adapting the weights is the simple LMS
of the airplane, birds do not have propellers, and airplanes (least mean square) algorithm [58], [59], often called the
do not operate by flapping feathered wings. The primary Widrow-Hoff delta rule [42]. This algorithm minimizes the
parallel between biological nervous systems and artificial sum of squares of the linear errors over the training set. The
neural networks is that each typically consists of a large linear error t k i s defined to be the difference between the
number of simple elements that learn and are able to col- desired response dk and the linear output sk, during pre-
lectively solve complicated and ambiguous problems. sentation k . Having this error signal is necessary for adapt-
Today, most artificial neural network research and appli- ing the weights. When the adaptive linear combiner i s
cation is accomplished by simulating networks on serial embedded in a multi-element neural network, however, an
computers. Speed limitations keep such networks rela- error signal i s often notdirectlyavailableforeach individual
tively small, but even with small networks some surpris- linear combiner and more complicated procedures must
ingly difficult problems have been tackled. Networks with be devised for adapting the weight vectors. These proce-
fewer than 150 neural elements have been used success- dures are the main focus of this paper.
fully in vehicular control simulations [50], speech genera-
tion [51], [52], and undersea mine detection [49]. Small net- B. A Linear Classifier-The Single Threshold Element
works have also been used successfully in airport explosive The basic building block used in many neural networks
detection [53], expert systems [54], [55], and scores of other is the "adaptive linear element," or Adaline3 [58] (Fig. 2).
applications. Furthermore, efforts to develop parallel neural This i s an adaptive threshold logic element. It consists of
network hardware are meeting with some success, and such an adaptive linear combiner cascaded with a hard-limiting
hardware should be available in the future for attacking quantizer, which is used t o produce a binary 1 output,
more difficult problems, such as speech recognition [56], Yk = sgn (sk). The bias weight wokwhich i s connected t o a
[57l. constant input xo = + I , effectively controls the threshold
Whether implemented in parallel hardware or simulated level of the quantizer.
on a computer, all neural networks consist of a collection In single-element neural networks, an adaptivealgorithm
of simple elements that work together to solve problems. (such as the LMS algorithm, or the Perceptron rule) i s often
A basic building block of nearly all artificial neural net- used to adjust the weights of the Adaline so that it responds
works, and most other adaptive systems, is the adaptive lin- correctly to as many patterns as possible in a training set
ear combiner. that has binary desired responses. Once the weights are
adjusted, the responses of the trained element can be tested
by applying various input patterns. If the Adaline responds
A. The Adaptive Linear Combiner
correctly with high probability to input patterns that were
The adaptive linear combiner i s diagrammed in Fig. 1. Its not included in the training set, it i s said that generalization
output i s a linear combination of i t s inputs. I n a digital has taken place. Learning and generalization are among the
implementation, this element receives at time k an input most useful attributes of Adalines and neural networks.
signal vector or input pattern vector X k = [x,, x l t , xzk, Linear Separability: With n binary inputs and one binary
. . , x,]' and a desired response dk,a special input used
1
to effect learning. The components of the input vector are 31n the neural network literature, such elements are often
weighted by a set of coefficients, the weight vector Wk = referred to as "adaptive neurons." However, in a conversation
between David Hubel of Harvard Medical School and Bernard Wid-
[wok,wlk,wZt, * . . , w,~]'. The sum of the weighted inputs row, Dr. Hubel pointed out that the Adaline differs from the bio-
is then computed, producing a linear output, the inner logical neuron in that it contains not only the neural cell body, but
product sk = XLWk. The components of X k may be either also the input synapses and a mechanism for training them.
Linear
therefore
output
w1 x, - -.WO
Binary x2= --
output w2 w2
(+L-11
Figure 4 graphs this linear relation, which comprises a
separating line having slope and intercept given by
W
slope = -2
w2
' - _ - - - _ _ - - _ _ _ - - _ _ _ _ _ _ - - I
intercept = -3. (3)
'k-1- w2
Desired Response Input
(training signal) The three weights determine slope, intercept, and the side
Fig. 2. Adaptive linear element (Adaline). of the separating line that corresponds t o a positive output.
The opposite side of the separating line corresponds t o a
negative output. For Adalines with four weights, the sep-
output, a single Adaline of the type shown in Fig. 2 is capa- arating boundary is a plane; with more than four weights,
ble of implementing certain logic functions. There are 2" the boundary i s a hyperplane. Note that if the bias weight
possible input patterns. A general logic implementation i s zero, the separating hyperplane will be homogeneous-
would be capable of classifying each pattern as either + I it will pass through the origin i n pattern space.
or -1, in accord with the desired response. Thus, there are As sketched in Fig. 4, the binary input patterns are clas-
22' possible logic functions connecting n inputs t o a single sified as follows:
binary output. A single Adaline is capable of realizing only
asmall subset of thesefunctions, known as the linearlysep- (+I, +I) + +I
arable logic functions or threshold logic functions [60]. ( + I , -1) + +I
These are the set of logic functions that can be obtained
with all possible weight variations. (-1, -1) -+ +I
Figure3 shows atwo-input Adalineelement. Figure4 rep- (-1, + I ) + -1 (4)
resents all possible binary inputs to this element with four
large dots in pattern vector space. I n this space, the com- This is an example of a linearly separable function. An
ponentsof the input pattern vector liealongthecoordinate example of a function which i s not linearly separable is the
axes. The Adaline separates input patterns into two cate- two-input exclusive NOR function:
gories, depending o n the values of the weights. A critical ( + I , +I) + +I
(+I, -1) -+ -1
Xok= +1
(-1, -1) + +I
(-1, +I) + -1 (5)
Nosinglestraight lineexiststhat can achievethisseparation
of the input patterns; thus, without preprocessing, no sin-
'k
gle Adaline can implement the exclusive NOR function.
With two inputs, a single Adaline can realize 14 of the 16
Fig. 3. Two-input Adaline. possible logic functions. With many inputs, however, only
a small fraction of all possible logic functions i s realizable,
that is, linearly separable. Combinations of elements or net-
Separating Line
works of elements can be used t o realize functions that are
xz = 3 x , - "0 not linearly separable.
w2 WZ
Capacity o f Linear C/assifiers:The number of training pat-
terns or stimuli that an Adalinecan learn tocorrectlyclassify
i s an important issue. Each pattern and desired output com-
bination represents an inequalityconstraint o n the weights.
It i s possible to have inconsistencies in sets of simultaneous
inequalities just as with simultaneous equalities. When the
inequalities (that is, the patterns) are determined at ran-
dom, the number that can be picked before an inconsis-
tency arises i s a matter of chance.
In their 1964 dissertations [61], [62], T. M. Cover and R. J.
Brown both showed that the average number of random
Fig. 4. Separating line in pattern space. patterns with random binary desired responses that can be
Input
Pattern
VeCtOl
-
X Binary
Np/Nw--Ratio of Input Patterns to Weights Xl output
Fig. 5. Probability that an Adaline can separate a training Y
pattern set as a function of the ratio NJN,.
(+1,-1)
& py
Fig. 7. Elliptical separating boundary for realizing a func-
tion which i s not linearly separable. Input
Pattern
X
xiT--,
, output
The polynomial approach offers great simplicity and
beauty.Through it onecan realizeawidevarietyofadaptive
nonlinear discriminant functions by adapting only a single
Adaline element. Several methods have been developed for
training the polynomial discriminant function. Specht
developed a very efficient noniterative (that is, single pass x1
through the training set) training procedure: the polyno- Fig. 8. Two-Adaline form of Madaline.
mial discriminant method (PDM), which allows the poly-
nomial discriminant function t o implement a nonpara-
metric classifier based on the Bayes decision rule. Other lines are connected to an A N D logic device to provide an
methods for training the system include iterative error-cor- output.
rection rules such as the Perceptron and a-LMS rules, and With weights suitably chosen, the separating boundary
iterative gradient-descent procedures such as the w-LMS in pattern space for the system of Fig. 8 would be as shown
and SER (also called RLS) algorithms [30]. Gradient descent in Fig. 9. This separating boundary implements the exclu-
with a single adaptive element is typically much faster than sive NOR function of (5).
with a layered neural network. Furthermore, as we shall see,
when the single Adaline is trained by a gradient descent
procedure, it will converge to a unique global solution.
Separating
Lines ,\
After the polynomial discriminant function has been
trained byagradient-descent procedure, theweights of the
Adaline will represent an approximation to the coefficients
in a multidimensional Taylor series expansion of thedesired
response function. Likewise, if appropriate trigonometric
terms are used in place of the polynomial preprocessor, the
Adaline's weight solution will approximate the terms in the
(truncated) multidimensional Fourier series decomposi-
tion of a periodic version of the desired response function.
o u t p u t = +1
The choice of preprocessing functions determines how well
a network will generalize for patterns outside the training
set. Determining "good" functions remains a focus of cur-
rent research [73], [74]. Experience seems to indicate that Fig. 9. Separating lines for Madaline of Fig. 8.
unless the nonlinearities are chosen with care to suit the
problem at hand, often better generalization can be
obtained from networks with more than one adaptive layer. Madalines were constructed with many more inputs, with
In fact,onecan view multilayer networks assingle-layer net- many more Adaline elements in the first layer, and with var-
works with trainable preprocessors which are essentially ious fixed logic devices such as AND, OR, and majority-vote-
self-optimizing. taker elements in the second layer. Those three functions
(Fig. IO) are all threshold logic functions. The given weight
Madaline I valueswill implement these threefunctions, but theweight
choices are not unique.
One of the earliest trainable layered neural networks with
multiple adaptive elements was the Madaline I structure of
Feedforward Networks
Widrow [2] and Hoff (751. Mathematical analyses of Mada-
line I were developed in the Ph.D. theses of Ridgway [76], The Madalines of the 1960s had adaptive first layers and
Hoff [75], and Glanz [77]. I n the early 1960s, a 1000-weight fixed threshold functions in the second (output) layers [76],
~ ~~
A great deal of theoretical and experimental work has whereK,and &are positive numberswhich aresmall terms
been directed toward determining the capacity of both if the network i s large with few outputs relative to the num-
Adalines and Hopfield networks [87]-[90]. Somewhat less ber of inputs and hidden elements.
theoretical work has been focused o n the pattern capacity It is easy t o show that Eq. (8) also bounds the number of
of multilayer feedforward networks, though some knowl- weights needed to ensure that N, patterns can be learned
edge exists about the capacity of two-layer networks. Such with probability 1/2, except in this case the lower bound o n
results are of particular interest because the two-layer net- +
N, becomes: (N,N, - .1)/(1 log, (N,)). It follows that Eq.
work is surprisingly powerful. With a sufficient number of (9) also serves t o bound the statistical capacity C, of a two-
hidden elements, a signum network with two layers can layer signum network.
implement any Boolean function.’ Equally impressive is the It is interesting to note that the capacity bounds (9)
power of the two-layer sigmoid network. Given a sufficient encompass the deterministic capacity for the single-layer
number of hidden Adaline elements, such networks can networkcomprisinga bankof N,Adalines. In thiscaseeach
implement any continuous input-output mapping to arbi- Adaline would have N,/N, weights, so the system would
trary accuracy [92]-[94]. Although two-layer networks are have a deterministic pattern capacity of NN /, ., AS N,
quite powerful, it i s likely that some problems can be solved becomes large, the statistical capacity also approaches
more efficiently by networks with more than two layers. N,/N, (for N, finite). Until further theory o n feedforward
Nonfinite-order predicate mappings (such as the connect- network capacity is developed, it seems reasonable to use
edness problem [95]) can often be computed by small net- the capacity results from the single-layer network t o esti-
works using signal feedback [96]. mate that of multilayer networks.
In the mid-I960s, Cover studied the capacity of a feed- Little i s known about the number of binary patterns that
forward signum networkwith an arbitrary number of layersg layered sigmoid networks can learn to classify correctly.
and a single output element [61], [97. He determined a lower The pattern capacityof sigmoid networks cannot be smaller
bound o n the minimum number of weights N, needed to than that of signum networks of equal size, however,
enable such a network t o realize any Boolean function because as the weights of a sigmoid network grow toward
defined over an arbitrary set of Nppatterns in general posi- infinity, it becomes equivalent t o a signum network with
tion. Recently, Baum extended Cover’s result to multi-out- aweight vector in the same direction. Insight relating to the
put networks, and also used a construction argument to capabilities and operating principles of sigmoid networks
find corresponding upper bounds for the special case of can be winnowed from the literature [99]-[loll.
thetwo-layer signum network[98l.Consideratwo-layerfully A network’s capacity i s of little utility unless it i s accom-
connected feedforward network of signum Adalines that panied by useful generalizations to patterns not presented
has Nxinput components (excluding the bias inputs) and during training. In fact, if generalization is not needed, we
N,output components. If this network is required to learn can simply store the associations in a look-up table, and will
to map any set containing Np patterns that are in general have little need for a neural network. The relationship
position to any set of binary desired response vectors (with between generalization and pattern capacity represents a
N, components), it follows from Baum’s results” that the fundamental trade-off in neural network applications:
minimum requisite number of weights N,can be bounded the Adaline’s inability t o realize all functions i s i n a sense
by a strength rather than the fatal flaw envisioned by some crit-
ics of neural networks [95], because it helps limit the capac-
NYNP
1 + l0g,(Np)
5 N, < N
(”- +
N x 1
1 (N, + N, + 1) + N,.
(8)
ity of the device and thereby improves its ability t o gen-
eralize.
For good generalization, the training set should contain
a number of patterns at least several times larger than the
From Eq. (8), it can be shown that for a two-layer feedfor-
network‘s capacity (i.e., Np >> N,IN,). This can be under-
ward networkwith several times as many inputs and hidden
stood intuitively by noting that if the number of degrees of
elements as outputs (say, at least 5 times as many), the deter-
freedom in a network (i.e., N), i s larger than the number
ministic pattern capacity is bounded below by something
of constraints associated with the desired response func-
slightly smaller than N,/N,. It also follows from Eq. (8) that
tion (i.e., N,N,), the training procedure will be unable to
the pattern capacityof any feedforward network with a large
completely constrain the weights in the network. Appar-
ratio of weights to outputs (that is, N,IN, at least several
ently, this allows effects of initial weight conditions t o inter-
thousand) can be bounded above by a number of some-
fere with learned information and degrade the trained net-
what larger than (N,/Ny) log, (Nw/Ny).Thus, the determin-
istic pattern capacity C, of a two-layer network can be work’s ability to generalize. A detailed analysis of
generalization performance of signum networks as a func-
bounded by
tion of training ” set size i s described in 11021.
-
Nw
N,
N,
- K, IC, 5 - log,
N,
(%) + K2 (9)
A Nonlinear Classifier Application
‘This can be seen by noting that any Boolean function can be
written in the sum-of-products form [91], and that such an expres-
Neural networks have been used successfully in a wide
sion can be realized with a two-laver network bv usingthe first-laver range of applications. To gain Some insight about how
Adalines to implement AND gates, while using t h g second-layer neural networks are trained and what they can be used t o
Adalines to implement OR gates. compute, it is instructive t o consider Sejnowski and Rosen-
’Actually, the network can bean arbitrary feedforward structure berg,s 1986 NETtalk demonstration [521. With the
and need not be layered.
‘ q h e uDDer bound used here isB ~loose~bound:~minimum ’ ~ exception of work on the traveling salesman problem with
number iibden nodes 5 N, rNJN,1 < N,(NJN, + 1). Hopfield networks [103], this was the first neural network
~
The a-LMS Algorithm: The a-LMS algorithm or Widrow- X = input pattern vector
Hoff delta rule applied to the adaptation of a single Adaline A ~
(Fig. 2) embodies the minimal disturbance principle. The W = next weight vector
weight update equation for the original form of the algo-
rithm can be written as
-Awk = weight vector
change
€k dk - w,'x,. (11) selects A w k t o be collinear with Xk, the desired error cor-
Changing the weights yields a corresponding change in the rection is achieved with a weight change of the smallest
error: possible magnitude. When adapting to respond properly
to a new input pattern, the responses to previous training
AEk = A(dk - W&) = -xiAwk. (12) patterns are therefore minimally disturbed, on the average.
I n accordance with the a-LMS rule of Eq. (IO), the weight The a-LMS algorithm corrects error, and if all input pat-
change i s as follows: terns are all of equal length, it minimizes mean-square error
[30]. The algorithm i s best known for this property.
(13)
B. Nonlinear Rules
Combining Eqs. (12) and (13), we obtain The a-LMS algorithm is a linear rule that makes error cor-
rections that are proportional to the error. It i s known [I051
that in some cases this linear rule may fail t o separate train-
ing patterns that are linearly separable. Where this creates
difficulties, nonlinear rules may be used. I n the next sec-
Therefore, theerror i s reduced byafactorof aastheweights tions,wedescribeearlynonlinear rules,which weredevised
are changed while holding the input pattern fixed. Pre-
by Rosenblatt [106], [5] and Mays [IOS]. These nonlinear rules
senting a new input pattern starts the next adaptation cycle.
also make weight vector changes collinear with the input
The next error is then reduced by a factor of cy, and the pro-
pattern vector (the direction which causes minimal dis-
cess continues. The initial weight vector is usually chosen
turbance), changes that are based on the linear error but
to be zero and is adapted until convergence. In nonsta-
are not directly proportional to it.
tionary environments, the weights are generally adapted
The Perceptron Learning Rule: The Rosenblatt a-Percep-
continually.
tron [106], [ 5 ] ,diagrammed in Fig. 13, processed input pat-
The choice of a controls stability and speed of conver-
gence [30]. For input pattern vectors independent over time, Fixed Random
stability i s ensured for most practical purposes if Inputs lo
Adaptive
x 1 Element
o<cy<2. (15)
Making a greater than 1 generally does not make sense,
Analog-
since the error would be overcorrected. Total error cor- Valued
Retina
rection comes with a = 1. A practical range for a is Input
Patterns
0.1 < a < 1.0. (16)
This algorithm i s self-normalizing in the sense that the
choice of a does not depend on the magnitude of the input
signals. The weight update i s collinear with the input pat-
tern and of a magnitude inversely proportional to IXk)2.With I \ Desired Response
(+1,-11
Element
binary *I inputs, IXkl2 is equal to the number of weights Sparse Random Fixed Threshold
Connections Elements
and does not vary from pattern to pattern. If the binary
Fig. 13. Rosenblatt's a-Perceptron.
inputs are the usual 1 and 0, no adaptation occurs for
weights with 0 inputs, while with *I inputs, all weights are
adapted each cycle and convergence tends to be faster. For terns with a first layer of sparse randomly connected fixed
this reason, the symmetric inputs +I and -1 are generally logic devices. The outputs of the fixed first layer fed a sec-
preferred. ond layer, which consisted of a single adaptive linear
Figure12 providesageometrical pictureof howthea-LMS threshold element. Other than the convention that i t s input
rule works. I n accord with Eq. (13), wk+,equals wk added signals were {I, 0 } binary, and that no bias weight was
t o AWk, and AWk i s parallel with the input pattern vector included, this element is equivalentto the Adaline element.
xk. From Eq. (12),the change in error is equal t o the negative The learning rule for the a-Perceptron is very similarto LMS,
dot product of x k and A",. Since the cy-LMS algorithm but its behavior i s in fact quite different.
d, [+L.ll
Desired Respanse Input
one presentation. If thetraining set is separable, thisvariant
has all the characteristics of the fixed-increment version
(training signal)
with a set to 1, except that it usually reaches a solution i n
Fig. 14. The adaptive threshold element of the Perceptron. fewer presentations.
Mays's Algorithms: I n his Ph.D. thesis [105], Mays
described an "increment adaptation" rule14and a "modi-
use of the "quantizer error" z k , defined to be the difference fied relaxation adaptation" rule. The fixed-increment ver-
between the desired response and the output of the quan-
sion of the Perceptron rule i s a special case of the increment
tizer adaptation rule.
zk d k - Yk. (17) lncreinent adaptation i n i t s general form involves the use
of a "dead zone" for the linear output s k , equal t o ky about
The Perceptron rule, sometimes called the Perceptron zero. All desired responses are +I (refer t o Fig. 14). If the
convergence procedure, does not adapt the weights if the linear output s k falls outside the dead zone ( 1 s k ( 2 y), adap-
output decision Y k i s correct, that is, if z k = 0. If the output tation follows a normalized variant of the fixed-increment
decision disagrees with the binary desired response d k , Perceptron rule (with a / ( X k I 2used i n place of a).If the linear
however, adaptation i s effected by adding the input vector output falls within the dead zone, whether or not the output
to the weight vector when the error z k i s positive, or sub- response y k is correct, the weights are adapted by the nor-
tracting the input vector from the weight vector when the malized variant of the Perceptron rule as though the output
error & i s negative. Thus, half the product of the input vec- response Y k had been incorrect. The weight update rule for
tor and the quantizer error gk i s added to the weight vector. Mays's increment adaptation algorithm can be written
The Perceptron rule i s identical t o the a-LMS algorithm, mathematically as
except that with the Perceptron rule, half of the quantizer
error &/2 is used in place of the normalized linear error E k /
I&)' of the ct-LMS rule. The Perceptron rule i s nonlinear,
in contrast to the LMS rule, which i s linear (compare Figs.
2 and 14). Nonetheless, the Perceptron rule can be written
in a form very similar to the a-LMS rule of Eq. (IO):
where F k i s the quantizer error of Eq. (17).
w k + , = w k + ff ' X k .
2
(18) With the dead zone y = 0, Mays's increment adaptation
algorithm reduces t o a normalized version of the Percep-
Rosenblatt normally set a to one. In contrast t o a-LMS,
12Thisresults because the length of the weight vector decreases
thechoiceof ctdoesnotaffectthestabilityof theperceptron with each adaptation that does not cause the linear output sk to
algorithm, and it affects convergence time only if the initial change sign and assume a magnitude greater than that before
weight vector i s nonzero. Also, while a-LMS can be used adaptation. Although there are exceptions, for most problems this
with either analog or binary desired responses, Rosen- situation occursonly rarely if theweight vector is much longer than
the weight increment vector.
blatt's rule can be used only with binary desired responses.
13Theterms "fixed-increment" and "absolute correction" are due
The Perceptron rule stops adapting when the training to Nilsson [46]. Rosenblatt referred to methods of these types,
patterns are correctly separated. There is no restraining respectively, as quantized and nonquantized learning rules.
force controlling the magnitude of the weights, however. 14Theincrement adaptation rule was proposed by others before
The direction of the weight vector, not its magnitude, deter- Mays, though from a different perspective [107].
i"
ity element (MAJ). Because the second-layer logic element
if Fk = o and [Ski 2 y
is fixed and known, it i s possible t o determine which first-
wk+l = (20) layer Adalines can be adapted to correct an output error.
xk
wk + c q 7 otherwise The Adalines in the first layer assist each other in solving
IXkl
problems by automatic load-sharing.
where zk is the quantizer error of Eq. (17). One procedurefortrainingthe network in Fig. 15follows.
If the dead zone y is set t o 00, this algorithm reduces to A pattern i s presented, and if the output response of the
the a-LMS algorithm (IO).Mays showed that, for dead zone majority element matches the desired response, no adap-
0 < y < 1 and learning rate 0 < a 5 2, this algorithm will tation takes place. However, if, for instance, the desired
converge and separate any linearly separable input set in response i s +I and three of the five Adalines read -1 for
a finite number of steps. If the training set is not linearly agiven input pattern,oneof the latterthreemust beadapted
separable, this algorithm performs much like Mays's incre- to the +I state. The element that i s adapted by MRI i s the
ment adaptation rule. onewhose linearoutputsk isclosesttozero-theonewhose
Mays's two algorithms achieve similar pattern separation analog response i s closest t o the desired response. If more
results. The choice of a does not affect stability, although of the Adalines were originally in the -1 state, enough of
it does affect convergence time. The two rules differ i n their them are adapted to the +I state to make the majority deci-
convergence properties but there i s no consensus on which sion equal + I . The elements adapted are those whose lin-
i s the better algorithm. Algorithms like these can be quite ear outputs are closest to zero. A similar procedure i s fol-
useful, and we believe that there are many more t o be lowed when the desired response i s -1. When adapting a
invented and analyzed. given element, the weight vector can be moved in the LMS
The a-LMS algorithm, the Perceptron procedure, and direction far enough to reverse the Adaline's output (abso-
Mays's algorithms can all be used for adapting the single lute correction, or "fast" learning), or it can be adapted by
Adaline element or they can be incorporated into proce- the small increment determined by the a-LMS algorithm
dures for adapting networks of such elements. Multilayer (statistical, or "slow" learning). The one desired response
network adaptation procedures that use some of these d k i s used for all Adalines that are adapted. The procedure
algorithms are discussed in the following. can also be modified toallow oneof Mays'srulesto be used.
In that event, for the case we have considered (majority out-
V. ERROR-CORRECTION
RULES-MULTI-ELEMENTNETWORKS put element), adaptations take place if at least half of the
Adalines either have outputs differing from the desired
The algorithms discussed next are the Widrow-Hoff
responseor haveanalog outputswhich are in thedead zone.
Madaline rule from the early 1960s, now called Madaline
By setting the dead zone of Mays's increment adaptation
Rule I (MRI),and MadalineRule II(MRll),developed byWid-
rule to zero, the weights can also be adapted by Rosen-
row and Winter in 1987.
blatt's Perceptron rule.
Differences in initial conditions and the results of sub-
A. Madaline Rule I
sequent adaptation cause the various elements to take
The M R I rule allows the adaptation of a first layer of hard- "responsibility" for certain parts of the training problem.
limited (signum) Adaline elements whose outputs provide The basic principle of load sharing i s summarized thus:
inputs t o a second layer, consisting of a single fixed-thresh- Assign responsibility to the Adaline or Adalines that can
old-logic element which may be, for example, the OR gate, most easily assume it.
~-
~
the conditions under which the additional complexity is Defining the vector P as the crosscorrelation between the
warranted are not generally known. The discussion that fol- desired response (a scalar) and the X-vector” then yields
lows i s restricted t o minimization of MSE by the method of
steepest descent [116], [117]. More sophisticated learning
procedures usuallyrequiremanyofthesamecomputations The input correlation matrix R i s defined in terms of the
used in the basic steepest-descent procedure. ensemble average
Adaptation of a network by steepest-descent starts with
an arbitrary initial value WOfor the system’s weight vector. R P E[XkXL]
The gradient of the MSE function i s measured and the
weight vector i s altered in the direction corresponding t o Xlk
...
the negative of the measured gradient. This procedure i s
XlkXlk
...
repeated, causing the M S E t o be successively reduced o n
average and causing the weight vector t o approach a locally
optimal value.
XnkXlk
The method of steepest descent can be described by the
relation This matrix i s real, symmetric, and positive definite, or in
rare cases, positive semi-definite. The MSE [k can thus be
wk+l = wk + +Vk) (21) expressed as
where p i s a parameter that controls stability and rate of
convergence, and Vk i s the value of the gradient at a point
on the M S E surface corresponding t o W = w k . = €[di] - 2PTWk + WLRWk. (27)
To begin, we derive rules for steepest-descent minimi- Note that the MSE is a quadratic function of the weights.
zation of the MSE associated with a single Adaline element. It i s a convex hyperparaboloidal surface, a function that
These rules are then generalized t o apply t o full-blown never goes negative. Figure 17 shows a typical MSE surface
neural networks. Like error-correction rules, the most prac-
tical and efficient steepest-descent rules typicallyworkwith
one pattern at a time. They minimize mean-square error,
approximately, averaged over the entire set of training pat-
terns.
A. Linear Rules
Steepest-descent rules for the single threshold element
are said t o be linear if weight changes are proportional t o
the linear error, the difference between the desired
response dk and the linear output of the element sk.
Mean-SquareError Surface o f the Linear Combiner: In this
section we demonstrate that the M S E surface of the linear
combiner of Fig. 1 is a quadratic function of the weights,
and thus easily traversed by gradient descent. Fig. 17. Typical mean-square-errorsurface of a linear com-
Let the input pattern Xk and the associated desired biner.
response dk be drawn from a statistically stationary pop-
ulation. During adaptation, the weight vector varies so that for a linear combiner with two weights. The position of a
even with stationary inputs, the output sk and error ek will point o n the grid in this figure represents the value of the
generally be nonstationary. Care must be taken in defining Adaline’s two weights. The height of the surface at each
the M S E since it is time-varying. The only possibility i s an point represents M S E over the training set when the Ada-
ensemble average, defined below. line’sweightsarefixed atthevaluesassociatedwith thegrid
At the k t h iteration, let theweight vector be wk. Squaring point. Adjusting theweights involvesdescending along this
and expanding Eq. (11) yields surface toward the unique minimum point (“the bottom of
the bowl”) by the method of steepest descent.
€: = (dk - XLWk)’ (22) The gradient Vk of the M S E function with W = wk i s
obtained by differentiating Eq. (27):
= d i - 2dkxIwk + W~xkX~Wk. (23)
E[E;]w= wk = f [ d i l - 2E[dkXi]Wk
-
This i s a linear function of the weights. The optimal weight mean to W * , the optimal Wiener solution discussed above.
vector W * , generally called the Wiener weight vector, i s A proof of this can be found in [30].
obtained from Eq. (28) by setting the gradient to zero: In the p-LMS algorithm, and other iterative steepest-
descent procedures, use of the instantaneous gradient i s
W * = R-’P. (29)
perfectly justified if the step size i s small. For small p , Wwill
This i s a matrix form of the Wiener-Hopf equation [118]- remain essentially constant over a relatively small number
[120]. In the next section we examine p-LMS, an algorithm of training presentationsK. The total weight change during
which enables us to obtain an accurate estimateof W * with- this period will be proportional to
out first computing R - ’ and P.
Thep-LMSA1gorithm:The p-LMS algorithm works by per-
forming approximate steepest descent on the M S E surface
in weight space. Because it is a quadratic function of the
weights, this surface is convex and has a unique (global)
minimum.” An instantaneous gradient based upon the
square of the instantaneous linear error is
(35)
“-‘=I
- ae2
- aw, i weights follow the true gradient. It i s shown in [30] that the
instantaneous gradient i s an unbiased estimate of the true
gradient.
Comparison of p-LMS and a-LMS: We have now pre-
sented two forms of the LMS algorithm, a-LMS (IO) in Sec-
tion IV-A and p-LMS (33) in the last section. They are very
L M S works by using this crude gradient estimate in place similar algorithms, both using the LMS instantaneous gra-
of the true gradient v k of Eq. (28). Making this replacement dient. a-LMS is self-normalizing, with the parameter a
into Eq. (21) yields determining the fraction of the instantaneous error to be
corrected with each adaptation. p-LMS is a constant-coef-
ficient linear algorithm which i s considerably easier to ana-
lyze than a-LMS. Comparing the two, the a-LMS algorithm
The instantaneous gradient is used because it is readily
i s like thep-LMS algorithm with acontinuallyvariable learn-
available from a single data sample. The true gradient i s
ing constant. Although a-LMS is somewhat more difficult
generally difficult to obtain. Computing it would involve
to implement and analyze, it has been demonstrated exper-
averaging the instantaneous gradients associated with all
imentally to be a better algorithm than p-LMS when the
patterns in the training set. This i s usually impractical and
eigenvalues of the input autocorrelation matrix Rare highly
almost always inefficient.
disparate, giving faster convergence for a given level of gra-
Performing the differentiation in Eq. (31) and replacing
dient noise” propagated into the weights. It will be shown
the linear error by definition (11)gives
next that p-LMS has the advantage that it will always con-
verge in the mean to the minimum MSE solution, while
a-LMS may converge to a somewhat biased solution.
We begin with a-LMS of Eq. (IO):
O < p < L (34) We define a -new- training set of pattern vectors and desired
trace [RI
responses {xk, a k } by normalizing elements of the original
where trace [RI = C(diagona1elements of R) is the average
training set as f o I I o ~ s , ’ ~
signal power of the X-vectors, that is, € ( X J X ) . With p set
within this range,17 the p-LMS algorithm converges in the
Input Pattern
vector Weight Vector
is the input correlation matrix of the normalized training
set and the vector
o<a<2. (44) sk = X L W k .
(45)
= w k + 2 / . b c k sgm' (sk) x k . (54)
Algorithm (54) i s the backpropagation algorithm for the
We shall adapt this Adaline with the objective of mini- sigmoid Adaline element. The backpropagation name
mizing the mean square of the sigmoid error i k , de- makes more sense when the algorithm is utilized in a lay-
Since As i s small,
a(tanh ( s k ) )
sgm‘ (sk) =
ask
~~
Input Pattern Weight
Vector Vector
Non-Quadratic MSE
Desired Response
Fig. 24. Example MSE surface of signum error.
Fig. 21. The linear, sigmoid, and signum errors of the Ada-
line.
than minimizing the mean square of the signum error. Only
To demonstrate the nature of the square error surfaces the linear error i s guaranteed t o have an M S E surface with
associated with these three types of error, a simple exper- a unique global minimum (assuming invertible R-matrix).
imentwith a two-input Adalinewas performed. The Adaline The other M S E surfaces can have local optima [122], [123].
was driven by a typical set of input patterns and their asso- I n nonlinear neural networks, gradient methods gener-
ciated binary { +I, -1) desired responses. The sigmoid ally work better with sigmoid rather than signum nonlin-
function used was the hyperbolic tangent. The weights earities. Smooth nonlinearities are required by the M R l l l
could have been adapted t o minimize the mean-square and backpropagation techniques. Moreover, sigmoid net-
error of E , i , or E. The M S E surfaces of € [ ( E ) ~ ] , € [ ( E ) 2 ] , E [ ( : ) * ] works are capable of forming internal representations that
plotted as functions of the two weight values, are shown are more complex than simple binarycodes and, thus, these
in Figs. 22, 23, and 24, respectively. networks can often form decision regions that are more
sophisticated than those associated with similar signum
networks. In fact, if a noiseless infinite-precision sigmoid
Adaline could be constructed, it would be able t o convey
an infinite amount of information at each time step. This
i s in contrast to the maximum Shannon information capac-
ity of one bit associated with each binary element.
The signum does have some advantages over the sigmoid
in that it is easier to implement in hardware and much sim-
pler to compute o n a digital computer. Furthermore, the
outputs of signums are binary signals which can be effi-
ciently manipulated by digital computers. In a signum net-
work with binary inputs, for instance, the output of each
linear combiner can be computed without performing
weight multiplications. This involves simply adding
together the values of weights with + I inputs and sub-
Fig. 22. Example MSE surface of linear error. tracting from this the values of all weights that are con-
nected t o -1 inputs.
Sometimes a signum i s used in an Adaline t o produce
decisive output decisions. The error probability is then pro-
portional to the mean square of the output error :. To min-
imize this error probability approximately, one can easily
minimize E [ ( E ) ~ ] instead of directly minimizing [58].
However, with only a little more computation one could
minimize and typically come much closer to the
objective of minimizing €[(E)2]. The sigmoid can therefore
be used in training the weights even when the signum i s
used to form the Adaline output, as i n Fig. 21.
VII. STEEPEST-DESCENT
RULES-MULTI-ELEMENT
NETWORKS
We now study rules for steepest-descent minimization
of the M S E associated with entire networks of sigmoid Ada-
Fig. 23. Example MSE surface of sigmoid error.
line elements. Like their single-element counterparts, the
most practical and efficient steepest-descent rules for multi-
Although the above experiment i s not all encompassing, element networks typically work with one pattern presen-
we can infer from it that minimizing the mean square of the tation at a time. We will describe two steepest-descent rules
linear error is easy and minimizing the mean square of the for multi-element sigmoid networks, backpropagation and
sigmoid error i s more difficult, but typically much easier Madaline Rule Ill.
A. Backpropagation for Networks In the network example shown in Fig. 25, the sum square
error i s given by
The publication of the backpropagation technique by
Rumelhart et al. [42] has unquestionably been the most E2 = (d, - yJ2 + (d2 - y2)2
influential development in the field of neural networks dur-
ing the past decade. In retrospect, the technique seems where we now suppress the time index k for convenience.
simple. Nonetheless, largely because early neural network In its simplest form, backpropagation training begins by
research dealt almost exclusively with hard-limiting non- presenting an input pattern vector X t o the network, sweep-
linearities, the idea never occurred to neural network ing forward through the system to generate an output
researchers throughout the 1960s. response vector Y, and computing the errors at each out-
put.The next step involvessweeping theeffectsof theerrors
The basic concepts of backpropagation are easily
backward through the network t o associate a “square error
grasped. Unfortunately, these simple ideas are often
obscured by relatively intricate notation, so formal deri- derivative” 6 with each Adaline, computing a gradient from
each 6, and finally updating the weights of each Adaline
vations of the backpropagation rule are often tedious. We
based upon the corresponding gradient. A new pattern is
present an informal derivation of the algorithm and illus-
then presented and the process i s repeated. The initial
trate how it works for the simple network shown in Fig. 25.
weight values are normally set t o small random numbers.
The backpropagation technique i s a nontrivial general-
ization of the single sigmoid Adaline case of Section VI-B. The algorithm will not work properly with multilayer net-
works if the initial weights are either zero or poorlychosen
When applied t o multi-element networks, the backprop-
nonzero
agation technique adjusts the weights in the direction
We can get some idea about what i s involved in the cal-
opposite the instantaneous error gradient:
culations associated with the backpropagation algorithm
by examining the network of Fig. 25. Each of the five large
circles represents a linear combiner, as well as some asso-
ciated signal paths for error backpropagation, and the cor-
responding adaptive machinery for updating the weights.
This detail is shown in Fig. 26. The solid lines in these dia-
grams represent forward signal paths through the network,
“)
awmk
20Recently,Nguyen has discovered that a more sophisticated
choice of initial weight values in hidden layers can lead to reduced
Now, however, wk is a long rn-component vector of all
problems with local optima and dramatic increases in network
weights i n the entire network. The instantaneous sum training speed [IOO]. Experimental evidence suggests that it i s
squared error €2 i s the sum of the squares of the errors at advisable to choose the initial weights of each hidden layer in a
each of the N, outputs of the network. Thus quasi-random manner, which ensures that at each position in a
layer’s input space the outputs of all but a few of its Adalines will
besaturated, whileensuringthateachAdaline in the layer i s unsat-
urated in some region of its input space. When this method i s used,
the weights in the output layer are set to small random values.
~ _ _
We note that the second term is zero. Accordingly,
(74)
"In Fig. 25, all notation follows the convention that superscripts
within parentheses indicate the layer number of the associated
Adaline or input node, while subscripts identify the associated Referring to Fig. 25, we can trace through the circuit t o
Adaline(s) within a layer. verify that 6 7 ) is computed in accord with Eqs. (86) and (87).
~~
I
The easiest way t o find values of 6 for all the Adaline ele- might appear to play the same role in backpropagation as
ments i n the network i s t o follow the schematic diagram of that played by the error in the p-LMS algorithm. However,
Fig. 25. 6 k i s not an error. Adaptation of the given Adaline i s effected
Thus, the procedure for finding 6('), the square-error to reduce the squared output error e;, not tik of the given
derivative associated with a given Adaline in hidden layer Adaline or of any other Adaline i n the network. The objec-
I, involves respectively multiplying each derivative 6 ( ' + ' ) tive i s not t o reduce the 6 k ' S of the network, but t o reduce
associated with each element in the layer immediately E', at the network output.
downstream from a given Adaline by the weight that con- It i s interesting to examine the weight updates that back-
nects it t o the given Adaline. These weighted square-error propagation imposes on the Adalineelements in theoutput
derivatives are then added together, producing an error layer. Substituting Eq. (77) into Eq. (94) reveals the Adaline
term E ( ' ) , which, in turn, is multiplied bysgm'(s(')), thederiv- which provides output y1 in Fig. 25 is updated by the rule
ative of the given Adaline's sigmoid function at its current
operating point. If a network has more than two layers, this w k + l = w k + 2pe:'sgm' (Sy))Xk. (95)
process of backpropagating the instantaneous square-error
derivatives from one layer to the immediately preceding This rule turns out to be identical to the single Adaline ver-
layer is successively repeated until a square-error derivative sion (54) of the backpropagation rule. This i s not surprising
6 is computed for each Adaline i n the network. This i s easily since the output Adaline is provided with both input signals
shown at each layer by repeating the chain rule argument and desired responses, so i t s training circumstance i s the
associated with Eq. (81). same as that experienced by an Adaline trained in isolation.
We now have a general method for finding a derivative There are many variants of the backpropagation algo-
6 for each Adaline element i n the network. The next step rithm. Sometimes, the size of p i s reduced during training
to diminish the effects of gradient noise in the weights.
i s t o use these 6's t o obtain the corresponding gradients.
Consider an Adalinesomewhere in the networkwhich,dur- Another extension is the momentum technique [42] which
ing presentation k, has a weight vector w k , an input vector involves including in theweightchangevectorAWkof each
Adaline a term proportional t o the corresponding weight
x k , and a linear output s k = W L X k .
change from the previous iteration. That is, Eq. (94) is
The instantaneous gradient for this Adaline element i s
replaced by a pair of equations:
6, =
at ;
- A w k = 2p(1 - ??)6,x, f qAwk_-( (96)
awk'
v
A ae2 at', as where the momentum constant 0 I9 < 1 i s in practice usu-
k - awk ask aw,' ally set to something around 0.8 or 0.9.
The momentum technique low-pass filters the weight
Note that w k and Xk are independent so updates and thereby tends to resist erratic weight changes
caused either by gradient noise or high spatial frequencies
(90) i n the M S E surface. The factor (1 - 7) i n Eq. (96) is included
to give the filter a DC gain of unity so that the learning rate
Therefore, p does not need t o be stepped down as the momentum con-
stant 9 i s increased. A momentum term can also be added
(91) to the update equations of other algorithms discussed in
this paper. A detailed analysis of stability issues associated
For this element, with momentum updating for the p-LMS algorithm, for
instance, has been described by Shynk and Roy [124].
I n our experience, the momentum technique used alone
is usually of little value. We have found, however, that it i s
Accordingly, often useful to apply the technique in situations that require
relatively "clean"22 gradient estimates. One case i s a nor-
6, = -26kXk. (93) malized weight update equation which makes the net-
Updating the weights of the Adaline element using the work's weight vector move the same Euclidean distance
method of steepest descent with the instantaneous gra- with each iteration. This can be accomplished by replacing
dient is a process represented by Eq. (96) and (97) with
~
T-
tered. The weights move by the same amount whether the
surfaceis flat or inclined. It i s reminiscentof a-LMS because
the gradient term i n the weight update equation is nor-
malized by a time-varying factor. The weight update rule initial state
could be further modified by including terms from both
-I
techniques associated with Eqs. (96) through (99). Other
methods for speeding u p backpropagation training include
Fahlman’s popular quickprop method [125], as well as the
delta-bar-delta approach reported in an excellent paper by
Jacobs [126].23
One of the most promising new areas of neural network
research involves backpropagation variants for training var-
ious recurrent (signal feedback) networks. Recently, back-
propagation rules have been derived for training recurrent
networks to learn static associations [127l, [128]. More inter-
esting is the on-line technique of Williams and Zipser [I291
which allows a wide class of recurrent networks t o learn
dynamic associations and trajectories. A more general and
computationally viable variant of this technique has been
advanced by Narendra and Parthasarathy [104]. These on-
line methods are generalizations of a well-known steepest-
descent algorithm for training linear IIR filters [130], [30].
An equivalent technique that i s usually far less compu-
tationally intensive but best suited for off-line computation
[37, [42], [131], called “backpropagation through time,” has
been used by Nguyen and Widrow [SO]t o enable a neural
network t o learn without a teacher how to back u p a com- I final state
puter-simulated trailer truck to a loading dock (Fig. 27). This
Fig. 27. Example truck backup sequence.
i s a highly nonlinear steering task and it i s not yet known
how t o design a controller t o perform it. Nevertheless, with
just 6 inputs providing information about the current posi- Input
Pattern Perturbation
tion of the truck, a two-layer neural network with only 26
Adalines was able t o learn of i t s own accord to solve this output
Vector
problem. Once trained, the network could successfully
YI
back u p the truck from any initial position and orientation
)Y,k
in front of the loading dock.
and 30 show two such slices of the MSE surface from a typ-
VIII. SUMMARY
ical learning problem involving, respectively, an untrained
sigmoidal network and a trained one. The first surface This year is the 30th anniversaryof the publication of the
resulted from varying two first-layer weights of an untrained Perceptron rule by Rosenblatt and the LMS algorithm by
network. The second surface resulted from varying the same Widrow and Hoff. It has also been 16 years since Werbos
two weights after the network was fully trained. The two first published the backpropagation algorithm. These
surfaces are similar, but the second one has a deeper min- learning rules and several others have been studied and
imum which was carved out by the backpropagation learn- compared. Although they differ significantly from each
ing process. Figs. 31 and 32 resulted from varying adifferent other, they all belong to the same "family."
set of two weights in the same network. Fig. 31 is the result A distinction was drawn between error-correction rules
from varying a first-layer weight and third-layer weight in and steepest-descent rules. The former includes the Per-,
the untrained network, whereas Fig. 32 is the surface that ceptron rule, Mays' rules, the CY-LMS algorithm, the original
resulted from varying the same two weights after the net- Madaline I rule of 1962, and the Madaline II rule. The latter
work was trained. includes thep-LMS algorithm, theMadaline Ill rule,and the
’n
ulators using a general learning algorithm.” I€€€]. Robotics
Automat., vol. RA-3, pp. 157-165, Apr. 1987.
1151 S. Grossberg, “Adaptive pattern classification and universal
recoding, 11: Feedback, expectation, olfaction, and illu-
,
Layered Wngle
Network Element sions,” Biolog. Cybernetics, vol. 23, pp. 187-202, 1976.
r u m )
f
Nonlinear Nonlinear
(,
h Linear
[I61 G. A. Carpenter and S. Grossberg, “A massively parallel
architecturefor a self-organizing neural pattern recognition
machine,” Computer Vision, Graphics, and Image Process-
ing, vol. 37, pp. 54-115, 1983.
[I7 -, “Art 2: Self-organization of stable category recognition
MRlIl MRlll p-LMS MRI Perceptron a-LMS codes for analog output patterns,” Applied Optics, vol. 26,
Backprop Backprop MRII Mays pp. 4919-4930, Dec. 1, 1987.
Fig. 33. Learning rules. [I81 -, “Art 3 hierarchical search:Chemical transmitters in self-
organizing pattern recognition architectures,” in Proc. lnt.
Joint Conf. on Neural Networks, vol. 2, pp. 30-33, Wash.,
backpropagation algorithm. Fig. 33categorizes the learning DC, Jan. 1990.
rules that have been studied. [I91 T. Kohonen, “Self-organized formation of topologically cor-
rect feature maps,” Biolog. Cybernetics, vol. 43, pp. 59-69,
Although these algorithms have been presented asestab- 1982.
lished learning rules, one should not gain the impression [20] -, Self-organization and Associative Memory. New York:
that they are perfect and frozen for all time. Variations are Springer-Verlag, 2d ed., 1988.
possible for every one of them. They should be regarded [21] D. 0. Hebb, Theorganization ofBehavior. New York: Wiley,
1949.
as substrates upon which t o build new and better rules.
1221 1. J. Hopfield, “Neural networks and physical systems with
There i s a tremendous amount of invention waiting “in the emergent collective computational abilities,” Proc. Natl.
wings.” We look forward t o the next 30 years. Acad. Sci., vol. 79, pp. 2554-2558, Apr. 1982.
[23] -, “Neurons with graded response have collective com-
REFERENCES putational properties like those of two-state neurons,” Proc.
Natl. Acad. Sci.,\ol. 81, pp. 3088-3092, May 1984.
K. 5teinbuchandV.A. W. Piske,“Learningmatricesand their [24] B. Kosko, “Adaptive bidirectional associative memories,”
applications,” / € € E Trans. Electron. Comput., vol. EC-12, pp. Appl. Optics, vol. 26, pp. 4947-4960, Dec. 1, 1987.
846-862, Dec. 1963. [25] G. E. Hinton, R. J. Sejnowski, and D. H. Ackley, “Boltzmann
B. Widrow, “Generalization and information storage in net- machines: Constraint satisfaction networks that learn,”
works of adaline ’neurons,‘ in Self-OrganizingSystems 1962, Tech. Rep. CMU-CS-84-119, Carnegie-Mellon University,
M. Yovitz, G. Jacobi, and G. Goldstein, Eds. Washington, Dept. of Computer Science, 1984.
DC: Spartan Books, 1962, pp. 435-461. [26] G. E. Hinton and T. J. Sejnowski, “Learning and relearning
L. Stark, M. Okajima, and G. H. Whipple, ”Computer pat- in Boltzmann machines,” in Parallel Distributed Processing,
tern recognition techniques: Electrocardiographic diag- vol. 1, ch. 7, D. E. Rumelhart and J. L. McClelland, Eds. Cam-
nosis,” Commun. Ass. Comput. Mach., vol. 5, pp. 527-532, bridge, MA, M.I.T. Press, 1986.
Oct. 1962. [27 L. R. Talbert etal., “A real-time adaptive speech-recognition
F. Rosenblatt, “Two theorems of statistical separability in system,” Tech. rep., Stanford University, 1963.
the perceptron,” in Mechanization of Thought Processes: [28] M. J.C. Hu, Application of the Adaline System to Weather
Proceedings of a Symposium held a t the National Physical Forecasting. Thesis, Tech. Rep. 6775-1, Stanford Electron.
Laboratory, Nov. 1958, vol. 1 pp. 421-456. London: HM Sta- Labs., Stanford, CA, June 1964.
tionery Office, 1959. 1291 B. Widrow, ”The original adaptive neural net broom-bal-
F. Rosenblatt, Principles of Neurodynamics: Perceptronsand ancer,” Proc. /€€€ lntl. Symp. Circuits andSystems, pp. 351-
the Theory of Brain Mechanisms. Washington, DC: Spartan 357, Phila., PA, May 4-7 1987.
Books, 1962. [30] B. Widrow and S. D. Stearns, Adaptive Signal Processing.
C. von der Malsburg, “Self-organizing of orientation sen- Englewood Cliffs, NJ: Prentice-Hall, 1985.
sitive cells in the striate cortex,” Kybernetik, vol. 14, pp. 85- [31] B. Widrow, P. Mantey, L. Griffiths, and B. Goode, “Adaptive
100, 1973. antenna systems,” Proc. / € E € , vol. 55, pp. 2143-2159, Dec.
S. Grossberg, “Adaptive pattern classification and universal 1967.
recoding, I: Parallel development and coding of neural fea- [32] B. Widrow, “Adaptive inverse control,” Proc. 2d lntl. fed.
ture detectors,” Biolog. Cybernetics, vol. 23, pp. 121-134, ofAutomatic Control Workshop, pp. 1-5, Lund, Sweden, July
1976. I-3,1986.
K. Fukushima, “Cognitron: A self-orgainizing multilayered [33] B. Widrow, etal., “Adaptive noise cancelling: Principlesand
neural network,” Biolog. Cybernetics, vol. 20, pp. 121-136, applications,” Proc. /€€€, vol. 63, pp. 1692-1716, Dec. 1975.
1975. [34] R. W. Lucky, “Automatic equalization for digital commu-
-, “Neocognitron: A self-organizing neural network nication,” Bell Syst. Tech. J., vol. 44, pp. 547-588, Apr. 1965.
model for a mechanism of pattern recognition unaffected [35] R. W. Lucky, et al., Principles of Data Communication. New
by shift in position,” Biolog. Cybernetics, vol. 36, pp. 193- York: McGraw-Hill, 1968.
202,1980. [36] M. M. Sondhi,”An adaptive echo canceller,” BellSyst. Tech.
B. Widrow,”Bootstrap learning in threshold logic systems,” J., vol. 46, pp. 497-511, Mar. 1967.
presented at the American Automatic Control Council (The- [37 P. Werbos, Beyond Regression: New Tools for Predictionand
orycornmittee), IFAC Meeting, London, England, June1966. Analysis in the Behavioral Sciences. Ph.D. thesis, Harvard
B. Widrow, N. K. Gupta, and S. Maitra, “Punishlreward: University, Cambridge, MA, Aug. 1974.
Learning with a critic in adaptive threshold systems,” / € E € [38] Y. le Cun, “A theoretical framework for back-propagation,”
Trans. Syst., Man, Cybernetics,vol. SMC-3, pp. 455-465, Sept. in Proc. 1988 Connectionist Models Summer School, D.
1973. Touretzky, G. Hinton, and T. Sejnowski, Eds. June17-26, pp.
A. G. Barto, R. S. Sutton, and C. W. Anderson, ”Neuronlike 21-28. San Mateo, CA; Morgan Kaufmann.
adaptive elements that can solve difficult learning control [39] D. Parker, “Learning-logic,’‘ Invention Report 581-64, File 1,
problems,” /E€€ Trans. Syst., Man, Cybernetics, vol. Office of Technology Licensing, Stanford University, Stan-
SMC-13, pp. 834-846, 1983. ford, CA, Oct. 1982.
J. S. Albus, ”A new approach to manipulator control: the [40] -, “Learning-logic,” Technical Report TR-47, Center for
__
1