0% found this document useful (0 votes)
156 views28 pages

Overview of Madaline Neural Networks

The document summarizes the history and development of neural networks over 30 years, focusing on key algorithms including the Perceptron rule, LMS algorithm, Madaline rules, and backpropagation technique. It describes how these algorithms were developed independently but are related through the concept of minimal disturbance training. The summary highlights early applications in areas like speech recognition and adaptive filtering, and how backpropagation became widely known after being rediscovered in the 1980s.

Uploaded by

ddatdh1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views28 pages

Overview of Madaline Neural Networks

The document summarizes the history and development of neural networks over 30 years, focusing on key algorithms including the Perceptron rule, LMS algorithm, Madaline rules, and backpropagation technique. It describes how these algorithms were developed independently but are related through the concept of minimal disturbance training. The summary highlights early applications in areas like speech recognition and adaptive filtering, and how backpropagation became widely known after being rediscovered in the 1980s.

Uploaded by

ddatdh1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

30 Years of Adaptive Neural Networks:

Perceptron, Madaline, and Backpropagation


BERNARD WIDROW, FELLOW, IEEE, AND MICHAEL A. LEHR

Fundamental developments in feedfonvard artificial neural net- Widrow devised a reinforcement learning algorithm
works from the past thirty years are reviewed. The central theme of called “punish/reward” or ”bootstrapping” [IO], [ I l l in the
this paper is a description of the history, origination, operating
characteristics, and basic theory of several supervised neural net- mid-1960s.This can be used to solve problems when uncer-
work training algorithms including the Perceptron rule, the LMS tainty about the error signal causes supervised training
algorithm, three Madaline rules, and the backpropagation tech- methods t o be impractical. A related reinforcement learn-
nique. These methods were developed independently, but with ing approach was later explored in a classic paper by Barto,
the perspective of history they can a / / be related to each other. The
Sutton, and Anderson o n the “credit assignment” problem
concept underlying these algorithms is the “minimal disturbance
principle,” which suggests that during training it is advisable to [12]. Barto et al.’s technique is also somewhat reminiscent
inject new information into a network in a manner that disturbs of Albus’s adaptive CMAC, a distributed table-look-up sys-
stored information to the smallest extent possible. tem based o n models of human memory [13], [14].
In the 1970s Grossberg developed his Adaptive Reso-
I. INTRODUCTION nance Theory (ART), a number of novel hypotheses about
the underlying principles governing biological neural sys-
This year marks the 30th anniversary of the Perceptron tems [15]. These ideas served as the basis for later work by
rule and the L M S algorithm, two early rules for training Carpenter and Grossberg involving three classes of ART
adaptive elements. Both algorithms were first published in architectures: ART 1 [16], ART 2 [17], and ART 3 [18]. These
1960. In the years following these discoveries, many new are self-organizing neural implementations of pattern clus-
techniques have been developed in the field of neural net- tering algorithms. Other important theory on self-organiz-
works, and the discipline is growing rapidly. One early ing systems was pioneered by Kohonen with his work o n
development was Steinbuch’s Learning Matrix [I], a pattern feature maps [19], [201.
recognition machine based o n linear discriminant func- In the early 1980s, Hopfield and others introduced outer
tions. At the same time, Widrow and his students devised product rules as well as equivalent approaches based o n
Madaline Rule I (MRI), the earliest popular learning rule for the early work of Hebb [21] for training a class of recurrent
neural networks with multiple adaptive elements [2]. Other (signal feedback) networks now called Hopfield models [22],
early work included the “mode-seeking” technique of [23]. More recently, Kosko extended some of the ideas of
Stark, Okajima, and Whipple [3]. This was probably the first Hopfield and Grossberg t o develop his adaptive Bidirec-
example of competitive learning in the literature, though tional Associative Memory (BAM) [24], a network model
it could be argued that earlierwork by Rosenblatt on “spon- employing differential as well as Hebbian and competitive
taneous learning” [4], [5] deserves this distinction. Further learning laws. Other significant models from the past de-
pioneering work o n competitive learning and self-organi- cade include probabilistic ones such as Hinton, Sejnowski,
zation was performed in the 1970s by von der Malsburg [6] and Ackley‘s Boltzmann Machine [25], [26] which, to over-
and Grossberg [7l. Fukushima explored related ideas with simplify, is a Hopfield model that settles into solutions by
his biologically inspired Cognitron and Neocognitron a simulated annealing process governed by Boltzmann sta-
models [8], [9]. tistics. The Boltzmann Machine i s trained by a clever two-
phase Hebbian-based technique.
Manuscript received September 12,1989; revised April 13,1990. While these developments were taking place, adaptive
This work was sponsored by SDI0 Innovative Science and Tech- systems research at Stanford traveled an independent path.
nologyofficeand managed by ONR under contract no. N00014-86-
K-0718, by the Dept. of the Army Belvoir RD&E Center under con- After devising their Madaline I rule, Widrow and his stu-
tracts no. DAAK70-87-P-3134andno. DAAK-70-89-K-0001,by a grant dents developed uses for the Adaline and Madaline. Early
from the Lockheed Missiles and Space Co., by NASA under con- applications included, among others, speech and pattern
tract no. NCA2-389, and by Rome Air Development Center under recognition [27], weather forecasting [28], and adaptive con-
contract no. F30602-88-D-0025,subcontract no. E-21-T22-S1. trols [29]. Work then switched to adaptive filtering and
The authors are with the Information Systems Laboratory,
Department of Electrical Engineering, Stanford University, Stan- adaptive signal processing [30] after attempts t o develop
ford, CA 94305-4055, USA. learning rules for networks with multiple adaptive layers
IEEE Log Number 9038824. were unsuccessful. Adaptive signal processing proved t o

0018-9219/90/0900-1415$01.000 1990 IEEE

PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990 1415


bea fruitful avenue for research with applications involving thosedeveloped in our laboratoryat Stanford, and to related
adaptive antennas [311, adaptive inverse controls [32], adap- techniques developed elsewhere, the most important of
tive noise cancelling [33], and seismic signal processing [30]. which is the backpropagation algorithm. Section I I explores
Outstanding work by Lucky and others at Bell Laboratories fundamental concepts, Section Ill discusses adaptation and
led to major commercial applications of adaptive filters and the minimal disturbance principle, Sections I V and V cover
the L M S algorithm to adaptive equalization in high-speed error correction rules, Sections VI and VI1 delve into
modems [34], [35] and to adaptive echo cancellers for long- steepest-descent rules, and Section V l l l provides a sum-
distance telephone and satellite circuits [36]. After 20 years mary.
of research in adaptive signal processing, the work in Wid- Information about the neural network paradigms not dis-
row’s laboratory has once again returned to neural net- cussed in this papercan beobtainedfromanumberofother
works. sources, such as the concise survey by Lippmann [44], and
The first major extension of the feedforward neural net- the collection of classics by Anderson and Rosenfeld [45].
work beyond Madaline I took place in 1971 when Werbos Much of the early work in the field from the 1960s is care-
developed a backpropagation training algorithm which, in fully reviewed in Nilsson’s monograph [46]. A good view
1974, he first published in his doctoral dissertation [371.’ of some of the more recent results i s presented in Rumel-
Unfortunately, Werbos’s work remained almost unknown hart and McClelland’s popular three-volume set [471. A
in the scientific community. In 1982, Parker rediscovered paper by Moore [48] presents a clear discussion about ART
the technique [39] and in 1985, published a report o n it at 1 and some of Crossberg’s terminology. Another resource
M.I.T. [40]. Not long after Parker published his findings, is the DARPA Study report [49] which gives a very compre-
Rumelhart, Hinton, and Williams [41], [42] also rediscovered hensive and readable “snapshot” of the field i n 1988.
the techniqueand, largelyasaresultof theclear framework
within which they presented their ideas, they finally suc-
ceeeded in making it widely known.
II. FUNDAMENTAL CONCEPTS
The elements used by Rumelhart et al. in the backprop- Today we can build computers and other machines that
agation network differ from those used in the earlier Mada- perform avarietyofwell-defined taskswith celerityand reli-
line architectures. The adaptive elements in the original ability unmatched by humans. No human can invert matri-
Madaline structure used hard-limiting quantizers (sig- ces or solve systems of differential equations at speeds
nums), while the elements in the backpropagation network rivaling modern workstations. Nonetheless, many prob-
use only differentiable nonlinearities, or “sigmoid” func- lems remain to be solved to our satisfaction by any man-
tions.2 In digital implementations, the hard-limiting made machine, but are easily disentangled by the percep-
quantizer is more easily computed than any of the differ- tual or cognitive powers of humans, and often lower mam-
entiable nonlinearities used in backpropagation networks. mals, or even fish and insects. No computer vision system
In 1987, Widrow,Winter,and Baxter looked backattheorig- can rival the human ability t o recognize visual images
inal Madaline I algorithm with the goal of developing a new formed by objects of all shapes and orientations under a
technique that could adapt multiple layers of adaptive ele- wide range of conditions. Humans effortlessly recognize
ments using the simpler hard-limitingquantizers. The result objects in diverse environments and lighting conditions,
was Madaline Rule II [43]. even when obscured by dirt, or occluded by other objects.
David Andes of U.S. Naval Weapons Center of China Lake, Likewise, the performance of current speech-recognition
CA, modified Madaline I I in 1988 by replacing the hard-lim- technology pales when compared t o the performance of
iting quantizers in the Adaline and sigmoid functions, the human adult who easily recognizes words spoken by
thereby inventing Madaline Rule Ill (MRIII).Widrow and his different people, at different rates, pitches, and volumes,
students were first to recognize that this rule i s mathe- even in the presence of distortion or background noise.
matically equivalent to backpropagation. The problems solved more effectively by the brain than
The outline above gives only a partial view of the disci- by the digital computer typically have two characteristics:
pline, and many landmark discoveries have not been men- they are generally ill defined, and they usually require an
tioned. Needless to say, the field of neural networks is enormous amount of processing. Recognizing the char-
quickly becoming a vast one, and in one short survey we acter of an object from its image on television, for instance,
could not hope to cover the entire subject in any detail. involves resolving ambiguities associated with distortion
Consequently, many significant developments, including and lighting. It also involves filling in information about a
some of those mentioned above, are not discussed in this three-dimensional scene which i s missing from the two-
paper. The algorithms described are limited primarily to dimensional image o n the screen. An infinite number of
three-dimensional scenes can be projected into a two-
’Weshould note, however,that in the fieldof variationalcalculus dimensional image. Nonetheless, the brain deals well with
the idea of error backpropagation through nonlinear systems this ambiguity, and using learned cues usually has little dif-
existedcenturies beforeWerbosfirstthoughttoapplythisconcept ficulty correctly determining the role played bythe missing
to neural networks. In the past 25years, these methods have been dimension.
used widely in the field of optimal control, as discussed by Le Cun
As anyone who has performed even simple filtering oper-
[381.
*The term “sigmoid” i s usually used in reference to monoton- ations o n images is aware, processing high-resolution
ically increasing “S-shaped” functions, such as the hyperbolic tan- images requires a great deal of computation. Our brains
gent. In this paper, however, we generally use the term to denote accomplish this by utilizing massive parallelism, with mil-
any smooth nonlinear functions at the output of a linear adaptive
lions and even billions of neurons in partsof the brain work-
element. In other papers, these nonlinearities go by a variety of
names, such as “squashing functions,” ”activation functions,” ing together to solve complicated problems. Because solid-
“transfer characteristics,” or ”threshold functions.” state operational amplifiers and logic gates can compute

1416 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990

1
many orders of magnitude faster than current estimates of Input

the computational speed of neurons in the brain, we may Vector output


soon be able t o build relatively inexpensive machines with
the ability to process as much information as the human
brain.Thisenormous processing powerwill do littleto help
U S solve problems, however, unless we can utilize it effec-
: / I
nk
tively. For instance, coordinating many thousands of pro-
cessors, which must efficiently cooperate t o solve a prob- t Error
1
lem, is not a simple task. If each processor must be Wk dk
programmed separately, and if all contingencies associated Weight Vector Desired Response

with various ambiguities must be designed into the soft- Fig. 1. Adaptive linear combiner.
ware, even a relatively simple problem can quickly become
unmanageable. The slow progress over the past 25 years or
so in machinevision and otherareasofartificial intelligence continuous analog values or binary values. The weights are
i s testament to the difficulties associated with solving essentially continuously variable, and can take on negative
ambiguous and computationally intensive problems on von as well as positive values.
Neumann computers and related architectures. During the training process, input patterns and corre-
Thus, there i s some reason t o consider attacking certain sponding desired responses are presented to the linear
problems by designing naturally parallel computers, which combiner. An adaptation algorithm automatically adjusts
process information and learn by principles borrowed from the weights so that the output responses to the input pat-
the nervous systems of biological creatures. This does not terns will be as close as possible to their respective desired
necessarily mean we should attempt to copy the brain part reponses. In signal processing applications, the most pop-
for part. Although the bird served to inspire development ular method for adapting the weights is the simple LMS
of the airplane, birds do not have propellers, and airplanes (least mean square) algorithm [58], [59], often called the
do not operate by flapping feathered wings. The primary Widrow-Hoff delta rule [42]. This algorithm minimizes the
parallel between biological nervous systems and artificial sum of squares of the linear errors over the training set. The
neural networks is that each typically consists of a large linear error t k i s defined to be the difference between the
number of simple elements that learn and are able to col- desired response dk and the linear output sk, during pre-
lectively solve complicated and ambiguous problems. sentation k . Having this error signal is necessary for adapt-
Today, most artificial neural network research and appli- ing the weights. When the adaptive linear combiner i s
cation is accomplished by simulating networks on serial embedded in a multi-element neural network, however, an
computers. Speed limitations keep such networks rela- error signal i s often notdirectlyavailableforeach individual
tively small, but even with small networks some surpris- linear combiner and more complicated procedures must
ingly difficult problems have been tackled. Networks with be devised for adapting the weight vectors. These proce-
fewer than 150 neural elements have been used success- dures are the main focus of this paper.
fully in vehicular control simulations [50], speech genera-
tion [51], [52], and undersea mine detection [49]. Small net- B. A Linear Classifier-The Single Threshold Element
works have also been used successfully in airport explosive The basic building block used in many neural networks
detection [53], expert systems [54], [55], and scores of other is the "adaptive linear element," or Adaline3 [58] (Fig. 2).
applications. Furthermore, efforts to develop parallel neural This i s an adaptive threshold logic element. It consists of
network hardware are meeting with some success, and such an adaptive linear combiner cascaded with a hard-limiting
hardware should be available in the future for attacking quantizer, which is used t o produce a binary 1 output,
more difficult problems, such as speech recognition [56], Yk = sgn (sk). The bias weight wokwhich i s connected t o a
[57l. constant input xo = + I , effectively controls the threshold
Whether implemented in parallel hardware or simulated level of the quantizer.
on a computer, all neural networks consist of a collection In single-element neural networks, an adaptivealgorithm
of simple elements that work together to solve problems. (such as the LMS algorithm, or the Perceptron rule) i s often
A basic building block of nearly all artificial neural net- used to adjust the weights of the Adaline so that it responds
works, and most other adaptive systems, is the adaptive lin- correctly to as many patterns as possible in a training set
ear combiner. that has binary desired responses. Once the weights are
adjusted, the responses of the trained element can be tested
by applying various input patterns. If the Adaline responds
A. The Adaptive Linear Combiner
correctly with high probability to input patterns that were
The adaptive linear combiner i s diagrammed in Fig. 1. Its not included in the training set, it i s said that generalization
output i s a linear combination of i t s inputs. I n a digital has taken place. Learning and generalization are among the
implementation, this element receives at time k an input most useful attributes of Adalines and neural networks.
signal vector or input pattern vector X k = [x,, x l t , xzk, Linear Separability: With n binary inputs and one binary
. . , x,]' and a desired response dk,a special input used
1

to effect learning. The components of the input vector are 31n the neural network literature, such elements are often
weighted by a set of coefficients, the weight vector Wk = referred to as "adaptive neurons." However, in a conversation
between David Hubel of Harvard Medical School and Bernard Wid-
[wok,wlk,wZt, * . . , w,~]'. The sum of the weighted inputs row, Dr. Hubel pointed out that the Adaline differs from the bio-
is then computed, producing a linear output, the inner logical neuron in that it contains not only the neural cell body, but
product sk = XLWk. The components of X k may be either also the input synapses and a mechanism for training them.

WIDROW AND LEHR: PERCEPTRON, MADALINE, AND BACKPROPACATION 1417


thresholding condition occurs when the linear output s
equals zero:
s = XlW, + X,W, + WO = 0, (1)

Linear
therefore
output
w1 x, - -.WO
Binary x2= --
output w2 w2
(+L-11
Figure 4 graphs this linear relation, which comprises a
separating line having slope and intercept given by
W
slope = -2
w2

' - _ - - - _ _ - - _ _ _ - - _ _ _ _ _ _ - - I
intercept = -3. (3)
'k-1- w2
Desired Response Input
(training signal) The three weights determine slope, intercept, and the side
Fig. 2. Adaptive linear element (Adaline). of the separating line that corresponds t o a positive output.
The opposite side of the separating line corresponds t o a
negative output. For Adalines with four weights, the sep-
output, a single Adaline of the type shown in Fig. 2 is capa- arating boundary is a plane; with more than four weights,
ble of implementing certain logic functions. There are 2" the boundary i s a hyperplane. Note that if the bias weight
possible input patterns. A general logic implementation i s zero, the separating hyperplane will be homogeneous-
would be capable of classifying each pattern as either + I it will pass through the origin i n pattern space.
or -1, in accord with the desired response. Thus, there are As sketched in Fig. 4, the binary input patterns are clas-
22' possible logic functions connecting n inputs t o a single sified as follows:
binary output. A single Adaline is capable of realizing only
asmall subset of thesefunctions, known as the linearlysep- (+I, +I) + +I
arable logic functions or threshold logic functions [60]. ( + I , -1) + +I
These are the set of logic functions that can be obtained
with all possible weight variations. (-1, -1) -+ +I
Figure3 shows atwo-input Adalineelement. Figure4 rep- (-1, + I ) + -1 (4)
resents all possible binary inputs to this element with four
large dots in pattern vector space. I n this space, the com- This is an example of a linearly separable function. An
ponentsof the input pattern vector liealongthecoordinate example of a function which i s not linearly separable is the
axes. The Adaline separates input patterns into two cate- two-input exclusive NOR function:
gories, depending o n the values of the weights. A critical ( + I , +I) + +I
(+I, -1) -+ -1
Xok= +1
(-1, -1) + +I
(-1, +I) + -1 (5)
Nosinglestraight lineexiststhat can achievethisseparation
of the input patterns; thus, without preprocessing, no sin-
'k
gle Adaline can implement the exclusive NOR function.
With two inputs, a single Adaline can realize 14 of the 16
Fig. 3. Two-input Adaline. possible logic functions. With many inputs, however, only
a small fraction of all possible logic functions i s realizable,
that is, linearly separable. Combinations of elements or net-
Separating Line
works of elements can be used t o realize functions that are
xz = 3 x , - "0 not linearly separable.
w2 WZ
Capacity o f Linear C/assifiers:The number of training pat-
terns or stimuli that an Adalinecan learn tocorrectlyclassify
i s an important issue. Each pattern and desired output com-
bination represents an inequalityconstraint o n the weights.
It i s possible to have inconsistencies in sets of simultaneous
inequalities just as with simultaneous equalities. When the
inequalities (that is, the patterns) are determined at ran-
dom, the number that can be picked before an inconsis-
tency arises i s a matter of chance.
In their 1964 dissertations [61], [62], T. M. Cover and R. J.
Brown both showed that the average number of random
Fig. 4. Separating line in pattern space. patterns with random binary desired responses that can be

1418 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


absorbed by an Adaline i s approximately equal to twice the are continuous valued and smoothly distributed (that is,
number of weights4 This i s the statistical pattern capacity pattern positions are generated by a distribution function
C, of the Adaline. As reviewed by Nilsson [46], both theses containing no impulses), general position i s assured. The
included an analyticformuladescribingthe probabilitythat general position assumption i s often invalid if the pattern
such a training set can be separated by an Adaline (i.e., it vectors are binary. Nonetheless, even when the points are
is linearly separable). The probability i s afunction of Np,the not in general position, the capacity results represent use-
number of input patterns in the training set, and N ,, the ful upper bounds.
number of weights in the Adaline, including the threshold The capacity results apply to randomly selected training
weight, if used: patterns. In most problems of interest, the patterns in the
training set are not random, but exhibit some statistical reg-
ularities. These regularities are what make generalization
possible. The number of patterns that an Adaline can learn
in a practical problem often far exceeds its statistical capac-
for N, 5 N.,
ity becausetheAdaline isabletogeneralizewithin thetrain-
ing set, and learns many of the training patterns before they
In Fig. 5 this formula was used to plot a set of analytical are even presented.
curves, which show the probability that a set of Nprandom
patterns can be trained into an Adaline as a function of the C. Nonlinear Classifiers
ratio NJN,. Notice from these curves that as the number
of weights increases, the statistical pattern capacity of the Thelinearclassifier i s limited in itscapacity,andofcourse
AdalineC, = 2N,becomesan accurateestimateofthenum- i s limited to only linearly separable forms of pattern dis-
ber of responses it can learn. crimination. More sophisticated classifiers with higher
Another fact that can be observed from Fig. 5 i s that a capacities are nonlinear. Two types of nonlinear classifiers
are described here. The first i s a fixed preprocessing net-
work connected to a single adaptive element, and the other
i s the multi-element feedforward neural network.
N,= 15 Polynomial Discriminant Functions: Nonlinear functions
08- N,= 5
N,= 2 of the in.puts applied t o the single Adaline can yield non-
Probability linear decision boundaries. Useful nonlinearities include
of Linear 0 6
Separability the polynomial functions. Consider the system illustrated
04-
in Fig. 6 which contains only linear and quadratic input

Input
Pattern
VeCtOl

-
X Binary
Np/Nw--Ratio of Input Patterns to Weights Xl output
Fig. 5. Probability that an Adaline can separate a training Y
pattern set as a function of the ratio NJN,.
(+1,-1)

problem i s guaranteed to have a solution if the number of


patterns i s equal to (or less than) half the statistical pattern
capacity; that is, if the number of patterns i s equal to the
number of weights. We will refer to this as the deterministic Fig. 6. Adalinewith inputs mappedthrough nonlinearities.
pattern capacityCdof the Adaline. An Adaline can learn any
two-category pattern classification task involving no more
patterns than that representedby its deterministic capacity, functions. The critical thresholding condition for this sys-
Cd = N., tem is
Both the statistical and deterministic capacity results
depend upon a mild condition on the positionsof the input s = WO + XlWl + x:w1, + X1XzW12
patterns: the patterns must be in general position with
respect to the Adaline.’ If the input patterns to an Adaline
+ x;w2* + xzw2 = 0. (7)

With proper choiceof theweights, the separating bound-


4Underlyingtheory for this result was discovered independently ary in pattern space can be established as shown, for exam-
by a number of researchers including, among others, Winder [63], ple, in Fig. 7.This representsasolutionfortheexclusive NOR
Cameron [U], and Joseph[65]. function of (5). Of course, all of the linearly separable func-
5Patternsare in general position with respect to an Adaline with
no threshold weight if any subset of pattern vectors containing no tions are also realizable. The use of such nonlinearities can
more than N, membersforms a linearly independent set or, equiv- be generalized for more than two inputs and for higher
alently, if no set of N, or more input points in the N,-dimensional degree polynomial functions of the inputs. Some of the first
pattern space lie on a homogeneous hyperplane. For the more work in this area was done by Specht [66]-[68]at Stanford
common case involving an Adaline with a threshold weight, gen- in the 1960s when he successfully applied polynomial dis-
eral position means that no set of N, or more patterns in the (N,
- 1)-dimensionpattern space lie on a hyperplane not constrained criminants to the classification and analysis of electrocar-
to pass through the origin [61], [46]. diographic signals. Work on this topic has also been done

WIDROW AND LEHR: PERCEPTRON, MADALINE, AND BACKPROPAGATION 1419


r Separating
Boundary
Madaline I was built out of hardware [78] and used in pat-
tern recognition research. Theweights in this machinewere
memistors, electrically variable resistors developed by
Widrow and Hoff which are adjusted by electroplating a
resistive link [79].
Madaline I was configured in the following way. Retinal
inputs were connected t o a layer of adaptive Adaline ele-
ments, the outputs of which were connected to a fixed logic
device that generated the system output. Methods for
Adaline
-0 O u t p u t = +1 adapting such systems were developed at that time. An
Adaline
exampleof this kind of network is shown in Fig. 8. TwoAda-
Output = -1

& py
Fig. 7. Elliptical separating boundary for realizing a func-
tion which i s not linearly separable. Input
Pattern

by Barron and Barron [69]-[71] and by lvankhnenko [72] in


the Soviet Union.
Vector

X
xiT--,
, output
The polynomial approach offers great simplicity and
beauty.Through it onecan realizeawidevarietyofadaptive
nonlinear discriminant functions by adapting only a single
Adaline element. Several methods have been developed for
training the polynomial discriminant function. Specht
developed a very efficient noniterative (that is, single pass x1

through the training set) training procedure: the polyno- Fig. 8. Two-Adaline form of Madaline.
mial discriminant method (PDM), which allows the poly-
nomial discriminant function t o implement a nonpara-
metric classifier based on the Bayes decision rule. Other lines are connected to an A N D logic device to provide an
methods for training the system include iterative error-cor- output.
rection rules such as the Perceptron and a-LMS rules, and With weights suitably chosen, the separating boundary
iterative gradient-descent procedures such as the w-LMS in pattern space for the system of Fig. 8 would be as shown
and SER (also called RLS) algorithms [30]. Gradient descent in Fig. 9. This separating boundary implements the exclu-
with a single adaptive element is typically much faster than sive NOR function of (5).
with a layered neural network. Furthermore, as we shall see,
when the single Adaline is trained by a gradient descent
procedure, it will converge to a unique global solution.
Separating
Lines ,\
After the polynomial discriminant function has been
trained byagradient-descent procedure, theweights of the
Adaline will represent an approximation to the coefficients
in a multidimensional Taylor series expansion of thedesired
response function. Likewise, if appropriate trigonometric
terms are used in place of the polynomial preprocessor, the
Adaline's weight solution will approximate the terms in the
(truncated) multidimensional Fourier series decomposi-
tion of a periodic version of the desired response function.
o u t p u t = +1
The choice of preprocessing functions determines how well
a network will generalize for patterns outside the training
set. Determining "good" functions remains a focus of cur-
rent research [73], [74]. Experience seems to indicate that Fig. 9. Separating lines for Madaline of Fig. 8.
unless the nonlinearities are chosen with care to suit the
problem at hand, often better generalization can be
obtained from networks with more than one adaptive layer. Madalines were constructed with many more inputs, with
In fact,onecan view multilayer networks assingle-layer net- many more Adaline elements in the first layer, and with var-
works with trainable preprocessors which are essentially ious fixed logic devices such as AND, OR, and majority-vote-
self-optimizing. taker elements in the second layer. Those three functions
(Fig. IO) are all threshold logic functions. The given weight
Madaline I valueswill implement these threefunctions, but theweight
choices are not unique.
One of the earliest trainable layered neural networks with
multiple adaptive elements was the Madaline I structure of
Feedforward Networks
Widrow [2] and Hoff (751. Mathematical analyses of Mada-
line I were developed in the Ph.D. theses of Ridgway [76], The Madalines of the 1960s had adaptive first layers and
Hoff [75], and Glanz [77]. I n the early 1960s, a 1000-weight fixed threshold functions in the second (output) layers [76],

1420 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


There i s no reason whyafeedforward network must have
the layered structure of Fig. 11. In Werbos’s development
w, = + 1 of the backpropagation algorithm [37], in fact, the Adalines
are ordered and each receives signals directly from each
AND input component and from the output of each preceding
Adaline. Many other variations of the feedforward network
are possible. An interesting areaof current research involves
xg= +1 a generalized backpropagation method which can be used
1
to train “high-order” or ‘’u-T’’ networks that incorporate
a polynomial preprocessor for each Adaline [47], [80].
One characteristic that is often desired in pattern rec-
ognition problems i s invariance of the network output to
changes in the position and size of the input pattern or
image. Varioustechniques have been used toachievetrans-
lation, rotation, scale, and time invariance. One method
W] = +1 involves including in the training set several examples of
- - ‘ IX xo$o: =o each exemplar transformed in size, angle, and position, but
with a desired response that depends only on the original
exemplar [78]. Other research has dealt with various Fourier
and Mellin transform preprocessors [81], [82], as well as
neural preprocessors [83]. Giles and Maxwell have devel-
oped a clever averaging approach, which removes
unwanted dependencies from the polynomial terms in high-
Fig. 10. Fixed-weight Adaline implementations of AND, OR, order threshold logic units (polynomial discriminant func-
and MAJ logic functions.
tions) [74] and high-order neural networks [80]. Other
approaches have considered Zernike moments [84], graph
[46]. The feedfoward neural networks of today often have matching [85], spatially repeated feature detectors [9], and
many layers, and usually all layers are adaptive. The back- time-averaged outputs [86].
propagation networks of Rumelhart et al. [47] are perhaps
the best-known examples of multilayer networks. A fully Capacity of Nonlinear Classifiers
connected three-layer6 feedforward adaptive network i s
An important consideration that should be addressed
illustrated in Fig. 11. In a fully connected layered network,
when comparing various network topologies concerns the
amount of information they can store.’ Of the nonlinear
classifiers mentioned above, the pattern capacity of the
Adaline driven byafixed preprocessor composed of smooth
nonlinearities is the simplest to determine. If the inputs to
the system are smoothly distributed in position, the out-
puts of the preprocessing network will be in general posi-
tion with respecttotheAdaline.Thus,the inputstothe Ada-
line will satisfy the condition required in Cover’s Adaline
capacity theory. Accordingly, the deterministic and statis-
t
first-layer
t
second-layer
tical pattern capacities of the system are essentially equal
Adalines Adalines to those of the Adaline.
Fig. 11. Three-layer adaptive neural network. Thecapacities of Madaline I structures, which utilize both
the majoritiy element and the OR element, were experi-
mentally estimated by Koford in the early 1960s. Although
each Adaline receives inputs from every output in the pre- the logic functions that can be realized with these output
ceding layer. elements are quite different, both types of elements yield
During training, the response of each output element in essentially the same statistical storage capacity. The aver-
the network is compared with a corresponding desired age number of patterns that a Madaline I network can learn
response. Error signals associated with the output elements to classify was found to be equal to the capacity per Adaline
are readily computed, so adaptation of the output layer is multiplied by the number of Adalines in the structure. The
straightforward. The fundamental difficulty associated with statistical capacity C, i s therefore approximately equal to
adapting a layered network lies in obtaining “error signals” twice the number of adaptive weights. Although the Mada-
for hidden-layer Adalines, that is,forAdalines in layersother line and the Adaline have roughly the same capacity per
than the output layer. The backpropagation and Madaline adaptive weight, without preprocessing the Adaline can
Ill algorithms contain methods for establishing these error separate only linearly separable sets, while the Madaline
signals. has no such limitation.
61n Rumelhart et al.’s terminology, this would be called a four- ’We should emphasize that the information referred to herecor-
layer network, following Rosenblatt’s convention of counting lay- responds to the maximum number of binary input/output map-
ers of signals, including the input layer. For our purposes, we find pings a network achieve with properly adjusted weights, not the
it more useful to count only layers of computing elements. We do number of bits of information that can be stored directly into the
not count as a layer the set of input terminal points. network’s weights.

WIDROW AND LEHR PERCEPTRON, MADALINE, AND BACKPROPACATION 1421

~ ~~
A great deal of theoretical and experimental work has whereK,and &are positive numberswhich aresmall terms
been directed toward determining the capacity of both if the network i s large with few outputs relative to the num-
Adalines and Hopfield networks [87]-[90]. Somewhat less ber of inputs and hidden elements.
theoretical work has been focused o n the pattern capacity It is easy t o show that Eq. (8) also bounds the number of
of multilayer feedforward networks, though some knowl- weights needed to ensure that N, patterns can be learned
edge exists about the capacity of two-layer networks. Such with probability 1/2, except in this case the lower bound o n
results are of particular interest because the two-layer net- +
N, becomes: (N,N, - .1)/(1 log, (N,)). It follows that Eq.
work is surprisingly powerful. With a sufficient number of (9) also serves t o bound the statistical capacity C, of a two-
hidden elements, a signum network with two layers can layer signum network.
implement any Boolean function.’ Equally impressive is the It is interesting to note that the capacity bounds (9)
power of the two-layer sigmoid network. Given a sufficient encompass the deterministic capacity for the single-layer
number of hidden Adaline elements, such networks can networkcomprisinga bankof N,Adalines. In thiscaseeach
implement any continuous input-output mapping to arbi- Adaline would have N,/N, weights, so the system would
trary accuracy [92]-[94]. Although two-layer networks are have a deterministic pattern capacity of NN /, ., AS N,
quite powerful, it i s likely that some problems can be solved becomes large, the statistical capacity also approaches
more efficiently by networks with more than two layers. N,/N, (for N, finite). Until further theory o n feedforward
Nonfinite-order predicate mappings (such as the connect- network capacity is developed, it seems reasonable to use
edness problem [95]) can often be computed by small net- the capacity results from the single-layer network t o esti-
works using signal feedback [96]. mate that of multilayer networks.
In the mid-I960s, Cover studied the capacity of a feed- Little i s known about the number of binary patterns that
forward signum networkwith an arbitrary number of layersg layered sigmoid networks can learn to classify correctly.
and a single output element [61], [97. He determined a lower The pattern capacityof sigmoid networks cannot be smaller
bound o n the minimum number of weights N, needed to than that of signum networks of equal size, however,
enable such a network t o realize any Boolean function because as the weights of a sigmoid network grow toward
defined over an arbitrary set of Nppatterns in general posi- infinity, it becomes equivalent t o a signum network with
tion. Recently, Baum extended Cover’s result to multi-out- aweight vector in the same direction. Insight relating to the
put networks, and also used a construction argument to capabilities and operating principles of sigmoid networks
find corresponding upper bounds for the special case of can be winnowed from the literature [99]-[loll.
thetwo-layer signum network[98l.Consideratwo-layerfully A network’s capacity i s of little utility unless it i s accom-
connected feedforward network of signum Adalines that panied by useful generalizations to patterns not presented
has Nxinput components (excluding the bias inputs) and during training. In fact, if generalization is not needed, we
N,output components. If this network is required to learn can simply store the associations in a look-up table, and will
to map any set containing Np patterns that are in general have little need for a neural network. The relationship
position to any set of binary desired response vectors (with between generalization and pattern capacity represents a
N, components), it follows from Baum’s results” that the fundamental trade-off in neural network applications:
minimum requisite number of weights N,can be bounded the Adaline’s inability t o realize all functions i s i n a sense
by a strength rather than the fatal flaw envisioned by some crit-
ics of neural networks [95], because it helps limit the capac-
NYNP

1 + l0g,(Np)
5 N, < N
(”- +
N x 1
1 (N, + N, + 1) + N,.
(8)
ity of the device and thereby improves its ability t o gen-
eralize.
For good generalization, the training set should contain
a number of patterns at least several times larger than the
From Eq. (8), it can be shown that for a two-layer feedfor-
network‘s capacity (i.e., Np >> N,IN,). This can be under-
ward networkwith several times as many inputs and hidden
stood intuitively by noting that if the number of degrees of
elements as outputs (say, at least 5 times as many), the deter-
freedom in a network (i.e., N), i s larger than the number
ministic pattern capacity is bounded below by something
of constraints associated with the desired response func-
slightly smaller than N,/N,. It also follows from Eq. (8) that
tion (i.e., N,N,), the training procedure will be unable to
the pattern capacityof any feedforward network with a large
completely constrain the weights in the network. Appar-
ratio of weights to outputs (that is, N,IN, at least several
ently, this allows effects of initial weight conditions t o inter-
thousand) can be bounded above by a number of some-
fere with learned information and degrade the trained net-
what larger than (N,/Ny) log, (Nw/Ny).Thus, the determin-
istic pattern capacity C, of a two-layer network can be work’s ability to generalize. A detailed analysis of
generalization performance of signum networks as a func-
bounded by
tion of training ” set size i s described in 11021.
-
Nw
N,
N,
- K, IC, 5 - log,
N,
(%) + K2 (9)
A Nonlinear Classifier Application
‘This can be seen by noting that any Boolean function can be
written in the sum-of-products form [91], and that such an expres-
Neural networks have been used successfully in a wide
sion can be realized with a two-laver network bv usingthe first-laver range of applications. To gain Some insight about how
Adalines to implement AND gates, while using t h g second-layer neural networks are trained and what they can be used t o
Adalines to implement OR gates. compute, it is instructive t o consider Sejnowski and Rosen-
’Actually, the network can bean arbitrary feedforward structure berg,s 1986 NETtalk demonstration [521. With the
and need not be layered.
‘ q h e uDDer bound used here isB ~loose~bound:~minimum ’ ~ exception of work on the traveling salesman problem with
number iibden nodes 5 N, rNJN,1 < N,(NJN, + 1). Hopfield networks [103], this was the first neural network

1422 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


application since the 1960s to draw widespread attention. line Rule I l l could be used as well. Likewise, if the sigmoid
NETtalk i s a two-layer feedforward sigmoid network with network was replaced by a similar signum network, Mada-
80 Adalines in the first layer and 26 Adalines in the second line Rule II would also work, although more first-layer Ada-
layer. The network i s trained to convert text into phonet- lines would likely be needed for comparable performance.
ically correct speech, a task well suited t o neural imple- The remainder of this paper develops and compares var-
mentation. The pronunciation of most words follows gen- ious adaptive algorithms for training Adalines and artificial
eral rules based upon spelling and word context, but there neural networks t o solve classification problems such as
are many exceptions and special cases. Rather than pro- NETtalk. These same algorithms can be used to train net-
gramming a system to respond properly to each case, the works for other problems such as those involving nonlinear
network can learn the general rules and special cases by control [SO], system identification [50], [104], signal pro-
example. cessing [30], or decision making [55].
One of the more remarkable characteristics of NETtalk
i s that it learns to pronounce words in stages suggestive of III. ADAPTATION-THE
MINIMAL PRINCIPLE
DISTURBANCE
the learning process in children. When the output of NET-
The iterative algorithms described in this paper are all
talk i s connected t o a voice synthesizer, the system makes
designed in accord with a single underlying principle. These
babbling noises during the early stages of the training pro-
techniques-the two LMS algorithms, Mays‘s rules, and the
cess. As the network learns, it next conquers the general
Perceptron procedurefortrainingasingle Adaline, theMRI
rules and, like a child, tends t o make a lot of errors by using
rulefortrainingthesimpleMadaline,aswell asMRII,MRIII,
these rules even when not appropriate. As the training con-
and backpropagation techniques for training multilayer
tinues, however, the network eventually abstracts the
Madalines-all rely upon the principle of minimal distur-
exceptions and special cases and i s able t o produce intel-
bance: Adapt to reduce the output error for the current
ligible speech with few errors.
training pattern, with minimal disturbance to responses
The operation of NETtalk is surprisingly simple. Its input
already learned. Unless this principle i s practiced, it is dif-
is a vector of seven characters (including spaces) from a
ficult to simultaneously store the required pattern
transcript of text, and its output i s phonetic information
responses. The minimal disturbance principle is intuitive.
corresponding to the pronunciation of the center (fourth)
It was the motivating idea that led t o the discovery of the
character in the seven-character input field. The other six
L M S algorithm and the Madaline rules. I n fact, the LMS
characters provide context, which helps to determine the
algorithm had existed for several months as an error-reduc-
desired phoneme. To read text, the seven-character win-
tion rule before it was discovered that the algorithm uses
dow i s scanned across a document in computer memory
an instantaneous gradient t o follow the path of steepest
and the networkgenerates a sequenceof phonetic symbols
descent and minimizethe mean-squareerrorofthetraining
that can be used to control a speech synthesizer. Each of
set. It was then given the name “LMS” (least mean square)
the seven characters at the network‘s input i s a 29-corn-
algorit h m.
ponent binary vector, with each component representing
adifferent alphabetic character or punctuation mark. A one
IV. ERROR CORRECTION
RULES-SINGLE THRESHOLD
ELEMENT
is placed in the component associated with the represented
character; all other components are set t o zero.’’ As adaptive algorithms evolved, principally two kinds of
Thesystem’s26outputscorrespondto23 articulatoryfea- on-line rules have come t o exist. Error-correction rules alter
tures and 3 additional features which encode stress and syl- the weights of a network to correct error in the output
lable boundaries. When training the network, the desired response to the present input pattern. Gradient rules alter
response vector has zeros in all components except those the weights of a network during each pattern presentation
which correspond to the phonetic features associated with by gradient descent with the objective of reducing mean-
the center character in the input field. I n one experiment, square error, averaged over all training patterns. Both types
Sejnowski and Rosenberg had the system scan a 1024-word of rules invoke similar training procedures. Because they
transcript of phonetically transcribed continuous speech. are based upon different objectives, however, they can have
With the presentation of each seven-character window, the significantly different learning characteristics.
system‘s weights were trained by the backpropagation Error-correction rules, of necessity, often tend to be a d
algorithm i n response to the network’s output error. After hoc. They are most often used when training objectives are
roughly 50 presentations of the entire training set, the net- not easilyquantified, orwhen a problem does not lend itself
work was able t o produce accurate speech from data the t o tractable analysis. A common application, for instance,
network had not been exposed to during training. concerns training neural networks that contain discontin-
Backpropagation is not the only technique that might be uous functions. An exception i s the WLMS algorithm, an
used t o train NETtalk. In other experiments, the slower error-correction rule that has proven to be an extremely
Boltzmann learning method was used, and, in fact, Mada- useful technique for finding solutions to well-defined and
tractable linear problems.
We begin with error-correction rules applied initially to
”The input representation often has a considerable impact on single Adaline elements, and then to networks of Adalines.
the successof a network. In NETtalk,the inputs are sparselycoded
in 29 components. One might consider instead choosing a 5-bit A. Linear Rules
binary representation of the 7-bit ASCII code. It should be clear,
however, that in this case the sparse representation helps simplify Linear error-correction rules alter the weights of the
the network’s job of interpreting input characters as 29 distinct
symbols. Usually the appropriate input encoding i s not difficult to adaptive threshold elementwith each pattern presentation
decide. When intuition fails, however, one sometimes must exper- to make an error correction proportional to the error itself.
iment with different encodings to find one that works well. The one linear rule, a-LMS, i s described next.

WIDROW AND LEHR PERCEPTRON, MADALINE, AND BACKPROPACATIO\ 1423

~
The a-LMS Algorithm: The a-LMS algorithm or Widrow- X = input pattern vector
Hoff delta rule applied to the adaptation of a single Adaline A ~

(Fig. 2) embodies the minimal disturbance principle. The W = next weight vector
weight update equation for the original form of the algo-
rithm can be written as
-Awk = weight vector
change

The time index or adaptation cycle number i s k . wk+,i s the


next value of the weight vector, wk is the present value of
the weight vector, and x k i s the present input pattern vector.
The present linear error E k i s defined to be the difference
between the desired response dk and the linear output sk
/
x

Fig. 12. Weight correction by the L M S rule.


= w$k before adaptation:

€k dk - w,'x,. (11) selects A w k t o be collinear with Xk, the desired error cor-
Changing the weights yields a corresponding change in the rection is achieved with a weight change of the smallest
error: possible magnitude. When adapting to respond properly
to a new input pattern, the responses to previous training
AEk = A(dk - W&) = -xiAwk. (12) patterns are therefore minimally disturbed, on the average.
I n accordance with the a-LMS rule of Eq. (IO), the weight The a-LMS algorithm corrects error, and if all input pat-
change i s as follows: terns are all of equal length, it minimizes mean-square error
[30]. The algorithm i s best known for this property.

(13)
B. Nonlinear Rules

Combining Eqs. (12) and (13), we obtain The a-LMS algorithm is a linear rule that makes error cor-
rections that are proportional to the error. It i s known [I051
that in some cases this linear rule may fail t o separate train-
ing patterns that are linearly separable. Where this creates
difficulties, nonlinear rules may be used. I n the next sec-
Therefore, theerror i s reduced byafactorof aastheweights tions,wedescribeearlynonlinear rules,which weredevised
are changed while holding the input pattern fixed. Pre-
by Rosenblatt [106], [5] and Mays [IOS]. These nonlinear rules
senting a new input pattern starts the next adaptation cycle.
also make weight vector changes collinear with the input
The next error is then reduced by a factor of cy, and the pro-
pattern vector (the direction which causes minimal dis-
cess continues. The initial weight vector is usually chosen
turbance), changes that are based on the linear error but
to be zero and is adapted until convergence. In nonsta-
are not directly proportional to it.
tionary environments, the weights are generally adapted
The Perceptron Learning Rule: The Rosenblatt a-Percep-
continually.
tron [106], [ 5 ] ,diagrammed in Fig. 13, processed input pat-
The choice of a controls stability and speed of conver-
gence [30]. For input pattern vectors independent over time, Fixed Random
stability i s ensured for most practical purposes if Inputs lo
Adaptive
x 1 Element
o<cy<2. (15)
Making a greater than 1 generally does not make sense,
Analog-
since the error would be overcorrected. Total error cor- Valued
Retina
rection comes with a = 1. A practical range for a is Input
Patterns
0.1 < a < 1.0. (16)
This algorithm i s self-normalizing in the sense that the
choice of a does not depend on the magnitude of the input
signals. The weight update i s collinear with the input pat-
tern and of a magnitude inversely proportional to IXk)2.With I \ Desired Response
(+1,-11
Element

binary *I inputs, IXkl2 is equal to the number of weights Sparse Random Fixed Threshold
Connections Elements
and does not vary from pattern to pattern. If the binary
Fig. 13. Rosenblatt's a-Perceptron.
inputs are the usual 1 and 0, no adaptation occurs for
weights with 0 inputs, while with *I inputs, all weights are
adapted each cycle and convergence tends to be faster. For terns with a first layer of sparse randomly connected fixed
this reason, the symmetric inputs +I and -1 are generally logic devices. The outputs of the fixed first layer fed a sec-
preferred. ond layer, which consisted of a single adaptive linear
Figure12 providesageometrical pictureof howthea-LMS threshold element. Other than the convention that i t s input
rule works. I n accord with Eq. (13), wk+,equals wk added signals were {I, 0 } binary, and that no bias weight was
t o AWk, and AWk i s parallel with the input pattern vector included, this element is equivalentto the Adaline element.
xk. From Eq. (12),the change in error is equal t o the negative The learning rule for the a-Perceptron is very similarto LMS,
dot product of x k and A",. Since the cy-LMS algorithm but its behavior i s in fact quite different.

1424 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


It is interesting t o note that Rosenblatt's Perceptron mines the decision function. The Perceptron rule has been
learning rule was first presented in 1960 [106], and Widrow proven to be capable of separating any linearly separable
and Hoff's LMS rulewas first presented the same year, afew set of training patterns [SI, [107], [46], [105]. If the training
months later [59]. These rules were developed indepen- patterns are not linearly separable, the Perceptron algo-
dently in 1959. rithm goes on forever, and often does not yield a low-error
The adaptive threshold element of the a-Perceptron i s solution, even if one exists. I n most cases, if the training set
shown in Fig. 14. Adapting with the Perceptron rule makes is not separable, the weight vector tends t o gravitate toward
zero12 so that even if a i s very small, each adaptation can
dramatically affect the switching function implemented by
the Perceptron.
This behavior i s very different from that of the a-LMS
I
Weights algorithm. Continued use of ct-LMS does not lead t o an
unreasonable weight solution if the pattern set is not lin-
early separable. Nor, however, is this algorithm guaranteed
to separate any linearly separable pattern set. a-LMS typ-
ically comes close to achieving such separation, but its
Binary
-output objective i s different-error reduction at the linear output
[+1,-1) of the adaptive element.
Rosenblatt also introduced variants of the fixed-incre-
ment rule that we have discussed thus far. A popular one
was the absolute-correction version of the Perceptron
rule.13This rule is identical t o that stated in Eq. (18) except
the increment size a i s chosen with each presentation to
be the smallest integer which corrects the output error i n
L - - - - - - _ _ - - _ - - - - - - - - - -
t J

d, [+L.ll
Desired Respanse Input
one presentation. If thetraining set is separable, thisvariant
has all the characteristics of the fixed-increment version
(training signal)
with a set to 1, except that it usually reaches a solution i n
Fig. 14. The adaptive threshold element of the Perceptron. fewer presentations.
Mays's Algorithms: I n his Ph.D. thesis [105], Mays
described an "increment adaptation" rule14and a "modi-
use of the "quantizer error" z k , defined to be the difference fied relaxation adaptation" rule. The fixed-increment ver-
between the desired response and the output of the quan-
sion of the Perceptron rule i s a special case of the increment
tizer adaptation rule.
zk d k - Yk. (17) lncreinent adaptation i n i t s general form involves the use
of a "dead zone" for the linear output s k , equal t o ky about
The Perceptron rule, sometimes called the Perceptron zero. All desired responses are +I (refer t o Fig. 14). If the
convergence procedure, does not adapt the weights if the linear output s k falls outside the dead zone ( 1 s k ( 2 y), adap-
output decision Y k i s correct, that is, if z k = 0. If the output tation follows a normalized variant of the fixed-increment
decision disagrees with the binary desired response d k , Perceptron rule (with a / ( X k I 2used i n place of a).If the linear
however, adaptation i s effected by adding the input vector output falls within the dead zone, whether or not the output
to the weight vector when the error z k i s positive, or sub- response y k is correct, the weights are adapted by the nor-
tracting the input vector from the weight vector when the malized variant of the Perceptron rule as though the output
error & i s negative. Thus, half the product of the input vec- response Y k had been incorrect. The weight update rule for
tor and the quantizer error gk i s added to the weight vector. Mays's increment adaptation algorithm can be written
The Perceptron rule i s identical t o the a-LMS algorithm, mathematically as
except that with the Perceptron rule, half of the quantizer
error &/2 is used in place of the normalized linear error E k /
I&)' of the ct-LMS rule. The Perceptron rule i s nonlinear,
in contrast to the LMS rule, which i s linear (compare Figs.
2 and 14). Nonetheless, the Perceptron rule can be written
in a form very similar to the a-LMS rule of Eq. (IO):
where F k i s the quantizer error of Eq. (17).
w k + , = w k + ff ' X k .
2
(18) With the dead zone y = 0, Mays's increment adaptation
algorithm reduces t o a normalized version of the Percep-
Rosenblatt normally set a to one. In contrast t o a-LMS,
12Thisresults because the length of the weight vector decreases
thechoiceof ctdoesnotaffectthestabilityof theperceptron with each adaptation that does not cause the linear output sk to
algorithm, and it affects convergence time only if the initial change sign and assume a magnitude greater than that before
weight vector i s nonzero. Also, while a-LMS can be used adaptation. Although there are exceptions, for most problems this
with either analog or binary desired responses, Rosen- situation occursonly rarely if theweight vector is much longer than
the weight increment vector.
blatt's rule can be used only with binary desired responses.
13Theterms "fixed-increment" and "absolute correction" are due
The Perceptron rule stops adapting when the training to Nilsson [46]. Rosenblatt referred to methods of these types,
patterns are correctly separated. There is no restraining respectively, as quantized and nonquantized learning rules.
force controlling the magnitude of the weights, however. 14Theincrement adaptation rule was proposed by others before
The direction of the weight vector, not its magnitude, deter- Mays, though from a different perspective [107].

WIDROW AND LEHR: PERCEPTRON, MADALINE, AND BACKPROPACATION 1425


tron rule (18). Mays proved that if the training patterns are Adalines
linearly separable, increment adaptation will always con- 1
verge and separate the patterns in a finite number of steps. Input
Pattern
He also showed that use of the dead zone reduces sensi- Vector
tivity to weight errors. If the training set i s not linearly sep-
arable, Mays's increment adaptation rule typically per- X output
Decision
forms much better than the Perceptron rule because a
sufficiently large dead zone tends t o cause the weight vec-
tortoadapt awayfrom zerowhen any reasonablygood solu-
tion exists. I n such cases, the weight vector may sometimes
appear to meander rather aimlessly, but it will typically
remain in a region associated with relatively low average
error.
The increment adaptation rule changes the weights with
increments that generally are not proportional to the linear ! Desired
Response
d {-1JI
error Ek. The other Mays rule, modified relaxation, i s closer
t o a-LMS in its use of the linear error Ek (refer t o Fig. 2). The Fig. 15. A five-Adalineexample of the Madaline I architec-
desired response and the quantizer output levels are binary ture.
fl. Ifthequantizeroutputykiswrongor ifthelinear output
sk falls within the dead zone f y , adaptation follows a-LMS AND gate, or majority-vote-taker discussed previously. The
t o reduce the linear error. If the quantizer output yk i s cor- weights of the Adalines are initially set t o small random val-
rect and the linear output skfallsoutside the dead zone, the ues.
weights are not adapted. The weight update rule for this Figure 15 shows a Madaline I architecture with five fully
algorithm can be written as connected first-layer Adalines. The second layer i s a major-

i"
ity element (MAJ). Because the second-layer logic element
if Fk = o and [Ski 2 y
is fixed and known, it i s possible t o determine which first-
wk+l = (20) layer Adalines can be adapted to correct an output error.
xk
wk + c q 7 otherwise The Adalines in the first layer assist each other in solving
IXkl
problems by automatic load-sharing.
where zk is the quantizer error of Eq. (17). One procedurefortrainingthe network in Fig. 15follows.
If the dead zone y is set t o 00, this algorithm reduces to A pattern i s presented, and if the output response of the
the a-LMS algorithm (IO).Mays showed that, for dead zone majority element matches the desired response, no adap-
0 < y < 1 and learning rate 0 < a 5 2, this algorithm will tation takes place. However, if, for instance, the desired
converge and separate any linearly separable input set in response i s +I and three of the five Adalines read -1 for
a finite number of steps. If the training set is not linearly agiven input pattern,oneof the latterthreemust beadapted
separable, this algorithm performs much like Mays's incre- to the +I state. The element that i s adapted by MRI i s the
ment adaptation rule. onewhose linearoutputsk isclosesttozero-theonewhose
Mays's two algorithms achieve similar pattern separation analog response i s closest t o the desired response. If more
results. The choice of a does not affect stability, although of the Adalines were originally in the -1 state, enough of
it does affect convergence time. The two rules differ i n their them are adapted to the +I state to make the majority deci-
convergence properties but there i s no consensus on which sion equal + I . The elements adapted are those whose lin-
i s the better algorithm. Algorithms like these can be quite ear outputs are closest to zero. A similar procedure i s fol-
useful, and we believe that there are many more t o be lowed when the desired response i s -1. When adapting a
invented and analyzed. given element, the weight vector can be moved in the LMS
The a-LMS algorithm, the Perceptron procedure, and direction far enough to reverse the Adaline's output (abso-
Mays's algorithms can all be used for adapting the single lute correction, or "fast" learning), or it can be adapted by
Adaline element or they can be incorporated into proce- the small increment determined by the a-LMS algorithm
dures for adapting networks of such elements. Multilayer (statistical, or "slow" learning). The one desired response
network adaptation procedures that use some of these d k i s used for all Adalines that are adapted. The procedure
algorithms are discussed in the following. can also be modified toallow oneof Mays'srulesto be used.
In that event, for the case we have considered (majority out-
V. ERROR-CORRECTION
RULES-MULTI-ELEMENTNETWORKS put element), adaptations take place if at least half of the
Adalines either have outputs differing from the desired
The algorithms discussed next are the Widrow-Hoff
responseor haveanalog outputswhich are in thedead zone.
Madaline rule from the early 1960s, now called Madaline
By setting the dead zone of Mays's increment adaptation
Rule I (MRI),and MadalineRule II(MRll),developed byWid-
rule to zero, the weights can also be adapted by Rosen-
row and Winter in 1987.
blatt's Perceptron rule.
Differences in initial conditions and the results of sub-
A. Madaline Rule I
sequent adaptation cause the various elements to take
The M R I rule allows the adaptation of a first layer of hard- "responsibility" for certain parts of the training problem.
limited (signum) Adaline elements whose outputs provide The basic principle of load sharing i s summarized thus:
inputs t o a second layer, consisting of a single fixed-thresh- Assign responsibility to the Adaline or Adalines that can
old-logic element which may be, for example, the OR gate, most easily assume it.

1426 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


In Fig. 15, the “job assigner,” a purely mechanized pro-
cess, assigns responsibility during training by transferring
Outnut
the appropriate adapt commands and desired response sig- Vecior
nals to the selected Adalines. The job assigner utilizes lin-
ear-output information. Load sharing i s important, since it
results in the various adaptive elements developing indi-
vidual weight vectors. If all the weights vectors were the
same, there would be no point in having more than one
element in the first layer.
When training the Madaline, the pattern presentation
sequence should be random. Experimenting with this,
Ridgway [76] found that cyclic presentation of the patterns
could lead to cycles of adaptation. These cycles would cause
theweights of the entire Madaline to cycle, preventingcon-
Desired Responses
vergence. (+1,-1)
The adaptive system of Fig. 15was suggested by common Fig. 16. Typical two-layer Madaline II architecture.
sense, and was found t o work well in simulations. Ridgway
found that the probability that a given Adaline will be
adapted in response to an input pattern i s greatest if that By the minimal disturbance principle, we select the first-
element had taken such responsibility during the previous layer Adalinewith the smallest linear output magnitudeand
adapt cycle when the pattern was most recently presented. perform a “trial adaptation” by inverting its binary output.
The division of responsibility stabilizes at the same time This can be done without adaptation by adding a pertur-
that the responses of individual elements stabilize to their bation Asof suitableamplitudeand polarityto the Adaline’s
share of the load. When the training problem i s not per- sum (refer to Fig. 16).If the output Hamming error is reduced
fectly separable bythis system, the adaptation process tends by this bit inversion, that is, if the number of output errors
to minimize error probability, although it i s possible for the is reduced, the perturbation A s i s removed and theweights
algorithm to “hang up” on local optima. of the selected Adaline element are changed by a-LMS in
The Madaline structure of Fig. 15 has 2 layers-the first a direction collinear with the corresponding input vector-
layer consists of adaptive logic elements, the second of fixed the direction that reinforces the bit reversal with minimal
logic. A variety of fixed-logic devices could be used for the disturbance to the weights. Conversely, if the trial adap-
second layer. A variety of MRI adaptation rules were devised tation does not improve the network response, no weight
by Hoff [75] that can be used with all possible fixed-logic adaptation i s performed.
output elements. An easily described training procedure After finishing with the first element, we perturb and
results when theoutput element i s an gate. During train- update other Adalines in the first layer which have “suf-
ing, if the desired output for a given input pattern i s + I , ficiently small” linear-output magnitudes. Further error
only the one Adaline whose linear output is closest to zero reductions can be achieved, if desired, by reversing pairs,
would be adapted if any adaptation i s needed-in other triples, and so on, up to some predetermined limit. After
words, if all Adalines give -1 outputs. If the desired output exhausting possibilities with the first layer, we move on to
i s -1, all elements must give -1 outputs, and any giving the next layer and proceed in a like manner. When the final
+ I outputs must be adapted. layer i s reached, each of the output elements is adapted by
The MRI rule obeys the “minimal disturbance principle” a-LMS. At this point, a new training pattern i s selected at
in the following sense. No more Adaline elements are random and the procedure i s repeated.Thegoa1 is to reduce
adapted than necessary to correct the output decision and Hamming error with each presentation, thereby hopefully
any dead-zone constraint. The elements whose linear out- minimizing the average Hamming error over the training
puts are nearest to zero are adapted because they require set. Like MRI, the procedure can be modified so that adap-
the smallest weight changes to reverse their output tations follow an absolute correction rule or one of Mays‘s
responses. Furthermore, whenever an Adaline is adapted, rules rather than a-LMS. Like MRI, M R l l can “hang up” on
theweights are changed in the direction of its input vector, local optima.
providing the requisite error correction with minimal
weight change.
VI. RULES-SINGLE THRESHOLD
STEEPEST-DESCENT ELEMENT
Thus far, we have described a variety of adaptation rules
B. Madaline Rule II
that act to reduce error with the presentation of each train-
The MRI rule was recently extended to allow the adap- ing pattern. Often, the objective of adaptation is to reduce
tation of multilayer binary networks by Winter and Widrow error averaged in some way over the training set. The most
with the introduction of Madaline Rule II (MRII) [43], [83], common error function i s mean-square error (MSE),
[108]. A typical two-layer M R l l network i s shown in Fig. 16. although in some situations other error criteria may be more
The weights in both layers are adaptive. appropriate [log]-[Ill]. The most popular approaches to
Training with the MRll rule is similar to training with the M S E reduction in both single-element and multi-element
M R I algorithm. The weights are initially set to small random networks are based upon the method of steepest descent.
values. Training patterns are presented in a random More sophisticated gradient approaches such as quasi-
sequence. If the network produces an error during a train- Newton [30], [112]-[I141 and conjugate gradient [114], [I151
ing presentation, we begin by adapting first-layer Adalines. techniques often have better convergence properties, but

WIDROW AND LEHR: PERCEPTRON, MADALINE, AND BACKPROPACATION 1427

~-
~
the conditions under which the additional complexity is Defining the vector P as the crosscorrelation between the
warranted are not generally known. The discussion that fol- desired response (a scalar) and the X-vector” then yields
lows i s restricted t o minimization of MSE by the method of
steepest descent [116], [117]. More sophisticated learning
procedures usuallyrequiremanyofthesamecomputations The input correlation matrix R i s defined in terms of the
used in the basic steepest-descent procedure. ensemble average
Adaptation of a network by steepest-descent starts with
an arbitrary initial value WOfor the system’s weight vector. R P E[XkXL]
The gradient of the MSE function i s measured and the
weight vector i s altered in the direction corresponding t o Xlk
...
the negative of the measured gradient. This procedure i s
XlkXlk
...
repeated, causing the M S E t o be successively reduced o n
average and causing the weight vector t o approach a locally
optimal value.
XnkXlk
The method of steepest descent can be described by the
relation This matrix i s real, symmetric, and positive definite, or in
rare cases, positive semi-definite. The MSE [k can thus be
wk+l = wk + +Vk) (21) expressed as
where p i s a parameter that controls stability and rate of
convergence, and Vk i s the value of the gradient at a point
on the M S E surface corresponding t o W = w k . = €[di] - 2PTWk + WLRWk. (27)
To begin, we derive rules for steepest-descent minimi- Note that the MSE is a quadratic function of the weights.
zation of the MSE associated with a single Adaline element. It i s a convex hyperparaboloidal surface, a function that
These rules are then generalized t o apply t o full-blown never goes negative. Figure 17 shows a typical MSE surface
neural networks. Like error-correction rules, the most prac-
tical and efficient steepest-descent rules typicallyworkwith
one pattern at a time. They minimize mean-square error,
approximately, averaged over the entire set of training pat-
terns.

A. Linear Rules
Steepest-descent rules for the single threshold element
are said t o be linear if weight changes are proportional t o
the linear error, the difference between the desired
response dk and the linear output of the element sk.
Mean-SquareError Surface o f the Linear Combiner: In this
section we demonstrate that the M S E surface of the linear
combiner of Fig. 1 is a quadratic function of the weights,
and thus easily traversed by gradient descent. Fig. 17. Typical mean-square-errorsurface of a linear com-
Let the input pattern Xk and the associated desired biner.
response dk be drawn from a statistically stationary pop-
ulation. During adaptation, the weight vector varies so that for a linear combiner with two weights. The position of a
even with stationary inputs, the output sk and error ek will point o n the grid in this figure represents the value of the
generally be nonstationary. Care must be taken in defining Adaline’s two weights. The height of the surface at each
the M S E since it is time-varying. The only possibility i s an point represents M S E over the training set when the Ada-
ensemble average, defined below. line’sweightsarefixed atthevaluesassociatedwith thegrid
At the k t h iteration, let theweight vector be wk. Squaring point. Adjusting theweights involvesdescending along this
and expanding Eq. (11) yields surface toward the unique minimum point (“the bottom of
the bowl”) by the method of steepest descent.
€: = (dk - XLWk)’ (22) The gradient Vk of the M S E function with W = wk i s
obtained by differentiating Eq. (27):
= d i - 2dkxIwk + W~xkX~Wk. (23)

Now assume an ensemble of identical adaptive linear com-


biners, each having the same weight vector Wk at the k t h
iteration. Let each combiner have individual inputs xk and Vk 4 = -2P + 2RWk. (28)
d k derived from stationary ergodic ensembles. Each com-
biner will produce an individual error Ek represented by Eq.
(23). Averaging Eq. (23) over the ensemble yields

E[E;]w= wk = f [ d i l - 2E[dkXi]Wk

(24) 15Weassume here that X includes a bias component xOk = + I .

1428 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990

-
This i s a linear function of the weights. The optimal weight mean to W * , the optimal Wiener solution discussed above.
vector W * , generally called the Wiener weight vector, i s A proof of this can be found in [30].
obtained from Eq. (28) by setting the gradient to zero: In the p-LMS algorithm, and other iterative steepest-
descent procedures, use of the instantaneous gradient i s
W * = R-’P. (29)
perfectly justified if the step size i s small. For small p , Wwill
This i s a matrix form of the Wiener-Hopf equation [118]- remain essentially constant over a relatively small number
[120]. In the next section we examine p-LMS, an algorithm of training presentationsK. The total weight change during
which enables us to obtain an accurate estimateof W * with- this period will be proportional to
out first computing R - ’ and P.
Thep-LMSA1gorithm:The p-LMS algorithm works by per-
forming approximate steepest descent on the M S E surface
in weight space. Because it is a quadratic function of the
weights, this surface is convex and has a unique (global)
minimum.” An instantaneous gradient based upon the
square of the instantaneous linear error is
(35)

where denotes the MSE function. Thus, on average the

“-‘=I
- ae2
- aw, i weights follow the true gradient. It i s shown in [30] that the
instantaneous gradient i s an unbiased estimate of the true
gradient.
Comparison of p-LMS and a-LMS: We have now pre-
sented two forms of the LMS algorithm, a-LMS (IO) in Sec-
tion IV-A and p-LMS (33) in the last section. They are very
L M S works by using this crude gradient estimate in place similar algorithms, both using the LMS instantaneous gra-
of the true gradient v k of Eq. (28). Making this replacement dient. a-LMS is self-normalizing, with the parameter a
into Eq. (21) yields determining the fraction of the instantaneous error to be
corrected with each adaptation. p-LMS is a constant-coef-
ficient linear algorithm which i s considerably easier to ana-
lyze than a-LMS. Comparing the two, the a-LMS algorithm
The instantaneous gradient is used because it is readily
i s like thep-LMS algorithm with acontinuallyvariable learn-
available from a single data sample. The true gradient i s
ing constant. Although a-LMS is somewhat more difficult
generally difficult to obtain. Computing it would involve
to implement and analyze, it has been demonstrated exper-
averaging the instantaneous gradients associated with all
imentally to be a better algorithm than p-LMS when the
patterns in the training set. This i s usually impractical and
eigenvalues of the input autocorrelation matrix Rare highly
almost always inefficient.
disparate, giving faster convergence for a given level of gra-
Performing the differentiation in Eq. (31) and replacing
dient noise” propagated into the weights. It will be shown
the linear error by definition (11)gives
next that p-LMS has the advantage that it will always con-
verge in the mean to the minimum MSE solution, while
a-LMS may converge to a somewhat biased solution.
We begin with a-LMS of Eq. (IO):

Noting that dk and xk are independent of wk yields


Replacing the error with its definition (11) and rearranging
wk+1 = wk + 2pekxk. (33) terms yields
This i s the p-LMS algorithm. The learning constant p deter-
mines stability and convergence rate. For input patterns (37)
independent over time, convergence of the mean and vari-
ance of the weight vector i s ensured [30] for most practical
purposes if
1

O < p < L (34) We define a -new- training set of pattern vectors and desired
trace [RI
responses {xk, a k } by normalizing elements of the original
where trace [RI = C(diagona1elements of R) is the average
training set as f o I I o ~ s , ’ ~
signal power of the X-vectors, that is, € ( X J X ) . With p set
within this range,17 the p-LMS algorithm converges in the

lblftheautocorrelationmatrixofthepatternvector set hasmzero


eigenvalues,the minimum M S E solution will be an m-dimensional (39)
subspace in weight space [30].
17Horowitzand Senne [I211 have proven that (34) is not sufficient
in general to guarantee convergence of the weight vector’s vari- ”Gradient noise is the difference between the gradient estimate
ance. For input patterns generated by a zero-mean Gaussian pro- and the true gradient.
cess independent over time, instability can occur in the worst case ?he idea of a normalized training set was suggested by Derrick
if fi is greater than 1/(3 trace [RI). Nguyen.

WIDROW AND LEHR: PERCEPTRON, MADALINE, AND BACKPROPACATION 1429


Eq. (38) then becomes fined as
- -
wk+, = w k + a(ak - WLXk)Xk. (40) zk A dk - yk = d k - sgm ( s k ) . (46)
This i s the p-LMS rule of Eq. (33) with 2 p replaced by a. Backpropagation for the Sigmoid Adaline: Our objective
Theweight adaptationschosen bythea-LMS ruleare equiv- is to minimize E[(&)*], averaged over the set of training pat-
alent to those of the K-LMS algorithm presented with a dif- terns, by proper choice of the weight vector. To accomplish
ferent training set-the normalized training set defined by this, we shall derive a backpropagation algorithm for the
(39). The solution that will be reached by the p-LMS algo- sigmoid Adaline element. An instantaneous gradient is
rithm i s the Wiener solution of this training set obtained with each input vector presentation, and the
method of steepest descent i s used to minimize error aswas
done with the p-LMS algorithm of Eq. (33).
where Referring to Fig. 18, the instantaneousgradient estimate

Input Pattern
vector Weight Vector
is the input correlation matrix of the normalized training
set and the vector

i s the crosscorrelation between the normalized input and


the normalized desired response. Therefore a-LMS con-
verges in the mean to the Wiener solution of the normalized
training set. When the input vectors are binary with + _ I
components, all input vectors have the same magnitude Linear Sigmoid
and the two algorithms are equivalent. For nonbinary train- Error Error

ing patterns, however, the Wiener solution of the nor-


Id,
malized training set generally i s no longer equal to that of Desired Response
the original problem, so a-LMS converges in the mean to Fig. 18. Adaline with sigmoidal nonlinearity.
a somewhat biased version of the optimal least-squares
solution.
The idea of a normalized training set can also be used to obtained during presentation of the k t h input vector X k i s
relate the stable ranges for the learning constants a and p given by
in the two algorithms. The stable range for a in the a-LMS
algorithm given in Eq. (15)can be computed from the cor-
responding range for p given in Eq. (34) by replacing Rand
p in Eq. (34) by @ and a/2,respectively, and then noting that Differentiating Eq. (46) yields

trace[i?l i s equal to one:


2
O < a < ~ , o r
trace[R] We may note that

o<a<2. (44) sk = X L W k .

B. Nonlinear Rules Therefore,

The Adalineelements considered thus far useat theirout-


puts either hard-limiting quantizers (signums), or no non-
linearity at all. The input-output mapping of the hard-lim- Substituting into Eq. (48) gives
iting quantizer i s Y k =.sgn ( s k ) Other
. forms of nonlinearity
have come into use in the past two decades, primarily of
the sigmoid type. These nonlinearities provide saturation
for decision making, yet they have differentiable input-out-
Inserting this into Eq. (47) yields
put characteristics that facilitate adaptivity. We generalize
the definition of the Adaline element to include the pos- 6, = -2zk sgm' (Sk)Xk.
sible use of a sigmoid in place of the signum, and then
Using this gradient estimate with the method of steepest
determine suitable adaptation algorithms.
descent provides a means for minimizing the mean-square
Fig. 18 shows a "sigmoid Adaline" element which incor-
erroreven afterthe summed signal skgoesthrough the non-
porates a sigmoidal nonlinearity. The input-output relation
linear sigmoid. The algorithm i s
of the sigmoid can be denoted by yk = sgm ( s k ) . A typical
sigmoid function is the hyperbolic tangent: wk+, = w k + c((-6k) (53)

(45)
= w k + 2 / . b c k sgm' (sk) x k . (54)
Algorithm (54) i s the backpropagation algorithm for the
We shall adapt this Adaline with the objective of mini- sigmoid Adaline element. The backpropagation name
mizing the mean square of the sigmoid error i k , de- makes more sense when the algorithm is utilized in a lay-

1430 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


Input Pattern Weight Vector An instantaneousestimated gradient can be obtained as
Vector
follows:

Since As i s small,

Another way to obtain an approximate instantaneousgra-


dient by measuring the effects of the perturbation As can
be obtained from Eq. (57).
Desired
2PLksgm’(sg,) d, Respons
Fig. 19. Implementation of backpropagation for the sig-
moid Adaline element.

Accordingly, there are two forms of the M R l l l algorithm for


the sigmoid Adaline. They are based on the method of
ered network, which will be studied below. Implementa-
steepest descent, using the estimated instantaneous gra-
tion of algorithm (54) i s illustrated in Fig. 19.
dients:
If the sigmoid i s chosen to be the hyperbolic tangent
function (45), then the derivative sgm’ ( s k ) is given by

a(tanh ( s k ) )
sgm‘ (sk) =
ask

= I - (tanh (Sk))’ = I - y;. (55)


Accordingly, Eq. (54) becomes For small perturbations, these two forms are essentially
identical. Neither one requires a priori knowledge of the
wk+1 = wk + 2pzk(1 - y i ) x k . (56) sigmoid’s derivative, and both are robust with respect to
Madaline Rule 111 for the Sigmoid Adaline: The imple- natural variations, biases, and drift in the analog hardware.
mentation of algorithm (54) (Fig. 19) requires accurate real- Which form to use is a matter of implementational con-
ization of the sigmoid function and its derivative function. venience. The algorithm of Eq. (60) i s illustrated in Fig. 20.
These functions may not be realized accurately when Regarding algorithm (61), some changes can be made to
implemented with analog hardware. Indeed, in an analog establish a point of interest. Note that, in accord with Eq.
network, each Adaline will have its own individual nonlin- (46)r
earities. Difficulties in adaptation have been encountered z k = dk - Y k . (62)
in practice with the backpropagation algorithm because of
Adding the perturbation A s causes a change in t k equal to
imperfections in the nonlinear functions.
Tocircumvent these problems a new algorithm has been Aik = -AYk. (63)
devised by David Andes for adapting networks of sigmoid
Now, Eq. (61) may be rewitten as
Adalines. This i s the Madaline Rule I l l (MRIII) algorithm.
The idea of MRlll for a sigmoid Adaline i s illustrated in
Fig. 20. The derivative of the sigmoid function i s not used
here. Instead, a small perturbation signal As is added to the
sum Sk, and the effect of this perturbation upon output Y k Since As is small, the ratio of increments may be replaced
and error Ek i s noted. by a ratio of differentials, finally giving

= wk + 2pzk sgm’ ( s k ) x k . (66)


Perturbation
This i s identical to the backpropagation algorithm (54)for
. I
P the sigmoid Adaline. Thus, backpropagation and MRlll are
mathematically equivalent if the perturbation As is small,
but MRlll i s robust, even with analog implementations.
MSE Surfaces of the Adaline: Fig. 21 shows a linear com-
biner connected to both sigmoid and signum devices. Three
errors, E, Zk, and are designated in this figure. They are:
linear error = E = d -s
sigmoid error = E = d - sgm (s)
1 Desired
dk Response
signum error = E = d - sgn (sgm (s))
Fig. 20. Implementation of the M R l l l algorithm for the sig-
moid Adaline element. = d - sgn (s). (67)

WIDROW AND LEHR: PERCEPTRON, MADALINE, AND BACKPROPACATION 1431

~~
Input Pattern Weight
Vector Vector

Non-Quadratic MSE
Desired Response
Fig. 24. Example MSE surface of signum error.
Fig. 21. The linear, sigmoid, and signum errors of the Ada-
line.
than minimizing the mean square of the signum error. Only
To demonstrate the nature of the square error surfaces the linear error i s guaranteed t o have an M S E surface with
associated with these three types of error, a simple exper- a unique global minimum (assuming invertible R-matrix).
imentwith a two-input Adalinewas performed. The Adaline The other M S E surfaces can have local optima [122], [123].
was driven by a typical set of input patterns and their asso- I n nonlinear neural networks, gradient methods gener-
ciated binary { +I, -1) desired responses. The sigmoid ally work better with sigmoid rather than signum nonlin-
function used was the hyperbolic tangent. The weights earities. Smooth nonlinearities are required by the M R l l l
could have been adapted t o minimize the mean-square and backpropagation techniques. Moreover, sigmoid net-
error of E , i , or E. The M S E surfaces of € [ ( E ) ~ ] , € [ ( E ) 2 ] , E [ ( : ) * ] works are capable of forming internal representations that
plotted as functions of the two weight values, are shown are more complex than simple binarycodes and, thus, these
in Figs. 22, 23, and 24, respectively. networks can often form decision regions that are more
sophisticated than those associated with similar signum
networks. In fact, if a noiseless infinite-precision sigmoid
Adaline could be constructed, it would be able t o convey
an infinite amount of information at each time step. This
i s in contrast to the maximum Shannon information capac-
ity of one bit associated with each binary element.
The signum does have some advantages over the sigmoid
in that it is easier to implement in hardware and much sim-
pler to compute o n a digital computer. Furthermore, the
outputs of signums are binary signals which can be effi-
ciently manipulated by digital computers. In a signum net-
work with binary inputs, for instance, the output of each
linear combiner can be computed without performing
weight multiplications. This involves simply adding
together the values of weights with + I inputs and sub-
Fig. 22. Example MSE surface of linear error. tracting from this the values of all weights that are con-
nected t o -1 inputs.
Sometimes a signum i s used in an Adaline t o produce
decisive output decisions. The error probability is then pro-
portional to the mean square of the output error :. To min-
imize this error probability approximately, one can easily
minimize E [ ( E ) ~ ] instead of directly minimizing [58].
However, with only a little more computation one could
minimize and typically come much closer to the
objective of minimizing €[(E)2]. The sigmoid can therefore
be used in training the weights even when the signum i s
used to form the Adaline output, as i n Fig. 21.

VII. STEEPEST-DESCENT
RULES-MULTI-ELEMENT
NETWORKS
We now study rules for steepest-descent minimization
of the M S E associated with entire networks of sigmoid Ada-
Fig. 23. Example MSE surface of sigmoid error.
line elements. Like their single-element counterparts, the
most practical and efficient steepest-descent rules for multi-
Although the above experiment i s not all encompassing, element networks typically work with one pattern presen-
we can infer from it that minimizing the mean square of the tation at a time. We will describe two steepest-descent rules
linear error is easy and minimizing the mean square of the for multi-element sigmoid networks, backpropagation and
sigmoid error i s more difficult, but typically much easier Madaline Rule Ill.

1432 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


Input
Pattern
Vector

Fig. 25. Example two-layer backpropagation network architecture.

A. Backpropagation for Networks In the network example shown in Fig. 25, the sum square
error i s given by
The publication of the backpropagation technique by
Rumelhart et al. [42] has unquestionably been the most E2 = (d, - yJ2 + (d2 - y2)2
influential development in the field of neural networks dur-
ing the past decade. In retrospect, the technique seems where we now suppress the time index k for convenience.
simple. Nonetheless, largely because early neural network In its simplest form, backpropagation training begins by
research dealt almost exclusively with hard-limiting non- presenting an input pattern vector X t o the network, sweep-
linearities, the idea never occurred to neural network ing forward through the system to generate an output
researchers throughout the 1960s. response vector Y, and computing the errors at each out-
put.The next step involvessweeping theeffectsof theerrors
The basic concepts of backpropagation are easily
backward through the network t o associate a “square error
grasped. Unfortunately, these simple ideas are often
obscured by relatively intricate notation, so formal deri- derivative” 6 with each Adaline, computing a gradient from
each 6, and finally updating the weights of each Adaline
vations of the backpropagation rule are often tedious. We
based upon the corresponding gradient. A new pattern is
present an informal derivation of the algorithm and illus-
then presented and the process i s repeated. The initial
trate how it works for the simple network shown in Fig. 25.
weight values are normally set t o small random numbers.
The backpropagation technique i s a nontrivial general-
ization of the single sigmoid Adaline case of Section VI-B. The algorithm will not work properly with multilayer net-
works if the initial weights are either zero or poorlychosen
When applied t o multi-element networks, the backprop-
nonzero
agation technique adjusts the weights in the direction
We can get some idea about what i s involved in the cal-
opposite the instantaneous error gradient:
culations associated with the backpropagation algorithm
by examining the network of Fig. 25. Each of the five large
circles represents a linear combiner, as well as some asso-
ciated signal paths for error backpropagation, and the cor-
responding adaptive machinery for updating the weights.
This detail is shown in Fig. 26. The solid lines in these dia-
grams represent forward signal paths through the network,

“)
awmk
20Recently,Nguyen has discovered that a more sophisticated
choice of initial weight values in hidden layers can lead to reduced
Now, however, wk is a long rn-component vector of all
problems with local optima and dramatic increases in network
weights i n the entire network. The instantaneous sum training speed [IOO]. Experimental evidence suggests that it i s
squared error €2 i s the sum of the squares of the errors at advisable to choose the initial weights of each hidden layer in a
each of the N, outputs of the network. Thus quasi-random manner, which ensures that at each position in a
layer’s input space the outputs of all but a few of its Adalines will
besaturated, whileensuringthateachAdaline in the layer i s unsat-
urated in some region of its input space. When this method i s used,
the weights in the output layer are set to small random values.

WIDROW AND LEHR PERCEPTRON, MADALINE, AND BACKPROPACATION 1433

~ _ _
We note that the second term is zero. Accordingly,

Observing that dl and s:" are independent yields

= (dl - sgm by))) sgm' (sy)). (77)


We denote the error dl - sgm (sy)), by €7'.Therefore,
6): = e!,2) sgm' (s:). (78)
Fig. 26. Detail of linear combiner and associated circuitry
in backpropagation network. Notethatthiscorrespondstothecomputationof6?'asillus-
trated in Fig. 25. The value of S associated with the other
output element in the figure can be expressed in an anal-
and the dotted lines represent the separate backward paths ogous fashion. Thus each square-error derivative 6 in the
that are used i n association with calculations of the square output layer i s computed by multiplying the output error
error derivatives 6. From Fig. 25, we see that the calculations associated with that element by the derivative of the asso-
associated with the backward sweep are of a complexity ciated sigmoidal nonlinearity. Note from Eq. (55)that if the
roughly equal t o that represented by the forward pass sigmoid function is the hyperbolic tangent, Eq. (78)becomes
through the network. The backward sweep requires the simply
same numberoffunctioncalculationsas the forward sweep, 6;" = 1 (1 - 0q2). (79)
but no weight multiplications i n the first layer.
As stated earlier, after a pattern has been presented t o Developing expressions for the square-error derivatives
thenetwork,and the responseerrorofeachoutput has been associated with hidden layers is not much more difficult
calculated, the next step of the backpropagation algorithm (refer to Fig. 25). We need an expression for Ay), the square-
involves finding the instantaneous square-error derivative error derivative associated with the top element in the first
6 associated with each summing junction in the network. layer of Fig. 25. The derivative 87) i s defined by
The square error derivative associated with the j t h Adaline
in layer I is defined as21 (80)

Expanding this by the chain rule, noting that e2 is deter-


mined entirely by the values of s): and s,!' yields
Each of these derivatives i n essence tells us how sensitive
the sum square output error of the network i s t o changes
in the linear output of the associated Adaline element.
The instantaneous square-error derivatives are first com- Using the definitions of 6:" and S:", and then substituting
puted for each element in the output layer. The calculation expanded versions of Adaline linear outputs s p ) and
i s simple. As an example, below we derive the required s f ) gives
expression for 67), the derivative associated with the top
Adalineelement i n theoutput layer of Fig. 25. We begin with
the definition of 67) from Eq. (71)

Expanding the squared-error term e 2 by Eq. (70)yields

(74)

"In Fig. 25, all notation follows the convention that superscripts
within parentheses indicate the layer number of the associated
Adaline or input node, while subscripts identify the associated Referring to Fig. 25, we can trace through the circuit t o
Adaline(s) within a layer. verify that 6 7 ) is computed in accord with Eqs. (86) and (87).

1434 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990

~~

I
The easiest way t o find values of 6 for all the Adaline ele- might appear to play the same role in backpropagation as
ments i n the network i s t o follow the schematic diagram of that played by the error in the p-LMS algorithm. However,
Fig. 25. 6 k i s not an error. Adaptation of the given Adaline i s effected
Thus, the procedure for finding 6('), the square-error to reduce the squared output error e;, not tik of the given
derivative associated with a given Adaline in hidden layer Adaline or of any other Adaline i n the network. The objec-
I, involves respectively multiplying each derivative 6 ( ' + ' ) tive i s not t o reduce the 6 k ' S of the network, but t o reduce
associated with each element in the layer immediately E', at the network output.
downstream from a given Adaline by the weight that con- It i s interesting to examine the weight updates that back-
nects it t o the given Adaline. These weighted square-error propagation imposes on the Adalineelements in theoutput
derivatives are then added together, producing an error layer. Substituting Eq. (77) into Eq. (94) reveals the Adaline
term E ( ' ) , which, in turn, is multiplied bysgm'(s(')), thederiv- which provides output y1 in Fig. 25 is updated by the rule
ative of the given Adaline's sigmoid function at its current
operating point. If a network has more than two layers, this w k + l = w k + 2pe:'sgm' (Sy))Xk. (95)
process of backpropagating the instantaneous square-error
derivatives from one layer to the immediately preceding This rule turns out to be identical to the single Adaline ver-
layer is successively repeated until a square-error derivative sion (54) of the backpropagation rule. This i s not surprising
6 is computed for each Adaline i n the network. This i s easily since the output Adaline is provided with both input signals
shown at each layer by repeating the chain rule argument and desired responses, so i t s training circumstance i s the
associated with Eq. (81). same as that experienced by an Adaline trained in isolation.
We now have a general method for finding a derivative There are many variants of the backpropagation algo-
6 for each Adaline element i n the network. The next step rithm. Sometimes, the size of p i s reduced during training
to diminish the effects of gradient noise in the weights.
i s t o use these 6's t o obtain the corresponding gradients.
Consider an Adalinesomewhere in the networkwhich,dur- Another extension is the momentum technique [42] which
ing presentation k, has a weight vector w k , an input vector involves including in theweightchangevectorAWkof each
Adaline a term proportional t o the corresponding weight
x k , and a linear output s k = W L X k .
change from the previous iteration. That is, Eq. (94) is
The instantaneous gradient for this Adaline element i s
replaced by a pair of equations:
6, =
at ;
- A w k = 2p(1 - ??)6,x, f qAwk_-( (96)
awk'

This can be written as (97)

v
A ae2 at', as where the momentum constant 0 I9 < 1 i s in practice usu-
k - awk ask aw,' ally set to something around 0.8 or 0.9.
The momentum technique low-pass filters the weight
Note that w k and Xk are independent so updates and thereby tends to resist erratic weight changes
caused either by gradient noise or high spatial frequencies
(90) i n the M S E surface. The factor (1 - 7) i n Eq. (96) is included
to give the filter a DC gain of unity so that the learning rate
Therefore, p does not need t o be stepped down as the momentum con-
stant 9 i s increased. A momentum term can also be added
(91) to the update equations of other algorithms discussed in
this paper. A detailed analysis of stability issues associated
For this element, with momentum updating for the p-LMS algorithm, for
instance, has been described by Shynk and Roy [124].
I n our experience, the momentum technique used alone
is usually of little value. We have found, however, that it i s
Accordingly, often useful to apply the technique in situations that require
relatively "clean"22 gradient estimates. One case i s a nor-
6, = -26kXk. (93) malized weight update equation which makes the net-
Updating the weights of the Adaline element using the work's weight vector move the same Euclidean distance
method of steepest descent with the instantaneous gra- with each iteration. This can be accomplished by replacing
dient is a process represented by Eq. (96) and (97) with

wk+1 = w k + p(-$k) = w k + 2p6kxk. (94) Ak = 6kXk + VAk+l (98)

Thus, after backpropagating all square-error derivatives, we


complete a backpropagation iteration by adding to each
weight vector thecorresponding input vector scaled by the
associated square-error derivative. Eq. (94) and the means
for finding 8 k comprise the general weight update rule of where again 0 < 7 < 1.The weight updates determined by
the backpropagation algorithm. Eqs. (98) and (99) can help a network find a solution when
There is a great similarity between Eq. (94) and the p-LMS a relatively flat local region i n the M S E surface is encoun-
algorithm (33), but one should view this similarity with cau-
tion. The quantity 6 k , defined as a squared-error derivative, **"Clean" gradient estimates are those with little gradient noise.

WIDROW AND LEHR. PERCEPTRON, MADALINE, AND BACKPROPACATION 1435

~
T-
tered. The weights move by the same amount whether the
surfaceis flat or inclined. It i s reminiscentof a-LMS because
the gradient term i n the weight update equation is nor-
malized by a time-varying factor. The weight update rule initial state
could be further modified by including terms from both

-I
techniques associated with Eqs. (96) through (99). Other
methods for speeding u p backpropagation training include
Fahlman’s popular quickprop method [125], as well as the
delta-bar-delta approach reported in an excellent paper by
Jacobs [126].23
One of the most promising new areas of neural network
research involves backpropagation variants for training var-
ious recurrent (signal feedback) networks. Recently, back-
propagation rules have been derived for training recurrent
networks to learn static associations [127l, [128]. More inter-
esting is the on-line technique of Williams and Zipser [I291
which allows a wide class of recurrent networks t o learn
dynamic associations and trajectories. A more general and
computationally viable variant of this technique has been
advanced by Narendra and Parthasarathy [104]. These on-
line methods are generalizations of a well-known steepest-
descent algorithm for training linear IIR filters [130], [30].
An equivalent technique that i s usually far less compu-
tationally intensive but best suited for off-line computation
[37, [42], [131], called “backpropagation through time,” has
been used by Nguyen and Widrow [SO]t o enable a neural
network t o learn without a teacher how to back u p a com- I final state
puter-simulated trailer truck to a loading dock (Fig. 27). This
Fig. 27. Example truck backup sequence.
i s a highly nonlinear steering task and it i s not yet known
how t o design a controller t o perform it. Nevertheless, with
just 6 inputs providing information about the current posi- Input
Pattern Perturbation
tion of the truck, a two-layer neural network with only 26
Adalines was able t o learn of i t s own accord to solve this output
Vector
problem. Once trained, the network could successfully
YI
back u p the truck from any initial position and orientation
)Y,k
in front of the loading dock.

B. Madaline Rule 111 for Networks


It i s difficult t o build neural networks with analog hard-
ware that can be trained effectively by the popular back-
propagation technique. Attempts t o overcome this diffi-
culty have led to the development of the M R l l l algorithm.
A commercial analog neurocomputing chip based primar-
ily o n this algorithm has already been devised [132]. The Desired Responses
method described i n this section is a generalization of the
Fig. 28. Example two-layer Madaline I l l architecture.
singleAdalineMRlll technique(60).The multi-element gen-
eralization of the other single element M R l l l rule (61) i s
described in [133]. elements i n any feedforward structure. In [133], we discuss
The MRlll algorithm can be readilydescribed by referring variants of the basic MRlll approach that allow steepest-
t o Fig. 28. Although this figure shows a simple two-layer descent training t o be applied t o more general network
feedforward architecture, the procedure t o be developed topologies, even those with signal feedback.
will work for neural networks with any number of Adaline Assume that an input pattern Xand its associated desired
output responses d, and d2are presented t o the network
of Fig.28.Atthispoint,we measurethesum squaredoutput
23Jacob’spaper, like many other papers in the literature, assumes
for analysis that the true gradients rather than instantaneous gra-
+
response error e* = (d, - Y , ) ~ (d2- y2)2 = E : +
E ; . We then
add asmall quantity Astoaselected Adaline i n the network,
dients are used to update the weights, that is, that weights are
changed periodically, only after all training patterns are presented. providing a perturbation t o the element’s linear sum. This
This eliminates gradient noise but can slow down training enor- perturbation propagates through the network, and causes
mously if the training set is large. The delta-bar-delta procedure a change in the sum of the squares of the errors, A(e2) =
in Jacob’spaper involves monitoring changes of the true gradients
in response to weight changes. It should be possible to avoid the
A(€: +E ; ) . An easily measured ratio i s

expense of computing the true gradients explicitly in this case by


instead monitoringchanges in theoutputs of, say, two momentum
filters with different time constants.

1436 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


Below we use this t o obtain the instantaneous gradient of sum square error. Minimizing the output Hamming error
e: with respect t o the weight vector of the selected Adaline. isthereforeequivalentto minimizingtheoutput sum square
For the k t h presentation, the instantaneous gradient i s error.
The MRlll algorithm works in a similar manner. All the
Adalines in theMRlll networkareadapted, butthosewhose
analog sums areclosesttozerowill usually beadapted most
Replacing the derivative with a ratio of differences yields strongly, because the sigmoid has its maximum slope at
zero,contributingto highgradientvalues.Aswith MRII, the
objective is t o change the weights for the given input pre-
sentation to reduce the sum square error at the network
output. In accord with the minimal disturbance principle,
The ideaof obtainingaderivative by perturbing the linear
the weight vectors of the Adaline elements are adapted in
output of the selected Adaline element i s the same as that
the L M S direction, along their X-vectors, and are adapted
expressed for the single element in Section VI-B, except that
in proportion t o their capabilities for reducing the sum
here the error i s obtained from the output of a multi-ele-
square error (the square of the Euclidean error) at the out-
ment network rather than from the output of a single ele-
put.
ment.
The gradient (102) can be used t o optimize the weight
D. Comparison of MRlll with Backpropagation
vector in accord with the method of steepest descent:
In Section VI-B, we argued that for the sigmoid Adaline
element, the M R l l l algorithm (61) i s essentially equivalent
t o the backpropagation algorithm (54).The same argument
can be extended to the network of Adaline elements, dem-
Maintaining the same input pattern, onecould either per-
onstrating that if A s i s small and adaptation i s applied to
turb all the elements in the network in sequence, adapting
all elementsinthenetworkatonce,then M R l l l isessentially
after each gradient calculation, or else the derivatives could
equivalent to backpropagation. That is, to the extent that
be computed and stored to allow all Adalines to be adapted
the sample derivative AE;/As from Eq. (103) i s equal to the
at once. These two M R l l l approaches both involve the same
analytical derivtive &;/ask from Eq. (91), the two rules fol-
weight update equation (103), and if p i s small, both lead
low identical instantaneous gradients, and thus perform
to equivalent solutions. With large p, experience indicates
identical weight updates.
that adapting one element at a time results in convergence
The backpropagation algorithm requires fewer opera-
after fewer iterations, especially i n large networks. Storing
tions than MRlll t o calculate gradients, since it i s able t o
the gradients, however, has the advantage that after the ini-
take advantage of a priori knowledge of the sigmoid non-
tial unperturbed error is measured during a given training
linearities and their derivative functions. Conversely, the
presentation, each gradient estimate requires only the per-
MRlll algorithm uses no prior knowledge about the char-
turbed error measurement. If adaptations take place after
acteristics of the sigmoid functions. Rather, it acquires
each error measurement, both perturbed and unperturbed
instantaneous gradients from perturbation measurements.
errors must be measured for each gradient calculation. This
Using MRIII, tolerances on the sigmoid implementations
i s because each weight update changes the associated
can be greatly relaxed compared t o acceptable tolerances
unperturbed error.
for successful backpropagation.
Steepest-descent training of multilayer networks imple-
C. Comparison of MRlll with MRll
mented by computer simulation or by precise parallel dig-
M R l l l was derived from MRll by replacing the signum ital hardware i s usually best carried out by backpropaga-
nonlinearities with sigmoids. The similarity of these algo- tion. During each training presentation, the backprop-
rithms becomes evident when comparing Fig. 28, repre- agation method requires only one forward computation
senting MRIII, with Fig. 16, representing MRII. through the network followed by one backward compu-
The MRll network i s highlydiscontinuous and nonlinear. tation in order to adapt all the weights of an entire network.
Usingan instantaneousgradienttoadjusttheweights is not To accomplish the same effect with the form of MRlIl that
possible. In fact, from the M S E surface for the signum Ada- updates all weights at once, one measures the unperturbed
line presented in Section VI-€3, it is clear that even gradient error followed by a number of perturbed error measure-
descent techniques that use the true gradient could run mentsequal tothenumberofelements in the network.This
into severe problems with local minima. The idea of adding could require a lot of computation.
a perturbation to the linear sum of a selected Adaline ele- If a network i s t o be implemented in analog hardware,
ment i s workable, however. If the Hamming error has been however, experience has shown that MRlll offers strong
reduced by the perturbation, the Adaline is adapted to advantages over backpropagation. Comparison of Fig. 25
reverse its output decision. This weight change i s in the L M S with Fig. 28 demonstrates the relative simplicity of MRIII.
direction, along its X-vector. If adapting the Adaline would All the apparatus for backward propagation of error-related
not reduce network output error, it is not adapted. This is signals i s eliminated, and the weights do not need to carry
in accord with the minimal disturbance principle. The Ada- signals in both directions (see Fig. 26). MRlll i s a much sim-
lines selected for possible adaptation are those whose ana- pler algorithm to build and to understand, and in principle
log sums are closest t o zero, that is, the Adalines that can it produces the same instantaneous gradient as the back-
be adapted to give opposite responses with the smallest propagation algorithm. The momentum technique and
weight changes. It is useful to note that with binary + I most other common variants of the backpropagation algo-
desired responses, the Hamming error i s equal t o 114 the rithm can be applied to MRlll training.

WIDROW AND LEHR: PERCEPTRON, MADALINE, AND BACKPROPACATION 1437


E. MSE Surfaces ofNeural Networks
In Section VI-6, "typical" mean-square-errorsurfaces of
sigmoid and signum Adalines were shown, indicating that
sigmoid Adalines are much more conducive to gradient
approaches than signum Adalines. The same phenomena
result when Adalines are incorporated into multi-element
networks. The M S E surfaces of M R l l networks are reason-
ably chaotic and will not be explored here. In this section
we examine only M S E surfaces from a typical backpropa-
gation training problem with a sigmoidal neural network.
In a network with more than two weights, the M S E sur-
face i s high-dimensional and difficult t o visualize. It i s pos-
sible, however, to look at slices of this surface by plotting
the MSE surfacecreated byvaryingtwooftheweightswhile
holding all others constant. The surfaces plotted in Figs. 29

Fig. 31. Example M S E surface of untrained sigmoidal net-


work as a function of a first-layer weight and a third-layer
weight.

Fig. 29. Example M S E surface of untrained sigmoidal net-


work as a function of two first-layer weights.

Fig. 32. Example MSE surface of trained sigmoidal network


as a function of a first-layer weight and a third-layer weight.

By studying many plots, it becomes clear that backpro-


pagation and M R l l l will be subject to convergence on local
optima. The same i s true for MRII. The most common rem-
edyfor this i s the sporadic addition of noise to the weights
or gradients. Some of the "simulated an.nealing" methods
[47] do this. Another method involves retraining the net-
work several times using differnt random initial weight val-
ues until a satisfactory solution i s found.
Solutions found by people in everyday life are usually not
Fig. 30. Example M S E surface of trained sigmoidal network optimal, but many of them are useful. If a local optimum
as a function of two first-layer weights. yields satisfactory performance, often there is simply no
need to search for a better solution.

and 30 show two such slices of the MSE surface from a typ-
VIII. SUMMARY
ical learning problem involving, respectively, an untrained
sigmoidal network and a trained one. The first surface This year is the 30th anniversaryof the publication of the
resulted from varying two first-layer weights of an untrained Perceptron rule by Rosenblatt and the LMS algorithm by
network. The second surface resulted from varying the same Widrow and Hoff. It has also been 16 years since Werbos
two weights after the network was fully trained. The two first published the backpropagation algorithm. These
surfaces are similar, but the second one has a deeper min- learning rules and several others have been studied and
imum which was carved out by the backpropagation learn- compared. Although they differ significantly from each
ing process. Figs. 31 and 32 resulted from varying adifferent other, they all belong to the same "family."
set of two weights in the same network. Fig. 31 is the result A distinction was drawn between error-correction rules
from varying a first-layer weight and third-layer weight in and steepest-descent rules. The former includes the Per-,
the untrained network, whereas Fig. 32 is the surface that ceptron rule, Mays' rules, the CY-LMS algorithm, the original
resulted from varying the same two weights after the net- Madaline I rule of 1962, and the Madaline II rule. The latter
work was trained. includes thep-LMS algorithm, theMadaline Ill rule,and the

1438 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


cerebellar model articulation controller (CMAC),” J. Dyn.
Steepest
A Error Sys., Meas., Contr., vol. 97, pp. 220-227, 1975.
Descent Correction (141 W. T. Miller, Ill, “Sensor-based control of robotic rnanip-
Rules

’n
ulators using a general learning algorithm.” I€€€]. Robotics
Automat., vol. RA-3, pp. 157-165, Apr. 1987.
1151 S. Grossberg, “Adaptive pattern classification and universal
recoding, 11: Feedback, expectation, olfaction, and illu-

,
Layered Wngle
Network Element sions,” Biolog. Cybernetics, vol. 23, pp. 187-202, 1976.

r u m )
f
Nonlinear Nonlinear
(,
h Linear
[I61 G. A. Carpenter and S. Grossberg, “A massively parallel
architecturefor a self-organizing neural pattern recognition
machine,” Computer Vision, Graphics, and Image Process-
ing, vol. 37, pp. 54-115, 1983.
[I7 -, “Art 2: Self-organization of stable category recognition
MRlIl MRlll p-LMS MRI Perceptron a-LMS codes for analog output patterns,” Applied Optics, vol. 26,
Backprop Backprop MRII Mays pp. 4919-4930, Dec. 1, 1987.
Fig. 33. Learning rules. [I81 -, “Art 3 hierarchical search:Chemical transmitters in self-
organizing pattern recognition architectures,” in Proc. lnt.
Joint Conf. on Neural Networks, vol. 2, pp. 30-33, Wash.,
backpropagation algorithm. Fig. 33categorizes the learning DC, Jan. 1990.
rules that have been studied. [I91 T. Kohonen, “Self-organized formation of topologically cor-
rect feature maps,” Biolog. Cybernetics, vol. 43, pp. 59-69,
Although these algorithms have been presented asestab- 1982.
lished learning rules, one should not gain the impression [20] -, Self-organization and Associative Memory. New York:
that they are perfect and frozen for all time. Variations are Springer-Verlag, 2d ed., 1988.
possible for every one of them. They should be regarded [21] D. 0. Hebb, Theorganization ofBehavior. New York: Wiley,
1949.
as substrates upon which t o build new and better rules.
1221 1. J. Hopfield, “Neural networks and physical systems with
There i s a tremendous amount of invention waiting “in the emergent collective computational abilities,” Proc. Natl.
wings.” We look forward t o the next 30 years. Acad. Sci., vol. 79, pp. 2554-2558, Apr. 1982.
[23] -, “Neurons with graded response have collective com-
REFERENCES putational properties like those of two-state neurons,” Proc.
Natl. Acad. Sci.,\ol. 81, pp. 3088-3092, May 1984.
K. 5teinbuchandV.A. W. Piske,“Learningmatricesand their [24] B. Kosko, “Adaptive bidirectional associative memories,”
applications,” / € € E Trans. Electron. Comput., vol. EC-12, pp. Appl. Optics, vol. 26, pp. 4947-4960, Dec. 1, 1987.
846-862, Dec. 1963. [25] G. E. Hinton, R. J. Sejnowski, and D. H. Ackley, “Boltzmann
B. Widrow, “Generalization and information storage in net- machines: Constraint satisfaction networks that learn,”
works of adaline ’neurons,‘ in Self-OrganizingSystems 1962, Tech. Rep. CMU-CS-84-119, Carnegie-Mellon University,
M. Yovitz, G. Jacobi, and G. Goldstein, Eds. Washington, Dept. of Computer Science, 1984.
DC: Spartan Books, 1962, pp. 435-461. [26] G. E. Hinton and T. J. Sejnowski, “Learning and relearning
L. Stark, M. Okajima, and G. H. Whipple, ”Computer pat- in Boltzmann machines,” in Parallel Distributed Processing,
tern recognition techniques: Electrocardiographic diag- vol. 1, ch. 7, D. E. Rumelhart and J. L. McClelland, Eds. Cam-
nosis,” Commun. Ass. Comput. Mach., vol. 5, pp. 527-532, bridge, MA, M.I.T. Press, 1986.
Oct. 1962. [27 L. R. Talbert etal., “A real-time adaptive speech-recognition
F. Rosenblatt, “Two theorems of statistical separability in system,” Tech. rep., Stanford University, 1963.
the perceptron,” in Mechanization of Thought Processes: [28] M. J.C. Hu, Application of the Adaline System to Weather
Proceedings of a Symposium held a t the National Physical Forecasting. Thesis, Tech. Rep. 6775-1, Stanford Electron.
Laboratory, Nov. 1958, vol. 1 pp. 421-456. London: HM Sta- Labs., Stanford, CA, June 1964.
tionery Office, 1959. 1291 B. Widrow, ”The original adaptive neural net broom-bal-
F. Rosenblatt, Principles of Neurodynamics: Perceptronsand ancer,” Proc. /€€€ lntl. Symp. Circuits andSystems, pp. 351-
the Theory of Brain Mechanisms. Washington, DC: Spartan 357, Phila., PA, May 4-7 1987.
Books, 1962. [30] B. Widrow and S. D. Stearns, Adaptive Signal Processing.
C. von der Malsburg, “Self-organizing of orientation sen- Englewood Cliffs, NJ: Prentice-Hall, 1985.
sitive cells in the striate cortex,” Kybernetik, vol. 14, pp. 85- [31] B. Widrow, P. Mantey, L. Griffiths, and B. Goode, “Adaptive
100, 1973. antenna systems,” Proc. / € E € , vol. 55, pp. 2143-2159, Dec.
S. Grossberg, “Adaptive pattern classification and universal 1967.
recoding, I: Parallel development and coding of neural fea- [32] B. Widrow, “Adaptive inverse control,” Proc. 2d lntl. fed.
ture detectors,” Biolog. Cybernetics, vol. 23, pp. 121-134, ofAutomatic Control Workshop, pp. 1-5, Lund, Sweden, July
1976. I-3,1986.
K. Fukushima, “Cognitron: A self-orgainizing multilayered [33] B. Widrow, etal., “Adaptive noise cancelling: Principlesand
neural network,” Biolog. Cybernetics, vol. 20, pp. 121-136, applications,” Proc. /€€€, vol. 63, pp. 1692-1716, Dec. 1975.
1975. [34] R. W. Lucky, “Automatic equalization for digital commu-
-, “Neocognitron: A self-organizing neural network nication,” Bell Syst. Tech. J., vol. 44, pp. 547-588, Apr. 1965.
model for a mechanism of pattern recognition unaffected [35] R. W. Lucky, et al., Principles of Data Communication. New
by shift in position,” Biolog. Cybernetics, vol. 36, pp. 193- York: McGraw-Hill, 1968.
202,1980. [36] M. M. Sondhi,”An adaptive echo canceller,” BellSyst. Tech.
B. Widrow,”Bootstrap learning in threshold logic systems,” J., vol. 46, pp. 497-511, Mar. 1967.
presented at the American Automatic Control Council (The- [37 P. Werbos, Beyond Regression: New Tools for Predictionand
orycornmittee), IFAC Meeting, London, England, June1966. Analysis in the Behavioral Sciences. Ph.D. thesis, Harvard
B. Widrow, N. K. Gupta, and S. Maitra, “Punishlreward: University, Cambridge, MA, Aug. 1974.
Learning with a critic in adaptive threshold systems,” / € E € [38] Y. le Cun, “A theoretical framework for back-propagation,”
Trans. Syst., Man, Cybernetics,vol. SMC-3, pp. 455-465, Sept. in Proc. 1988 Connectionist Models Summer School, D.
1973. Touretzky, G. Hinton, and T. Sejnowski, Eds. June17-26, pp.
A. G. Barto, R. S. Sutton, and C. W. Anderson, ”Neuronlike 21-28. San Mateo, CA; Morgan Kaufmann.
adaptive elements that can solve difficult learning control [39] D. Parker, “Learning-logic,’‘ Invention Report 581-64, File 1,
problems,” /E€€ Trans. Syst., Man, Cybernetics, vol. Office of Technology Licensing, Stanford University, Stan-
SMC-13, pp. 834-846, 1983. ford, CA, Oct. 1982.
J. S. Albus, ”A new approach to manipulator control: the [40] -, “Learning-logic,” Technical Report TR-47, Center for

WIDROW A N D LEHR: PERCEPTRON, MADALINE, A N D BACKPROPACATION 1439


Computational Research in Economics and Management tions for Pattern Recognition. Ph.D. thesis, Tech. Rep. 6764-
Science, M.I.T., Apr. 1985. 5, Stanford Electron. Labs., Stanford, CA, May 1966.
[41] D. E. Rumelhart, C. E. Hinton, and R. J. Williams, “Learning [67] -, “Vectorcardiographic diagnosis using the polynomial
internal representations by error propagation,” ICs Report discriminant method of pattern recognition,” lEEE Trans.
8506, Institute for Cognitive Science, University of Califor- Biomed. Eng., vol. BME-14, pp. 90-95, Apr. 1967.
nia at San Diego, La Jolla, CA, Sept. 1985. [68] -, “Generation of polynomial discriminant functions for
[42] -, ”Learning internal representations by error propaga- pattern recognition,” lEEE Trans. Electron. Comput., vol.
tion,” in Parallel Distributed Processing, vol. 1, ch. 8, D. E. EC-16, pp. 308-319, June1967.
Rumelhart and J. L. McClelland, Eds., Cambridge, MA: M.I.T. [69] A. R. Barron, “Adaptive learning networks: Development
Press, 1986. and application in the United States of algorithms related
[43] B. Widrow, R. G . Winter, and R. Baxter, ”Learning phenom- to gmdh,“ i n Self-organizingMethods in Modeling, S. J. Far-
ena in layered neural networks,” Proc. 1st lEEE lntl. Conf. low, Ed., New York: Marcel Dekker Inc., 1984, pp. 25-65.
on NeuralNetworks, vol. 2, pp. 411-429, San Diego, CA, June [70] -, “Predicted squared error: A criterion for automatic
1987. model selection,” Self-organizingMethods in Modeling, in
[44] R. P. Lippmann, “An introductionto computing with neural S. J. Farlow, Ed. NewYork: Marcel Dekker Inc., 1984, pp. 87-
nets,” lEEE ASSP Mag., Apr. 1987. 103.
[45] J. A. Anderson and E. Rosenfeld, Eds., Neurocomputing: [71] A. R. Barron and R. L. Barron, ”Statistical learning networks:
Foundations ofResearch.Cambridge, MA: M.I.T. Press, 1988. A unifying view,” 1988Symp. on the Interface: Statistics and
[46] N. Nilsson, Learning Machines. New York: McCraw-Hill, Computing Science, pp. 192-203, Reston, VA, Apr. 21-23,
1965. 1988.
[473 D. E. Rumelhart and J. L. McClelland, Eds., Parallel Distrib- [72] A. C . Ivakhnenko, “Polynomial theoryof complexsystems,”
uted Processing.Cambridge, MA: M.I.T. Press, 1986. /E€€ Trans. Syst., Man, Cybernetics, SMC-1, pp. 364-378, Oct.
[48] B. Moore, “Art 1 and pattern clustering,” in Proc. 1988Con- 1971.
nectionistModelsSummerSchool, D. Touretzky, C. Hinton, [73] Y. H. Pao, “Functional link nets: Removing hidden layers.”
and T. Sejnowski, Eds., June 17-26 1988, pp. 174-185, San A/ Expert, pp. 60-68, Apr. 1989.
Mateo, CA: Morgan Kaufmann. [74] C. L. Ciles and T. Maxwell, “Learning, invariance, and gen-
[49] DARPA Neural Network Study. Fairfax, VA: AFCEA Interna- eralization in high-order neural networks,” Applied Optics,
tional Press, 1988. vol. 26, pp. 4972-4978, Dec. 1, 1987.
[50] D. Nguyen and B. Widrow, “The truck backer-upper: An [75] M. E. Hoff, Jr., LearningPhenomena in NetworksofAdaptive
exampleof self-learning in neural networks,” Proc. lntl.loint SwitchingCircuits. Ph.D. thesis, Tech. Rep. 1554-1, Stanford
Conf. on Neural Networks, vol. 2, pp. 357-363, Wash., DC, Electron. Labs., Stanford, CA, July 1962.
lune 7989. [76] W. C. Ridgway 111, An Adaptive Logic System with Gener-
[51] T. J.Sejnowski and C. R. Rosenberg, “Nettalk: a parallel net- alizing Properties. Ph.D. thesis, Tech. Rep. 1556-1, Stanford
work that learns to read aloud,”Tech. Rep. JHU/EECS-86/01, Electron. Labs., Stanford, CA, April 1962.
Johns Hopkins University, 1986. [77l F. H. Glanz, Statistical Extrapolation in Certain Adaptive Pat-
[52] -, “Parallel networks that learn to pronounce English tern-Recognition Systems. Ph.D. thesis, Tech. Rep.
text,” Complex Systems, vol. 1, pp. 145-168,1987. 6767-1, Stanford Electron. Labs., Stanford, CA, May 1965.
[53] P. M. Shea and V. Lin, “Detection of explosives in checked [78] B. Widrow, “Adaline and Madaline-1963, plenary speech,”
airline baggage using an artificial neural system,” Proc. htl. Proc. 1st lEEE lntl. Conf. on NeuralNetworks, vol. 1, pp. 145-
joint Conf. on Neural Networks, vol. 2, pp. 31-34, Wash., 158, San Diego, CA, June 23, 1987.
DC, June1989. [79] -, “An adaptive ”adaline” neuron using chemical ‘mem-
[54] D. G. Bounds, P. 1. Lloyd, B. Mathew, and G. Waddell, “A istors.”’ Tech. Rep. 1553-2, Stanford Electron. Labs., Stan-
multilayer perceptron networkforthediagnosisof low back ford, CA, Oct. 17, 1960.
pain,” Proc. 2d lEEE lntl. Conf. on Neural Networks, vol. 2, [80] C. L. Ciles, R. D. Griffin, and T. Maxwell, “Encoding geo-
pp. 481-489, San Diego, CA, July 1988. metric invariances in higher order neural networks,” Neural
[55] G. Bradshaw, R. Fozzard, and L. Ceci, “A connectionist lnformation ProcessingSystems, in D. Z. Anderson, Ed. New
expert system that actually works,” in Advances in Neural York: American Institute of Physics, 1988, pp. 301-309.
lnformation Processing Systems I, D. S. Touretzky, Ed. San [81] D. Casasent and D. Psaltis, “Position, rotation, and scale
Mateo, CA: Morgan Kaufmann, 1989, pp. 248-255. invariant optical correlation,” Appl. Optics, vol. 15, pp. 1795-
[56] N. Mokhoff, “Neural nets making the leap out of lab,” Elec- 1799, July 1976.
tronic Engineering Times, p. 1, Jan. 22, 1990. [82] W. L. Reber and J. Lyman, “An artificial neural system design
[57] C. A. Mead, Analog VLSl and NeuralSystems. Reading, MA: for rotation and scale invariant pattern recognition,” Proc.
Addison-Wesley, 1989. Ist lEEE lntl. Conf. on Neural Networks, vol. 4, pp. 277-283,
[58] B. Widrow and M. E. Hoff, Jr., “Adaptive switching circuits.” San Diego, CA, June 1987.
1960 IRE Western Electric Show and Convention Record, Part [83] B. Widrow and R. C. Winter, “Neural nets for adaptive fil-
4, pp. 96-104, Aug. 23, 1960. tering and adaptive pattern recognition,” lEEE Computer,
[59] -, “Adaptive switching circuits,” Tech. Rep. 1553-1, Stan- pp. 25-39, Mar. 1988.
ford Electron. Labs., Stanford, CA June 30,1960. [84] A. Khotanzad and Y. H. Hong, “Rotation invariant pattern
[60] P. M. Lewis II and C. Coates, Threshold Logic. New York: recognition using zernike moments,” Proc. 9th lntl. Conf.
Wiley, 1967. on Pattern Recognition, vol. 1, pp. 326-328, 1988.
[61] T. M. Cover, Geometricaland Statistical Properties o f Linear [85] C. von der Malsburg, “Pattern recognition by labeled graph
ThresholdDevices. Ph.D. thesis,Tech. Rep. 6107-1, Stanford matching,” Neural Networks, vol. 1, pp. 141-148, 1988.
Electron. Labs., Stanford, CA, May 1964. [86] A. Waibel,T. Hanazawa,C. Hinton, K. Shikano, and K. J. Lang,
[62] R. J. Brown, Adaptive Multiple-Output Threshold Systems “Phoneme recognition using time delay neural networks,”
and Their Storage Capacities. Thesis, Tech. Rep. 6771-1, lEEE Trans. Acoust., Speech, and Signal Processing, vol.
Stanford Electron. Labs., Stanford, CA, June1964. ASSP-37, pp. 328-339, Mar. 1989.
[63] R. 0. Winder, ThresholdLogic. Ph.D. thesis, Princeton Uni- [871 C. M. Newman, “Memory capacity in neural network
versity, Princeton, NJ,1962. models: Rigorous lower bounds,” Neural Networks, vol. 1,
[64] S. H. Cameron, ”An estimate of the complexity requisite in pp. 223-238, 1988.
a universal decision network,” Proc. 1960Bionics Sympos- [88] Y. S. Abu-Mostafa and J. St. Jacques,“Information capacity
ium, Wright Air Development Division Tech. Rep. 60-600, of the hopfield model,” /E€€ Trans. Inform. Theory, vol.
pp. 197-211, Dayton, OH, Dec. 1960. IT-31, pp. 461-464,1985.
[65] R. D. Joseph, “The number of orthants in n-space inter- [89] Y. S. Abu-Mostafa, ”Neural networks for computing?” in
sected by an s-dimensional subspace,” Tech. Memorandum Neural Networks for Computing, Amer. Inst. of Phys. Conf.
8, Project PARA, Cornell Aeronautical Laboratory, Buffalo, Proc. No. 151,J. S. Denker, Ed. New York: American Institute
New York 1960. of Physics, 1986, pp. 1-6.
[66] D. F. Specht, Generation o f Polynomial Discriminant Func- I901 S. S. Venkatesh, “Epsilon capacity of neural networks,” in

1440 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990


Neural Networks for Computing, Amer. Inst. of Phys. Conf. Neural lnformation ProcessingSystems I, D. S. Touretzky,
Proc. No. 151, J. S. Denker, Ed. New York: American Institute Ed., pp. 40-48, San Mateo, CA: Morgan Kaufmann, 1989.
of Physics, 1986, pp. 440-445. [116] R. V. Southwell, Relaxation Methods in Engineering Sci-
[91] J. D. Greenfield, Practical Digital Design Using IC's. 2d ed., ence. New York: Oxford, 1940.
New York: Wiley, 1983. [I17 D. J. Wilde; Optimum Seeking Methods. Englewood Cliffs,
[92] M. Stinchombe and H. White, "Universal approximation NJ: Prentice-Hall, 1964.
using feedforward networkswith non-sigmoidhidden layer [I181 N . Wiener, Extrapolation, Interpolation, and Smoothing of
activation functions," Proc. lntl. Joint Conf. on Neural Net- Stationary Time Series, with Engineering Applications. New
works, vol. 1, pp. 613-617, Wash., DC, June1989. York: Wiley, 1949.
[93] G. Cybenko,"Continuousvaluedneural networkswith two [I191 T. Kailath, "A view of three decades of linear filtering the-
hidden layersare sufficient,"Tech. Rep., Dept. of Computer ory," /€€€ Trans. Inform. Theory, vol. IT-20, pp. 145-181, Mar.
Science, Tufts University, Mar. 1988. 1974.
[94] B. Irie and S. Miyake, "Capabilities of three-layered per- [I201 H. Bode and C. Shannon, "A simplified derivation of linear
ceptrons," Proc. 2d IEEE lntl. Conf. on NeuralNetworks, vol. least squares smoothing and prediction theory," Proc. lR€,
1, pp. 641-647, San Diego, CA, July 1988. vol. 38, pp. 417-425, Apr. 1950.
[95] M. L. Minsky and S. A. Papert, Perceptrons:An lntroduction [I211 L. L. Horowitz and K. D. Senne, "Performance advantage of
to Computational Geometry. Cambridge, MA: M.I.T. Press, complex LMSfor controlling narrow-band adaptive arrays,"
expanded ed., 1988. / € € E Trans. Circuits, Systems, vol. CAS-28, pp. 562-576, June
[96] M. W. Roth,"Surveyof neural network technology for auto- 1981.
matic target recognition," /€€E Trans. NeuralNetworks, vol. [I221 E. D. Sontag and H. J. Sussmann, "Backpropagation sepa-
1, pp. 28-43, Mar. 1990. rates when perceptrons do," Proc. lntl. Joint Conf. on Neural
[971 T. M. Cover, "Capacity problems for linear ma- Networks, vol. 1 , pp. 639-642, Wash., DC, June 1989.
chines,"Pattern Recognition, in L. N. Kanal, Ed. Wash., DC: [I231 -, "Backpropagation can give rise to spurious local min-
Thompson Book Co., 1968, pp. 283-289, part 3. ima even for networks without hidden layers," Complex
[98] E. B. Baum, "On the capabilitiesof multilayer perceptrons," Systems, vol. 3, pp. 91-106, 1989.
1. Complexity, vol. 4, pp. 193-215, Sept. 1988. [I241 J.J. Shynk and S. Roy, "The Ims algorithm with momentum
[99] A. Lapedes and R. Farber, "How neural networks work," updating," lSCAS 88, Espoo, Finland, June1988.
Tech. Rep. LA-UR-88-418, Los Alamos Nat. Laboratory, Los [I251 S. E. Fahlman, "Faster learning variations on backpropa-
Alamos, NM, 1987. gation: An empirical study," in Proc. 1988 Connectionist
[loo] D. Nguyen and B. Widrow, "Improving the learning speed Models Summer School, D. Touretzky, G. Hinton, and T.
of 2-layer neural networks by choosing initial values of the Sejnowski, Eds. June17-26,1988, pp. 38-51, San Mateo, CA:
adaptive weights," Proc. lntl. Joint Conf. on Neural Net- Morgan Kaufmann.
works, San Diego, CA, June1990. [I261 R.A. Jacobs,"Increased ratesof convergencethrough learn-
[I011 G. Cybenko, "Approximation by superpositions of a sig- ing rate adaptation, Neural Networks, vol. 1, pp. 295-307,
moidal function," Mathematics of Control, Signals, and Sys- 1988.
tems, vol. 2, 1989. [I271 F. J. Pineda, "Generalization of backpropagation to recur-
[I021 E. B. Baum and D. Haussler, "What size net gives valid gen- rent neural networks," Phys. Rev. Lett., vol. 18, pp. 2229-
eralization?" NeuralComputation, vol. 1, pp. 151-160,1989. 2232, 1987.
[lo31 J. J. Hopfield and D. W.Tank,"Neural computationsof deci- [I281 L. B. Almeida, "A learning rule for asynchronous percep-
sions in optimization problems," Biolog. Cybernetics, vol. trons with feedback in a combinatorial environment," Proc.
52, pp. 141-152, 1985. 1st IEEE lntl. Conf. on Neural Networks, vol. 2, pp. 609-618,
[I041 K. S. Narendra and K. Parthasarathy, "Identification and San Diego, CA, June 1987.
control of dynamical systems using neural networks," /€€E [I291 R. J. Williams and D. Zipser, "A learning algorithm for con-
Trans. Neural Networks, vol. 1 , pp. 4-27, Mar. 1990. tinually running fully recurrent neural networks," ICs
[I051 C. H. Mays, Adaptive Threshold Logic. Ph.D. thesis, Tech. Report 8805, Inst. for Cog. Sci., University of California at
Rep.1557-1, Stanford Electron. Labs., Stanford, CA,Apr. 1963. San Diego, La Jolla, CA, Oct. 1988.
[I061 F. Rosenblatt, "On the convergence of reinforcement pro- [I301 S. A. White, "An adaptive recursive digital filter," Proc. 9th
cedures in simple perceptrons," Cornell Aeronautical Lab- Asilomar Conf. Circuits Syst. Comput., p. 21, Nov. 1975.
oratory Report VG-1796-G-4, Buffalo, NY, Feb. 1960. [I311 B. Pearlmutter, "Learning state space trajectories in recur-
[IO71 H. Block, "The perceptron: A model for brain functioning, rent neural networks," in Proc. 1988 Connectionist Models
I," Rev. Modern Phys., vol. 34, pp. 123-135, Jan. 1962. SummerSchool, D. Touretzky, G. Hinton, and T. Sejnowski,
[lo81 R. G. Winter, Madaline Rule /I: A New Method for Training Eds. June17-26,1988, pp. 113-117. San Mateo, CA: Morgan
Networks of Adalines. Ph.D. thesis, Stanford University, Kaufmann.
Stanford, CA, Jan. 1989. [I321 M. Holler, et al., "An electrically trainable artificial neural
[I091 E. Walach and B. Widrow,"The least mean fourth (1mf)adap- network (etann) with 10240 'floating gate' synapses," Proc.
tive algorithm and its family," lEEE Trans. Inform. Theory, lntl. Joint Conf. on Neural Networks, vol. 2, pp. 191-196,
vol. IT-30, pp. 275-283, Mar. 1984. Wash., DC, June1989.
[I101 E. B. Baum and F. Wilczek, "Supervised learning of prob- [I331 D. Andes, B. Widrow, M. Lehr, and E. Wan, "MRIII: A robust
ability distributions by neural networks," in Neural lnfor- algorithm for training analog neural networks, Proc. lntl.
mation ProcessingSystems, D. Z. Anderson, Ed. New York: Joint Conf. on Neural Networks, vol. 1, pp. 533-536, Wash.,
American Institute of Physics, 1988, pp. 52-61. DC, Jan. 1990.
[lll] S. A. Solla, E. Levin, and M. Fleisher, "Accelerated learning
in layered neural networks," Complex Systems, vol. 2, pp.
625-640,1988.
[112] D. B. Parker, "Optimal algorithms for adaptive neural net-
works: Second order back propagation, second order direct
propagation, and second order Hebbian learning," Proc. 1st Bernard Widrow (Fellow, IEEE) received the
/ € E € lntl. Conf. on NeuralNetworks, vol. 2, pp. 593-600, San S.B.,S.M.,andSc.D.degreesfromtheMas-
Diego, CA, June1987. sachusetts Institute of Technology in 1951,
[113] A. J. Owens and D. L. Filkin, "Efficient training of the back 1953, and 1956, respectively.
propagation network by solving a system of stiff ordinary He was with M.I.T. until he joined the
differential equations," Proc. lntl. Joint Conf. on Neural Stanford University faculty in 1959, where
Networks, vol. 2, pp. 381-386, Wash., DC, June1989. he is now a Professor of electrical engi-
[I141 D. G. Luenberger, Linear and Nonlinear Programming. neering. He is presently engaged in
Reading, MA: Addison-Wesley, 2d ed., 1984. research and teaching in neural networks,
[I151 A. Kramer and A. Sangiovanni-Vincentelli, "Efficient parallel pattern recognition, adaptive filtering, and
learning algorithms for neural networks," in Advances in adaptive control systems. He i s associate

WIDROW AND LEHR: PERCEPTRON, MADALINE, AND BACKPROPACATION 1441


editor of the journals Adaptive Control and Signal Processing, Michael A. Lehr was born in New Jerseyon
Neural Networks, lnformation Sciences, and Pattern Recognition April 18,1964. He received the B.E.E. degree
and coauthor with S. D. Stearns of Adaptive Signal Processing in electrical engineering at the Georgia
(Prentice Hall). Institute of Technology in 1987, graduating
Dr. Widrow received the SB, S M and ScD degrees from MIT in top in his class. He received the M.S.E.E.
1951,1953, and 1956. He i s a member of the American Association from Stanford University in 1986.
of University Professors, the Pattern Recognition Society, Sigma From 1982 to 1984, heworked on two-way
Xi, and Tau Beta Pi. He i s a fellow of the American Association for radio development at Motorola in Ft. Lau-
the Advancement of Science. He i s president of the International derdale, Florida, and from 1984 to 1987 he
Neural Network Society. He received the IEEE Alexander Graham was involved with naval sonar system
Bell Medal in 1986 for exceptional contributions to the advance- development and test at IBM in Manassas,
ment of telecommunications. Virginia. Currently, he i s a doctoral candidate in the Department
of Electrical Engineering at Stanford University. His research inter-
ests include adaptive signal processing and neural networks.
Mr. Lehr holds a General RadiotelephoneOperator License(1981)
and Radar Endorsement(1982), and i s a member of Tau Beta Pi, Eta
Kappa Nu, and Phi Kappa Phi.

1442 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 9, SEPTEMBER 1990

__
1

You might also like