0% found this document useful (0 votes)
263 views7 pages

(1991) Hornik (Neural Netw.)

Uploaded by

Max
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
263 views7 pages

(1991) Hornik (Neural Netw.)

Uploaded by

Max
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Neural Networks, Vol. 4, pp. 251-257, 1991 0893-6080/91 $3.00 + .

00
Printed in the USA. All rights reserved. Copyright © 1991 Pergamon Press pic

ORIGINAL CONTRIBUTION

Approximation Capabilities of Multilayer


Feedforward Networks

KURT HORNIK
Technische Universitiit Wien, Vienna, Austria

(Received 30 January 1990; revised and accepted 25 October 1990)

Abstract- We show that standard multilayer feedforward networks with as few as a single hidden layer and
arbitrary bounded and nonconstant activation function are universal approximators with respect to V(p.) per-
formance criteria, for arbitrary finite input environment measures Jl., provided only that sufficiently many hidden
units are available. If the activation function is continuous, bounded and nonconstant, then continuous mappings
can be learned uniformly over compact input sets. We also give very general conditions ensuring that networks
with sufficiently smooth activation functions are capable of arbitrarily accurate approximation to a function and
its derivatives.
Keywords-Multilayer feedforward networks, Activation function, Universal approximation capabilities, Input
environment measure, V(p.) approximation, Uniform approximation, Sobolev spaces, Smooth approximation.

1. INTRODUCTION measured by the uniform distance between functions


on X, that is,
The approximation capabilities of neural network ar-
chitectures have recently been investigated by many Pp.x(f, g) = sup if(x) - g(x)j.
xEX
authors, including Carroll and Dickinson (1989), Cy-
benko (1989), Funahashi (1989), Gallant and White In other applications, we think of the inputs as ran-
(1988), Hecht-Nielsen (1989), Hornik, Stinchcombe, dom variables and are interested in the average per-
and White (1989, 1990), Irie and Miyake (1988), formance where the average is taken with respect to
Lapedes and Farber (1988), Stinchcombe and White the input environment measure !J,, where ~t(Rk) < oo.
(1989, 1990). (This list is by no means complete.) In this case, closeness is measured by the U(ll) dis-
If we think of the network architecture as a rule tances
for computing values at I output units given values
at k input units, hence implementing a class of map-
pings from Rk to R1, we can ask how well arbitrary
Pp.if, g) = [
L
if(x) - g(x)IP dfJ.(x) ]
lip

mappings from Rk to R1 can be approximated by the 1 ::::; p < oo, the most popular choice being p = 2,
network, in particular, if as many hidden units as corresponding to mean square error.
required for internal representation and computation Of course, there are many more ways of measur-
may be employed. ing closeness of functions. In particular, in many ap-
How to measure the accuracy of approximation plications, it is also necessary that the derivatives of
depends on how we measure closeness between func- the approximating function implemented by the net-
tions, which in turn varies significantly with the spe- work closely resemble those of the function to be
cific problem to be dealt with. In many applications, approximated, up to some order. This issue was first
it is necessary to have the network perform simul- taken up in Hornik et al. (1990), who discuss the
taneously well on all input samples taken from some sources of need of smooth functional approximation
compact input set X in Rk. In this case, closeness is in more detail. Typical examples arise in robotics
(learning of smooth movements) and signal process-
ing (analysis of chaotic time series); for a recent ap-
plication to problems of nonparametric inference in
Requests for reprints should be sent to Kurt Hornik, lnstitut
fiir Statistik und Wahrscheinlichkeitstheorie, Technische Uni- statistics and econometrics, see Gallant and White
versitiit Wien, Wiedner HauptstraBe 8-10/107, A-1040 Wien, Aus- (1989).
tria. All papers establishing certain approximation ca-
251
252 K. Hornik

pabilities of multilayer perceptrons thus far have set of all functions implemented by such a network
been successful only by making more or less explicit with an arbitrarily large number of hidden units is
assumptions on the activation function If/, for ex-
u 'Dl1"l(lfl).
~

ample, by assuming If/ to be integrable, or sigmoidal \~k(lfl) =


respectively squashing (sigmoidal and monotone), n:l

etc. In this article, we shall demonstrate that these In what follows, some concepts from modern anal-
assumptions are unnecessary. We shall show that ysis will be needed. As a reference, we recommend
whenever If/ is bounded and nonconstant, then, for Friedman (1982). For 1 ::s p < oo, we write
arbitrary input environment measures fl, standard
multilayer feedforward networks with activation
function If/ can approximate any function in U(fl)
llfllp.p = [L lf(x)IP dp(x) rp

(the space of all functions on Rk such that IRk lf(x)IP so that pP,if, g) = llf - gllp.11 • U(fl) is the space of
dfl(X) < oo) arbitrarily well if closeness is measured all functions f such that llfllp.11 < oo. A subset S
by PP.Jl' provided that sufficiently many hidden units of U(fl) is dense in U(fl) if for arbitrary f E U(fl)
are available. and e > 0 there is a function g E S such that
Similarly, we shall establish that whenever 1f1 is Pp.if, g) < e.
continuous, bounded and nonconstant, then, for ar-
bitrary compact subsets X of Rk, standard multilayer Theorem 1: If 1f1 is unbounded and nonconstant, then
feedforward networks with activation function 1f1 can ~·Jzk(lfl) is dense in U(fl) for all finite measures f1 on
approximate any continuous function on X arbitrar- Rk.
ily well with respect to uniform distance P11 .x, pro- C(X) is the space of all continuous functions on
vided that sufficiently many hidden units are avail- X. A subsetS of C(X) is dense in C(X) if for arbitrary
able. Hence, we conclude that it is not the specific f E C(X) and e > 0 there is a function g E S such
choice of the activation function, but rather the mul- that Pu.x<f, g) < e.
tilayer feedforward architecture itself which gives
neural networks the potential of being universal Theorem 2: If If/ is continuous, bounded and non-
learning machines. constant, then ~·Jzk( If!) is dense in C(X) for all compact
In addition to that, we significantly improve the subsets X of Rk.
results on smooth approximation capabilities of A k-tuple a: = ( a: 1, ... , ak) of nonnegative in-
neural nets given in Hornik et al. (1990) by simul- tegers is called a multiindex. We then write la:l =
taneously relaxing the conditions to be imposed on a:1 + .. · + ak for the order of the multiindex a: and
the activation function and providing results for the
aal+···+akf
previously uncovered cases of weighted Sobolev ap-
D•f(x) = a~r~ ... a~fk(x)
proximation with respect to finite input environment
measures which do not have compact support, for for the corresponding partial derivative of a suffi-
example, Gaussian input distributions. ciently smooth function f of x = (~ 1 , • • . , ~ k) E 1

Rk.
2. RESULTS Cm(Rk) is the space of all functions f which, to-
gether with all their partial derivatives Da f of order
For notational convenience we shall explicitly for- la:l ::s m, are continuous on Rk. For all subsets X of
mulate our results only for the case where there is Rk and f E Cm(Rk), let
only one hidden layer and one output unit. The cor-
responding results for the general multiple hidden llfllm.ux: = maxlalsmSUPxExiD•f(x )I.
layer multioutput case can easily be deduced from A subset S of cm(Rk) is uniformly m dense on com-
the simple case, cf. corollary 2.6 and 2.7 in Hornik pacta in Cm(Rk) if for all f E cm(Rk), for all compact
et al. (1989). subsets X of Rk, and for all e > 0 there is a function
If there is only one hidden layer and only one g = g(f, X, e) E S such that llf - gllm.u,x < e.
output unit, then the set of allfunctions implemented For f E cm(Rk), f1 a finite measure on Rk and
by such a network with n hidden units is 1 ::s p < oo, let

:~l["> (lfl) = { h:Rk ~ Rih(x) = ~ f3ilfl(aj x - Oi) }, llfllm.p.u: = [ ~ r ID"fiP dp]llp '
lalsm JRk
where If/ is the common activation function of the and let the weighted Sobolev space cm·P(fl) be defined
hidden units and 1 denotes transpose so that if a has by
components a:t. . . . , ak and x has components ~ t. Cm·P(p) = {f E Cm(Rk) : llfllm.p.u < oo} .
• • . , ~b a X is the dot product a: 1 ~ 1 + ... + ak~k·
1

(Output units are always assumed to be linear.) The Observe that cm·P(fl) = cm(Rk) if f1 has compact
Multilayer Feedforward Networks 253

support. A subsets of cm·P(/1) is dense in cm·P(/1), if definitely not remain valid for all unbounded acti-
for all f E cm·P(/1) and e > 0 there is a function g = vation functions. If f/1 is a polynomial of degree d
g(f, e) E S such that llf - glim.p.u <e. (d;::: 1), then 'Dlifll) is just the space Pd of all poly-
We then have the following results. nomials in k variables of degree less than or equal
to d. Hence, for all reasonably rich input spaces X
Theorem 3: If f/1 E cm(Rk) is nonconstant and or input environment measures Ji, <iJl-'k{f/1) cannot be
bounded, then ~'Jlk(f!l) is uniformly m-dense on com- dense in C(X) or LP(/1), respectively. Also, if the
pacta in Cm(Rk) and dense in Cm·P(Ji) for all finite tail behavior of an unbounded function f/1 is not com-
measures Ji on Rk with compact support. patible with the tail behavior of Ji, then x ~ f/1
(a'x - 0) may not be an element of LP(/1) for most
Theorem 4: If f/1 E cm(Rk) is nonconstant and all its or all nonzero a E Rk.
derivatives up to order mare bounded, then ~')lk(f/1) is By allowing for a much larger class of activation
dense in cm·P(Ji) for all finite measures Ji on Rk. functions, Theorem 3 significantly improves the re-
sults in Hornik et al. (1990), where the conclusions
of Theorem 3 are established under the assumption
3. DISCUSSION
that there exists some l;::: m such that f/1 E C1(R) and
The conditions imposed on f/1 in our theorems are 0 < f R ID 1flll dt < oo (/-finiteness). However, many
very general. In particular, they are satisfied by all interesting functions, such as all nonconstant periodic
smooth squashing activation functions-such as the functions, are not l finite. Using Theorem 3 we easily
logistic squasher or the arctangent squasher-that infer that if f/1 is a nonconstant finite linear combi-
have become popular in neural network applications. nation of periodic functions in cm(R) (in particular,
A lot of corollaries can be deduced from our theo- if f/1 is a nonconstant trigonometric polynomial), then
rems. In particular, as convergence in LP(/1) implies mk(f!l) is uniformly m dense on compacta in cm(Rk).
convergence in Ji measure, we conclude from Theo- Other interesting examples that can now be dealt
rem 1 that whenever f/1 is bounded and nonconstant, with are functions such as f//(t) = sin(t)/t (which is
all measurable functions on Rk can be approximated not l finite for any l), or more generally, all functions
by functions in ~'nk( f/1) in Ji measure. It follows that which are the Fourier transform of some finite signed
(cf. Lemma 2.1 in Hornik et al. [1989]) for arbitrary measure which has finite absolute moments up to
measurable functions f and e > 0, we can find a order m (such functions are usually not I finite).
compact subset x. of Rk and a function g E ~Xk( f/1) Theorem 4 gives weighted Sobolev type approx-
such that imation results for the previously uncovered case of
finite input environment measures which are not
Pu.xlf, g) < e,
compactly supported. Using Theorem 4 we may con-
This substantially improves Theorems 3 and 5 in clude that if f/1 is the logistic or arctangent squasher,
Cybenko (1989) and Corollary 2.1 in Hornik et or a nonconstant trigonometric polynomial, then
al. (1989), and is of basic importance for the use \'Jlifll) is dense in cm·P(/1), for all finite measures li·
of artificial neural networks in classification and In particular, we now have a result for inputs that
decision problems, cf. Cybenko (1989), Sections 3 follows a multivariate Gaussian distribution.
and 4. The following generalization of our results is im-
If the activation function is constant, only constant mediate: suppose that f/1 is unbounded, but that there
mappings can be learned, which is definitely not a is a nonconstant and bounded function tjJ E ~(f!l).
very interesting case. The continuity assumption in Then, by Theorem 1, mk(tjJ) is dense in LP(/1). As
Theorem 2 can be weakened. For example, Theorem mk(tjJ) C ~nk(f/1), we can state that in this case, mk(f!l)
2.4 in Hornik et al. (1989) shows that whenever f/1 contains a subset which is dense in LP(/1). (Observe
is a squashing function, then ~Xif/1) is dense in C(X) that if the support of Ji is not compact and f/1 is un-
for all compact subsets of Rk. In fact, their method bounded, we do not necessarily have mk(f!l) C LP(/1);
can easily be modified to deliver the same uniform hence, we cannot simply state that mk(f!l) itself is
approximation capability whenever f/1 has distinct fi- dense in U(Ji).) Similar considerations apply for the
nite limits at ±oo. Whether or not the continuity as- other theorems.
sumption can entirely be dropped is still an open (and If 0 is an open subset of Rk, let cm(O) be the
quite challenging) problem. space of all functions f which, together with all their
There are, of course, unbounded functions which partial derivatives Da f of order lal :s m, are contin-
are capable of uniform approximation. For example, uous on 0. Let us say that a subset S of cm(O) is
a simple application of the Stone-WeierstraB theo- uniformly m dense on compacta in cm(O) if for all
rem (cf. Hornik et al. [1989]) implies that 'Dli exp) f E cm(O), for all compact subsets X of 0, and for
is dense in C(X), where of course exp is the standard all e > 0 there is a function g = g(f, X, e) E S such
exponential function. However, our theorems do that llf - gllm,u,x < e.
254 K. Hornik

It is easily seen that under the conditions of Theo- the possibility that 0. lies on both sides of part of its
rem 3, ~Jlk( If/) is uniformly m dense on compacta in boundary. Such conditions are, for example, that 0.
em(O.) for all open subsets 0. of Rk. In fact, it suffices has the segment property (Adams, 1975, Theorem
to show that whenever f E em(0.) and X is a compact 3.18) or that 0. is starshaped with respect to a point
subset of 0., then we can find a function Exf E (Maz'ja, 1985, Theorem 1.1.6.1). In both cases, it
em(Rk) satisfying Exf(x) = f(x) for allx EX. Now, can be shown that eQ'(Rk), the space of all func-
by Problem 3.3.1 illFriedman (1982), we can find a tions on Rk with compact support which are in-
function h E C"(Rk) such that h = 1 on X, 0 :s finitely often continuously differentiable, is dense in
h :s 1 on 0\X, and h = 0 outside 0.. Take Exf = Hm·P(O.). Hence, if in addition 0. is bounded, mi If!)
hf on 0. and Exf = 0 outside 0.. is dense in Hm·P(O.) under the conditions of Theo-
Suppose that 0. is bounded. Functions fin em(O.) rem 3.
do not necessarily satisfy llfllm.u,n < 00 • On the other If the underlying input environment measure J1. is
hand, all functions in em(Rk), and hence in partic- not finite, but is regular in the sense that JJ.(X) < oo
ular all functions in ~')li If!) if If/ E em(Rk), satisfy for all compact subsets X of Rk (as an example we
llgllm.u.X < oo for each compact subset X of Rk. may take standard Lebesgue measure on Rk), then
Hence in general, it is not possible to approxi- ~·)lk(lfl) is dense in all Lfoc{Jl.) spaces, 1 :s p < oo,
mate functions in em(O.) by functions ~')lk( If!) arbi- whenever If/ is bounded and nonconstant, improving
trarily well with respect to ll·llm.u.n· results in Stinchcombe and White {1989).
However, one might ask whether such approxi- Similarly, we can measure closeness of functions
mation is possible for at least all functions in the in em(Rk) by the local weighted Sobolev space dis-
space er(O.) which consists of all functions f E tance measure
em(O.) for which vat is bounded and uniformly con-
tinuous on 0. for 0 :s lal :sm. The following prom- Pm.p.lcx:.if, g):= L z-n min(llf- gllm.p.pn,l),
inent counterexample shows that this is not always n=t

possible. Let k = 1, 0. = ( -1,0) U (0,1) and let where 1 :s p :s oo, Jl.n is the restriction of J1. to some
f = 0 on ( -1,0) and f = 1 on {0,1). Then f E bounded set Xn and the Xn exhau~t all of Rk, that is,
e;(O.), but it is obviously impossible to approximate U;'= 1 Xn = Rk. It follows straightforwardly that, un-
f by continuous functions on R uniformly over 0.. In der the conditions of Theorem 3, ~)li If!) is dense in
fact, we always have llf - gllo.u,n ::::: 112 for all em(Rk) with respect to Pm.p.toc.w
g E C(R). Roughly speaking, if 0. is bounded, then
mk( If/) approximates all functions in ere 0.) arbitrarily
well with respect to ll·llm.u.n if the geometry of 0. is Concluding Remark
such that functions f E er(O.) can be extended to In this article, we established that multilayer feed-
functions in em(Rk). (Cf. also the next paragraph.) forward networks are, under very general conditions
Classical (nonweighted) Sobolev spaces are de- on the hidden unit activation function, universal ap-
fined as follows. Let 0. be an open set in Rk, let the proximators provided that sufficiently many hidden
input environment measure J1. be standard Lebesgue units are available. However, it should be empha-
measure on 0., for functions f E em(O.) let sized that our resultsA!o not mean that all activation
functions 1f1 will perform equally well in specific
learning problems. In applications, additional issues
as, for example, minimal redundancy or computa-
and let tional efficiency, have to be taken into account as
well.
Hm·P(fi) = {f E cm(n):Jifllm.p.n < oo}.
4. PROOFS
(More precisely, standard Sobolev spaces are defined
as the completions of the above Hm·P(O.) with respect In order to establish our theorems, we follow an
to ll·llm.p.n· The elements of these spaces are not nec- approach first utilized by Cybenko (1989) that is
essarily classically smooth functions, but have gen- based on an application of the Hahn-Banach theo-
eralized derivatives. See, for example, the discussion rem combined with representation theorems for con-
in Hornik et al. (1990).) tinuous linear functionals on the function spaces
It is easily seen that globally smooth functions on under consideration.
Rk are not dense in Hm·P(O.) (with respect to 11-ilm.p,n) Proof of Theorems 1 and 2: As If/ is bounded,
for most domains 0.. In the above example, no func- ~')lk( 1f1) is a linear subspace of U(Jl.) for all finite mea-
tion in e 1(R) can approximate f in H 1•1(0.). Arbi- sures J1. on Rk. If, for some Jl., milfl) is not dense in
trarily close approximations by globally smooth U(Jl.), Corollary 4.8.7 in Friedman (1982) yields that
functions on Rk are only possible under certain con- there is a nonzero continuous linear functional A on
ditions on the geometry of 0. that somehow exclude U(Jl.) that vanishes on 'Dlilfl).
Multilayer Feedforward Networks 255

As well known (Friedman, 1982, Corollary 4.14.4 reference we recommend the excellent book by Ru-
and Theorem 4.14.6), A is of the form f ~ A(f) = din (1967).
fRk f g df.l with some g in Lq(f.l), where q is the Suppose that If/ is bounded and nonconstant and
conjugate exponent q = p!(p - 1). (For p = 1 we that a is a finite signed measure on Rk such that
obtain q = oo; L ~(fl.) is the space of all functions f fRk !f!(a'x - fJ) da(x) = 0 for all a E Rk and() E R.

for which the fl. essential supremum Fix u E Rk and let au be the finite signed measure
on R induced by the transformation x ~ u'x, that is,
llfllx.~ = inf {N > 0 : f.1 {x E Rk :lf(x)l > N} = 0} for all Borel sets of R we have
is finite, that is, the space of all f1 essentially bounded au(B) = a{x E Rk : u'x E B}.
functions.)
If we write a(B) = f 8 g df.l, we find by Holder's Then at least for all bounded functions x on R,
inequality that for all B,
{ x(u'x) da(x) = { x(t) dau(t).

la(B)I = IL lngdf.11
J~
Hence by assumption,
JR

:S lllniiP.PIIgllq.~ :S (f.l(Rk))"PIIgllq.~ < 00 ,

{ lfi(A.u'x - 0) da(x) = { lfi(A.t- 0) dau(t) = 0


hence a is a nonzero finite signed measure on Rk J. JR
such that A (f) = f Rk f g df.l = f Rk fda. As A vanishes for all A., () E R.
on \')lk( If/), we conclude that in particular
To simplify notations, let us write L = U(R) for
{ lfl(a'x - B) da(x) = 0 the space of integrable functions on R (with respect
JRk to Lebesgue measure) and M = M(R) for the space
of finite signed measures on R. For f E L, llfiiL
for all a E Rk and () E R.
denotes the usual U norm and J the Fourier trans-
Similarly, suppose that If/ is continuous and that
form. Similarly, for r E M, llriiM denotes the total
for some compact subset X of Rk, mk( If/) is not dense
variation of r on R and i the Fourier transform.
in C(X). Proceeding as in the proof of Theorem 1 in
By choosing () such that If!(- 8) ¥ 0 and setting
Cybenko (1989), we find that in this case there exists
A. to zero, we find that in particular f R da u(t) =
a nonzero finite signed measure a on Rk (a is actually
au(O) = 0. For u = 0, a 0 is concentrated at t =
concentrated on X) such that
0 and a 0{0} = 8- 0 = 0, hence a 0 = 0. Now suppose
{ lfl(a'x - B) da(x) = 0 u ¥ 0. Pick a function w E L whose Fourier trans-
JRk form has no zero (e.g., take w(t) = exp( -f)). Con-
sider the integral
for all a E Rk, () E R.
Summing up, in either case we arrive at the fol-
lowing question. Can there exist a nonzero finite
LL lfi(A.(s + t) - 0) w(s) ds dau(t).
signed measure a on Rk such that f Rk !f!(a'x - fJ) As
da(x) vanishes for all a E Rk and() E R? This ques-
tion was first asked and investigated by Cybenko
(1989) who basically gave the following definition.
LL ilfi(A.(s + t) - O)llw(s)l ds dia ul(t)

Definition. A bounded function If/ is called discrim- :s llwiiL!IauiiM sup,ERIIfl(t)l < 00 ,

inatory if no nonzero finite signed measure a on Rk we may apply Fubini's theorem to obtain
exists such that

{ lfl(a'x - B) da(x) = 0
JRk
for all a E Rk, 0 E R. 0 = L[L lfi(At- (0- A.s)) dau(t) J w(s) ds
In Cybenko (1989), it is shown that if If! is sigmoidal, = LL lfi(A.(s + t) - 0) w(s) ds dau(t)
then If/ is discriminatory. (The proof can trivially be
generalized to the case where If/ has distinct and finite
limits at ±oo.) However, the following much stronger
= Llfi(A.t - 0) d(w * au)(t),
result is true, which, upon combination with the
above arguments, establishes Theorem 1 and 2. where w * au denotes the convolution of w and au·
By Theorem 1.3.5 in Rudin (1967), L is a closed
Theorem 5: Whenever If! is bounded and noncon- ideal in M, hence in particular w * au is absolutely
stant, it is discriminatory. continuous with respect to Lebesgue measure. Let h
Proof: Throughout the proof, certain techniques E L be the corresponding Radon-Nikodym deriva-
and results from Fourier analysis will be used. As a tive. Then h = wa u' hence in particular h(O) = 0.
256 K. Hornik

The above equation is then equivalent to fR If/ where IxI is the euclidean length of x and c is a
(A.t - 0) h(t) dt = 0. Let a ~ 0 and y E R. By first constant chosen in a way that f Rk w(x) dx = 1. For
replacing A. by 1/ a and() by - y/ a and then perform- e > 0, let us write w,(x) = e-kw(x!e).
ing the change of variables t ~ at - y, we obtain If f is a locally integrable function on Rk, let J.f
that for all y E R and for all nonzero real a, be the convolution w, * f. The following facts are

L lfl(l)h(at - y) dt = 0.
well known (Adams, 1975, pp. 29ff.).

• J.f E CX'(Rk) with derivatives Da J,f = Da w, *


Let us write Mah(t) for h(at). The above equation f.
implies that f R lfl(t)f(t) dt vanishes for all f contained • IIJ.fllu :S llfllu· Thus, iff is bounded, then J.f(x) is
in the closed translation invariant subspace I spanned uniformly bounded in x and e.
by the family M ah, a~ 0. By Theorem 7 .1.2 in Rudin • Iff is continuous, then J.f- f uniformly on com-
(1967), I is an ideal in L. pacta as e - 0.
Following the notation in Rudin (1967), let us
write Z(f) for the set of all w E R where the Fourier Similarly, if a is a locally finite signed measure on
transform](w) off E L vanishes, and if I is an ideal, Rk, let J.a be the convolution w, * a, that is,
define Z(I), the zero set of I, as the set of w where
the Fourier transforms of all functions in I vanish. J,a(x) = J w,(x - y) da(y).
Suppose that h is nonzero. As Mah(w) = h(wl
Rk

a)/ a, we find that Z(I) = {0} and in fact, I is precisely Then again, l.a E C"(Rk). If a has compact support,
the set of all integrable functions f with f R f(t) J.a has compact support.
dt = ](0) = 0. To see this, let us first note that for Finally, the following result can easily be estab-
all functions f E I, we trivially have {0} = Z(I) ~ lished. (The first assertion is a straightforward ap-
Z(f). Conversely, suppose that f has zero integral. plication of Fubini's theorem using the symmetry of
As the intersection of the boundaries of Z(I) and w., and the second one follows by Lebesgue's
Z(f) (again trivially) equals {0} and hence contains bounded convergence theorem.)
no perfect set, Theorem 7 .2.4 in Rudin (1967) im-
plies that f E I. Lemma. Suppose that f and a satisfy one of the two
Hence, if h is nonzero, the integral f R lf/(t)f(t) dt following conditions: (a) f is continuous and a is a
vanishes for all integrable functions which have zero finite signed measure with compact support; (b) f is
integral. It is easily seen that this implies that If/ is bounded and continuous and a is a finite signed mea-
constant which was ruled out by assumption. Hence sure. Then, if Ty denotes translation by y, that is,
h = 0 and thus h = wau is identically zero, which in Tyf(x) = f(x + y),
turn yields that au vanishes identically' because w
has no zeros. By the uniqueness Theorem 1.3.7(b)
in Rudin (1967), O'u = 0.
J f J,a x = J [J
Rk Rk Rk
TJ da] w,(y) dy = J J.f da,
Rk

Summing up, we find that au = 0 for all u E Rk. and


To complete the proof, leta(u) = f Rk exp(iu'x) da(x)
be the Fourier transform of a at u. Then lim J f J,a dx = f t da.

J exp(iu'x) da(x)
l!-'J'0 Rk JRk

u(u) =
Rk Proof of Theorem 3: If 'ini If/) is not uniformly m

= L exp(it) da.(t)
dense on compacta in cm(Rk), then by the usual dual
space argument there exists a collection aa,
Ial s m of finite signed measures with support in
= 0, some compact subset X of Rk such that the functional

that is, a = 0. Again invoking the uniqueness Theo-


rem 1.3.7(b) in Rudin (1967), a = 0 and the proof
A(f) = L { nat daa
ialsm JRk
of Theorem 5 is complete.
The proofs of the remaining theorems require vanishes on 'Dlk(lf/), but not identically on cm(Rk).
some additional preparation. For functions f defined For e > 0, define functionals A, by
on Rk, let llfllu : = supRk lf(x)l. Let w be the familiar
function in C"(Rk) with support in the unit sphere A,(f) : = L { nat J,aa dx.
given by lalsm JRk

w(x) = {c exp( -1/(1 - lxl 2)), if lxl < 1, (All integrals exist because all J,aa have compact
0, if lxl ;::: 1, support.) By part (a) of the above lemma, we con-
Multilayer Feedforward Networks 257

elude that Cm·P(Jl). By assumption, If/ E cm,u(R), hence mk


(If/) C cm.u( Rk) C cm·P(Jl).
If \')Lk( 1f1) is not dense in Cm·P(Jl), the usual dual
A,(f) = f [ lal:-o;m
Rk
2:; f
Rk
D•TJ da.] w,(y) dy
space argument yields the existence of a suitable col-
lection of functions ga E Lq(Jl), lal :5 m, where q is
= f Rk
A(TJ) w,(y) dy the conjugate exponent pl(p - 1), such that the
functional
and that A(f) = 2:;
lalsm
f Rk
D•f g. df.l
lim A,(f) = A(f)
,_.o vanishes on ~)Lk( If/), but not identically on cm.u( Rk).
Now proceed as in the proof of Theorem 3 with the
for all IE Cm(Rk). Finally, integration by parts yields
finite signed measures CJ given by dCJ = ga dJ1,
that
01 01

Cm·u(Rk) replacing Cm(Rk), and using part (b) of the

A,(f) = f [2:; (
Rk lal::'fm
-1)1•1 D•J,a.J f dx.
lemma.
REFERENCES
Adams, R. A. (1975). Sobolev spaces. New York: Academic
:= h, Press.
Carroll, S.M., & Dickinson, B. W. (1989). Construction of neural
Let us write If/ a,e(x) = If/ (a' x - 8). Suppose that nets using the Radon transform. In Proceedings of the Inter-
A vanishes on mk(lfl). As lfla.e E mk(lfl) for all a E Rk national Joint Conference on Neural Networks (pp. I:607-611).
San Diego: SOS Printing.
and 8 E R, we infer that A( If/ a,e) = 0. Observing that Cybenko, G. (1989). Approximation by superposition of a sig-
Tylf/a.e = lfla..e-a'y• we see that A(Tylf/a.e) = 0 for all moidal function. Mathematics of Control, Signals and Systems,
a, y E Rk, 8 E R. It follows that 2, 303-314.
Friedman, A. (1982). Foundation of modern analysis. New York:
f lfla.o h, dx = A,(lf/ •. 0) = J A(Tylf/a.o) w,(y) dy = 0 Dover Publications.
• • Funahashi, K. (1989). On the approximate realization of contin-
uous mappings by neural networks. Neural Networks, 2, 183-
for all a E Rk and e E R. As, by assumption, If/ is 192.
bounded and nonconstant, Theorem 5 implies that Gallant, A. R., & White, H. (1988). There exists a neural network
=
h, 0. Hence A,(f) = f Rk I h, dx vanishes for all that does not make avoidable mistakes. In IEEE Second In-
functions I E cm(Rk) which in turn yields that ternational Conference on Neural Networks (pp. I:657-664).
San Diego: SOS Printing.
A(f) = lim A,(f) = 0 Gallant, A. R., & White, H. (1989). On learning the derivatives
e-->0
of an unknown mapping with multilayer feedforward net-
works. Preprint.
for all IE cm(Rk), which was ruled out by assump- Hecht-Nielsen, R. (1989). Theory of the back propagation neural
tion. We conclude that, under the conditions of network. In Proceedings of the International Joint Conference
Theorem 3, \')lk( If/) is uniformly m dense on compacta on Neural Networks (pp. I:593-606). San Diego: SOS Printing.
in cm(Rk), establishing the first half of Theorem 3. Hornik, K., Stinchcombe, M, & White, H. (1989). Multilayer
The second half of Theorem 3 now follows feedforward networks are universal approximators. Neural
Networks, 2, 359-366.
easily. We have to show that for all f E Cm(Rk) Hornik, K., Stinchcombe, M., & White, H. (1990). Universal
and e > 0, there is a function g E mk( If/) such that approximation of an unknown mapping and its derivatives
III - gllm.p.,u < e. Let X be a compact set containing using multilayer feedforward networks. Neural Networks.
the support of 11- We find that Irie, B., & Miyake, S. (1988). Capabilities of three layer percep-
trons. In IEEE Second International Conference on Neural
II! - gllm.p." :s Y II! - gllm.u.x• Networks (pp. I:641-648). San Diego: SOS Printing.
Lapedes, A., & Farber, R. (1988). How neural networks work.
where yP = J1(Rk) #{a: lal :s m}. Hence, if we take Technical Report LA-UR-88-418. Los Alamos, NM: Los Ala-
g E \')Lllfl) such that III - gllm.u.X < ely, which is mos National Laboratory.
possible by the first half of Theorem 3 that we just Maz'ja, V. G. (1985). Sobolevspaces. New York: Springer Verlag.
established, we find that III - gllm,p,11 < e and the Rudin, W. (1967). Fourier analysis on groups. Interscience Tracts
in Pure and Applied Mathematics, Vol. 12. New York: Inter-
proof of Theorem 3 is complete. science Publishers.
Proof of Theorem 4: The proof of Theorem 4 Stinchcombe, M., & White, H. (1989). Universal approximation
parallels the one of Theorem 3. Let us write using feedforward networks with non-sigmoid hidden layer
cm,u(Rk) for the space of all functions I E Cm(Rk) activation functions. In Proceedings of the International Joint
which, along with their derivatives up to order m, Conference on Neural Networks (pp. I:613-618). San Diego:
SOS Printing. ·
are bounded, that is, Stinchcombe, M., & White, H. (1990). Approximating and learn-
Cm·•(Rk) = {f E cm(Rk) : IID"fllu < oo, lal :s m}. ing unknown mappings using multilayer feedforward networks
with bounded weights. Preprint. San Diego: Department of
It is easily seen that cm,u(Rk) is a dense subset of Economics, University of California.

You might also like