Approximation of Functions
Ravi Kothari, Ph.D.
ravi.kothari@ashoka.edu.in
I think that it is a relatively good approximation to truth which is much too complicated
to allow anything but approximations that mathematical ideas originate in empirics. -
John von Neumann
Advanced Machine Learning 1 / 20
Approximation of Functions
Advanced Machine Learning 2 / 20
Approximation of Functions
Let y = f (x) be given on the interval [x0 , x2 ]. Let x1 be a point such
that x0 < x1 < x2
Advanced Machine Learning 2 / 20
Approximation of Functions
Let y = f (x) be given on the interval [x0 , x2 ]. Let x1 be a point such
that x0 < x1 < x2
y0 = f (x0 ), y1 = f (x1 ), y2 = f (x2 )
Advanced Machine Learning 2 / 20
Approximation of Functions
Let y = f (x) be given on the interval [x0 , x2 ]. Let x1 be a point such
that x0 < x1 < x2
y0 = f (x0 ), y1 = f (x1 ), y2 = f (x2 )
Let us say we want to approximate f (x). A polynomial of second
degree seems appropriate to approximate f (x) i.e.,
P(x) = a0 + a1 x + a2 x 2 (1)
Advanced Machine Learning 2 / 20
Approximation of Functions
Let y = f (x) be given on the interval [x0 , x2 ]. Let x1 be a point such
that x0 < x1 < x2
y0 = f (x0 ), y1 = f (x1 ), y2 = f (x2 )
Let us say we want to approximate f (x). A polynomial of second
degree seems appropriate to approximate f (x) i.e.,
P(x) = a0 + a1 x + a2 x 2 (1)
We want to nd the coecients of P(x) such that P(x0 ) = y0 ,
P(x1 ) = y1 , and P(x2 ) = y2
Advanced Machine Learning 2 / 20
Approximation of Functions
Let y = f (x) be given on the interval [x0 , x2 ]. Let x1 be a point such
that x0 < x1 < x2
y0 = f (x0 ), y1 = f (x1 ), y2 = f (x2 )
Let us say we want to approximate f (x). A polynomial of second
degree seems appropriate to approximate f (x) i.e.,
P(x) = a0 + a1 x + a2 x 2 (1)
We want to nd the coecients of P(x) such that P(x0 ) = y0 ,
P(x1 ) = y1 , and P(x2 ) = y2
We can of course solve it exactly since there are 3 equations and 3
unknowns
Advanced Machine Learning 2 / 20
Advanced Machine Learning 3 / 20
Let us approach it dierently and construct a polynomial Q0 (x) of
second degree such that Q0 (x0 ) = 1, Q0 (x1 ) = 0, Q0 (x2 ) = 0.
Likewise Q1 (x0 ) = 0, Q1 (x1 ) = 1, Q1 (x2 ) = 0, and Q2 (x0 ) = 0,
Q2 (x1 ) = 0, and Q2 (x2 ) = 1
Advanced Machine Learning 3 / 20
Let us approach it dierently and construct a polynomial Q0 (x) of
second degree such that Q0 (x0 ) = 1, Q0 (x1 ) = 0, Q0 (x2 ) = 0.
Likewise Q1 (x0 ) = 0, Q1 (x1 ) = 1, Q1 (x2 ) = 0, and Q2 (x0 ) = 0,
Q2 (x1 ) = 0, and Q2 (x2 ) = 1
The desired polynomials are,
(x − x1 )(x − x2 )
Q0 (x) =
(x0 − x1 )(x0 − x2 )
(x − x0 )(x − x2 )
Q1 (x) =
(x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 )
Q2 (x) = (2)
(x2 − x0 )(x2 − x1 )
Advanced Machine Learning 3 / 20
Let us approach it dierently and construct a polynomial Q0 (x) of
second degree such that Q0 (x0 ) = 1, Q0 (x1 ) = 0, Q0 (x2 ) = 0.
Likewise Q1 (x0 ) = 0, Q1 (x1 ) = 1, Q1 (x2 ) = 0, and Q2 (x0 ) = 0,
Q2 (x1 ) = 0, and Q2 (x2 ) = 1
The desired polynomials are,
(x − x1 )(x − x2 )
Q0 (x) =
(x0 − x1 )(x0 − x2 )
(x − x0 )(x − x2 )
Q1 (x) =
(x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 )
Q2 (x) = (2)
(x2 − x0 )(x2 − x1 )
The interpolating polynomial is then,
P(x) = y0 Q0 (x) + y1 Q1 (x) + y2 Q2 (x) (3)
Advanced Machine Learning 3 / 20
Advanced Machine Learning 4 / 20
This interpolation polynomial of degree 2 is unique
Advanced Machine Learning 4 / 20
This interpolation polynomial of degree 2 is unique
Proof:
Advanced Machine Learning 4 / 20
This interpolation polynomial of degree 2 is unique
Proof:
I Let P1 (x) be another polynomial. Then, P(x) = P1 (x) for x = x0 ,
x = x1 , x = x2
Advanced Machine Learning 4 / 20
This interpolation polynomial of degree 2 is unique
Proof:
I Let P1 (x) be another polynomial. Then, P(x) = P1 (x) for x = x0 ,
x = x1 , x = x2
I So, P(x) − P1 (x) vanishes at three values of x
Advanced Machine Learning 4 / 20
This interpolation polynomial of degree 2 is unique
Proof:
I Let P1 (x) be another polynomial. Then, P(x) = P1 (x) for x = x0 ,
x = x1 , x = x2
I So, P(x) − P1 (x) vanishes at three values of x
I Hence P(x) − P1 (x) must be 0 i.e. P(x) = P1 (x)
Advanced Machine Learning 4 / 20
This interpolation polynomial of degree 2 is unique
Proof:
I Let P1 (x) be another polynomial. Then, P(x) = P1 (x) for x = x0 ,
x = x1 , x = x2
I So, P(x) − P1 (x) vanishes at three values of x
I Hence P(x) − P1 (x) must be 0 i.e. P(x) = P1 (x)
Obviously, the polynomial P(x) diers from f (x) at values of x other
than x0 , x1 , and x2
Advanced Machine Learning 4 / 20
Advanced Machine Learning 5 / 20
In a general case,
n
X (x − x0 )(x − x1 ) . . . (x − xk−1 )(x − xk+1 ) . . . (x − xn )
Pn (x) = f (xk )
(xk − x0 )(xk − x1 ) . . . (xk − xk−1 )(xk − xk+1 ) . . . (xk − xn )
k=0
(4)
Advanced Machine Learning 5 / 20
In a general case,
n
X (x − x0 )(x − x1 ) . . . (x − xk−1 )(x − xk+1 ) . . . (x − xn )
Pn (x) = f (xk )
(xk − x0 )(xk − x1 ) . . . (xk − xk−1 )(xk − xk+1 ) . . . (xk − xn )
k=0
(4)
Weierstrass proved that polynomials can approximate arbitrary well
any continuous real function in an interval
Advanced Machine Learning 5 / 20
In a general case,
n
X (x − x0 )(x − x1 ) . . . (x − xk−1 )(x − xk+1 ) . . . (x − xn )
Pn (x) = f (xk )
(xk − x0 )(xk − x1 ) . . . (xk − xk−1 )(xk − xk+1 ) . . . (xk − xn )
k=0
(4)
Weierstrass proved that polynomials can approximate arbitrary well
any continuous real function in an interval
Equation (6) corresponds to approximation of f (x) using a
superposition of simpler functions
Advanced Machine Learning 5 / 20
Generalizing
Advanced Machine Learning 6 / 20
Generalizing
Let f (x) be a real function of a real valued vector x = [x1 x2 . . . , xn ]T
that is square integrable over the real numbers
Advanced Machine Learning 6 / 20
Generalizing
Let f (x) be a real function of a real valued vector x = [x1 x2 . . . , xn ]T
that is square integrable over the real numbers
The goal of function approximation is to describe the behavior of f (x)
in a compact area S of the input space using a superposition of
simpler functions φi (x, w ) i.e.,
ñ
(5)
X
fˆ(φ(x, w ), W ) = Wi φi (x, w )
i=1
where, Wi 's are real valued constants such that,
|f (x) − fˆ(φ(x, w ), W )| < (6)
and can be arbitrarily small
Advanced Machine Learning 6 / 20
Generalizing
Let f (x) be a real function of a real valued vector x = [x1 x2 . . . , xn ]T
that is square integrable over the real numbers
The goal of function approximation is to describe the behavior of f (x)
in a compact area S of the input space using a superposition of
simpler functions φi (x, w ) i.e.,
ñ
(5)
X
fˆ(φ(x, w ), W ) = Wi φi (x, w )
i=1
where, Wi 's are real valued constants such that,
|f (x) − fˆ(φ(x, w ), W )| < (6)
and can be arbitrarily small
So we obtain the value of x ∈ S based on the combination of simpler
or elementary functions {φi (x, w )}
Advanced Machine Learning 6 / 20
Advanced Machine Learning 7 / 20
Advanced Machine Learning 7 / 20
Generalizing
Advanced Machine Learning 8 / 20
Generalizing
There are many possible choices for {φi (x, w )}. The polynomial we
saw before is one possibility
Advanced Machine Learning 8 / 20
Generalizing
There are many possible choices for {φi (x, w )}. The polynomial we
saw before is one possibility
We would prefer a set of {φi (x, w )} over another set if it provides a
smaller error for a given h or is computationally more ecient
Advanced Machine Learning 8 / 20
Generalizing
There are many possible choices for {φi (x, w )}. The polynomial we
saw before is one possibility
We would prefer a set of {φi (x, w )} over another set if it provides a
smaller error for a given h or is computationally more ecient
As a side note, observe that if we make the number of inputs equal to
the number of elementary functions {φi (x, w )}ñi=1 , then,
φ1 (x (1) , w ) φ2 (x (1) , w ) . . . φñ (x (1) , w )
(1)
W1 f (x )
φ1 (x (2) , w ) φ2 (x (2) , w ) . . . φñ (x (2) , w ) W2 f (x (2) )
=
... . . . . . .
φ1 (x (h) , w ) φ2 (x (h) , w ) . . . φñ (x (h) , w ) Wñ f (x (ñ) )
(7)
Advanced Machine Learning 8 / 20
Generalizing
There are many possible choices for {φi (x, w )}. The polynomial we
saw before is one possibility
We would prefer a set of {φi (x, w )} over another set if it provides a
smaller error for a given h or is computationally more ecient
As a side note, observe that if we make the number of inputs equal to
the number of elementary functions {φi (x, w )}ñi=1 , then,
φ1 (x (1) , w ) φ2 (x (1) , w ) . . . φñ (x (1) , w )
(1)
W1 f (x )
φ1 (x (2) , w ) φ2 (x (2) , w ) . . . φñ (x (2) , w ) W2 f (x (2) )
=
... . . . . . .
φ1 (x (h) , w ) φ2 (x (h) , w ) . . . φñ (x (h) , w ) Wñ f (x (ñ) )
(7)
I ..and W = φ−1 f (assuming the inverse exists!)
Advanced Machine Learning 8 / 20
Generalizing
There are many possible choices for {φi (x, w )}. The polynomial we
saw before is one possibility
We would prefer a set of {φi (x, w )} over another set if it provides a
smaller error for a given h or is computationally more ecient
As a side note, observe that if we make the number of inputs equal to
the number of elementary functions {φi (x, w )}ñi=1 , then,
φ1 (x (1) , w ) φ2 (x (1) , w ) . . . φñ (x (1) , w )
(1)
W1 f (x )
φ1 (x (2) , w ) φ2 (x (2) , w ) . . . φñ (x (2) , w ) W2 f (x (2) )
=
... . . . . . .
φ1 (x (h) , w ) φ2 (x (h) , w ) . . . φñ (x (h) , w ) Wñ f (x (ñ) )
(7)
I ..and W = φ−1 f (assuming the inverse exists!)
I An important condition must be placed on the elementary functions i.e.
the inverse of φ exists
Advanced Machine Learning 8 / 20
Geometric Interpretation
Advanced Machine Learning 9 / 20
Geometric Interpretation
Equation (5) describes a projection of f (x) in to a set of basis
functions {φi (x, w )}. The basis functions dene a manifold and
fˆ(φ(x, w ), W ) is the image or projection of f (x) in this manifold
Advanced Machine Learning 9 / 20
Geometric Interpretation
Equation (5) describes a projection of f (x) in to a set of basis
functions {φi (x, w )}. The basis functions dene a manifold and
fˆ(φ(x, w ), W ) is the image or projection of f (x) in this manifold
Advanced Machine Learning 9 / 20
Choices for Elementary Functions
Advanced Machine Learning 10 / 20
Choices for Elementary Functions
If the elementary functions are not chosen properly, then there will
always be an error no matter how large ñ is
Advanced Machine Learning 10 / 20
Choices for Elementary Functions
If the elementary functions are not chosen properly, then there will
always be an error no matter how large ñ is
One requirement that we have seen is that φ−1 (·) must exist. This
condition is met if the elementary functions constitute a basis i.e. are
linearly independent
Advanced Machine Learning 10 / 20
Choices for Elementary Functions
If the elementary functions are not chosen properly, then there will
always be an error no matter how large ñ is
One requirement that we have seen is that φ−1 (·) must exist. This
condition is met if the elementary functions constitute a basis i.e. are
linearly independent
Fourier series, Wavelets are two widely used bases
Advanced Machine Learning 10 / 20
Choices for Elementary Functions
If the elementary functions are not chosen properly, then there will
always be an error no matter how large ñ is
One requirement that we have seen is that φ−1 (·) must exist. This
condition is met if the elementary functions constitute a basis i.e. are
linearly independent
Fourier series, Wavelets are two widely used bases
In neural networks, the bases are dependent on the data (as opposed
to being xed), and (ii) the co-ecients (weights) are adapted as
opposed to analytically computed
Advanced Machine Learning 10 / 20
Advanced Machine Learning 11 / 20
When f (x) is non-linear, there is no natural choice of the basis.
Volterra expansions, Splines etc. are some basis that have been tried.
As we saw before, Weierstrass showed that polynomials are universal
approximators
Advanced Machine Learning 11 / 20
When f (x) is non-linear, there is no natural choice of the basis.
Volterra expansions, Splines etc. are some basis that have been tried.
As we saw before, Weierstrass showed that polynomials are universal
approximators
The diculty is that either too many terms are required or the
approximation is not well behaved
Advanced Machine Learning 11 / 20
Multi-Layered Perceptrons (MLP)
Advanced Machine Learning 12 / 20
Multi-Layered Perceptrons (MLP)
Multi-layered Perceptrons (MLP) often use a bases of sigmoidal
functions
Advanced Machine Learning 12 / 20
Multi-Layered Perceptrons (MLP)
Multi-layered Perceptrons (MLP) often use a bases of sigmoidal
functions
The hidden layer changes the bases and consequently the manifold
and the output layer nds the best projection within the manifold
…
…
Advanced Machine Learning 12 / 20
Advanced Machine Learning 13 / 20
Recall,
n
!
(l) (l) (l)
(8)
X
hj = σ Sj =σ wjk xk
k=0
ñ
(l) (l) (l)
(9)
X
ŷi = σ Si = σ Wij hj
j=0
where, σ(a) = 1/(1 + e −a ) is the sigmoid function
Advanced Machine Learning 13 / 20
Recall,
n
!
(l) (l) (l)
(8)
X
hj = σ Sj =σ wjk xk
k=0
ñ
(l) (l) (l)
(9)
X
ŷi = σ Si = σ Wij hj
j=0
where, σ(a) = 1/(1 + e −a ) is the sigmoid function
Training is done by optimizing,
N
X
J = J (l)
l=1
N Xm 2
(l) (l)
(10)
X
= yi − ŷi
l=1 i=1
Advanced Machine Learning 13 / 20
Approximation Capabilities of MLP's
Advanced Machine Learning 14 / 20
Approximation Capabilities of MLP's
Let f (x) be the function to be approximated. Further, let us assume
that l uniformly sampled points (x (1) , x (2) , . . . , x (l) ) in (a, b) are
known. Let y (i) = f (x (i) )
Advanced Machine Learning 14 / 20
Approximation Capabilities of MLP's
Let f (x) be the function to be approximated. Further, let us assume
that l uniformly sampled points (x (1) , x (2) , . . . , x (l) ) in (a, b) are
known. Let y (i) = f (x (i) )
x (l+1) − x (l) = ∆x = (b − a)/l . x (1) − ∆x/2 = a and x (l) + ∆x/2 = b
Advanced Machine Learning 14 / 20
Advanced Machine Learning 15 / 20
Dene a function,
0
x <0
1 1
ζ(x) = sgn(x) + = undened x = 0 (11)
2 2
1 x >0
Advanced Machine Learning 15 / 20
Dene a function,
x <0 0
1 1
ζ(x) = sgn(x) + = undened x = 0 (11)
2 2
1 x >0
Then,
l
∆x ∆x
(12)
X
(i) (i) (i)
f (x) = y ζ x −x + −ζ x −x −
i=1
2 2
Advanced Machine Learning 15 / 20
Dene a function,
x <0 0
1 1
ζ(x) = sgn(x) + = undened x = 0 (11)
2 2
1 x >0
Then,
l
∆x ∆x
(12)
X
(i) (i) (i)
f (x) = y ζ x −x + −ζ x −x −
i=1
2 2
Now,
(i) ∆x (i) ∆x
ζ x −x + −ζ x −x −
2 2
1 1
∆x ∆x
= sgn x − x + (i)
− sgn x − x −
(i)
(13)
2 2 2 2
Advanced Machine Learning 15 / 20
Advanced Machine Learning 16 / 20
Each term of the summation above can be produced by,
Advanced Machine Learning 16 / 20
Each term of the summation above can be produced by,
We can replace sgn(·) with a steep sigmoid
Advanced Machine Learning 16 / 20
Each term of the summation above can be produced by,
We can replace sgn(·) with a steep sigmoid
Each vertical bar can then be produced by a pair of neurons and we
can approximate f (x) to any desired degree of accuracy
Advanced Machine Learning 16 / 20
Each term of the summation above can be produced by,
We can replace sgn(·) with a steep sigmoid
Each vertical bar can then be produced by a pair of neurons and we
can approximate f (x) to any desired degree of accuracy
I Many authors (e.g Cybenko, Hornik and others) have formally shown
that the function realized by a single hidden layered neural network
with a nite number of neurons in the hidden layer is dense in teh
space of continuous functions
Advanced Machine Learning 16 / 20
Advanced Machine Learning 17 / 20
Advanced Machine Learning 17 / 20
Sigmoidal Neuron With 1 Input
Advanced Machine Learning 18 / 20
Sigmoidal Neuron With 1 Input
1
w0 = 0.5, w1 = 0.1
w0 = -5, w1 = 0.8
w0 = -1, w1 = -0.1
0.8
0.6
0.4
0.2
Advanced Machine Learning 18 / 20
Superposition of 4 Sigmoidal Neurons Each With 1 Input
Advanced Machine Learning 19 / 20
Superposition of 4 Sigmoidal Neurons Each With 1 Input
0.5
-0.5
Neuron 1
Neuron 2
Neuron 3 Advanced Machine Learning 19 / 20
Classication
Advanced Machine Learning 20 / 20
Classication
Though it is possible to set up a dierent cost function (e.g. cross
entropy) or approach classication in a dierent way, thresholding the
approximated output also results in classication
Advanced Machine Learning 20 / 20
Classication
Though it is possible to set up a dierent cost function (e.g. cross
entropy) or approach classication in a dierent way, thresholding the
approximated output also results in classication
For example,
0.4
Superposition (Outut)
Threshold
0.2
-0.2
Advanced Machine Learning 20 / 20
Classication
Though it is possible to set up a dierent cost function (e.g. cross
entropy) or approach classication in a dierent way, thresholding the
approximated output also results in classication
For example,
0.4
Superposition (Outut)
Threshold
0.2
-0.2
Advanced Machine Learning 20 / 20
Classication
Though it is possible to set up a dierent cost function (e.g. cross
entropy) or approach classication in a dierent way, thresholding the
approximated output also results in classication
For example,
0.4
Superposition (Outut)
Threshold
0.2
-0.2
Advanced Machine Learning 20 / 20