0% found this document useful (0 votes)
60 views

INM Script

The document introduces polynomial interpolation and spline interpolation as numerical methods. Polynomial interpolation finds a polynomial that passes through a set of provided data points. The polynomial can be computed using Lagrange polynomials. Spline interpolation is a generalization that finds a smooth piecewise polynomial curve to fit the data points. Linear and cubic spline interpolation are discussed as examples. Estimates of the interpolation error are also provided.

Uploaded by

madhuyaram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

INM Script

The document introduces polynomial interpolation and spline interpolation as numerical methods. Polynomial interpolation finds a polynomial that passes through a set of provided data points. The polynomial can be computed using Lagrange polynomials. Spline interpolation is a generalization that finds a smooth piecewise polynomial curve to fit the data points. Linear and cubic spline interpolation are discussed as examples. Estimates of the interpolation error are also provided.

Uploaded by

madhuyaram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Introduction to Numerical Methods

Fleurianne Bertrand
2015

original version of this lecture notes: Dr. Steffen Mnzenmaier, Dr. Benjamin Mller

Contents
1 Interpolation
1.1 Polynomial Interpolation . . . . . .
1.2 Spline Interpolation . . . . . . . .
1.2.1 Linear Spline Interpolation
1.2.2 Cubic Spline Interpolation .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

7
7
10
11
13

2 Numerical Integration
2.1 Newton-Cotes Formulas . . . . . . . . .
2.2 Gauss-Legendre Formulas . . . . . . . .
2.3 Composite Rules . . . . . . . . . . . . .
2.4 Multidimensional numerical integration .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

21
21
25
26
27

3 Direct Solvers for Linear Equation Systems


3.1 The Gaussian Elimination and LU-Decomposition . . . . . . . . . . . . . . . . . .
3.2 Symmetric Positive Definite Matrices: The Cholesky Decomposition . . . . . . .
3.3 Partial Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29
29
33
36

4 Iterative Methods for Linear Equations Systems


4.1 Norms and Condition Numbers for Matrix Vector Calculus .
4.2 Iterative Solvers . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Steepest descent and conjugate gradient method . . . . . .
4.3.1 The steepest descent method . . . . . . . . . . . . .
4.3.2 The conjugate gradient method . . . . . . . . . . . .

.
.
.
.
.

43
43
48
53
54
55

5 Iterative Methods for Nonlinear Equation Systems


5.1 Fixed point iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Newtons method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59
60
62

.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

CONTENTS

Introduction
In many applications numerical methods are an essential tool to solve complicated problems. As
models tend to be more sophisticated there arise problems that can not be solved directly. And
even if direct methods are an alternative approach they typically fail to address problems that
arise in large-scale computations: Computational complexity and limitation of memory.
In this lecture we will have a close look at some very basic numerical methods for several
problems that arise in engineering applications. As a motivation we will consider the shallow
water equations that are used to model a free surface water flow under a hydrostatic pressure
assumption which is depicted in figure 1.
Let Rd1 be a domain with boundary = Du DH .
u
|u|u
+ (u) u + gH + gHb + cf
=f
t
H
H
+ (H) u + H( u) = 0
t

(1)

with the additional boundary/initial conditions


u = uD
H = HD

on Du
on DH

u(0, x) = uI (x) on

(2)

H(0, x) = HI (x) on
The problem is then to find a velocity u(t, x) and an elevation H(t, x) above sea ground that
satisfies equations (1) with the additional conditions (2). The additional parameters are the
gravity g, the bottom friction coefficient cf and sources/sinks of momentum f . Solving this
problem gives rise to several difficulties. First this problem can not be solved directly for general boundary conditions geometries. But as real-world applications tend to have complicated
geometries involved a numerical method is necessary. Therefore the need of a discretization of
the partial differential equation is needed. Numerical methods for the spatial discretization, as
"Finite Differences" or "Finite Elements", and the temporal discretization will be introduced in

Hb

Figure 1: Domain for the shallow water equations


5

CONTENTS

Advanced Numerical Methods. These Methods rely heavily on other numerical approaches
which are introduced in this lecture such as:
1. polynomial interpolation: The problem of finding preferably simple functions that interpolate general functions in a given set of points or fitting a set of given data. This
method is introduced in chapter 1. For the upper problem finding an appropriate basis,
which is piecewise polynomial, is a problem of polynomial interpolation.
2. As upper problem is non-linear and finite element methods / variational principles include
(potentially multi-dimensional) integration an efficient way to evaluate general integrals is
needed. These methods are summarized under the term numerical quadrature which is
introduced in chapter 2.
3. Many numerical methods result in the problem of solving a linear equation system of the
form Ax = b. Upper problem is no exception to that and results in several linear equations
systems which might be big but have very few entries that are not zero. The solution of
these problems can be very time and memory consuming. In chapter 3 we will get to know
direct and iterative solvers for linear equation systems.
4. If you solve these linear equations systems another question arises: How prone are your
problems to unavoidable errors in your given data? How do numerical errors propagate in
the used algorithm? These questions are questions of Condition and Stability and will
be approached in chapter 4.
5. Upper problem is obviously nonlinear which complicates the solution process considerably.
How nonlinearities can be adressed and the problem can be lead back to a sequence of
linear problems is the topic of chapter 5: "Nonlinear equation systems".
If we now apply the initial condition depicted in figure 2 and enforce some outflow through the
sea-ground on the right half we get the solution shown in figure 3.

H
Hb

Figure 2: Initial condition for example

Figure 3: Solution over 120s

Chapter 1

Interpolation
Problems of interpolation arise in many applications where one is looking for functions that
interpolate a given set of data. Examples are a given temperature for different times or the
spatial distribution of a pressure field where the data is only given at discrete points.

1.1

Polynomial Interpolation

The general problem in polynomial interpolation is to find a polynomial p(x) of degree n such
that for a set of data (xi , fi ) with i = 0, ...n it holds:
p(xi ) = fi
The space of polynomials of degree n is given by
Pn = {p | p(x) = cn xn + cn1 xn1 + + c1 x + c0 with ci R}
In this definition we do not assume ci 6= 0 and therefore every polynomial of degree n 1 is also a
polynomial of degree n. If one wants to represent/interpolate a function f (x) by an interpolating
polynomial of degree n the condition
p(xi ) = f (xi ) i = 0, ..., n
for a chosen set of interpolation points xi is used. To get the interpolation polynomial we make
use of the Lagrange-Polynomials. These are defined as the solution of the following problem:
Find Li (x) Pn such that
Li (xj ) = ij
with the Kronecker delta be defined by
(
1 i=j
ij =
0 i=
6 j
The solution of this problem is given by
n
Y

Li (x) =

j=0,j6=i

x xj
xi xj

Then the interpolation polynomial p(x) is given by


p(x) =

n
X

fi Li (x).

(1.1)

i=0

Therefore the problem of polynomial interpolation has a solution. The next theorem states that
this solution is unique.
7

CHAPTER 1. INTERPOLATION

Theorem 1.1 Let n + 1 different points xi and values fi (i = 0..., n) be given. Then there exists
exactly one interpolation polynomial p Pn such that p(xi ) = fi .
Proof: As mentioned before (1.1) implies existence of the interpolating polynomial of degree
n. For uniqueness we assume p1 and p2 to be two polynomials of degree n with p1 (xi ) = p2 (xi ).
Therefore the polynomial of degree n given by q(x) = p1 (x) p2 (x) has n + 1 different roots
in xi . But the only polynomial of degree n with n + 1 roots is q(x) = 0 and we therefore have
p1 (x) = p2 (x).

To estimate the error we make use of the following polynomial of degree n + 1:
n+1

n
Y
=
(x xi )
i=0

Theorem 1.2 Let f be (n+1)-times continuously differentiable function, f C n+1 ([a, b]). Then
there exists for every x [a, b] a (a, b), such that for the interpolating polynomial p(x) with
the nodes x0 , x1 , ..., xn it holds:
f (x) p(x) =

n+1 (x)f (n+1) ()


(n + 1)!

Proof: Let x
[a, b] with x
6= xi i = 0, ..., n. Set
F (x) = f (x) p(x)

f (
x) p(
x)
n+1 (x)
n+1 (
x)

Obviously F has at least n + 2 roots in [a, b]:


F (xi ) = f (xi ) p(xi )

f (
x) p(
x)
n+1 (xi ) = 0
n+1 (
x) | {z }
=0

and
F (
x) = f (
x) p(
x)

f (
x) p(
x)
n+1 (
x) = 0
n+1 (
x)

By the theorem of Rolle the derivative of F, F, has at least n + 1 roots in (a, b). By induction it
follows that F (n+1) has at least one root in (a,b). That means that there exists a (a, b) such
that
f (
x) p(
x) (n+1)
0 = F (n+1) () = f (n+1) () p(n+1) ()

()
| {z }
n+1 (
x) | n+1
{z }
=0

= f (
x) p(
x) =

f (n+1) ()
(n + 1)!

=(n+1)!

n+1 (
x)

which concludes the proof.



Remark: A direct consequence of theorem 1.2 is the estimate
|f (x) p(x)|

1
max |f (n+1) ()| |n+1 (x)|
(n + 1)! [a,b]

(1.2)

Using the Lagrange polynomials and equation (1.1) to evaluate an interpolating polynomial in
a point x is very expensive as a straightforward implementation would result in O(n2 ) operations

1.1. POLYNOMIAL INTERPOLATION

for every function evaluation. A more efficient evaluation can be achieved by using the barycentric
formula. First we see that for the Lagrange-polynomials it holds
n+1 (x)
Li (x) =
x xi

n
Y

n+1 (x)
1
=
wi
xi xj
x xi
j=0,j6=i
|
{z
}
:=wi

with the weights wi . Now we can rewrite equation (1.1) as


p(x) =

n
X

fi Li (x) =

i=0

n
X

fi

i=0

X
n+1 (x)
wi
wi = n+1 (x)
fi
x xi
x xi
i=0

Now it might be problematic to evaluatePthis formula very close to some node xi and therefore
wi
we can rewrite it, by using 1 = n+1 (x) ni=0 xx
:
i
Pn

i=0
p(x) = Pn

wi
fi xx
i

wi
i=0 xxi

After an initial computational cost of O(n2 ) operations for the calculation of the weights one
only needs O(n) operations for the evaluation of the polynomial at a point x.
Remark: We will see in exercise 5 that one can efficiently (O(n) operations) add new nodes
to the interpolating polynomial to improve accuracy.
Example: We want to interpolate the following function f : [1, 1] R with
(
0
if 1 x 0
f (x) =
1/x
e
if 0 < x 1
In figure 1.1 the solution is plotted as a continuous line and interpolating polynomials of
degree 5 as dotted line and degree 10 as dashed line. One can clearly see that the error does not
decrease uniformly if the polynomial degree increases.
0.45
exact solution
interpolating polynomial (n=5)
interpolating polynomial (n=10)
0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0.05
1

0.8

0.6

0.4

0.2

0.2

0.4

0.6

Figure 1.1: Example for polynomial interpolation

0.8

10

1.2

CHAPTER 1. INTERPOLATION

Spline Interpolation

The example of the last section showed that we can not expect our polynomial approximation
to converge uniformly if we use equidistant interpolation points. Runge found out about this
phenomenon in 1901. If we take a look at the so called Runge function
f (x) =

1
,
1 + x2

x [5, 5]

(1.3)

and the interpolating polynomials in figure 1.2 we clearly see that the behavior of the interpolating
polynomials of higher degree close to the edges of the Interval is quite bad. Spline interpolation
2.5
exact solution
interpolating polynomial (n=5)
interpolating polynomial (n=10)
interpolating polynomial (n=15)
2

1.5

0.5

0.5
5

Figure 1.2: Runges phenomenon


circumvents this behavior by using piecewise polynomials of lower degree and still achieving good
approximation properties.
Definition 1.3 For the n + 1 points a = x0 < x1 < < xn1 < xn = b a decomposition
T = {[x0 , x1 ], [x1 , x2 ], . . . , [xn1 , xn ]} of the interval [a, b] is given. A Spline of degree k with
respect to T is a k 1-times differentiable function s C k1 ([a, b]) which is a polynomial of
degree k in every interval [xi1 , xi ]. The space of Splines of degree k with respect to T is then
given by Sk,T .
Remark: Two special cases of splines are commonly used: The linear splines S1,T and the
cubic splines S3,T .
Upper definition means that we are able to represent a Spline s Sk,T by

s1 (x) := c1,0 + c1,1 x + + c1,k xk

s2 (x) := c2,0 + c2,1 x + + c2,k xk


s(x) = .
..

sn (x) := cn,0 + cn,1 x + + cn,k xk

x [x0 , x1 )
x [x1 , x2 )
..
.
x [xn1 , xn ]

(1.4)

1.2. SPLINE INTERPOLATION

11

We are not completely free in the choice of the ci,j as we still have to satisfy the continuity and
differentiability conditions. By using representation (1.4) we can formulate them as
s1 (x1 ) = s2 (x1 ), s2 (x2 ) = s3 (x2 ), . . . , sn1 (xn1 ) = sn (xn1 )
s01 (x1 ) = s02 (x1 ), s02 (x2 ) = s03 (x2 ), . . . , s0n1 (xn1 ) = s0n (xn1 )
..
..
.
.
(k1)

s1

(k1)

(x1 ) = s2

(k1)

(x1 ), s2

(k1)

(x2 ) = s3

(1.5)

(k1)

(x2 ), . . . , sn1 (xn1 ) = s(k1)


(xn1 )
n

If a function s(x) can be written as in (1.4) and satisfies (1.5) it follows that s(x) Sk,T .
Lemma 1.4 The dimension of the Splinespace Sk,T is n + k.
Proof: In [x0 , x1 ] we assume an arbitrary polynomial and therefore we have k + 1 degrees of
freedom (see (1.4): c1,0 , . . . , c1,k ). In [x1 , x2 ] we get additional k +1 degrees of freedom, but k are
already predetermined due to (1.5). Therefore we only get 1 additional degree of freedom. This
holds for every interval [xi1 , xi ] with i = 2, . . . , n. Therefore we get dim Sk,T = k + 1 + (n 1) =
n + k.

For the next two subsections we will take a closer look at the commonly used linear and cubic
splines.

1.2.1

Linear Spline Interpolation

The space S1,T consists of piecewise linear and globally continuous functions. An example is
given in figure 1.3.

s 1 `
t ` s2

t 

```t

a = x0

x1

s3

x2

tXs4
X`
t ` s5
```t

x3 x4

x5 = b

Figure 1.3: function in S1,T


As in section 1.1 we have the points
a = x0 < x1 < < xn = b
and a function f (x) (or some data fi ) be given. Our interpolation problem can then be formulated
as: Find a function s S1,T such that s(xi ) = f (xi ) (or s(xi ) = fi ). If we recall the Runge
function (1.3) the problems that arose in using more interpolation points can then be easily
circumvented, as we can see in figure 1.4.
In the case of linear splines we can easily calculate the coefficients for the si (x) in representation (1.4):
f (xi1 )(xi x) + f (xi )(x xi1 )
(1.6)
xi xi1
Remark:
We can define a basis of S1,T in the same way as we have done it for the
polynomial interpolation using Lagrange-polynomials: Let Ci S1,T being the solution of the
interpolation problem: Find Ci S1,T such that
si (x) =

Ci (xj ) = ij

12

CHAPTER 1. INTERPOLATION

2
exact solution
interpolating polynomial (n=10)
interpolating linear spline (n=10)

1.5

0.5

0.5
5

Figure 1.4: Runges phenomenon


These n + 1 splines are obviously linearly independent and therefore form a basis with lemma
1.4. Furthermore we can easily solve upper interpolation problem by using (1.6):

Ci (x) =

xxi1

xi xi1

x [xi1 , xi ]

else

xi+1 x
xi+1 xi

x [xi , xi+1 ]

Analogously to the interpolation polynomial we can define the solution of our general interpolation problem as
n
X
s(x) =
f (xi )Ci (x)
i=0

For the error we get the following consequence of theorem 1.2:


Theorem 1.5 Let f be a 2 times continuously differentiable function, f C 2 ([a, b]). Then for
the interpolating linear spline s S1,T with respect to the interpolation points
a = x0 < x1 < < xn1 < xn = b
it holds
max |f (x) s(x)|
x[xi1 ,xi ]

(xi xi1 )2
max |f 00 (x)|
8
x[xi1 ,xi ]

Proof: See Exercise 3.


Remark: By setting
h = max (xi xi1 )
1in

we get immediatly
max |f (x) s(x)|
x[a,b]

h2
max |f 00 (x)|
8 x[a,b]

1.2. SPLINE INTERPOLATION

1.2.2

13

Cubic Spline Interpolation

As we have seen before as linear spline interpolation circumvents the problem of oscillations the
approximation properties in some regions the approximation properties afar from the edges of
interval tend to be worse than for standard interpolation polynomials. Therefore we will use
piecewise higher order to retain these approximation properties. In figure 1.5 the comparison of
different interpolation techniques shows clearly the advantages of using higher order polynomials
for approximation such as cubic spline interpolants. One has to mention that the space S3,T
is of slightly higher dimension for this decomposition (dim(S3,T ) = 13)) than the other spaces
(dim(P10 ) = dim(S1,T ) = 11) (see Lemma 1.4). Still the advantages clearly outweigh the little
additional cost.

2
interpolaion points
exact solution
interpolating polynomial (n=10)
linear spline
complete cubic spline (n=10)

1.5

interpolaion points
exact solution
interpolating polynomial (n=10)
linear spline
complete cubic spline (n=10)

0.95

0.9

0.85
0.5

0.8

0
0.75

0.5
5

(a) Runge function

0.2

0.2

0.4

0.6

0.8

(b) Runge function (zoomed)

Figure 1.5: Comparison of interpolation strategies with n = 10

Lemma 1.4 also implies that the n+1 interpolation conditions cant be enough to get a unique
solution of the interpolation problem in S3,T : Find a function s S3,T such that s(xi ) = f (xi ).
We need (at least) 2 additional conditions to guarantee uniqueness of this problem. Common
conditions (other conditions might be used as well) are the following:
(i) s0 (a) = f 0 (a) and s0 (b) = f 0 (b) (complete Spline)
(ii) s00 (a) = s00 (b) = 0 (natural Spline)
(iii) s0 (a) = s0 (b) and s00 (a) = s00 (b) (if f (a) = f (b)) (periodic spline)
Conditions (i) - (iii) lead always to a unique interpolating spline. We will now use (1.4) and
(1.5) to derive a linear equation system which has to be solved. As a function in s S3,T is 2
times differentiable we are allowed to evaluate the moments mi := s00 (xi ) for i = 0, ..., n and use
them to represent our cubic spline. Therefore we first calculate the linear spline s00 S1,T which
is due to (1.6) given by
s00 (x) =

s00 (xi1 )(xi x) + s00 (xi )(x xi1 )


mi1 (xi x) + mi (x xi1 )
=
xi xi1
hi

(1.7)

in [xi1 , xi ] with hi = xi xi1 . Here (and for the following parts) we use that the cubic spline is
2 times continuously differentiable. We now integrate this function from xi1 to x and therefore

14

CHAPTER 1. INTERPOLATION

get
0

s (x) = s (xi1 ) +

s00 ()d

x1
x

mi1 (xi ) + mi ( xi1 )


d
hi
x1
hi
(xi x)2
(x xi1 )2
+ mi1 + mi
= s0 (xi1 ) mi1
2hi
2
2hi
= s0 (xi1 ) +

(1.8)

Integrating (1.8) another time from xi1 to x we get


Z x
s(x) = s(xi1 ) +
s0 ()d
x1
Z x
(xi )2
hi
( xi1 )2
s0 (xi1 ) mi1
= s(xi1 ) +
+ mi1 + mi
d
2hi
2
hi
x1
hi
h2
(xi x)3
(x xi1 )3
= s(xi1 ) + (s0 (xi1 ) + mi1 )(x xi1 ) + mi1
mi1 i + mi
2
6hi
6
6hi
(1.9)
Now inserting xi in (1.9) we get
h2
h3
hi
)hi mi1 i + mi i
2
6
6hi
2
2
h
h
= s(xi1 ) + s0 (xi1 )hi + mi1 i + mi i
3
6
Relocating the terms in equation (1.10) we get
s(xi ) = s(xi1 ) + (s0 (xi1 ) + mi1

s0 (xi1 ) =

s(xi ) s(xi1 )
hi
hi
mi1 mi
hi
3
6

(1.10)

(1.11)

and inserting xi in (1.8) one obtains


hi
hi
+ mi
(1.12)
2
2
Now inserting (1.11) for the intervals [xi1 , xi ] and [xi , xi+1 ] in (1.12) using the continuity of
s0 (x) we get for every interior point i = 1, ..., n 1
s0 (xi ) = s0 (xi1 ) + mi1

s(xi+1 ) s(xi )
hi+1
hi+1
s(xi ) s(xi1 )
hi
hi
hi
hi
mi
mi+1
=
mi1 mi + mi1 + mi
hi+1
3
6
hi
3
6
2
2
(1.13)
Repositioning the terms now leads to
mi1

hi
hi + hi+1
hi+1
s(xi+1 ) s(xi ) s(xi ) s(xi1 )
+ mi
+ mi+1
=

=: bi
6
3
6
hi+1
hi

(1.14)

for every i = 1, ..., n 1.


By using the interpolation conditions s(xi ) = f (xi ) the resulting linear equation system
Ax = b with A Rn1n+1 , without additional conditions (i)-(iii), is given by

h1 h1 +h2

h2
0
0
0
...
0
m0
b1
6
3
6
h2
h2 +h3
h3

0
0
0
...
0
6
3
6

m1 b2

..

.
.
.
.
.
.
..
..
..
.. .. ..


(1.15)
.. .. = ..
..
..
..
..
.
. .
.
.
.
.

hn2 +hn1
hn1
0
mn1 bn2
0
...
...
0 hn2
6
3
6
hn1
hn1 +hn
hn
bn1
mn
0
...
...
...
0
6

Now we have to add the additional conditions:

1.2. SPLINE INTERPOLATION

15

(i) By using (1.11) for i = 1 we have


h1
h1
f (x1 ) f (x0 )
f 0 (x0 ) =: ca
+ m1
=
3
6
h1

m0

(1.16)

and by using (1.12) with (1.11) we get


mn1

hn
hn
f (xn ) f (xn1 )
=: cb
+ mn
= f 0 (xn )
6
3
hn

(1.17)

(ii) The conditions for the natural spline are given by


m0 = s00 (x0 ) = 0 ,

mn = s00 (xn ) = 0

(1.18)

(iii) For the periodic spline we have by using (1.11) and (1.12):
m0

h1
h1
hn
hn
s(x1 ) s(x0 ) s(xn ) s(xn1 )

=: p0
+ m1 + mn1
+ mn
=
3
6
6
3
hi
hn

(1.19)

These additional equations lead to the following linear equation systems


(i) For the complete cubic
h
h1
1
0
3
6
h1 h1 +h2
h2
6
3
6

h2 +h3
h2
0
6
3
.
..
..
.

.
..

...
...
0

0
...
...
0
...

spline:

0
h3
6

..

..

.
0
...
...

...
0
0
..
.
..
.

...
0
0

hn2
6

hn2 +hn1
3
hn1
6

..

...
...

hn1
6
hn1 +hn
3
hn
6

0
0
0
..
.
..
.
0
hn
6
hn
3

m0
m1
..
.
..
.
..
.

ca
b1
b2
..
.
..
.

bn2

mn1
bn1

mn
c
b

(1.20)
(ii) For the natural cubic spline we can immediately eliminate m0 and mn as they are 0
h1 +h2
3
h2
6

..
.

0
0

h2
6
h2 +h3
3

0
h3
6

..
..

...
...

.
0
...

0
0
..
.
..
.

...
...
..
.
..
.

hn2
6

hn2 +hn1
3
hn1
6

(iii) For the periodic spline we get


h
h1
1
0
...
3
6
h1 h1 +h2
h2
0
6
3
6

h2
h2 +h3
h3
0
6
3
6
.
..
..
..
.
.

.
.
..
..

...
...
0
0

0
...
...
...
1
...
...

m1
b1
m2 b2

.. ..
. .

.. = ..

0
. .
hn1
mn2 bn2
6
hn1 +hn
mn1
bn1
0
0
..
.

(1.21)

the following linear equation system

hn
hn

p
...
0
0
6
3
m0

0
0
...
0
b
m1
1


0
0
...
0 .
b2

.. .
..
..

..

.
.
..

.
..
..
..
..

.
.
. . .

.
hn2 +hn1
hn1
hn2
bn2
0 .

6
3
6
mn1
hn1
hn1 +hn
hn
bn1

0
6
3
6
mn
0
...
...
0
1

16

CHAPTER 1. INTERPOLATION
which can be rewritten as

h1 +hn
h1
0
6
h31
h1 +h2
h2
6
3
6

h2
h2 +h3
0
6
3
.
..
..
.

.
..

...
...
0
hn
...
...
6

...
0

...
0
0
..
.
..
.

0
...
...

hn2
6

hn2 +hn1
3
hn1
6

h3
6

..

..

.
0
...

..

hn
6

0
0
..
.

m0
m1
..
.
..
.
..
.

p0
b1
b2
..
.
..
.


(1.22)


hn1
bn2

6
mn1
hn1 +hn
b
n1

Example: (i) Let the following decomposition be given


1 1 1 1 3 3
T = {[0, ], [ , ], [ , ], [ , 1]}
4 4 2 2 4 4
and let the interpolation conditions be
s0 (0) = 0,

s(0) = 0,

s(1/4) = 0,

s(1/2) = 1,

s(3/4) = 0,

s(1) = 0,

s0 (1) = 0

With using (1.20) the linear equation system for the moments of the complete interpolating cubic
spline is (with hi = 1/4)

1/12 1/24
m0
0
0
0
0
m1 4
1/24 1/6 1/24 0
0

0 1/24 1/6 1/24 0 m2 = 8


0
0 1/24 1/6 1/24 m3 4
0
0
0 1/24 1/12
0
m4
Solving this equation system leads to

m0
24
m1 48

m=
m2 = 72
m3 48
m4
24
Therefore we can evaluate the spline by using (1.9) and (1.11):
(
0 + (0 + 2 2 3)(x 0) 16(1/4 x)3 + 1/4 + 32(x 0)3
s(x) =
0 + (4 4 + 3 + 6)(x 1/4) + 32(1/2 x)3 1/2 48(x 1/4)3

if x [0, 1/4]
if x [1/4, 1/2]

due to symmetry conditions (s(x) = s(1 x)) we have


(
0 + 9(1 x 1/4) + 32(1/2 (1 x))3 1/2 48(1 x 1/4)3
s(x) =
0 3(1 x 0) 16(1/4 (1 x))3 + 1/4 + 32(1 x 0)3

if x [1/2, 3/4]
if x [3/4, 1]

(ii) For the same decomposition and interpolation conditions we want to calculate the natural
cubic spline. Therefore we can use the linear equation system given in (1.21)


1/6
1/24
0
m1
4
1/24 1/6 1/24 m2 = 8
0 1/24 1/6
m3
4
which leads to the solution

288/7
m1
m = m2 = 480/7
288/7
m3

1.2. SPLINE INTERPOLATION

17

The additional conditions for a natural spline just mean m0 = 0 and m4 = 0 and by using (1.11)
we get analogously to (i)
s0 (0) = 12/7
s0 (1/4) = 24/7
s0 (1/2) = 0
s0 (3/4) = s0 (1/4) = 24/7.
And then by using (1.9):
(
3
0 + (12/7)x + 192
7 (x 0)
s(x) =
0 + (24/7 + 36/7)(x 1/4) +

192
7 (1/2

x)3

3/7 320/7(x

1/4)3

Due to symmetry conditions (s(x) = s(1 x)) we again have


(
60/7(1 x 1/4) + 192 (x 1/2)3 3/7 320/7(3/4 x)3
7
s(x) =
192
12
( /7)(1 x) + 7 (1 x)3

if x [0, 1/4]
if x [1/4, 1/2].

if x [1/2, 3/4]
if x [3/4, 1].

(iii) For the same decomposition and interpolation conditions we want to calculate the periodic cubic spline. Therefore we can use (1.22)

1/6

1/24

1/24

0
1/24

1/6
1/24

0
1/24
1/6
1/24


m0
0
m1 4
0
=
8
1/24 m2
1/6
4
m3
1/24

Solving this equation system leads to


m0
24
m1 48

m=
m2 = 72
m3
48
and therefore with m4 = m0 we see that the periodic and the complete spline for upper conditions
are the same.
The question arises if the conditions (i) - (iii) are enough to guarantee a unique solution. To
prove this we need the following properties for matrices in Rnn :
Definition 1.6 Let A Rnn . If
aii >

n
X

|aij |

j=1,j6=i

for i = 1, ..., n we call a matrix strictly diagonal dominant.


and
Definition 1.7 A symmetric matrix A Rnn is called positive definite if xT Ax > 0 for all
0 6= x Rn .
The following lemma then holds
Lemma 1.8 A symmetric matrix A Rnn is positive definite if and only if all eigenvalues are
> 0.

18

CHAPTER 1. INTERPOLATION

Proof: As A is symmetric one can find an orthonormal basis of eigenvectors of Rn (viT vj = ij ).


Let xT Ax > 0 for all 0 6= x Rn hold. Then for the eigenvector vi to the eigenvalue i it holds:
i =

viT Avi
>0
viT vi

For the other direction (let all i > 0) we write x as


x=

n
X

ci vi

i=1

and then use

n
n
n
X
X
X
xT Ax = (
ci vi )T A(
ci vi ) =
c2i i kvi k2 > 0
i=1

i=1

i=1

which finishes the proof.



Now we first have to prove the following lemma.
Lemma 1.9 Let A Rnn be symmetric. If A is strictly diagonal dominant then A is symmetric
positive definite.
Proof: With lemma 1.8 we only have to show that all eigenvalues are > 0. Let be an
eigenvalue of A and let v be a correspondent eigenvector. Let vi be the entry with the greatest
absolute value. Therefore by taking the i-th row of the product Av we get
(Av)i =

n
X

n
X

aij vj = aii vi +

j=1

aij vj

j=1,j6=i

At the same time, as v is the eigenvector for we have: (Av)i = vi and therefore:
n
X

vi = (Av)i = aii vi +

aij vj

j=1,j6=i
n
X

= aii +

aij

j=1,j6=i

vj
vi

By using the strict diagonal dominance we get


n
X

= aii +

aij

j=1,j6=i
n
X

>

j=1,j6=i
n
X

|aij | +
|aij |

j=1,j6=i

vj
vi
n
X

j=1,j6=i
n
X

aij

vj
vi

vj
|aij | | |
vi
j=1,j6=i
|{z}
1

n
X

j=1,j6=i

|aij |

n
X

|aij |

j=1,j6=i

=0
and therefore we have > 0 for all eigenvalues which proves this lemma.

From this lemma the following theorem follows immediately.

1.2. SPLINE INTERPOLATION

19

Theorem 1.10 For every condition (i)-(iii) there exists exactly one interpolating cubic spline
satisfying this additional condition .
Proof: The matrices arising in the linear equation systems in (1.20), (1.21) and (1.22) are
symmetric and strictly diagonal dominant and therefore with lemma 1.9 positive definite. As a
positive definite matrix is regular the resulting linear equation systems are uniquely solvable.

2
Without proof the following error estimate for the error in the L -norm
Z
kf k2,[a,b] :=

b
2

1/2

(f (x)) dx
a

is stated.
Theorem 1.11 Let f C (4) ([a, b]) be given. For the complete interpolating spline the following
estimates hold
kf 00 s00 k2 h2 kf (4) k2
kf 0 s0 k2 h3 kf (4) k2
kf sk2 h4 kf (4) k2
with h := max hi .
i=1,...,n

20

CHAPTER 1. INTERPOLATION

Chapter 2

Numerical Integration
If one wants to compute the integrals of arbitrary functions there might arise two problems:
First one might not be able to compute the antiderivative of a general function and second if
you are able to compute the antiderivative it might be costly to evaluate it at arbitrary points.
Therefore it is necessary/beneficiary to compute the integral of the form
Z

I(f ) :=

f (x)dx

(2.1)

by using only the known values of f (x) at some quadrature points xi . Let the quadrature points
x0 , x1 , . . . , xn in [a, b] be given. We now want to approximate the integral I(f ) by a so called
quadrature rule Q(f )
Z

f (x)dx Q(f ) = (b a)

I(f ) =
a

n
X

i f (xi )

(2.2)

i=0

where the weights i need to be determined.

2.1

Newton-Cotes Formulas

The easiest approach for upper problem is the usage of the interpolating polynomial:
Z

f (x)dx

I(f ) =
a

Z bX
n
a

n
X

1
f (xi )Li (x)dx = (b a)
f (xi )
ba
i=0
i=0
|

Li (x)dx =: Q(f ) (2.3)


{z
}

=: i

which yields an approximation of the integral I(f ).


Remark: As for every polynomial of degree n the interpolating polynomial is the polynomial itself (see theorem 1.1) this quadrature rule is exact for every polynomial of degree n.
Furthermore the problem:
Find the weights i such that I(p) = Q(p) for every p Pn ([a, b]

(2.4)

is a well posed problem. Upper construction ensures existence of the weights and the uniqueness
follows directly by
Z b
n
X
Lj (x)dx = (b a)
i Lj (xi ) = (b a)j
a

i=0

as Lj Pn . So both approaches lead to the same weights i .


If we take the equidistant quadrature points xi = a + (b a) ni for i = 0, ..., n we will get the
closed Newton-Cotes formulas.
21

22

CHAPTER 2. NUMERICAL INTEGRATION


Examples: For n = 1 we get for the weights 0 and 1 by using (2.3).
Z b
1
0 =
ba a
Z b
1
1 =
ba a

Z
x b x=a+(ba)s 1 s 1
1
dx
=
ds =
ab
2
0 01
Z 1
x a x=a+(ba)s
s0
1
dx
=
ds =
ba
2
0 10

(2.5)

The resulting quadrature rule


1
1
Q(f ) = (b a)( f (a) + f (b))
2
2

(2.6)

is called trapezoidal rule.


Remark: In this example one can see that the weights are independent of a and b. In general
we have the following
1
i =
ba

Li (x)dx

x=a+(ba)s

Z
Li (a + (b a)s)ds =

i (s)ds
L

(2.7)

i (s) being the i-th Lagrangian polynomial on [0, 1] with respect to the interpolation points
with L
j
lj = n . Therefore the weights j are independent of the length (b a) of the integration interval
and the position. As a consequence one can compute the weights by restricting the computations
to the interval [0, 1].
For n = 2 we compute the weights by using (2.7)
1

(s 1/2)(s 1)
1
ds =
(0 1/2)(0 1)
6
(s 0)(s 1)
2
ds =
(1/2 0)(1/2 1)
3
(s 0)(s 1/2)
1
ds =
1
(1 0)(1 /2)
6

(2.8)

ba
a+b
(f (a) + 4f (
) + f (b))
6
2

(2.9)

Z
0 =
0

Z
1 =
0

Z
2 =
0

which leads to the quadrature rule


Q(f ) =

which is called Simpsons rule.


A different, but more straightforward and general approach, for problem (2.4) is the following
(by using (2.7) we restrict ourselves to the interval [0, 1]): Let a basis of Pn ([0, 1]) such as
{B0 , . . . , Bn } be given. So due to the linearity of the integral/sum we get I(p) = Q(p) for every
p Pn ([0, 1]) by enforcing
I(Bi ) = Q(Bi )
(2.10)
as
Z

Z
p(x)dx =

I(p) =
0

n
1X

ci Bi (x)dx =

i=0

n
X
i=0

n
X
i=0

Z
ci
0

Bi (x)dx
{z
}
exact

ci Q(Bi ) = Q(

n
X

ci Bi ) = Q(p)

i=0

So the only equation we have to ensure to hold is given by (2.10). Furthermore we can easily
find a basis of Pn by using the monomials {1, x, x2 , . . . , xn }.

2.1. NEWTON-COTES FORMULAS

23

So for n = 3 a basis would be given by {1, x, x2 , x3 } and the conditions (2.10) for (2.2) lead
to (with (x0 , x1 , x2 , x3 ) = (0, 1/3, 2/3, 1)):
Z

1 dx = 1 = 0 + 1 + 2 + 3
0
Z 1

x dx = 1/2 = 1/31 + 2/32 + 3


Z

0
1

x2 dx = 1/3 = 1/91 + 4/92 + 3

x3 dx = 1/4 = 1/271 + 8/272 + 3

which leads to the linear equation system:

1
0

0
0

1
1/3
1/9
1/27

1
2/3
4/9
8/27


1
0
1
1 1/2
1
=
1 2 1/3
1/4
1
3

Solving this system gives us the weights



1/8
0
1 3/8
=
2 3/8
1/8
3
which leads to the quadrature rule
Q(f ) = Q(f ) =

ba
2(b a)
ba
(f (a) + 3f (a +
) + 3f (a +
) + f (b))
8
3
3

which is called Simpsons 3/8-rule.


Remark: One can even use this approach for more general quadrature rules. Let the
following quadrature rule be given:
Q(f ) = (b a)(0 f 00 (a) + 1 f (a) + 2 f (b) + 3 f 00 (b))
First it has to be noted that although, for x = a + (b a)s, it holds
Z

Z
f (x)dx = (b a)

Z
f (a + (b a)s)ds = (b a)

f(s)ds

for the n-th derivative we have to be a bit more careful. Due to


dx
= (b a)
ds
it holds
dn
1
dn
1
dn
f
(x)
=
f
(a
+
(b

a)s)
=
f (s)
dxn
(b a)n dsn
(b a)n dsn
R1
i be the weights such that Q(
So let
p) = 0 p(s)ds for all p P3 ([0, 1]) and therefore we can

24

CHAPTER 2. NUMERICAL INTEGRATION

write Q(p) as
Z

p(x)dx
a
1

Z
= (b a)

p(s)ds
0

0 p00 (0) +
1 p(0) +
2 p(1) +
3 p00 (1))
= (b a)(
0 p00 (a) +
1 p(a) +
2 p(b) + (b a)2
3 p00 (b)
= (b a)((b a)2
= (b a)(0 p00 (a) + 1 p(a) + 2 p(b) + 3 p00 (b))
= Q(p)
0 , 1 :=
1 , 2 :=
2 and 3 := (b a)2
3 . So the weights are independent
with 0 := (b a)2
of the interval length.
Then problem (2.4) for all p P3 ([a, b]) can be solved on interval [0, 1] and the weights 0
and 3 need to be transformed afterwards.
Z

0
Z 1

0
1

0 + 1
1 + 1
2 + 0
3 =
1 +
2
1 dx = 1 = 0
0 + 0
1 + 1
2 + 0
3 =
2
x dx = 1/2 = 0

0 + 0
1 + 1
2 + 2
3
x2 dx = 1/3 = 2

0 + 0
1 + 1
2 + 6
3
x3 dx = 1/4 = 0

The solution of this equation system is given by


0
1/24
1

1 = /2

2 1/2
3
1/24

and the quadrature rule is then


Q(f ) = (b a)((ba)2/24f 00 (a) + 1/2f (a) + 1/2f (b) (ba)2/24f 00 (b)).

(2.11)

For the error estimate we can we get by using the norm


kf k,[a,b] := max f ()
[a,b]

Theorem 2.1 Let f be (n+1)-times continuously differentiable function, f C n+1 ([a, b]). Then
by using a Newton-Cotes quadrature rule with n + 1 quadrature points we get the following error
estimate
(n+1)
k,[a,b] ba n+2 R n Qn
kf
( n )
| 0 j=0 (t j)dt|
if n is uneven
(n+1)!
|I(f ) Q(f )| kf (n+2) k
R
Q
n
,[a,b] ba n+3

(
)
|
t n (t j)dt| if n is even
(n+2)!

j=0

2.2. GAUSS-LEGENDRE FORMULAS

25

Remark: For small n we get the following numbers:


n=1: |I(f ) Q(f )|
n=2: |I(f ) Q(f )|
n=3: |I(f ) Q(f )|
n=4: |I(f ) Q(f )|
n=5: |I(f ) Q(f )|

(b a)3 (2)
kf k,[a,b]
12
(b a)5 (4)
kf k,[a,b]
2880
(b a)5 (4)
kf k,[a,b]
6480
(b a)7 (6)
kf k,[a,b]
1935360
11(b a)7 (6)
kf k,[a,b]
37800000

Here you can see that the largest jumps in accuracy occur if you use even n. So in general people
try to use even n for general functions as the gain/cost ratio is substantially higher.

2.2

Gauss-Legendre Formulas

The former idea involved the equidistant distribution of quadrature points. If we drop this
assumption and let the quadrature points by itself be free to choose and then consider problem
(2.4) for higher order polynomials we get the Gauss-Legendre formulas. Typically these are
derived on the interval [1, 1] and afterwards transformed to the general interval [a, b] by the
a+b
transformation x = ba
2 t + 2 . Without proofs the quadrature points/weights will be given on
the general interval [a, b]. It is beneficial in this case to define the midpoint m = a+b
2 for the
interval [a, b]. The quadrature rule, for general n, is then given as in (2.2).
Example: Consider the case n = 1. Find a quadrature rule
Q(f ) = (b a)(0 f (x0 ) + 1 f (x1 ))
such that Q(p) = I(p) for all p P3 ([a, b]. Here we have 4 degrees of freedom, namely x0 , x1
and 0 , 1 . The solution of this problem is given by
ba 1

2
3
1
1 =
2

x0 = m
0 =

1
2

x1 = m +

ba 1

2
3

In general we get the following quadrature points weights for different n with h := (b a).
For n = 0 the rule is generally called midpoint-rule: A Gauss-Legendre Quadrature rule should

n
0
1
2

quadrature points
x0
h 1
2 3

error

weights

=m

0 = 1
h 1
2 3

x0 = m
, x1 = m +
q
q
h
3
x0 = m 2 5 , x1 = m , x2 = m + h2 35

0 =
0 =

5
18

1
2

, 1 =

, 1 =

8
18

1
2

, 2 =

5
18

(ba)3
(2)
k,[a,b]
24 kf
(5)
(ba)
(4)
k,[a,b]
540 kf
(ba)7
(6)
k,[a,b]
2016000 kf

Table 2.1: Gauss-Legendre Quadrature


always be the preferred rule one applies. The error is substantially smaller for the same number
of quadrature points and these rules are the commonly used rules.

26

2.3

CHAPTER 2. NUMERICAL INTEGRATION

Composite Rules

All quadrature rules we have seen before do not overcome the problem of needing the higher
derivatives of f (x) to be not "too bad". Furthermore the Newton-Cotes formulas will have
negative weights for larger n, rendering them to be a bad choice for the approximation of I(f ) for
a general f (x). We have seen in chapter 1 that a piecewise interpolation overcomes this problem.
We will use the same idea here and apply the chosen quadrature rule to every subinterval. Let
the decomposition of L subintervals of the interval [a, b]
{[t0 , t1 ], [t1 , t2 ], . . . , [tL1 , tL ]}
be given. This means that ({[t0 , t1 ], [t1 , t2 ], . . . , [tL1 , tL ]}) = [a, b] and (tj1 , tj ) (tk1 , tk ) =
if j 6= k. We then can write I(f ) by
Z b
L Z ti
X
I(f ) =
f (x)dx =
f (x)dx
a

i=1

ti1

and can apply the quadrature rules to every subinterval:


Z b
L Z ti
X
I(f ) =
f (x)dx =
f (x)dx
a

ti1

i=1

t1

=
| t0

t2

f (x)dx + +
f (x)dx +
{z
}
}
| t1 {z

Q1 (f ) on [t0 , t1 ]

Q2 (f ) on [t1 , t2 ]

tL

tL1

f (x)dx
{z
}

QL (f ) on [tL1 , tL ]

= QC (f ) I(f ) on [a, b]
Here we used different quadrature formulas Qi on the intervals [ti1 , ti ]. So one can for example
choose to use Simpsons rule on one interval and the trapezoidal rule on the other if the problem
requires this. In the next examples we will use the same quadrature formula on every subinterval.
Example:
Let the integral I(f ) with a decomposition
{[t0 , t1 ], [t1 , t2 ], . . . , [tL1 , tL ]}
be given. Applying the trapezoidal rule to the subintervals leads to
Z b
L Z ti
X
I(f ) =
f (x)dx =
f (x)dx
a

i=1

ti1

t1

=
t0

t2

Z
f (x)dx
{z
}

(t1 t0 )( 12 f (t0 )+ 12 f (t1 ))

Z
f (x)dx
{z
}

t1

tL

+ +
tL1

(t2 t1 )( 12 f (t1 )+ 12 f (t2 ))

f (x)dx
{z
}

(tL tL1 )( 12 f (tL1 )+ 12 f (tL ))

(t1 t0 )( 21 f (t0 ) + 12 f (t1 ))


(tL tL1 )( 12 f (tL1 ) + 21 f (tL ))
+ +
2
2
and if we have equidistant points with tj tj1 = h we get
Z b
L Z ti
X
I(f ) =
f (x)dx =
f (x)dx

i=1

t1

=
t0

ti1

Z
f (x)dx +
{z
}

h( 12 f (t0 )+ 12 f (t1 ))

t2

t1

Z
f (x)dx
{z
}

h( 12 f (t1 )+ 21 f (t2 ))

tL

+ +
tL1

f (x)dx
{z
}

h( 12 f (tL1 )+ 21 f (tL ))

1
1
h( f (t0 ) + f (t1 ) + + f (tL1 ) + f (tL ))
2
2

2.4. MULTIDIMENSIONAL NUMERICAL INTEGRATION

27

which is called the composite trapezoidal rule.


Applying the midpoint rule (Gauss-Legendre with n = 0) we get:
b

f (x)dx =

I(f ) =
a

L Z
X

t1

t0

f (x)dx

ti1

i=1

ti

Z t2
f (x)dx + +
f (x)dx +
t1
|
{z
}
{z
}

(t1 t0 )f (

t0 +t1
)
2

(t1 t0 )f (

(t2 t1 )f (

t1 +t2
)
2

tL

tL1

f (x)dx
{z
}

(tL tL1 )f (

tL1 +tL
)
2

tL1 + tL
t0 + t1
) + + (tL tL1 )f (
)
2
2

and for equidistant points with tj tj1 = h we get


I(f ) h(f (

t0 + t1
tL1 + tL
) + + f(
))
2
2

which is called the composite midpoint rule.


Remark: Formula (2.11) has a special application: If the cubic (natural,complete,periodic)
interpolating spline s(x) is computed one can easily use formula (2.11) to calculate I(s) without
additional function evaluations as all values are known.
Due to the construction of these composite quadrature rules we can easily transfer the existing
error estimates to the decomposed interval: We have for the composite quadrature rule, with
Qi (f )|[ti1 ,ti ] = Qi (f ) on [[ti1 , ti ], the following estimate:
|I(f ) QC (f )| = |

=|

i=1
L
X

f (x)dx

L
ti1

i=1

ti

L
ti1

f (x)dx Qi (f )|[ti1 ,ti ] |

(2.12)

ti

|
ti1

i=1

Qi (f )|[ti1 ,ti ] |

i=1

ti

L
X

f (x)dx Qi (f )|[ti1 ,ti ] |

and can then apply the known estimates on every subinterval.


Example: Let hi := ti ti1 and let Qi (f ) be Simpsons rule. We then have:
|I(f ) QC (f )|

L
X
(ti ti1 )5
i=1

2880

kf

(4)

k,[ti ,ti1 ]

L
X
h5i
=
kf (4) k,[ti ,ti1 ]
2880

(2.13)

i=1

Assuming equidistant points ti , meaning hi = h for every i = 1, ..., L, we can simplify this:
|I(f ) QC (f )|

L
X
h5
h5
kf (4) k,[ti ,ti1 ] L
kf (4) k,[a,b]
2880
2880

(2.14)

i=1

Equation (2.14) might be very rough compared to (2.13). But even in (2.14) one can see that
by adding more points ti and therefore reducing h will reduce the error estimate considerably
without having to take higher derivatives of f into account.

2.4

Multidimensional numerical integration

In this section an approach for multidimensional quadrature rules is shortly introduced. For the
sake of simplicity only the case of a rectangle/cuboid integration area will be considered. For
more general domains refer to chapter 9.9 in Quarteroni/Sacco/Saleri: Numerical Mathematics.

28

CHAPTER 2. NUMERICAL INTEGRATION

Let the rectangle [a1 , b1 ] [a2 , b2 ] and a continuous function f (x, y) be given. The integral
can then be written as
Z b1 Z b2
Z b1
I(f ) =
f (x, y)dy dx =
F (x)dx = I(F )
a1
a1
| a2 {z
}
=:F (x)

The integral I(F ) can then be approximated by a (composite) quadrature rule. For the sake of
simplicity we drop the composite component of the rule and will only use a quadrature rule
with the quadrature points xi and weights i for i = 0, ..., n.
Z

b1

F (x)dx Qx (F ) = (b1 a1 )

I(F ) =
a1

n
X

i F (xi )

i=0

We then note that F (xi ) is given by


Z

b2

f (xi , y)dy

F (xi ) =
a2

which can also be approximated by a quadrature rule. Again for simplicity we use the same rule
again which leads to
Z

b2

f (xi , y)dy Qy (f ) = (b2 a2 )

F (xi ) =
a2

n
X

j f (xi , yj )

j=0

and for the rectangle:


Z

b1

b2

f (x, y)dydx = (b1 a1 )

I(f ) =
a1

a2

n
X

i (b2 a2 )

i=0

= (b1 a1 )(b2 a2 )

n
X

j f (xi , yj )

j=0
n X
n
X

i j f (xi , yj )

i=0 j=0

The derivation of this quadrature formula makes obvious that this rule, by using the weights
and quadrature points of the closed Newton-Cotes formulas, is exact for polynomials of the form
p(x, y) = p1,n (x)p2,n (y) with pi,n Pn ([ai , bi ]) but not necessarily (if n is uneven) for polynomials
p(x, y) = p1,n+1 (x) (or p(x, y) = p2,n+1 (y)). Furthermore the number of function evaluations
increases to (n+1)2 (for a general d-dimensional cuboid to (n+1)d ). This phenomenon is usually
called the curse of dimensionality.
For the cuboid the derivation of a tensor-product formula is analog. Let the rectangle [a1 , b1 ]
[a2 , b2 ] [a3 , b3 ] and a continuous function f (x, y, z) be given. We can then use the quadrature
formula:
Z b1 Z b2 Z b3
n
n
n
X
X
X
I(f ) =
f (x, y, z)dzdydx = (b1 a1 )
i (b2 a2 )
j (b3 a3 )
k f (xi , yj , zk )
a1

a2

a3

i=0

j=0

= (b1 a1 )(b2 a2 )(b3 a3 )

n X
n X
n
X
i=0 j=0 k=0

k=0

i j k f (xi , yj , zk )

Chapter 3

Direct Solvers for Linear Equation


Systems
We have seen in the previous two chapters that numerical problems might lead to the necessary
solution of linear equation systems. Examples were the calculation of the moments of cubic spline
interpolants or the determination of the weights of quadrature formulas. These linear equation
systems arise in the form:

Ax = b

(3.1)

with A Rnn being invertible.


There are several methods to solve problems like (3.1). One has to differentiate between direct
and iterative methods. Direct methods deliver the exact solution at the end of the process,
whereas iterative methods deliver (better) approximations in every step. We will first take a
closer look at the first mentioned.

3.1

The Gaussian Elimination and LU-Decomposition

The Gaussian Elimination is a commonly used method for solving linear equation systems. The
idea is to transform the matrix A by row operations to an upper triangular matrix. Applying
the same transformations to the vector b we can the get the solution by back-substitution. We
will first start with an example:


2 1 4
7
6
2 4 3 5
5

x =
4
16
7 12 20
0 12 5
3
5
|
| {z }
{z
}
=A

=b

29

30

CHAPTER 3. DIRECT SOLVERS FOR LINEAR EQUATION SYSTEMS

Applying the Gaussian Elimination to the augmented matrix we get:

2 1 4
2 4 3

4
7 12
0 12 5

2 1
0 3

0 9
0 12

2 1
0 3

0 0
0 0

2 1
0 3

0 0
0 0

2
1
6
7
+
5 5

20 16 +
3
5

4 7 6
3 4
1 2 1

4 6 4
+
5 3 5 +

4 7 6
1 2 1

1
1 0 1
1 5 1
+

4 7 6
1 2 1

1 0 1
0 5 0

One then has to solve the linear equation system


6
2 1 4 7
1
0 3 1 2

0 0 1 0 x = 1
0 0 0 5
0

by backward substitution resulting in

x4 = 0,

x3 = 1,

1
x2 = (1 1 1 2 0) = 0,
3

resulting in the solution


1
0

x=
1
0

1
x1 = (6 + 1 0 4 1 7 0) = 1
2

3.1. THE GAUSSIAN ELIMINATION AND LU-DECOMPOSITION


For a general matrix

a11
a21

a31

..
.

A the method looks like:

a
a
a21
a31
a12 a13 . . . a1n
11
11
+
a22 a23 . . . a2n

a32 a33 . . . a3n


+
..
.

an1
11

an1 an2 an3 . . . ann +

a11 a12 a13 . . . a1n


(2)
a
0 a(2) a(2) . . . a(2)
32

(2)
22
23
2n
a22

(2)
(2)
(2)
(2)
+

A := 0 a32 a33 . . . a3n


..
..
.
.
(2)

A(3)

(2)

0 an2 an3

a11 a12 a13


0 a(2) a(2)

22
23

(3)

0
0
a
:=
33
..
.
0

(3)

an3

(2)

31

(2)
an2
(2)
a22

. . . ann

. . . a1n
(2)
. . . a2n

(3)
. . . a3n

..
.
(3)

. . . ann

(3)
an3
(3)
a33

...

A(n)

a11 a12 a13 . . . a1n


0 a(2) a(2) . . . a(2)

22
23
2n

(3)
(3)

0
0
a
.
.
.
a
:=
33
3n =: U
.
..
..
..
..
.
.
.

(n)
0
0
. . . 0 ann

If we apply the same operations to the vector b we get

b1
b(2)
2
(3)
b
b=
3
..
.
(n)

bn

The operations we have seen before can also be done by a matrix-matrix multiplication. The
first step of the Gaussian elimination for example can be done by multiplying A = A(1) from the
left side by the matrix

1
0 0
... 0
l2,1 1 0
. . . 0

l3,1 0 1 0 . . . 0

.
.
.
.
L1 =
.
.
.
.
.
.
.
.

.
.
ln1,1 0 0
. 0
ln,1 0 0
... 1
i1
. Therefore we have
with li,1 := aa11

a11 a12 a13 . . . a1n


0 a(2) a(2) . . . a(2)

22
23
2n

(2)
(2)
(2)
(2)
0 a32 a33 . . . a3n = L1 A(1)
A =

..
..
.
.
(2)
(2)
(2)
0 an2 an3 . . . ann

32

CHAPTER 3. DIRECT SOLVERS FOR LINEAR EQUATION SYSTEMS

In general we can execute the k-th

0
.
.
.
.
.
.
Lk = .
..

.
..

.
..

step by Multiplying A(k) by Lk

0
0
...
.. ..
.
.
...
..
. 1
0
...
..
.
0
1
0
..
.
0 lk+1,k 1
..
..
.
. lk+2,k 0
..
..
..
..
.
.
.
.
0 ... 0
ln,k
0

... ... 0
.
. . . . . . ..
.

. . . ..
.

0 . . . ..
.
0 . . . ..

. . . . ..
.
. .

..
. 0
... 0 1

(k)

with lj,k =

ajk

(k)

akk

for j = k + 1, . . . , n and get:

A(k+1)

a11 a12 a13


...
0 a(2) a(2)
...

22
23
.
.
.
..
..
..
..
.

.
= .
..
(k+1)
.
0 ak+1,k+1
.
.
..
..
.

(k+1)
0
...
0
an,k+1

...
...
..
.

a1n
(2)
a2n
..
.

(k)

(k+1) = Lk A
. . . ak+1,n
..
.

(k+1)
. . . ann

So applying the Li for i = 1, . . . , n 1 we get


U = Ln1 A(n1) = Ln1 Ln2 A(n2) = = Ln1 Ln2 . . . L1 A
with the upper triangular matrix U . It

1 0
..
.
0
.
. ...
.
. .
. .
. .
1
Lk = . .
.. ..

. .
.. ..

. .
.. ..

is easy to check that the inverse of Lk is given by

0
...
... ... 0
.
..
.
...
. . . . . . ..
.

1
0
...
. . . ..
.

0
1
0
0 . . . ..
.
0 lk+1,k 1
0 . . . ..

..
. . . . ..
.
. .
. lk+2,k 0

..
..
..
..
. 0
.
.
.
0 ... 0
ln,k
0 ... 0 1

We then can write A as


A = (Ln1 Ln2 . . . L1 )1 U = L1
L1 . . . L1
n1 U = LU
| 1 2 {z
}
=:L

with

0
..
.

0
..
.

...

...

...

l2,1
...
...

0
...
...
l3,1 l3,2 1
.
.
.

..
..
L = ..
1
0
...
.
.
..
..
..
..
.
.
lk+1,k

.
.
.
.
..
..
..
..
1
ln,1 ln,2 . . . ln,k . . . ln,n1

0
..
.
..

.
..

.
..
.

0
1

3.2. SYMMETRIC POSITIVE DEFINITE MATRICES: THE CHOLESKY DECOMPOSITION33


The decomposition of A into a lower triangular matrix L and an upper triangular matrix U is
called LU decomposition. To solve a linear equation system of the form Ax = b we can make
use of the LU decomposition:
Ax = L |{z}
U x = Lz
=:z

(3.2)

first solve Lz = b
then solve U x = z

After we computed the LU-decomposition in 23 n3 + O(n2 ) operations we can solve the linear equation system by only using forward- and backward-substitution, which requires O(n2 )
operations.
For our example we get

1
1
L1 =
2
0

0
1
0
0

0
0

0
1

0
0
1
0

1 0 0
0 1 0
L2 =
0 3 1
0 4 0

0
0

0
1

1
0
L3 =
0
0

0 0
1 0
0 1
0 1

0
0

0
1

and therefore

1
1
L=
2
0

0
1
3
4

0
0

0
1

0
0
1
1

and U = A(4)

2 1 4 7
0 3 1 2

=
0 0 1 0
0 0 0 5

It is easy to check that A = LU holds. To solve the linear equation system we first solve

1
1

2
0

0
1
3
4

0
0
1
1


6
6
0

0
5 z = 1
z
=
1
16
0
0
1
5

and then solve



2 1 4 7
6
1
0 3 1 2
1
0



0 0 1 0 x = 1 x = 1
0 0 0 5
0
0

3.2

Symmetric Positive Definite Matrices: The Cholesky Decomposition

We can not guarantee existence of the LU decomposition for any invertible matrix A Rnn .
(k)
It is clear that the Gaussian elimination stops if any akk is 0. To prove the following first result
on existence and uniqueness of the LU decomposition we will make use of definition 1.7:
Theorem 3.1 Let A Rnn be symmetric positive definite. Then a LU decomposition A = LU
(k)
exists and when using the Gaussian elimination we have: akk > 0 for k = 1, ..., n

34

CHAPTER 3. DIRECT SOLVERS FOR LINEAR EQUATION SYSTEMS

proof: As A is symmetric and positive definite we have a11 = eT1 Ae1 > 0. Therefore the first
step of the Gaussian elimination can be executed and we get:

a
a
a
a21
a31
an1
a11 a21 a31 . . . an1
11
11
11
a21 a22 a32 . . . an2
+

a31 a32 a33 . . . an3 +

..
..
.
.
an1 an2 an3 . . . ann +

a11 a21 a31 . . . an1


0 a(2) a(2) . . . a(2)

22
32
n2

(2)
(2)
(2)
(2)

0
a
a
.
.
.
a
L1 A = A =
32
33
n3
..
..
.
.
0

(2)

an2

(2)

an3

(2)

. . . ann

As A is symmetric we can now apply the same operations to the columns:


a

an1

11

a31
11

a21
11

+
y

a11 a21 a31 . . . an1


0 a(2) a(2) . . . a(2)

22
32
n2

(2)
(2)

0 a(2)
a
.
.
.
a
32
33
n3

..
..
.
.
(2)
(2)
(2)
0 an2 an3 . . . ann

a11 0
0 ... 0
0 a(2) a(2) . . . a(2)

22
32
n2

(2)
(2)
(2)
T
(2)

0
a
a
.
.
.
a
L1 AL1 =: B =
32
33
n3
..
..
.
.
(2)
(2)
(2)
0 an2 an3 . . . ann
Now the general matrix B (k) after k 1 steps is given by

Lk1 B (k1) LTk1

=B

(k)

a11
(2)

a22

=
0

0
..

.
(k)

(k)
. . . an,k
..
..

.
.

ak,k
..
.
(k)
(k)
an,k . . . ann

and we have (with B (1) = A) for all y Rn :


yT B (k+1) y = yT Lk B (k) LTk y = (LTk y)T B (k) (LTk y)
Therefore as B (1) is symmetric positive definite the same is true for all B (k) with k = 2, ..., n and
we have
(k)
akk = eTk B (k) ek > 0

3.2. SYMMETRIC POSITIVE DEFINITE MATRICES: THE CHOLESKY DECOMPOSITION35


and the k-th step of the Gaussian elimination can be computed.

The proof of this theorem gives us after (n 1) steps

(n1)

Ln1 Ln2 . . . L1 ALT1 LT2

. . . LTn1

(1)

a11

(2)

a22

..

.
(n)

=: D

ann
(i)

with all aii > 0. This means we can rewrite A in the form
1
1
T
T T
A = L1
= LDLT
1 L2 . . . Ln1 DLn1 . . . L2 L1

And by setting
= LD1/2
L
with
q
D1/2

(1)

a11

(2)
a22

..

.
q
(n)
ann

we get
L
T
A = LDLT = LD1/2 D1/2 LT = LD1/2 (LD1/2 )T = L
is a lower triangular matrix with positive entries on the diagonal. This decomposition
Here L
is called the Cholesky decomposition and is named after Andre-Louis Cholesky (1875-1918). To
calculate the Cholesky factorization by simply applying the Gaussian elimination would be too
costly and would not take the symmetry into account. If we write the decomposition in the form

l11
l11
l21 l22

..

.
..
.

ln1 . . . . . . lnn

l21 . . . ln1
a11 a21 . . . an1
.. a

l22
21 a22 . . . an2
.
=
..
. ..
..
.
. .. .
an1 an2 . . . ann
lnn

with the i-th column of L


T
we can compute lki with i k by multiplying the k-th row of L
lk1 li1 + lk2 li2 + + lki lii = aki
i1
X
1

lkj lij ) if i < k


lki = (aki
lii
j=1
v
u
k1
X
u
t

l2 if i = k
lkk = akk
kj
j=1

36

CHAPTER 3. DIRECT SOLVERS FOR LINEAR EQUATION SYSTEMS

This leads to the following algorithm


for k=1 to n
for i=1 to k-1
i1
X
lkj lij )
lki = 1 (aki
lii
j=1

end
v
u
k1
X
u
lkk = takk
l2
kj
j=1

end
The computational cost is
n X
k1
n
X
X
1
(2i 1) +
2(k 1) = n3 + O(n2 )
3
k=1 i=1

k=1

with the additional computation of n square roots.


Example: Let the matrix

16 4
4 4
4
5
1 1

A=
4
1 10 7
4 1 7 9
be given. Applying upper algorithm we get the

4
1
=
L
1
1

Cholesky decomposition

0 0 0
2 0 0

0 3 0
0 2 2

and it holds
L
T
A=L

3.3

Partial Pivoting

As we have already mentioned before the Gaussian elimination can not be applied to any regular
matrix, as there might be a 0 on the diagonal. A simple example is given by


0 1
A=
1 1
where Gaussian elimination of section 3.1 can not be applied without swapping rows(columns).
Furthermore problems might arise if the pivot element is very small:
 20 
10
1
A=
1
1
Using the Gaussian elimination we get


 20

1
0
10
1
L=
and U =
1020 1
0
1 1020

3.3. PARTIAL PIVOTING

37

If one uses a computer with a machine precision of 1016 the difference 1 1020 is 1020 .
Therefore we get the disturbed matrices


 20

1
0
10
1

L=
and U =
1020 1
0
1020
and
U
=
L


1020 1
1
0

Using the LU decomposition for the solution of the LES Ax = (1, 0)T we get
 
1

L |{z}
Ux =
0
=:z

and therefore


z=

and then

1
1020

 
0
x=
1

whereas the true solution is given by



x=

1101 20

1
11020

 
1

A remedy for this phenomenon is to swap the two rows


 20 


1
1
10
1

1020 1
1
1

So instead of solving
 
 20 
1
10
1
x=
0
1
1
we want to solve:


 
0
1
1
x=
1
1020 1

We compute the LU decomposition of this new matrix. Here we get






1
0
1
1
L=
and
U
=
1020 1
0 1 1020
and for the disturbed matrices
=
L
with
U
=
L

1
0
1020 1

=
and U

1
1
20
10
1 + 1020



1 1
0 1

1
1
20
10
1

Using the LU decomposition to solve upper linear equation system we get (by taking machine
precision into account)
 
1
x=
1
The Gaussian elimination with partial pivoting in the k-th step looks like

38

CHAPTER 3. DIRECT SOLVERS FOR LINEAR EQUATION SYSTEMS


(k)

(k)

1. For the step A(k) A(k+1) choose a p {k, . . . , n}, such that |apk | |ajk | for j
{k, . . . , n}
2. Swap the rows k and p

(k)

(k)

with

(k)
a
ij

(k)

akj
= a(k)
pj

(k)
aij

if i=p
if i=k
else

3. Apply the Gaussian elimination step to A(k)


The second step can be applied by using a permutation matrix Pk of the form
k-th column

Pk =

p-th column

1
..

| k-th row

| p-th row

.
1
0 0 ... 0 1
0 1
0
..
..
..
.
.
.
0
1 0
1 0 ... 0 0
1
..

.
1

which is just the unity matrix where column k and row p are swapped. It is easy to see that
Pk1 = Pk . Similar to the LU decomposition before we are able to construct an upper triangular
matrix U by
U = Ln1 Pn1 Ln2 Pn2 Ln3 Pn3 . . . L2 P2 L1 P1 A
To get a LU decomposition we need to construct matrices similar to our Li before to keep the
known structure. It is easy to see that if we multiply Pk+1 (which swaps row k + 1 and pk+1 )
with Lk we get
k + 1-th

1
..
.

lpk+1 ,k 0

lk+2,k
0

..
..

Pk+1 Lk =
.
.

l
0
pk+1 1,k

lk+1,k
1

l
pk+1 +1,k

..

.
ln,k

col.

pk+1 -th col.

| k + 1-th row

| pk+1 -th row

0 ... 0 1
1
0
..
..
.
.
1 0
0 ... 0 0
1
..

.
1

3.3. PARTIAL PIVOTING

39

and then by swapping the k + 1-th and pk+1 -th column we regain the former structure. To swap
column k + 1 and pk+1 we need to multiply Pk+1 from the right.
k + 1-th col.

Pk+1 Lk Pk+1

pk+1 -th col.

1
..

| k + 1-th row

| pk+1 -th row

1
lpk+1 ,k
lk+2,k
..
.
lpk+1 1,k
lk+1,k
lpk+1 +1,k
..
.

1 0 ... 0 0
0 1
0
..
..
..
.
.
.
0
1 0
0 0 ... 0 1
1
..

ln,k

This implies the following construction (Pk = Pk1 )


U = Ln1 Pn1 Ln2 Pn1 Pn1 Pn2 Ln3 (Pn2 Pn1 )(Pn2 Pn1 )1 Pn3 L4 . . . L2 P2 L1 P1 A
| {z }
|
{z
}
=I

=I

= Ln1 Pn1 Ln2 Pn1 Pn1 Pn2 Ln3 Pn2 Pn1 Pn1 Pn2 Pn3 L4 . . . L2 P2 L1 P1 A
| {z } |
{z
}|
{z
}
n1
=:L

n2
=:L

n3
=:L

= ...
n1 L
n2 . . . L
1 (Pn1 . . . P1 ) A
=L
{z
}
|
=:P

with
k = Pn1 . . . Pk+1 Lk Pk+1 . . . Pn1
L
having the same structure as the Lk of section 3.1 with swapped entries. and therefore
n1 L
n2 . . . L
1 )1 U = L
1 L
1 . . . L
1 U = LU
P A = (L
n1
| 1 2 {z
}
=:L

It is clear that this construction works always if A is not singular as there has to be at least one
entry in the column (from row k downwarts) that is not 0. We summarize this in the following
theorem
Theorem 3.2 Let A Rnn be non singular. Then a decomposition
P A = LU
with a permutation matrix P a lower triangular matrix L (with |lij | 1) and an upper triangular
matrix U exists.
Remark: As the additional computational cost for the step k of the elimination process only
consists in finding the element of largest absolute value in the corresponding row (k operations)
and then swapping two rows (2k operations) the overall computational cost of
n1
X
i=1

3k = 3

n(n 1)
= O(n2 )
2

40

CHAPTER 3. DIRECT SOLVERS FOR LINEAR EQUATION SYSTEMS

is negligible.
To solve the linear equation system Ax = b with the given decomposition P A = LU we use
Ax = b P Ax = P b LU x = P b
first solve Lz = P b
then solve U x = z
Example: Let the Matrix

1/3
1/4
1/2
12 12 12 0
4

4
5
6
A=
3
3
21 12 +
+
6
4 17 9
be given. In the first step no swap is needed which means that P1 = I and L1 is given by

1 0
1 1
3
L1 =
1 0
4
1
2 0

0
0
1
0

0
0

0
1

A(2) is then given by

A(2)

12 12 12 0
0
0
1
6

0
6
18 12

0
2
11
9

A(2)

12 12 12 0
0
1/3
6
18 12

0
0
1
6
0
2
11
9

and therefore P2 and L2 are given by

1
0
P2 =
0
0

0
0
1
0

0
1
0
0

0
0

0
1

1 0
0 1
L2 =
0 0
0 13

0
0
1
0

0
0

0
1

and A(3) is given by

A(3)

12 12 12 0
0
6
18 12

=
0
0
1
6

0
0
5
5

A(3)

12 12 12 0
0
6
18 12

=
0
1/5
0
5
5
0
0
1
6

with

1
0
P3 =
0
0

0
1
0
0

0
0
0
1

0
0

1
0

1
0
L3 =
0
0

0 0
1 0
0 1
0 15

1
0
P = P3 P2 P1 =
0
0

0
0

0
1

and finally

12 12 12 0
0
6
18 12

U =
0
0
5
5
0
0
0
5

0
0
0
1

0
1
0
0

0
0

1
0

3.3. PARTIAL PIVOTING

41

i we get
For the L

1 0
1 1
1 = P3 P2 L1 P2 P3 = 41
L

2 0
1
3 0

1 0
0 1
3 =
L
0 0
0 0
and therefore

0
1 0
0 1
0
2 = P3 L2 P3 =
L
0 1
0
3
1
0 0

0 0
0 0

1 0
15 1
0
0
1
0

1
1
4
L=
1
2
13

0
1
1
3

0
0
1
0

0
0

0
1

0 0
0 0

1 0
1
5 1

with

1
0

0
0
|

0
0
0
1

0
1
0
0
{z
P

12 12 12 0
1
0

4
5
6 41
0 4
=

3
3
21 12 12
1
6
4 17
9
31
0
}|
{z
} |
A

Solving the linear equation system

1 0
1 1
4
Solve
1 1
2
3
13 0
and then

0 0
12 12 12 0

0 0
6
18 12
0

0
0
5
5
3 1 0
0 15 1
0
0
0
5
{z
}
{z
}|

0
1

Ax = b with b = (12, 10, 15, 15)T is done by

0 0
12
12

12
0 0
z = P b = 15

z=

5
1 0
15
1
10
5
5 1

12
12 12 12 0

0
6
18 12
x = z = 12
Solve
5
0
0
5
5
5
0
0
0
5


1
0

x=
0
1

42

CHAPTER 3. DIRECT SOLVERS FOR LINEAR EQUATION SYSTEMS

Chapter 4

Iterative Methods for Linear Equations


Systems
In this chapter we will take a closer look at iterative methods for the solution of linear equation
systems. A very important point when using these kind of methods is the speed of convergence.
This mostly depends on how prone the given problems, in this case solving the linear equation
system, are to errors. Therefore it is beneficial to define a theoretical framework which measures
this proneness to errors.

4.1

Norms and Condition Numbers for Matrix Vector Calculus

There are several different forms of errors arising in numerical computations.


- rounding errors: The results of numerical computations are given as floating point numbers.
Therefore the machine precision has to be taken into account if one uses the floating point
operations (+,-,,/).
- data errors: The data given by experiments/computations is generally polluted by measuring errors/ computational errors.
- approximation errors: Functions (for example sin(x)) are most of the time approximated
by easier functions, such as interpolating polynomials.
To have reliable numerical calculations one has to examine the effect of these errors on the
accuracy. We will examine the data errors under the assumption that there are no other errors
present.
We will first start with a simple example to emphasize the given problem. Let two straight
lines g1 (t) and g2 (t) be given. The problem we want to examine is given by: Calculate the value
of the point of intersection t on the t-axis.
Assume that the two straight lines are almost orthogonal, for example g1 (t) = c1 and g2 (t) =
20t + c2 and assume, that, instead of the exact values c1 = 1 and c2 = 10, we have some
disturbed values c1 and c2 (with an accuracy of 0.25). In figure 4.1 (a) one can see that the
point of intersection t lies somewhere between 0.52 and 0.58.
Now let us assume that the two straight lines are almost parallel. As an example we take
g1 (t) = c1 and g2 (t) = 0.1t + c2 and again assume that, instead of the exact values c1 = 1 and
c2 = 0 we have some disturbed values c1 and c2 (again with an accuracy of 0.25). In figure 4.1
(b) we can see that the point of intersection t lies somewhere between 5 and 15.
We can conclude that the error in the solution of our problem is much higher for the same
error in the input data if we have almost parallel straight lines. The problem of calculating the
abscissa of the point of intersection is worse conditioned than for almost orthogonal straight
lines.
43

44

CHAPTER 4. ITERATIVE METHODS FOR LINEAR EQUATIONS SYSTEMS

1.8

1.8

1.6

1.6

1.4

1.4

1.2

1.2

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0
0.5

0.51

0.52

0.53

0.54

0.55

0.56

0.57

0.58

0.59

0.6

(a) (almost) orthogonal straight lines

10

12

14

16

(b) (almost) parallel straight lines

Figure 4.1: Calculation of point of intersection for disturbed data

Let g20 be the slope of the straight line g2 . We then can calculate the abscissa of the point of
intersection by
 c 
c1 c2  1
1
1

t =
= g20 g20
= Ax
c
g20
|
{z
} | {z2 }
=:A

=:x

with A R12 . In this case the function that computes our solution f (x) is given by
f : R2 R1 with f (x) = Ax
and instead of f (x) one usually computes f (
x).
We will now get to the case of matrix-vector multiplication. Let a matrix A Rmn be
be given. We then have for the
given. Furthermore let a vector x and a disturbed vector x
error in the result/output data eout of the matrix-vector multiplication
)
eout = Ax A
x = A (x x
| {z }
ein

with ein being the error in the input data. For a vector x Rk we can measure the length of
the vector in the so called p-norms with p N or p = :
p
kxkp = p |x1 |p + + |xk |p
(4.1)
with the special cases of the Taxicab norm
kxk1 = |x1 | + + |xk |
the Euclidean norm
kxk2 =

|x1 |2 + + |xk |2

and the maximum norm


kxk = max {|xi |}
i=1,...,k

We will use these norms for measuring the error.

4.1. NORMS AND CONDITION NUMBERS FOR MATRIX VECTOR CALCULUS

45

We now get back to the question how the error in the input data is increased: The absolute
condition, with respect to a p-norm in the input and a q-norm in the output data, is the smallest
number such that
kp = kein kp
keout kq = kAx A
xkq kx x
Rn . We can rewrite upper statement as
holds for all x
kp for all x
Rn }
abs = inf{ : kAx A
xkq kx x
kAx A
xk q
kAykq
= max n
= max n
=: kAkp,q
R
kp
x6=x
06=yR
kx x
kykp

(4.2)

It is easy to show that kAkp,q is in fact a norm on Rmn for the by A indicated linear mapping
which is called the (by the vector norms k kp and k kq ) induced matrix norm. In the case p = q
we write kAkp instead of kAkp,p .
Remark: The definition of an induced matrix norm does not cover all matrix norms. There
are some matrix norms that are not induced by a vector norm and for these norms the following
Definition is of interest: We call a matrix norm k kM consistent with a vector norm k kV if
kAxkV kAkM kxkV

(4.3)

for all x Rn . Examples for norms that are not induced by vector norms are the Frobenius
norm
v
uX
n
um X
kAkF := t
a2ij
(4.4)
i=1 j=1

or the maximum norm


kAkM ax := max |aij |

(4.5)

i,j

The Frobenius norm is consistent with the Euclidean norm, whereas the maximum norm is not
consistent with any p-norm.
Generally we are more interested in relative errors than in absolute errors as we want to take
the magnitude of the input data/ solutions into account. Therefore we get the relative condition
of the matrix-vector multiplication by using relative errors: The relative condition, with respect
to a p-norm in the input and a q-norm in the output data, is the smallest number rel such that
kp
keout kq
kAx A
xkq
kx x
kein kp
=
rel
= rel
kAxkq
kAxkq
kxkp
kxkp
Rn . As before we can rewrite upper statement as
holds for all x
kp
kx x
kAx A
xk q
Rn }
rel
for all x
kAxkq
kxkp
kAx A
xkq kxkp
kxkp
= max n
= kAkp,q
R
kp kAxkq
x6=x
kx x
kAxkq

rel = inf{ :

(4.6)

So we see that the relative condition not only depends on A, p and q but on x as well. A very
important number is the so called condition number p,q . Therefore we assume A to be quadratic
(m = n) and invertible. If we take the relative condition (4.6) and take the maximum over all x
(worst case scenario) we get
p,q = max n kAkp,q
06=xR

kxkp
kAxkq

y=Ax

= kAkp,q max n
06=yR

kA1 ykp
= kAkp,q kA1 kq,p
kykq

(4.7)

For the condition number p,q we write p if p = q. Generally these matrix norms are hard to
compute and therefore we will restrict ourselves to the cases p = q and p = 1, p = 2 and p = .
We start with the special case p = 2:

46

CHAPTER 4. ITERATIVE METHODS FOR LINEAR EQUATIONS SYSTEMS

Theorem 4.1 (Spectral norm) For the Euclidean norm k k2 the induced matrix norm
kAk2 = max n
06=yR

kAyk2
kyk2

is given by:
q
max{ : is eigenvalue of AT A}
q
= max{ : is eigenvalue of AAT }

kAk2 =

(4.8)

if A is symmetric we have
kAk2 = max{|| : is eigenvalue of A}

(4.9)

Proof: Let n m. We know that AAT Rmm has the same eigenvalues as AT A Rnn
with additional m n zero eigenvalues. Therefore we only have to show
q
kAk2 = max{ : is eigenvalue of AT A}
The definition of the k k2 norm of A gives immediatly
kAk22 = maxn
yR

yT AT Ay
kAyk22
= maxn
2
yR
yT y
kyk2

As AT A is symmetric we can find an orthogonal matrix Q Rnn such that

QT AT AQ = =

.
.

.
m
with i being eigenvalue of AT A.It follows AT A = QQT as QT Q = I. Therefore we get
yT AT Ay
yR
yT y
T
y QQT y
= maxn T
yR
y QQT y
zT z
= maxn T
zR
z z
Pn
i zi2
= maxn Pi=1
n
2
zR
i=1 zi

kAk22 = maxn

= max i
i=1,...,n

with z = QT y. For m n the proof is analogously. For a symmetric matrix A we get (4.9)
immediately by using the fact that the eigenvalues of A2 are the squares of the eigenvalues of A.

Without proof the following theorem states how the matrix norms for p = 1 and p = can
be calculated.
Theorem 4.2 (Maximum absolute column/row sum norm) For the induced matrix norms
k k1 and k k we have
m

kAk1 = max n
06=yR

X
kAyk1
= max
|aij |
j=1,...,n
kyk1
i=1

4.1. NORMS AND CONDITION NUMBERS FOR MATRIX VECTOR CALCULUS

47

which is called the maximum absolute column sum norm and


n

kAk = max n
06=yR

X
kAyk
= max
|aij |
i=1,...,m
kyk
j=1

which is called the maximum absolute row sum norm.


To apply these considerations to linear equation systems
Ax = b
we have to keep in mind that solving this system is equivalent to multiplying b by A1 :
x = A1 b
Therefore the absolute condition is given by
abs = kA1 kp,q

(4.10)

and the relative condition is given by


rel = kA1 kp,q

kbkp
kA1 bkq

(4.11)

If we again take the maximum of all right hand sides b (the worst case scenario) we get
p,q = kAkp,q kA1 kq,p
It has to be noted that in practice one normally uses p = q and p = 1, p = 2 and p = .
Example: Let p = q = 2 and let the quadratic Hilbert-Matrix be given:

1/2
1/3
1/n
1
...
1/2 1/3
1/4
. . . 1/n+1

1/5
. . . 1/n+2
H = 1/3 1/4

..
..
..
..
.
.
.
.
.
.
.
1/n 1/n+1 1/n+2 . . . 1/2n1
and the right hand side

1
1/2


b = 1/3
..
.
1/n

We now disturb the vector b by going from double precision (error of 2.221 1016 ) to single
precision (error of 1.193 107 ). The calculations are all done using double precision. The
errors are listed in table 4.1
n
4
8
12

2
kb bk
9.9341 109
1.3154 108
1.3764 108

k2
kx x
8.1301 105
13.3924
1.4870 107

kH 1 k2
1.0341 104
8.9965 109
9.0997 1015

2
kbbk
/kbk2

109

8.3259
1.0643 108
1.1002 108

kx
xk2/kxk2

105

8.1301
13.3924
1.4870 107

Table 4.1: Errors and condition numbers

2 (H)
1.5514 104
1.5258 1010
1.6776 1016

48

CHAPTER 4. ITERATIVE METHODS FOR LINEAR EQUATIONS SYSTEMS

4.2

Iterative Solvers

In applications one typically has to handle large linear equation systems with only few non-zero
entries (finite element methods, finite differences, finite volumes,...). The matrices arising in this
systems are called sparse matrices.The main problem is that direct solvers might be
- inefficient
- memory consuming
The first point is important if in applications no exact solution is needed and one is only interested in an approximation. In contrast to direct solvers, Iterative solvers deliver a sequence of
approximations x(0) , x(1) , . . . and can be stopped at any time with giving an approximation.
The second point is illustrated in figure 4.2. The original matrix A is a sparse matrix with
only 180 entries that are not zero. The matrices L and U arising from the LU decomposition
with partial pivoting (P A = LU ) have 451 and 419 entries. This results in 5 times the memory
consumption of the original matrix (or more than 2 times if we only consider U ). The problem
of fill in is quite common for sparse matrices and might introduce serious memory problems.

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60
0

10

20

30
nz = 180

40

50

60

60
0

(a) A (180 non zeros)

10

20

30
nz = 451

40

50

60

(b) L (451 non zeros)

10

20

30
nz = 419

40

50

60

(c) U (419 non zeros)

Figure 4.2: LU decomposition with partial pivoting

Let the following linear equation system be given:

a11
a21

..
.

a12
a22
..
.

an1 an2


. . . a1n
x1
b1
x2 b2
. . . a2n

.. .. = ..
..
.
. . .
. . . ann
xn
bn

with a given initial guess:

(0)
x1
(0)
x2

=
..
.

x(0)

(0)

xn

Iterative methods are generally based on a fixed point equation of the form
x = Gx + c
where the matrix G and the vector c define the iterative method given by the fixed point iteration
x(k+1) = Gx(k) + c.

(4.12)

4.2. ITERATIVE SOLVERS

49

The matrix G Rnn is called the iteration matrix. To derive some first
use of the following splitting

a11 a12 . . . a1n


a21 a22 . . . a2n

..
..
..
.
.
.
.
.
.
an1 an2 . . . ann
{z
}
|
A



0 a12
0
a11

a21 0

a22
0



= .
+

+
.
.
.
..
..
..
..

an1 . . . an,n1 0
ann
|
{z
} |
{z
} |
L

methods we will make

...
a1n
..
..
.
.

..
. an1,n
0
{z
}
U

One of the easiest methods is derived in the following way


Ax = b (D + L + U )x = b Dx = b (L + U )x
x = D1 (b (L + U )x)
where we assumed that A has only diagonal entries not equal to zero. The iterative method
derived by this equation is then given by
x(k+1) = D1 (b (L + U )x(k) )
or equivalently
Dx(k+1) = b (L + U )x(k)
By writing every row separately we get
(k+1)

aii xi

= bi

n
X

(k)

aij xj

for i = 1, . . . , n

j=1,j6=i

or
(k+1)

xi

n
X
1
(k)
(bi
aij xj ) for i = 1, . . . , n
aii

(4.13)

j=1,j6=i

This method is called the Jacobi method. It should be mentioned that one is able to calculate
(k+1)
xi
for all i simultaneously. Therefore this method is suitable for parallelization. In sequential
(k+1)
programming all xi
are computed consecutively. Therefore one might conceive the idea to
use the already computed better approximations. If the numbering of the computations is
given by increasing i upper equation becomes
(k+1)

xi

i1
n
X
X
1
(k+1)
(k)
(bi
aij xj

aij xj ) for i = 1, . . . , n
aii
j=1

(4.14)

j=i+1

This method is called the Gauss-Seidel method. This can be rewritten as


i
X
j=1

(k+1)
aij xj

= bi

n
X

(k)

aij xj

for i = 1, . . . , n

j=i+1

and we are now able to put into the form of equation (4.12):
(D + L)x(k+1) = b U x(k) or x(k+1) = (D + L)1 (b U x(k) )

(4.15)

50

CHAPTER 4. ITERATIVE METHODS FOR LINEAR EQUATIONS SYSTEMS

In this case we need to calculate the new entries step by step and an easy implementation of
a parallel algorithm is not possible. The solutions we get by using the Gauss-Seidel method are
generally better than the ones we get by using the Jacobi method.
One is free to choose the numbering of the computed entries. If one starts first by computing
the last entries of x(k+1) (decreasing i) one gets
(k+1)
xi

i1
n
X
X
1
(k)
(k+1)
=
(bi
aij xj
aij xj
) for i = 1, . . . , n
aii
j=1

(4.16)

j=i+1

which is called the backward Gauss-Seidel method. This can be written as


(D + U )x(k+1) = b Lx(k) or x(k+1) = (D + U )1 (b Lx(k) )

(4.17)

If we combine these two methods by using one step of the Gauss-Seidel method and the next
step of the backward Gauss-Seidel method we get the symmetric Gauss-Seidel method:
x(k+ /2) = (D + L)1 (b U x(k) )
1

(4.18)

x(k+1) = (D + U )1 (b Lx(k+ /2) )


1

which is equivalent to
x

(k+1)

= (D + U )

b L(D + L)

(b U x

(k)


)

= (D + U )1 D(D + L)1 b + (D + U )1 L(D + L)1 U x(k)

(4.19)

Let
e(k) = x x(k)
be the error in the k-th step. We can then derive for the Jacobi method
e(k+1) = x x(k+1) = x D1 (b (L + U )x(k) ) = x D1 (Ax (L + U )x(k) )
= x D1 ((D + L + U )x (L + U )x(k) ) = D1 (L + U )(x x(k) )

(4.20)

= D1 (L + U )e(k)
Analogously we get for the other methods:
e(k+1) = (D + L)1 U e(k)
e(k+1) = (D + U )1 Le(k)
(k+1)

= (D + U )

L(D + L)

(k)

Ue

Gauss-Seidel

(4.21)

backward Gauss-Seidel

(4.22)

symmetric Gauss-Seidel

(4.23)

and therefore the error estimates


ke(k+1) k kD1 (L + U )kke(k) k
(k+1)

ke

(k+1)

ke

k k(D + L)

(k)

U kke

k k(D + U )

(k)

Lkke

Jacobi

(4.24)

Gauss-Seidel

(4.25)

backward Gauss-Seidel

(4.26)

symmetric Gauss-Seidel

(4.27)

ke(k+1) k k(D + U )1 L(D + L)1 U kke(k) k

for any vector norm k k and the induced matrix norm.


Thus, for all considered methods we obtain an error e(k+1) = Ge(k) and the estimate
ke(k+1) k kGk ke(k) k.

(4.28)

Thus, if kGk < 1 the error ke(k) k 0 for k . We will see below that kGk < 1 is a
sufficient convergence criterion for iterative methods of form (4.12) which follows directly from

4.2. ITERATIVE SOLVERS

51

another criterion. Under the assumption that kGk < 1 one can apply the Banach fixed point
theorem (the general statement and a proof will be given in the next chapter!) to obtain the
error estimates
kGk
kx(k) x(k1) k ("a posteriori")
1 kGk
kGkk
ke(k) k
kx(1) x(0) k ("a priori").
1 kGk
ke(k) k

(4.29)
(4.30)

With these inequalities one can estimate the error without knowing the exact solution.
In the following we motivate a convergence criterion for the iterative method (4.12) which is
based on the eigenvalues of G:
We assume that the iteration matrix G has a basis of eigenvectors v(1) , . . . , v(n) to the eigenvalues
1 , . . . , n . This assumption is equivalent to the requirement that the matrix A is diagonalizable.
In this case we can express the initial error e(0) = x x(0) uniquely as
(0)

= a1 v

(1)

+ . . . + an v

(n)

n
X

ai v(i)

i=1

with ai R, i = 1, . . . , n. With the help of e(k+1) = Ge(k) , k = 0, 1, 2, . . . , we obtain


e(k+1) = Ge(k) = G2 e(k1) = . . . = Gk+1 e(0) ,

k = 0, 1, 2, . . .

and therefore
e(k) = Gk e(0) =

n
X

ai Gk v(i) =

i=1

n
X

ai ki v(i) ,

k = 0, 1, 2, . . . .

i=1

It is clear that e(k) 0 if the absolute values of all eigenvalues are less than 1 or equivalently
the largest absolute value of all eigenvalues of A is less than one.
Definition 4.3 The spectral radius of a matrix A Rnn is defined by
(A) := max{|| : eigenvalue of A}.
One can show that the characterization above is generally valid and also necessary for convergence:
Theorem 4.4 Every iterative method of the form
x(k+1) = Gx(k) + c

(cf. (4.12))

with iteration matrix G converges for all initial values x(0) Rn to the solution of Ax = b if
and only if (G) < 1.
From Theorem 4.4 we can conclude:
Corollary 4.5 Let k kp := k kp,p be the (by the vector norm k kp ) induced matrix norm.
If kGkp < 1 then the iterative method (4.12) converges for all initial values x(0) Rn to the
solution of Ax = b.
Proof:
Let be an arbitrary eigenvalue of the iteration matrix G with corresponding eigenvector v Rn .
With the help of the definition of the matrix norm k kp (cp. (4.2)) it follows
kGkp = max n
06=yR

kGykp
kGvkp
kvkp

=
= ||.
kykp
kvkp
kvkp

52

CHAPTER 4. ITERATIVE METHODS FOR LINEAR EQUATIONS SYSTEMS

Then by Definition 4.3 it holds also (G) kGkp . Due to the assumption kGkp < 1 it follows
(G) < 1 and therefore the convergence of (4.12) for an arbitrary initial guess x(0) by Theorem
4.4.

By Corollary 4.5 it is sufficient to show kGkp < 1 for only one matrix norm to ensure
convergence of the iterative method.
At the end of this section we state some convergence criteria for the Jacobi and the three Gauss Seidel methods. These criteria have the advantage that they are based on the given matrix A
and not on the iteration matrix G. For the first criterion we start with the following definition.
Definition 4.6 A matrix A Rnn is said to be strictly row diagonally dominant if and only if
n
X
|aij | < |aii |

for i = 1, . . . , n.

j=1
j6=i

A matrix A Rnn is said to be strictly column diagonally dominant if and only if


n
X
|aij | < |ajj |

for j = 1, . . . , n.

i=1
i6=j

Theorem 4.7 If the matrix A is strictly row or strictly column diagonally dominant, then the
Jacobi and all Gauss Seidel methods converge for all initial values x(0) Rn .
Proof:
We prove the statement for simplicity only for the Jacobi method. A simple calculation leads to

a12
0
aa1n
a11
11
a21

0 aa2n
a22
22
1
G := D (L + U ) =

.
..

an1
an2
0
ann
ann . . .
for the Jacobi method. Thus we obtain

n
n
n
n

1 X
X
1 X
1 X
kGk = max
|gij | = max
|a1j |,
|a2j |, . . . ,
|anj |
i=1,...,n

|a |
|a22 |
|ann |

j=1
j=1
j=1

11 j=1
j6=1

1
i=1,...,n |aii |

= max

j6=2

j6=n

n
X

|aij | < 1,

j=1
j6=i

{z

<1

due to the assumed strictly row diagonally dominance of A. By Corollary 4.5 we obtain the
convergence of the Jacobi method.
If the matrix A is strictly column diagonally dominant one can get analogously
k(L + U )D

k1 = max

j=1,...,n

n
X
i=1

|((L + U )D

1 X
)ij | = max
|aij | < 1.
j=1,...,n |ajj |
i=1
i6=j

Since kAk1 = kAT k for all matrices A it follows 1 > k(L + U )D1 k1 = kD1 (L + U )T k .
D1 (L + U )T is the Jacobi iteration matrix corresponding to AT x = b. Again by Corollary

4.3. STEEPEST DESCENT AND CONJUGATE GRADIENT METHOD

53

4.5 the Jacobi method converges for AT . By Theorem 4.4 it must hold (D1 (L + U )T ) < 1.
Since the eigenvalues of A and AT and the eigenvalues of AB and BA are equal for arbitrary
matrices A, B Rnn it follows
1 > (D1 (L + U )T ) = ((L + U )D1 ) = (D1 (L + U )) = (G),
i.e. by Theorem 4.4 the convergence of the Jacobi method for the matrix A.

The next two convergence criteria are based on the positive definiteness of A:
Theorem 4.8 Let A Rnn be symmetric positive definite. Then the Gauss - Seidel, the backward Gauss - Seidel and the symmetric Gauss - Seidel methods converge for arbitrary initial values
x(0) Rn .
Note that the symmetry and positive definiteness of A is not sufficient for convergence of the
Jacobi method. For this method one needs an additional condition:
Theorem 4.9 Let A = L + D + U Rnn be symmetric positive definite. Then the Jacobi
method converges for arbitrary initial values x(0) Rn if and only if the matrix D L U is
also symmetric positive definite.

4.3

Steepest descent and conjugate gradient method

In this section we consider two iterative methods, the steepest descent and the conjugate gradient method. The steepest descent method is a very simple method and can be used to solve
unconstrained minimization problems for real - valued functions. We will use this method for
the solution of linear systems of equations Ax = b under the assumption that A Rnn is a
symmetric positive definite matrix. However, we will see exemplarily that this method often
converges quite slowly. In contrast the so - called conjugate gradient method needs at most n iterations using exact arithmetic to achieve the correct solution. This method is therefore efficient
and often used in applications.
Before we consider the methods we define a function f : Rn R by
1
f (x) = xT Ax xT b
2

(4.31)

for given symmetric positive definite A Rnn and given b Rn . In the following lemma we
show that the minimization of f is equivalent to solving Ax = b.
Lemma 4.10 x Rn is a solution of Ax = b if and only if x minimizes f , i.e. f (x ) =
minn f (x).
xR

Proof:
"": We have to show that f (x ) f (x + h) for each h Rn .

1
(x + h)T A(x + h) (x + h)T b
2
1 T
1
1
1
= (x ) Ax + hT Ax + (x )T Ah + hT Ah (x )T b hT b
2
2
2
2
1 T

T
= f (x ) + h Ax h b +
h Ah
2 | {z }

f (x + h) =

> 0 h6 = 0

f (x ) + h (Ax b) = f (x ).
T

"": Since x minimizes f it must hold f (x ) = 0 (necessary condition). Due to f (x) =


Ax b we obtain f (x ) = 0 if and only if Ax = b, i.e. x solves the problem Ax = b.


54

CHAPTER 4. ITERATIVE METHODS FOR LINEAR EQUATIONS SYSTEMS

4.3.1

The steepest descent method

The idea of the steepest descent method is quite simple. Starting from an initial guess x(0) Rn
we seek successively search directions d(k) Rn and k R such that the new iterations
x(k+1) := x(k) + k d(k) ,

k = 0, 1, 2, . . .

become closer to x . A natural approach for the search direction is to use the steepest descent
d(k) = f (x(k) ) = b Ax(k) .
With the residuals r(k) := b Ax(k) follows d(k) = r(k) .
For given x(k) and d(k) we determine k such that f (x(k+1) ) = f (x(k) + k d(k) ) is minimal,
i.e. we minimize f along the straight line x(k) + k d(k) with respect to k . Using the chain rule
leads to the necessary condition

T

f (x(k) + k d(k) ) = f (x(k) + k d(k) ) d(k)


k
= (A(x(k) + k d(k) ) b)T d(k)
!

= (d(k) )T d(k) + k (d(k) )T Ad(k) = 0.


This condition is equivalent to
k =

(d(k) )T d(k)
(r(k) )T r(k)
=
.
(d(k) )T Ad(k)
(r(k) )T Ar(k)

(k) + d(k) ) = (d(k) )T Ad(k) > 0 for d(k) 6= 0 shows us that we


The second derivative
k
2 f (x
k
have really found a minimum. We see in the exercises that the convergence of Algorithm 2 can

Algorithm 1 Steepest descent method


Require: initial guess x(0) Rn and tolerance tol
Set k = 0 and determine r(0) = b Ax(0) ;
while kr(k) k > tol do
(r(k) )T r(k)
;
(r(k) )T Ar(k)
(k)
(k)
x + k r ;

Determine k =

Set x(k+1) =
Set k = k + 1 and determine r(k) = b Ax(k) ;
end while
become very slow. Moreover, one can prove the error estimate
kx
where kxkA :=

(k)

x kA

2 (A) 1
2 (A) + 1

k

kx(0) x kA ,

(4.32)

xT Ax denotes the so - called energy norm on Rn for given symmetric positive

definite matrix A Rnn . If the condition number 2 (A) is large the fraction

2 (A)1
2 (A)+1

1
2 (A)
1+ 1(A)
2

tends to 1 and the estimate (4.32) becomes worse.


The conjugate gradient method, which we will consider in the following, is in general more
suitable, since we will see that x is obtained after at most n iterations with exact arithmetic
equal if the condition number is large or not. Due to this property the conjugate gradient method
is also regarded as direct solver.

4.3. STEEPEST DESCENT AND CONJUGATE GRADIENT METHOD

4.3.2

55

The conjugate gradient method

The idea of the conjugate gradient method is to use a set of so - called A - conjugated vectors as
search directions instead of the residuals r(k) .
Definition 4.11 Let A Rnn be symmetric positive definite. A set of vectors {p(0) , p(1) , . . . , p(l) }
with 0 6= p(j) Rn is called conjugated with respect to A (or A - conjugated), if
(p(i) )T Ap(j) = 0 for i 6= j.

(4.33)

In the following we firstly assume that such a set {p(0) , p(1) , . . . , p(l) } is given. Based on an old
approximation x(k) we set our new approximation again as
x(k+1) = x(k) + k p(k) ,

k = 0, . . . , l,

(4.34)

similar to the steepest gradient method. Starting with an given initial guess x(0) Rn this leads
to
x(k+1) = x(k) + k p(k) = . . . = x(0) + k p(k) + . . . + 0 p(0) = x(0) +

k
X

i p(i) ,

k = 0, . . . , l.

i=0

(4.35)
We seek now real numbers 0 , . . . , k such that
f (x

(k+1)

)=f

(0)

k
X

!
i p

(i)

=: g(0 , . . . , k )

i=0

is minimized.
As necessary condition for a minimum of g we obtain

T

g(0 , . . . , k ) = f (x(k+1) )
j
j
A x(0) +

k
X

(0)

k
X

!
(i)

i p


T
= Ax(k+1) b p(j)

i=0

!T

!
i p(i)

p(j)

i=0

= (Ax(0) b)T p(j) +

k
X

T

i p(i) Ap(j)

i=0

= p(j)

T

T

!
(Ax(0) b) + j p(j) Ap(j) = 0

for j = 1, . . . , k. This is equivalent to


T
p(j) (b Ax(0) )
,
j =
T
p(j) Ap(j)

j = 0, 1, . . . , k.

(4.36)

Since the vectors p(j) , j = 0, . . . , k, are A - conjugated we obtain for fixed j {1, . . . , k} the
relation
!
j1

T

T

T
X
p(j) Ax(j) = p(j) A x(0) +
i p(i) = p(j) Ax(0) .
(4.37)
i=0

Inserting it into (4.36) leads immediately to the expression


T
T
p(j) (b Ax(j) )
p(j) r(j)
j =
=
, j = 0, 1, . . . , k.
(4.38)
T
T
p(j) Ap(j)
p(j) Ap(j)
T
2
Note that due to Hij = i j g(0 , . . . , k ) = p(j) Ap(j) ij > 0 the Hessian matrix H is
positive definite and therefore g is a minimum in (0 , . . . , k ) with j given via (4.38).

56

CHAPTER 4. ITERATIVE METHODS FOR LINEAR EQUATIONS SYSTEMS

Lemma 4.12 For the residual r(k) = b Ax(k) it holds



T
p(j) r(k) = 0 for j = 0, . . . , k 1.
Proof:
By construction of j we know that f (x(k) ) = g(0 , . . . , k1 ) is minimized, i.e.

j g(0 , . . . , k1 ) = 0 for j = 0, . . . , k 1. With this property we get




p(j)

T


T

T

g(0 , . . . , k1 ) = 0
r(k) = r(k) p(j) = Ax(k) b p(j) =
j

for j = 0, . . . , k 1.

We come now to the construction of a set of A - conjugated vectors. We set p(0) := r(0) = bAx(0)
and
p(j+1) := r(j+1) + j+1 p(j) , j = 0, 1, 2, . . . ,
(4.39)
depending on the old search direction p(j) . We determine j+1 such that p(j) and p(j+1) are
A - conjugated, i.e.

T

T

T
!
0 = p(j) Ap(j+1) = p(j) Ar(j+1) + j+1 p(j) Ap(j) .

(4.40)

Obviously this is equivalent to


j+1 =

p(j)

T

Ar(j+1)
.
T
p(j) Ap(j)

(4.41)

With this choice, equation (4.38) and Lemma 4.12 we get


T
T
T
p(j) r(j)
r(j) + j p(j1) r(j)
r(j) r(j)
j =
=
=
T
T
T
p(j) Ap(j)
p(j) Ap(j)
p(j) Ap(j)
T

for j = 1, 2, . . .. Due to the choice p(0) = r(0) it also holds 0 =

(4.42)
T

(r(0) ) r(0)
(p(0) ) r(0)
= (0) T (0) .
T
(0)
(0)
(p ) Ap
(p ) Ap

We can also determine the residuals recursively as


r(k+1) = b Ax(k+1) = b A(x(k) + k p(k) ) = r(k) k Ap(k) .

(4.43)

With these choices we obtain the following lemma:


Lemma 4.13 For the residuals r(k) and the search directions p(k) holds
T
r(k) r(j) = 0

T
p(k) Ap(j) = 0


j = 0, 1, . . . , k 1,
j = 0, 1, . . . , k 1.

Proof:
will be done with mathematical induction over k in Exercise 41

Provided that p(i) 6= 0 for i = 0, . . . , l, Lemma 4.13 implies that the set {p(0) , . . . , p(l) } is
A - conjugated. Moreover we obtain the following corollary:
Corollary 4.14 If {p(0) , . . . , p(l) } is A - conjugated, then the vectors p(i) , i = 0, . . . , l, are linear
independent.

4.3. STEEPEST DESCENT AND CONJUGATE GRADIENT METHOD


Proof:

57

l
P

i p(i) = 0 and have to show that i = 0 for all i = 0, . . . , l. Multiplying the


i=0
T
equation from the left with p(j) A for j = 0, . . . , l leads to

We start with

0=

l
X


T

T
i p(j) Ap(i) = j p(j) Ap(j) j = 0.
|
{z
}
i=0
>0


With the help of Lemma 4.13 and (4.41) one can also obtain (cf. Exercise 41) the representation
T
r(j+1) r(j+1)
j+1 =
, j = 0, 1, 2, . . . .
(4.44)
T
r(j) r(j)
Combining (4.42), (4.34), (4.43), (4.44) and (4.39) results in the conjugate gradient algorithm:
Algorithm 2 Conjugate gradient method
Require: initial guess x(0) Rn and tolerance tol
Set k = 0, determine r(0) = b Ax(0) and set p(0) = r(0) ;
while kr(k) k > tol do
(r(k) )T r(k)
;
(p(k) )T Ap(k)
(k+1)
(k)
(k)
x
= x + k p and r(k+1) = r(k) k Ap(k) ;
T
(r(k+1) ) r(k+1)
and p(k+1) = r(k+1) + k+1 p(k) ;
k+1 =
T
(r(k) ) r(k)

Determine k =
Set
Set

Set k = k + 1;
end while
The essential computational costs per iteration step are one matrix - vector multiplication, three
vector - vector products, three multiplications with a scalar and three additions of vectors.
Theorem 4.15 Let A Rnn be symmetric positive definite and {p(0) , . . . , p(n1) } A - conjugated.
Then the exact solution of Ax = b for given b Rn is obtained after n steps using exact arithmetic.
Proof: By Lemma 4.12 we know for k = n that
T

r(n) p(j) = 0 for j = 0, . . . , n 1
Since {p(0) , . . . , p(n1) } is linear independent by Corollary 4.14 and thus a basis of Rn we can
T
T
conclude r(n) ei = 0 Ax(n) b ei = 0 for i = 1, . . . , n. This leads to Ax(n) = b, i.e.
x(n) solves Ax = b.

It may happen that Ax(k) = b for k < n in the conjugate gradient method. This happens
exactly if r(k) = 0 respectively p(k) = 0 during the algorithm. Theorem 4.15 means in particular
that the conjugate gradient method needs at most n iterations to obtain the exact solution of
Ax = b, provided that rounding errors are excluded.
Moreover, similar to (4.32) one can prove
!k
p
2 (A) 1
(k)

kx x kA 2 p
kx(0) x kA ,
(4.45)
2 (A) + 1
i.e. compared to (4.32) only the square root of the condition number 2 (A) is involved in this
estimate.

58

CHAPTER 4. ITERATIVE METHODS FOR LINEAR EQUATIONS SYSTEMS

Chapter 5

Iterative Methods for Nonlinear


Equation Systems
Nonlinear equation systems arise in a variety of applications, e.g. elastic or plastic deformation
processes in engineering. For the solution of such (nonlinear) problems one is again interested
in iterative numerical methods. In this chapter we consider the so - called Banach fixed point
theorem and afterwards Newtons method, which is often used in practice.
Before we state the Banach fixed point theorem, we have to repeat some important mathematical terms. In the whole chapter we assume that V is a normed space with norm k kV , e.g.
Rn with any vector norm of Section 4.1.
Definition 5.1 A sequence (an )nN in V is convergent to a V , if
> 0 N := N () N such that kan akV < for n N.
In this case we write lim kan akV = 0 or lim an = a.
n

A sequence (an )nN in V is called Cauchy sequence if

lim kam an kV = 0 respectively

m,n

> 0 N := N () N such that kam an kV < for m, n N.


If every Cauchy sequence converges in V , V is called complete. A complete normed space is called
Banach space (e.g. Rn ).
Definition 5.2 A set E in a normed space V is called open, if
v E := (v) > 0 such that for all w V with kv wk < follows w E.
A set E V is called closed, if V \ E is open.
Examples for open sets are the open intervals (a, b) R and the cartesian product (a1 , b1 )
. . . (an , bn ) Rn of open intervals. Similarly, examples for closed sets are the closed intervals
[a, b] R and the cartesian product [a1 , b1 ] . . . [an , bn ] Rn of closed intervals. Moreover,
one can prove that a normed space V (and in particular a Banach space V) itself is open and
closed.
Lemma 5.3 Let A V be closed and V a Banach space. Then A is complete.
Proof:
We have to show that every Cauchy sequence in A converges. Thus let (an )nN be a Cauchy
sequence in A. By assumption it holds (an )nN V . Since V is a Banach space it exists a V
with lim an = a. Since A is closed, we have a A.
n

59

60

CHAPTER 5. ITERATIVE METHODS FOR NONLINEAR EQUATION SYSTEMS

5.1

Fixed point iterations

Let : Rn Rn be given. We seek x Rn with (x ) = x , so - called fixed points of . The


following theorem is called Banach fixed point theorem and ensures the existence of an unique
fixed point x of under some assumptions.
Theorem 5.4 Let E Rn be a closed set and : E E a contractive mapping, i.e. there
exists L < 1 with
k(x) (y)k Lkx yk

x, y E.

Then it holds
a) It exists an unique fixed point x of in E, i.e. x E with (x ) = x .
b) For every initial guess x(0) E the fixed point iteration
x(k+1) := (x(k) ),

k = 0, 1, 2, . . . ,

(5.1)

converges to x .
c) The error estimates
L
kx(k) x(k1) k ("a posteriori"),
1L
Lk
kx(k) x k
kx(1) x(0) k ("a priori")
1L

kx(k) x k

(5.2)
(5.3)

are valid for k = 1, 2, . . ..


Proof:
Existence: Let x(0) E be arbitrary. For fixed k = 1, 2, . . . and arbitrary m N we get
kx(j+1) x(j) k Ljk+1 kx(k) x(k1) k

(5.4)

for all j = k, . . . , k + m 1, due to (5.1) and the property that is contractive. In particular
we get for j = k the inequality
kx(k+1) x(k) k Lkx(k) x(k1) k . . . Lk kx(1) x(0) k

(5.5)

for k = 1, 2, . . ..
With the help of the triangle inequality and the geometric series this implies
kx(k+m) x(k) k = k

m
m
X
X
(x(k+i) x(k+i1) )k
kx(k+i) x(k+i1) k

i=1
k+m1
X

i=1

kx

(j+1)

(j)

j=k

= kx(k) x(k1) k

k kx

(k)

(k1)

k+m1
X

Ljk+1

j=k
m1
X
j=0

Lj+1 = Lkx(k) x(k1) k

(5.6)
m1
X

Lj

j=0

1 Lm (k)
1 Lm (1)
=L
kx x(k1) k Lk
kx x(0) k 0
1L
1L

for k . This means that x(k) kN is a Cauchy sequence in E. Since E is closed in the
Banach space Rn the limit x = lim x(k) E exists by Lemma 5.3.
k

5.1. FIXED POINT ITERATIONS

61

Since is a contractive mapping by assumption (i.e. in particular Lipschitz continuous and


continuous) it holds



(k)
(k1)
(k1)
x = lim x = lim (x
) = lim x
= (x ).
k

Uniqueness: Let x , y E be two fixed points of with x 6= y . Then


kx y k = k(x ) (y )k Lkx y k,
which implies L 1 and contradicts the assumption.
Error estimates: Using (5.6) for m leads on the one hand to
kx(k) x k = kx x(k) k = lim kx(k+m) x(k) k L
m

1
kx(k) x(k1) k,
1L

k = 1, 2, . . . ,

1
kx(1) x(0) k,
1L

k = 1, 2, . . . .

and on the other hand to


kx(k) x k = kx x(k) k = lim kx(k+m) x(k) k Lk
m


Lemma 5.5 Let E Rn be a convex set, i.e. x, y E implies tx+(1t)y E for all t [0, 1],
and the function : E Rn continuously differentiable. Furthermore let L := sup k0 (x)k < 1,
xE

where k k denotes an induced matrix norm. Then is a contractive mapping.


Proof: Let x, y E and t [0, 1]. We define f (t) := (tx + (1 t)y) and obtain with the chain
rule


d
f 0 (t) = 0 (tx + (1 t)y)
(tx + (1 t)y) = 0 (tx + (1 t)y)(x y).
(5.7)
dt
We use the fundamental theorem of calculus to obtain
Z 1
k(x) (y)k = kf (1) f (0)k = k
f 0 (t) dtk sup kf 0 (t)k
0

t[0,1]

!
= sup k0 (tx + (1 t)y)(x y)k
t[0,1]

sup k0 (tx + (1 t)y)k kx yk


t[0,1]

sup k0 (z)k kx yk = Lkx yk,


zE

i.e. is a contractive mapping.




62

5.2

CHAPTER 5. ITERATIVE METHODS FOR NONLINEAR EQUATION SYSTEMS

Newtons method

In the following we consider Newtons method which is of great importance in practice. Let
f : Rn Rn be a continuously differentiable function. The aim of Newtons method is to find
roots of f , i.e. we seek x Rn with f (x ) = 0.
We start with the case n = 1 as motivation: By Taylor expansion (of order 1) about the point
x(k) we obtain
f (x) f (x(k) ) + f 0 (x(k) )(x x(k) )

(5.8)

in a (small) neighborhood of x(k) . The right - hand side of (5.8) is the tangent line of f at x(k) .
The idea of Newtons method is now to use the root of (5.8) as new approximation which leads
to
x(k+1) = x(k)

f (x(k) )
,
f 0 (x(k) )

(5.9)

provided that f 0 (x(k) ) 6= 0.


An illustration of Newtons method in the one - dimensional case is depicted in Figure 5.1 for
the function f (x) = ln(x) (black curve). Here the initial guess x(0) = 0.2 was chosen. One can
observe that the method converges to the exact root x = 1 of f .

4
3
2
1
x(1)

(0)

(2)

x
-1
-2
-3
-4
-5

0.2

0.4

0.6
x

0.8

1.2

Figure 5.1: Newtons method for f (x) = ln(x) and x(0) = 0.2

For the general n - dimensional case we use analogously the root of the Taylor expansion (of order
1) as new iteration. This leads to
f 0 (x(k) ) (x(k+1) x(k) ) = f (x(k) ),
|
{z
}
(k)
=:

(5.10)

5.2. NEWTONS METHOD

63

where f 0 (x(k) ) Rnn denotes now the Jacobi matrix of f in x(k) . If f 0 (x(k) ) is regular, we
obtain an unique (k) Rn by solving the linear system of equations (5.10). Afterwards we get
the new approximation as x(k+1) = x(k) + (k) .
Algorithm 3 Newtons method
Require:
continuously differentiable function f : Rn Rn , initial guess x(0) Rn and tolerance tol
Set k = 0;
while kf (x(k) )k > tol do
Solve f 0 (x(k) ) (k) = f (x(k) );
Set x(k+1) = x(k) + (k) and k = k + 1;
end while

Remark 5.6 Applying Newtons method to f (x) = Ax b for given A Rnn and b Rn
leads with f 0 (x) = A to A (0) = (Ax(0) b) and x(1) = x(0) + (0) in the first Newton step
(k = 0). Due to
f (x(1) ) = Ax(1) b = Ax(0) + A (0) b = Ax(0) (Ax(0) b) b = 0,
x(1) is a root of f , i.e. we have obtained the exact solution of Ax = b after one Newton step.
In the following we prove that Newtons method converges (local quadratically) to an root x of
f , if one chooses the initial guess x(0) sufficiently close to x .
Before we prove this convergence result, we state that Newtons method
x(k+1) = x(k) (f 0 (x(k) ))1 f (x(k) )

(5.11)

is equivalent to the fixed point iteration x(k+1) = (x(k) ) with


(x) := x (f 0 (x))1 f (x).

(5.12)

Definition 5.7 Let (xk )kN be a sequence in a normed space V converging to x V . We call
the sequence convergent with order p 1, if there exists a positive constant C > 0 such that
kxk+1 x kV Ckxk x kpV

k N.

In the case p = 1 we require C < 1 for convergence and say that the sequence converges linearly.
For p = 2 we say that the sequence converges quadratically.
Theorem 5.8 Let D Rn be open and convex, k k a norm on Rn (with induced matrix norm
k k), f C 1 (D, Rn ), f 0 (x) invertible for all x D, k(f 0 (x))1 k for all x D and f 0
Lipschitz continuous in D, i.e. > 0 with
kf 0 (x) f 0 (y)k kx yk

x, y D.

(5.13)

Moreover we assume that it exists x with f (x ) = 0 and choose > 0 and x(0) E:= K (x ) :=
2
{x Rn : kx xk } with E D,
and + 2 2 2 + kf 0 (x(0) )k < 1.

Then the sequence of Newton iterates x(k) converges to x and it holds
kx(k+1) x k
i.e. the sequence converges quadratically.

(k)
kx x k2 ,
2

(5.14)

64

CHAPTER 5. ITERATIVE METHODS FOR NONLINEAR EQUATION SYSTEMS

Proof:
We know that Newtons method can be described by x(k+1) = (x(k) ) with given by (5.12).
The idea of the proof is now to show that the conditions of Theorem 5.4 are satisfied. Then the
sequence x(k) converges to x .
We also know that E is (as closed ball about x ) a closed set. It remains to show that maps
from E to E and is contractive.
As first intermediate result we prove that
kf (y) f (x) f 0 (x)(y x)k

ky xk2
2

x, y D.

(5.15)

We define g(t) = f (ty + (1 t)x) for t [0, 1] and obtain similar to (5.7) the derivative
g 0 (t) = f 0 (ty + (1 t)x)(y x). We use again the fundamental theorem of calculus and obtain
Z

f (y) f (x) = g(1) g(0) =

g (t) dt =

R1
0

f 0 (ty + (1 t)x)(y x) dt.

Subtracting f 0 (x)(y x) =

f 0 (x)(y x) dt leads to
1

f (y) f (x) f (x)(y x) =


f 0 (ty + (1 t)x) f 0 (x) (y x) dt.

Rb
Using the general inequality k a h(t) dtk a kh(t)k dt, valid for an arbitrary continuous function h : R Rn , and the assumption (5.13) leads to
Rb

1

kf (y) f (x) f (x)(y x)k = k
f 0 (ty + (1 t)x) f 0 (x) (y x) dtk
0
Z 1



k f 0 (ty + (1 t)x) f 0 (x) (y x)k dt


0
Z 1

kf 0 (ty + (1 t)x) f 0 (x)kky xk dt


0
Z 1

tky xk2 dt = ky xk2 ,


2
0
0

i.e. equation (5.15) is proven.


As next step we show that maps from E to E and that is contractive for all x, y E D:
We use again (5.12) to obtain

(x) (y) = x (f 0 (x))1 f (x) y (f 0 (y))1 f (y)
= x y (f 0 (x))1 f (x) + (f 0 (y))1 f (y)

 

= (f 0 (x))1 f 0 (x)(x y) f (x) + f (y) + (f 0 (y))1 (f 0 (x))1 f (y)




= (f 0 (x))1 f (y) f (x) f 0 (x)(y x) + (f 0 (y))1 f 0 (x) f 0 (y) (f 0 (x))1 f (y).
We use the triangle inequality, the properties of induced matrix norms, equation (5.15) and the
assumptions to conclude
k(x) (y)k




k(f 0 (x))1 f (y) f (x) f 0 (x)(y x) k + k(f 0 (y))1 f 0 (x) f 0 (y) (f 0 (x))1 f (y)k
k(f 0 (x))1 kkf (y) f (x) f 0 (x)(y x)k + k(f 0 (y))1 kkf 0 (x) f 0 (y)kk(f 0 (x))1 kkf (y)k



ky xk + 2 kf (y)k ky xk.
ky xk2 + 2 kf (y)kky xk =
2
2
(5.16)

5.2. NEWTONS METHOD

65

for x, y E.
In particular we obtain for the choice y = x (i.e. f (y) = f (x ) = 0) and arbitrary x E
k(x) x k = k(x) (x )k

2

kx xk2
=
,
2
2
2
|{z}

(5.17)

i.e. (x) E for x E.


For all x, y E we know additionally
ky xk = ky x (x x )k ky x k + kx x k 2.
Equation (5.12) is equivalent to f (x) = f 0 (x)(x (x)). For y E and thus (y) E this
leads to
kf (y)k = kf 0 (y)(y (y))k kf 0 (y)kky (y)k
= kf 0 (y) f 0 (x(0) ) + f 0 (x(0) )kky (y)k


kf 0 (y) f 0 (x(0) )k + kf 0 (x(0) )k ky (y)k


ky x(0) k + kf 0 (x(0) )k ky (y)k


0 (0)
2 2 + kf (x )k .

(5.18)

Inserting (5.18) into (5.16) leads to




ky xk + 2 kf (y)k ky xk
2



+ 2 2 2 + kf 0 (x(0) )k kx yk



= + 2 2 2 + kf 0 (x(0) )k kx yk
|
{z
}


k(x) (y)k

<1

for all x, y E by assumption, i.e. we know that is a contractive mapping. Thus all assump
tions of Banach fixed point theorem (Theorem 5.4) are satisfied such that the sequence x(k) of
Newton iterates converges to x in E. Inserting x(k) as x in (5.17) leads directly to (5.14).

Theorem 5.8 is a local convergence result. The problem is that it is only valid if the initial
guess x(0) is sufficiently close to an exact root x of f . Otherwise the method can diverge. For
instance, applying Newtons method to the function f (x) = arctan(x) with x(0) = 1.5 diverges
(see Table 5.1).
x(0)
1.5000

x(1)
1.6941

x(2)
2.3211

x(3)
5.1141

x(4)
32.2957

x(5)
1.5753 103

x(6)
3.8950 106

Table 5.1: Divergence of Newtons method for f (x) = arctan(x) with x(0) = 1.5
A remedy of this problem can be reached through so - called globalization strategies, e.g. damping
strategies or trust region methods. An in practice often used damping strategy is the line - search strategy which leads to the following algorithm:

66

CHAPTER 5. ITERATIVE METHODS FOR NONLINEAR EQUATION SYSTEMS

Algorithm 4 Newton method with line - search as damping strategy


Require:
continuously differentiable function f : Rn Rn , initial guess x(0) Rn , tolerance tol and
small parameter (0, 0.5) (e.g. = 103 )
Set k = 0;
while kf (x(k) )k > tol do
Solve f 0 (x(k) ) (k) = f (x(k) );
Set = 1;
while kf (x(k) + (k) )k > (1 )kf (x(k) )k do
= 2 ;
end while
Set x(k+1) = x(k) + (k) and k = k + 1;
end while
Applying Algorithm 4 to f (x) = arctan(x) with x(0) = 1.5 and the choice = 103 converges
to the exact root x = 0 after a few steps (see Table 5.2).
x(0)
1.5000

x(1)
0.0970

x(2)
6.0806 104

x(3)
1.4988 1010

x(4)
0

Table 5.2: Convergence of Algorithm 4 for f (x) = arctan(x) with x(0) = 1.5, = 103

You might also like