INM Script
INM Script
Fleurianne Bertrand
2015
original version of this lecture notes: Dr. Steffen Mnzenmaier, Dr. Benjamin Mller
Contents
1 Interpolation
1.1 Polynomial Interpolation . . . . . .
1.2 Spline Interpolation . . . . . . . .
1.2.1 Linear Spline Interpolation
1.2.2 Cubic Spline Interpolation .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
10
11
13
2 Numerical Integration
2.1 Newton-Cotes Formulas . . . . . . . . .
2.2 Gauss-Legendre Formulas . . . . . . . .
2.3 Composite Rules . . . . . . . . . . . . .
2.4 Multidimensional numerical integration .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
25
26
27
29
29
33
36
.
.
.
.
.
43
43
48
53
54
55
59
60
62
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
Introduction
In many applications numerical methods are an essential tool to solve complicated problems. As
models tend to be more sophisticated there arise problems that can not be solved directly. And
even if direct methods are an alternative approach they typically fail to address problems that
arise in large-scale computations: Computational complexity and limitation of memory.
In this lecture we will have a close look at some very basic numerical methods for several
problems that arise in engineering applications. As a motivation we will consider the shallow
water equations that are used to model a free surface water flow under a hydrostatic pressure
assumption which is depicted in figure 1.
Let Rd1 be a domain with boundary = Du DH .
u
|u|u
+ (u) u + gH + gHb + cf
=f
t
H
H
+ (H) u + H( u) = 0
t
(1)
on Du
on DH
u(0, x) = uI (x) on
(2)
H(0, x) = HI (x) on
The problem is then to find a velocity u(t, x) and an elevation H(t, x) above sea ground that
satisfies equations (1) with the additional conditions (2). The additional parameters are the
gravity g, the bottom friction coefficient cf and sources/sinks of momentum f . Solving this
problem gives rise to several difficulties. First this problem can not be solved directly for general boundary conditions geometries. But as real-world applications tend to have complicated
geometries involved a numerical method is necessary. Therefore the need of a discretization of
the partial differential equation is needed. Numerical methods for the spatial discretization, as
"Finite Differences" or "Finite Elements", and the temporal discretization will be introduced in
Hb
CONTENTS
Advanced Numerical Methods. These Methods rely heavily on other numerical approaches
which are introduced in this lecture such as:
1. polynomial interpolation: The problem of finding preferably simple functions that interpolate general functions in a given set of points or fitting a set of given data. This
method is introduced in chapter 1. For the upper problem finding an appropriate basis,
which is piecewise polynomial, is a problem of polynomial interpolation.
2. As upper problem is non-linear and finite element methods / variational principles include
(potentially multi-dimensional) integration an efficient way to evaluate general integrals is
needed. These methods are summarized under the term numerical quadrature which is
introduced in chapter 2.
3. Many numerical methods result in the problem of solving a linear equation system of the
form Ax = b. Upper problem is no exception to that and results in several linear equations
systems which might be big but have very few entries that are not zero. The solution of
these problems can be very time and memory consuming. In chapter 3 we will get to know
direct and iterative solvers for linear equation systems.
4. If you solve these linear equations systems another question arises: How prone are your
problems to unavoidable errors in your given data? How do numerical errors propagate in
the used algorithm? These questions are questions of Condition and Stability and will
be approached in chapter 4.
5. Upper problem is obviously nonlinear which complicates the solution process considerably.
How nonlinearities can be adressed and the problem can be lead back to a sequence of
linear problems is the topic of chapter 5: "Nonlinear equation systems".
If we now apply the initial condition depicted in figure 2 and enforce some outflow through the
sea-ground on the right half we get the solution shown in figure 3.
H
Hb
Chapter 1
Interpolation
Problems of interpolation arise in many applications where one is looking for functions that
interpolate a given set of data. Examples are a given temperature for different times or the
spatial distribution of a pressure field where the data is only given at discrete points.
1.1
Polynomial Interpolation
The general problem in polynomial interpolation is to find a polynomial p(x) of degree n such
that for a set of data (xi , fi ) with i = 0, ...n it holds:
p(xi ) = fi
The space of polynomials of degree n is given by
Pn = {p | p(x) = cn xn + cn1 xn1 + + c1 x + c0 with ci R}
In this definition we do not assume ci 6= 0 and therefore every polynomial of degree n 1 is also a
polynomial of degree n. If one wants to represent/interpolate a function f (x) by an interpolating
polynomial of degree n the condition
p(xi ) = f (xi ) i = 0, ..., n
for a chosen set of interpolation points xi is used. To get the interpolation polynomial we make
use of the Lagrange-Polynomials. These are defined as the solution of the following problem:
Find Li (x) Pn such that
Li (xj ) = ij
with the Kronecker delta be defined by
(
1 i=j
ij =
0 i=
6 j
The solution of this problem is given by
n
Y
Li (x) =
j=0,j6=i
x xj
xi xj
n
X
fi Li (x).
(1.1)
i=0
Therefore the problem of polynomial interpolation has a solution. The next theorem states that
this solution is unique.
7
CHAPTER 1. INTERPOLATION
Theorem 1.1 Let n + 1 different points xi and values fi (i = 0..., n) be given. Then there exists
exactly one interpolation polynomial p Pn such that p(xi ) = fi .
Proof: As mentioned before (1.1) implies existence of the interpolating polynomial of degree
n. For uniqueness we assume p1 and p2 to be two polynomials of degree n with p1 (xi ) = p2 (xi ).
Therefore the polynomial of degree n given by q(x) = p1 (x) p2 (x) has n + 1 different roots
in xi . But the only polynomial of degree n with n + 1 roots is q(x) = 0 and we therefore have
p1 (x) = p2 (x).
To estimate the error we make use of the following polynomial of degree n + 1:
n+1
n
Y
=
(x xi )
i=0
Theorem 1.2 Let f be (n+1)-times continuously differentiable function, f C n+1 ([a, b]). Then
there exists for every x [a, b] a (a, b), such that for the interpolating polynomial p(x) with
the nodes x0 , x1 , ..., xn it holds:
f (x) p(x) =
Proof: Let x
[a, b] with x
6= xi i = 0, ..., n. Set
F (x) = f (x) p(x)
f (
x) p(
x)
n+1 (x)
n+1 (
x)
f (
x) p(
x)
n+1 (xi ) = 0
n+1 (
x) | {z }
=0
and
F (
x) = f (
x) p(
x)
f (
x) p(
x)
n+1 (
x) = 0
n+1 (
x)
By the theorem of Rolle the derivative of F, F, has at least n + 1 roots in (a, b). By induction it
follows that F (n+1) has at least one root in (a,b). That means that there exists a (a, b) such
that
f (
x) p(
x) (n+1)
0 = F (n+1) () = f (n+1) () p(n+1) ()
()
| {z }
n+1 (
x) | n+1
{z }
=0
= f (
x) p(
x) =
f (n+1) ()
(n + 1)!
=(n+1)!
n+1 (
x)
1
max |f (n+1) ()| |n+1 (x)|
(n + 1)! [a,b]
(1.2)
Using the Lagrange polynomials and equation (1.1) to evaluate an interpolating polynomial in
a point x is very expensive as a straightforward implementation would result in O(n2 ) operations
for every function evaluation. A more efficient evaluation can be achieved by using the barycentric
formula. First we see that for the Lagrange-polynomials it holds
n+1 (x)
Li (x) =
x xi
n
Y
n+1 (x)
1
=
wi
xi xj
x xi
j=0,j6=i
|
{z
}
:=wi
n
X
fi Li (x) =
i=0
n
X
fi
i=0
X
n+1 (x)
wi
wi = n+1 (x)
fi
x xi
x xi
i=0
Now it might be problematic to evaluatePthis formula very close to some node xi and therefore
wi
we can rewrite it, by using 1 = n+1 (x) ni=0 xx
:
i
Pn
i=0
p(x) = Pn
wi
fi xx
i
wi
i=0 xxi
After an initial computational cost of O(n2 ) operations for the calculation of the weights one
only needs O(n) operations for the evaluation of the polynomial at a point x.
Remark: We will see in exercise 5 that one can efficiently (O(n) operations) add new nodes
to the interpolating polynomial to improve accuracy.
Example: We want to interpolate the following function f : [1, 1] R with
(
0
if 1 x 0
f (x) =
1/x
e
if 0 < x 1
In figure 1.1 the solution is plotted as a continuous line and interpolating polynomials of
degree 5 as dotted line and degree 10 as dashed line. One can clearly see that the error does not
decrease uniformly if the polynomial degree increases.
0.45
exact solution
interpolating polynomial (n=5)
interpolating polynomial (n=10)
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0.05
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
10
1.2
CHAPTER 1. INTERPOLATION
Spline Interpolation
The example of the last section showed that we can not expect our polynomial approximation
to converge uniformly if we use equidistant interpolation points. Runge found out about this
phenomenon in 1901. If we take a look at the so called Runge function
f (x) =
1
,
1 + x2
x [5, 5]
(1.3)
and the interpolating polynomials in figure 1.2 we clearly see that the behavior of the interpolating
polynomials of higher degree close to the edges of the Interval is quite bad. Spline interpolation
2.5
exact solution
interpolating polynomial (n=5)
interpolating polynomial (n=10)
interpolating polynomial (n=15)
2
1.5
0.5
0.5
5
x [x0 , x1 )
x [x1 , x2 )
..
.
x [xn1 , xn ]
(1.4)
11
We are not completely free in the choice of the ci,j as we still have to satisfy the continuity and
differentiability conditions. By using representation (1.4) we can formulate them as
s1 (x1 ) = s2 (x1 ), s2 (x2 ) = s3 (x2 ), . . . , sn1 (xn1 ) = sn (xn1 )
s01 (x1 ) = s02 (x1 ), s02 (x2 ) = s03 (x2 ), . . . , s0n1 (xn1 ) = s0n (xn1 )
..
..
.
.
(k1)
s1
(k1)
(x1 ) = s2
(k1)
(x1 ), s2
(k1)
(x2 ) = s3
(1.5)
(k1)
If a function s(x) can be written as in (1.4) and satisfies (1.5) it follows that s(x) Sk,T .
Lemma 1.4 The dimension of the Splinespace Sk,T is n + k.
Proof: In [x0 , x1 ] we assume an arbitrary polynomial and therefore we have k + 1 degrees of
freedom (see (1.4): c1,0 , . . . , c1,k ). In [x1 , x2 ] we get additional k +1 degrees of freedom, but k are
already predetermined due to (1.5). Therefore we only get 1 additional degree of freedom. This
holds for every interval [xi1 , xi ] with i = 2, . . . , n. Therefore we get dim Sk,T = k + 1 + (n 1) =
n + k.
For the next two subsections we will take a closer look at the commonly used linear and cubic
splines.
1.2.1
The space S1,T consists of piecewise linear and globally continuous functions. An example is
given in figure 1.3.
s 1 `
t ` s2
t
```t
a = x0
x1
s3
x2
tXs4
X`
t ` s5
```t
x3 x4
x5 = b
Ci (xj ) = ij
12
CHAPTER 1. INTERPOLATION
2
exact solution
interpolating polynomial (n=10)
interpolating linear spline (n=10)
1.5
0.5
0.5
5
Ci (x) =
xxi1
xi xi1
x [xi1 , xi ]
else
xi+1 x
xi+1 xi
x [xi , xi+1 ]
Analogously to the interpolation polynomial we can define the solution of our general interpolation problem as
n
X
s(x) =
f (xi )Ci (x)
i=0
(xi xi1 )2
max |f 00 (x)|
8
x[xi1 ,xi ]
we get immediatly
max |f (x) s(x)|
x[a,b]
h2
max |f 00 (x)|
8 x[a,b]
1.2.2
13
As we have seen before as linear spline interpolation circumvents the problem of oscillations the
approximation properties in some regions the approximation properties afar from the edges of
interval tend to be worse than for standard interpolation polynomials. Therefore we will use
piecewise higher order to retain these approximation properties. In figure 1.5 the comparison of
different interpolation techniques shows clearly the advantages of using higher order polynomials
for approximation such as cubic spline interpolants. One has to mention that the space S3,T
is of slightly higher dimension for this decomposition (dim(S3,T ) = 13)) than the other spaces
(dim(P10 ) = dim(S1,T ) = 11) (see Lemma 1.4). Still the advantages clearly outweigh the little
additional cost.
2
interpolaion points
exact solution
interpolating polynomial (n=10)
linear spline
complete cubic spline (n=10)
1.5
interpolaion points
exact solution
interpolating polynomial (n=10)
linear spline
complete cubic spline (n=10)
0.95
0.9
0.85
0.5
0.8
0
0.75
0.5
5
0.2
0.2
0.4
0.6
0.8
Lemma 1.4 also implies that the n+1 interpolation conditions cant be enough to get a unique
solution of the interpolation problem in S3,T : Find a function s S3,T such that s(xi ) = f (xi ).
We need (at least) 2 additional conditions to guarantee uniqueness of this problem. Common
conditions (other conditions might be used as well) are the following:
(i) s0 (a) = f 0 (a) and s0 (b) = f 0 (b) (complete Spline)
(ii) s00 (a) = s00 (b) = 0 (natural Spline)
(iii) s0 (a) = s0 (b) and s00 (a) = s00 (b) (if f (a) = f (b)) (periodic spline)
Conditions (i) - (iii) lead always to a unique interpolating spline. We will now use (1.4) and
(1.5) to derive a linear equation system which has to be solved. As a function in s S3,T is 2
times differentiable we are allowed to evaluate the moments mi := s00 (xi ) for i = 0, ..., n and use
them to represent our cubic spline. Therefore we first calculate the linear spline s00 S1,T which
is due to (1.6) given by
s00 (x) =
(1.7)
in [xi1 , xi ] with hi = xi xi1 . Here (and for the following parts) we use that the cubic spline is
2 times continuously differentiable. We now integrate this function from xi1 to x and therefore
14
CHAPTER 1. INTERPOLATION
get
0
s (x) = s (xi1 ) +
s00 ()d
x1
x
(1.8)
s0 (xi1 ) =
s(xi ) s(xi1 )
hi
hi
mi1 mi
hi
3
6
(1.10)
(1.11)
s(xi+1 ) s(xi )
hi+1
hi+1
s(xi ) s(xi1 )
hi
hi
hi
hi
mi
mi+1
=
mi1 mi + mi1 + mi
hi+1
3
6
hi
3
6
2
2
(1.13)
Repositioning the terms now leads to
mi1
hi
hi + hi+1
hi+1
s(xi+1 ) s(xi ) s(xi ) s(xi1 )
+ mi
+ mi+1
=
=: bi
6
3
6
hi+1
hi
(1.14)
h1 h1 +h2
h2
0
0
0
...
0
m0
b1
6
3
6
h2
h2 +h3
h3
0
0
0
...
0
6
3
6
m1 b2
..
.
.
.
.
.
.
..
..
..
.. .. ..
(1.15)
.. .. = ..
..
..
..
..
.
. .
.
.
.
.
hn2 +hn1
hn1
0
mn1 bn2
0
...
...
0 hn2
6
3
6
hn1
hn1 +hn
hn
bn1
mn
0
...
...
...
0
6
15
m0
(1.16)
hn
hn
f (xn ) f (xn1 )
=: cb
+ mn
= f 0 (xn )
6
3
hn
(1.17)
mn = s00 (xn ) = 0
(1.18)
(iii) For the periodic spline we have by using (1.11) and (1.12):
m0
h1
h1
hn
hn
s(x1 ) s(x0 ) s(xn ) s(xn1 )
=: p0
+ m1 + mn1
+ mn
=
3
6
6
3
hi
hn
(1.19)
h2 +h3
h2
0
6
3
.
..
..
.
.
..
...
...
0
0
...
...
0
...
spline:
0
h3
6
..
..
.
0
...
...
...
0
0
..
.
..
.
...
0
0
hn2
6
hn2 +hn1
3
hn1
6
..
...
...
hn1
6
hn1 +hn
3
hn
6
0
0
0
..
.
..
.
0
hn
6
hn
3
m0
m1
..
.
..
.
..
.
ca
b1
b2
..
.
..
.
bn2
mn1
bn1
mn
c
b
(1.20)
(ii) For the natural cubic spline we can immediately eliminate m0 and mn as they are 0
h1 +h2
3
h2
6
..
.
0
0
h2
6
h2 +h3
3
0
h3
6
..
..
...
...
.
0
...
0
0
..
.
..
.
...
...
..
.
..
.
hn2
6
hn2 +hn1
3
hn1
6
h2
h2 +h3
h3
0
6
3
6
.
..
..
..
.
.
.
.
..
..
...
...
0
0
0
...
...
...
1
...
...
m1
b1
m2 b2
.. ..
. .
.. = ..
0
. .
hn1
mn2 bn2
6
hn1 +hn
mn1
bn1
0
0
..
.
(1.21)
hn
hn
p
...
0
0
6
3
m0
0
0
...
0
b
m1
1
0
0
...
0 .
b2
.. .
..
..
..
.
.
..
.
..
..
..
..
.
.
. . .
.
hn2 +hn1
hn1
hn2
bn2
0 .
6
3
6
mn1
hn1
hn1 +hn
hn
bn1
0
6
3
6
mn
0
...
...
0
1
16
CHAPTER 1. INTERPOLATION
which can be rewritten as
h1 +hn
h1
0
6
h31
h1 +h2
h2
6
3
6
h2
h2 +h3
0
6
3
.
..
..
.
.
..
...
...
0
hn
...
...
6
...
0
...
0
0
..
.
..
.
0
...
...
hn2
6
hn2 +hn1
3
hn1
6
h3
6
..
..
.
0
...
..
hn
6
0
0
..
.
m0
m1
..
.
..
.
..
.
p0
b1
b2
..
.
..
.
(1.22)
hn1
bn2
6
mn1
hn1 +hn
b
n1
s(0) = 0,
s(1/4) = 0,
s(1/2) = 1,
s(3/4) = 0,
s(1) = 0,
s0 (1) = 0
With using (1.20) the linear equation system for the moments of the complete interpolating cubic
spline is (with hi = 1/4)
1/12 1/24
m0
0
0
0
0
m1 4
1/24 1/6 1/24 0
0
0
0 1/24 1/6 1/24 m3 4
0
0
0 1/24 1/12
0
m4
Solving this equation system leads to
m0
24
m1 48
m=
m2 = 72
m3 48
m4
24
Therefore we can evaluate the spline by using (1.9) and (1.11):
(
0 + (0 + 2 2 3)(x 0) 16(1/4 x)3 + 1/4 + 32(x 0)3
s(x) =
0 + (4 4 + 3 + 6)(x 1/4) + 32(1/2 x)3 1/2 48(x 1/4)3
if x [0, 1/4]
if x [1/4, 1/2]
if x [1/2, 3/4]
if x [3/4, 1]
(ii) For the same decomposition and interpolation conditions we want to calculate the natural
cubic spline. Therefore we can use the linear equation system given in (1.21)
1/6
1/24
0
m1
4
1/24 1/6 1/24 m2 = 8
0 1/24 1/6
m3
4
which leads to the solution
288/7
m1
m = m2 = 480/7
288/7
m3
17
The additional conditions for a natural spline just mean m0 = 0 and m4 = 0 and by using (1.11)
we get analogously to (i)
s0 (0) = 12/7
s0 (1/4) = 24/7
s0 (1/2) = 0
s0 (3/4) = s0 (1/4) = 24/7.
And then by using (1.9):
(
3
0 + (12/7)x + 192
7 (x 0)
s(x) =
0 + (24/7 + 36/7)(x 1/4) +
192
7 (1/2
x)3
3/7 320/7(x
1/4)3
if x [0, 1/4]
if x [1/4, 1/2].
if x [1/2, 3/4]
if x [3/4, 1].
(iii) For the same decomposition and interpolation conditions we want to calculate the periodic cubic spline. Therefore we can use (1.22)
1/6
1/24
1/24
0
1/24
1/6
1/24
0
1/24
1/6
1/24
m0
0
m1 4
0
=
8
1/24 m2
1/6
4
m3
1/24
m0
24
m1 48
m=
m2 = 72
m3
48
and therefore with m4 = m0 we see that the periodic and the complete spline for upper conditions
are the same.
The question arises if the conditions (i) - (iii) are enough to guarantee a unique solution. To
prove this we need the following properties for matrices in Rnn :
Definition 1.6 Let A Rnn . If
aii >
n
X
|aij |
j=1,j6=i
18
CHAPTER 1. INTERPOLATION
viT Avi
>0
viT vi
n
X
ci vi
i=1
n
n
n
X
X
X
xT Ax = (
ci vi )T A(
ci vi ) =
c2i i kvi k2 > 0
i=1
i=1
i=1
n
X
n
X
aij vj = aii vi +
j=1
aij vj
j=1,j6=i
At the same time, as v is the eigenvector for we have: (Av)i = vi and therefore:
n
X
vi = (Av)i = aii vi +
aij vj
j=1,j6=i
n
X
= aii +
aij
j=1,j6=i
vj
vi
= aii +
aij
j=1,j6=i
n
X
>
j=1,j6=i
n
X
|aij | +
|aij |
j=1,j6=i
vj
vi
n
X
j=1,j6=i
n
X
aij
vj
vi
vj
|aij | | |
vi
j=1,j6=i
|{z}
1
n
X
j=1,j6=i
|aij |
n
X
|aij |
j=1,j6=i
=0
and therefore we have > 0 for all eigenvalues which proves this lemma.
From this lemma the following theorem follows immediately.
19
Theorem 1.10 For every condition (i)-(iii) there exists exactly one interpolating cubic spline
satisfying this additional condition .
Proof: The matrices arising in the linear equation systems in (1.20), (1.21) and (1.22) are
symmetric and strictly diagonal dominant and therefore with lemma 1.9 positive definite. As a
positive definite matrix is regular the resulting linear equation systems are uniquely solvable.
2
Without proof the following error estimate for the error in the L -norm
Z
kf k2,[a,b] :=
b
2
1/2
(f (x)) dx
a
is stated.
Theorem 1.11 Let f C (4) ([a, b]) be given. For the complete interpolating spline the following
estimates hold
kf 00 s00 k2 h2 kf (4) k2
kf 0 s0 k2 h3 kf (4) k2
kf sk2 h4 kf (4) k2
with h := max hi .
i=1,...,n
20
CHAPTER 1. INTERPOLATION
Chapter 2
Numerical Integration
If one wants to compute the integrals of arbitrary functions there might arise two problems:
First one might not be able to compute the antiderivative of a general function and second if
you are able to compute the antiderivative it might be costly to evaluate it at arbitrary points.
Therefore it is necessary/beneficiary to compute the integral of the form
Z
I(f ) :=
f (x)dx
(2.1)
by using only the known values of f (x) at some quadrature points xi . Let the quadrature points
x0 , x1 , . . . , xn in [a, b] be given. We now want to approximate the integral I(f ) by a so called
quadrature rule Q(f )
Z
f (x)dx Q(f ) = (b a)
I(f ) =
a
n
X
i f (xi )
(2.2)
i=0
2.1
Newton-Cotes Formulas
The easiest approach for upper problem is the usage of the interpolating polynomial:
Z
f (x)dx
I(f ) =
a
Z bX
n
a
n
X
1
f (xi )Li (x)dx = (b a)
f (xi )
ba
i=0
i=0
|
=: i
(2.4)
is a well posed problem. Upper construction ensures existence of the weights and the uniqueness
follows directly by
Z b
n
X
Lj (x)dx = (b a)
i Lj (xi ) = (b a)j
a
i=0
22
Z
x b x=a+(ba)s 1 s 1
1
dx
=
ds =
ab
2
0 01
Z 1
x a x=a+(ba)s
s0
1
dx
=
ds =
ba
2
0 10
(2.5)
(2.6)
Li (x)dx
x=a+(ba)s
Z
Li (a + (b a)s)ds =
i (s)ds
L
(2.7)
i (s) being the i-th Lagrangian polynomial on [0, 1] with respect to the interpolation points
with L
j
lj = n . Therefore the weights j are independent of the length (b a) of the integration interval
and the position. As a consequence one can compute the weights by restricting the computations
to the interval [0, 1].
For n = 2 we compute the weights by using (2.7)
1
(s 1/2)(s 1)
1
ds =
(0 1/2)(0 1)
6
(s 0)(s 1)
2
ds =
(1/2 0)(1/2 1)
3
(s 0)(s 1/2)
1
ds =
1
(1 0)(1 /2)
6
(2.8)
ba
a+b
(f (a) + 4f (
) + f (b))
6
2
(2.9)
Z
0 =
0
Z
1 =
0
Z
2 =
0
Z
p(x)dx =
I(p) =
0
n
1X
ci Bi (x)dx =
i=0
n
X
i=0
n
X
i=0
Z
ci
0
Bi (x)dx
{z
}
exact
ci Q(Bi ) = Q(
n
X
ci Bi ) = Q(p)
i=0
So the only equation we have to ensure to hold is given by (2.10). Furthermore we can easily
find a basis of Pn by using the monomials {1, x, x2 , . . . , xn }.
23
So for n = 3 a basis would be given by {1, x, x2 , x3 } and the conditions (2.10) for (2.2) lead
to (with (x0 , x1 , x2 , x3 ) = (0, 1/3, 2/3, 1)):
Z
1 dx = 1 = 0 + 1 + 2 + 3
0
Z 1
0
1
1
0
0
0
1
1/3
1/9
1/27
1
2/3
4/9
8/27
1
0
1
1 1/2
1
=
1 2 1/3
1/4
1
3
ba
2(b a)
ba
(f (a) + 3f (a +
) + 3f (a +
) + f (b))
8
3
3
Z
f (x)dx = (b a)
Z
f (a + (b a)s)ds = (b a)
f(s)ds
a)s)
=
f (s)
dxn
(b a)n dsn
(b a)n dsn
R1
i be the weights such that Q(
So let
p) = 0 p(s)ds for all p P3 ([0, 1]) and therefore we can
24
write Q(p) as
Z
p(x)dx
a
1
Z
= (b a)
p(s)ds
0
0 p00 (0) +
1 p(0) +
2 p(1) +
3 p00 (1))
= (b a)(
0 p00 (a) +
1 p(a) +
2 p(b) + (b a)2
3 p00 (b)
= (b a)((b a)2
= (b a)(0 p00 (a) + 1 p(a) + 2 p(b) + 3 p00 (b))
= Q(p)
0 , 1 :=
1 , 2 :=
2 and 3 := (b a)2
3 . So the weights are independent
with 0 := (b a)2
of the interval length.
Then problem (2.4) for all p P3 ([a, b]) can be solved on interval [0, 1] and the weights 0
and 3 need to be transformed afterwards.
Z
0
Z 1
0
1
0 + 1
1 + 1
2 + 0
3 =
1 +
2
1 dx = 1 = 0
0 + 0
1 + 1
2 + 0
3 =
2
x dx = 1/2 = 0
0 + 0
1 + 1
2 + 2
3
x2 dx = 1/3 = 2
0 + 0
1 + 1
2 + 6
3
x3 dx = 1/4 = 0
0
1/24
1
1 = /2
2 1/2
3
1/24
(2.11)
Theorem 2.1 Let f be (n+1)-times continuously differentiable function, f C n+1 ([a, b]). Then
by using a Newton-Cotes quadrature rule with n + 1 quadrature points we get the following error
estimate
(n+1)
k,[a,b] ba n+2 R n Qn
kf
( n )
| 0 j=0 (t j)dt|
if n is uneven
(n+1)!
|I(f ) Q(f )| kf (n+2) k
R
Q
n
,[a,b] ba n+3
(
)
|
t n (t j)dt| if n is even
(n+2)!
j=0
25
(b a)3 (2)
kf k,[a,b]
12
(b a)5 (4)
kf k,[a,b]
2880
(b a)5 (4)
kf k,[a,b]
6480
(b a)7 (6)
kf k,[a,b]
1935360
11(b a)7 (6)
kf k,[a,b]
37800000
Here you can see that the largest jumps in accuracy occur if you use even n. So in general people
try to use even n for general functions as the gain/cost ratio is substantially higher.
2.2
Gauss-Legendre Formulas
The former idea involved the equidistant distribution of quadrature points. If we drop this
assumption and let the quadrature points by itself be free to choose and then consider problem
(2.4) for higher order polynomials we get the Gauss-Legendre formulas. Typically these are
derived on the interval [1, 1] and afterwards transformed to the general interval [a, b] by the
a+b
transformation x = ba
2 t + 2 . Without proofs the quadrature points/weights will be given on
the general interval [a, b]. It is beneficial in this case to define the midpoint m = a+b
2 for the
interval [a, b]. The quadrature rule, for general n, is then given as in (2.2).
Example: Consider the case n = 1. Find a quadrature rule
Q(f ) = (b a)(0 f (x0 ) + 1 f (x1 ))
such that Q(p) = I(p) for all p P3 ([a, b]. Here we have 4 degrees of freedom, namely x0 , x1
and 0 , 1 . The solution of this problem is given by
ba 1
2
3
1
1 =
2
x0 = m
0 =
1
2
x1 = m +
ba 1
2
3
In general we get the following quadrature points weights for different n with h := (b a).
For n = 0 the rule is generally called midpoint-rule: A Gauss-Legendre Quadrature rule should
n
0
1
2
quadrature points
x0
h 1
2 3
error
weights
=m
0 = 1
h 1
2 3
x0 = m
, x1 = m +
q
q
h
3
x0 = m 2 5 , x1 = m , x2 = m + h2 35
0 =
0 =
5
18
1
2
, 1 =
, 1 =
8
18
1
2
, 2 =
5
18
(ba)3
(2)
k,[a,b]
24 kf
(5)
(ba)
(4)
k,[a,b]
540 kf
(ba)7
(6)
k,[a,b]
2016000 kf
26
2.3
Composite Rules
All quadrature rules we have seen before do not overcome the problem of needing the higher
derivatives of f (x) to be not "too bad". Furthermore the Newton-Cotes formulas will have
negative weights for larger n, rendering them to be a bad choice for the approximation of I(f ) for
a general f (x). We have seen in chapter 1 that a piecewise interpolation overcomes this problem.
We will use the same idea here and apply the chosen quadrature rule to every subinterval. Let
the decomposition of L subintervals of the interval [a, b]
{[t0 , t1 ], [t1 , t2 ], . . . , [tL1 , tL ]}
be given. This means that ({[t0 , t1 ], [t1 , t2 ], . . . , [tL1 , tL ]}) = [a, b] and (tj1 , tj ) (tk1 , tk ) =
if j 6= k. We then can write I(f ) by
Z b
L Z ti
X
I(f ) =
f (x)dx =
f (x)dx
a
i=1
ti1
ti1
i=1
t1
=
| t0
t2
f (x)dx + +
f (x)dx +
{z
}
}
| t1 {z
Q1 (f ) on [t0 , t1 ]
Q2 (f ) on [t1 , t2 ]
tL
tL1
f (x)dx
{z
}
QL (f ) on [tL1 , tL ]
= QC (f ) I(f ) on [a, b]
Here we used different quadrature formulas Qi on the intervals [ti1 , ti ]. So one can for example
choose to use Simpsons rule on one interval and the trapezoidal rule on the other if the problem
requires this. In the next examples we will use the same quadrature formula on every subinterval.
Example:
Let the integral I(f ) with a decomposition
{[t0 , t1 ], [t1 , t2 ], . . . , [tL1 , tL ]}
be given. Applying the trapezoidal rule to the subintervals leads to
Z b
L Z ti
X
I(f ) =
f (x)dx =
f (x)dx
a
i=1
ti1
t1
=
t0
t2
Z
f (x)dx
{z
}
Z
f (x)dx
{z
}
t1
tL
+ +
tL1
f (x)dx
{z
}
i=1
t1
=
t0
ti1
Z
f (x)dx +
{z
}
h( 12 f (t0 )+ 12 f (t1 ))
t2
t1
Z
f (x)dx
{z
}
h( 12 f (t1 )+ 21 f (t2 ))
tL
+ +
tL1
f (x)dx
{z
}
h( 12 f (tL1 )+ 21 f (tL ))
1
1
h( f (t0 ) + f (t1 ) + + f (tL1 ) + f (tL ))
2
2
27
f (x)dx =
I(f ) =
a
L Z
X
t1
t0
f (x)dx
ti1
i=1
ti
Z t2
f (x)dx + +
f (x)dx +
t1
|
{z
}
{z
}
(t1 t0 )f (
t0 +t1
)
2
(t1 t0 )f (
(t2 t1 )f (
t1 +t2
)
2
tL
tL1
f (x)dx
{z
}
(tL tL1 )f (
tL1 +tL
)
2
tL1 + tL
t0 + t1
) + + (tL tL1 )f (
)
2
2
t0 + t1
tL1 + tL
) + + f(
))
2
2
=|
i=1
L
X
f (x)dx
L
ti1
i=1
ti
L
ti1
(2.12)
ti
|
ti1
i=1
Qi (f )|[ti1 ,ti ] |
i=1
ti
L
X
L
X
(ti ti1 )5
i=1
2880
kf
(4)
k,[ti ,ti1 ]
L
X
h5i
=
kf (4) k,[ti ,ti1 ]
2880
(2.13)
i=1
Assuming equidistant points ti , meaning hi = h for every i = 1, ..., L, we can simplify this:
|I(f ) QC (f )|
L
X
h5
h5
kf (4) k,[ti ,ti1 ] L
kf (4) k,[a,b]
2880
2880
(2.14)
i=1
Equation (2.14) might be very rough compared to (2.13). But even in (2.14) one can see that
by adding more points ti and therefore reducing h will reduce the error estimate considerably
without having to take higher derivatives of f into account.
2.4
In this section an approach for multidimensional quadrature rules is shortly introduced. For the
sake of simplicity only the case of a rectangle/cuboid integration area will be considered. For
more general domains refer to chapter 9.9 in Quarteroni/Sacco/Saleri: Numerical Mathematics.
28
Let the rectangle [a1 , b1 ] [a2 , b2 ] and a continuous function f (x, y) be given. The integral
can then be written as
Z b1 Z b2
Z b1
I(f ) =
f (x, y)dy dx =
F (x)dx = I(F )
a1
a1
| a2 {z
}
=:F (x)
The integral I(F ) can then be approximated by a (composite) quadrature rule. For the sake of
simplicity we drop the composite component of the rule and will only use a quadrature rule
with the quadrature points xi and weights i for i = 0, ..., n.
Z
b1
F (x)dx Qx (F ) = (b1 a1 )
I(F ) =
a1
n
X
i F (xi )
i=0
b2
f (xi , y)dy
F (xi ) =
a2
which can also be approximated by a quadrature rule. Again for simplicity we use the same rule
again which leads to
Z
b2
F (xi ) =
a2
n
X
j f (xi , yj )
j=0
b1
b2
I(f ) =
a1
a2
n
X
i (b2 a2 )
i=0
= (b1 a1 )(b2 a2 )
n
X
j f (xi , yj )
j=0
n X
n
X
i j f (xi , yj )
i=0 j=0
The derivation of this quadrature formula makes obvious that this rule, by using the weights
and quadrature points of the closed Newton-Cotes formulas, is exact for polynomials of the form
p(x, y) = p1,n (x)p2,n (y) with pi,n Pn ([ai , bi ]) but not necessarily (if n is uneven) for polynomials
p(x, y) = p1,n+1 (x) (or p(x, y) = p2,n+1 (y)). Furthermore the number of function evaluations
increases to (n+1)2 (for a general d-dimensional cuboid to (n+1)d ). This phenomenon is usually
called the curse of dimensionality.
For the cuboid the derivation of a tensor-product formula is analog. Let the rectangle [a1 , b1 ]
[a2 , b2 ] [a3 , b3 ] and a continuous function f (x, y, z) be given. We can then use the quadrature
formula:
Z b1 Z b2 Z b3
n
n
n
X
X
X
I(f ) =
f (x, y, z)dzdydx = (b1 a1 )
i (b2 a2 )
j (b3 a3 )
k f (xi , yj , zk )
a1
a2
a3
i=0
j=0
n X
n X
n
X
i=0 j=0 k=0
k=0
i j k f (xi , yj , zk )
Chapter 3
Ax = b
(3.1)
3.1
The Gaussian Elimination is a commonly used method for solving linear equation systems. The
idea is to transform the matrix A by row operations to an upper triangular matrix. Applying
the same transformations to the vector b we can the get the solution by back-substitution. We
will first start with an example:
2 1 4
7
6
2 4 3 5
5
x =
4
16
7 12 20
0 12 5
3
5
|
| {z }
{z
}
=A
=b
29
30
2 1 4
2 4 3
4
7 12
0 12 5
2 1
0 3
0 9
0 12
2 1
0 3
0 0
0 0
2 1
0 3
0 0
0 0
2
1
6
7
+
5 5
20 16 +
3
5
4 7 6
3 4
1 2 1
4 6 4
+
5 3 5 +
4 7 6
1 2 1
1
1 0 1
1 5 1
+
4 7 6
1 2 1
1 0 1
0 5 0
6
2 1 4 7
1
0 3 1 2
0 0 1 0 x = 1
0 0 0 5
0
x4 = 0,
x3 = 1,
1
x2 = (1 1 1 2 0) = 0,
3
1
0
x=
1
0
1
x1 = (6 + 1 0 4 1 7 0) = 1
2
a11
a21
a31
..
.
a
a
a21
a31
a12 a13 . . . a1n
11
11
+
a22 a23 . . . a2n
an1
11
(2)
22
23
2n
a22
(2)
(2)
(2)
(2)
+
A(3)
(2)
0 an2 an3
22
23
(3)
0
0
a
:=
33
..
.
0
(3)
an3
(2)
31
(2)
an2
(2)
a22
. . . ann
. . . a1n
(2)
. . . a2n
(3)
. . . a3n
..
.
(3)
. . . ann
(3)
an3
(3)
a33
...
A(n)
22
23
2n
(3)
(3)
0
0
a
.
.
.
a
:=
33
3n =: U
.
..
..
..
..
.
.
.
(n)
0
0
. . . 0 ann
b1
b(2)
2
(3)
b
b=
3
..
.
(n)
bn
The operations we have seen before can also be done by a matrix-matrix multiplication. The
first step of the Gaussian elimination for example can be done by multiplying A = A(1) from the
left side by the matrix
1
0 0
... 0
l2,1 1 0
. . . 0
l3,1 0 1 0 . . . 0
.
.
.
.
L1 =
.
.
.
.
.
.
.
.
.
.
ln1,1 0 0
. 0
ln,1 0 0
... 1
i1
. Therefore we have
with li,1 := aa11
22
23
2n
(2)
(2)
(2)
(2)
0 a32 a33 . . . a3n = L1 A(1)
A =
..
..
.
.
(2)
(2)
(2)
0 an2 an3 . . . ann
32
0
.
.
.
.
.
.
Lk = .
..
.
..
.
..
0
0
...
.. ..
.
.
...
..
. 1
0
...
..
.
0
1
0
..
.
0 lk+1,k 1
..
..
.
. lk+2,k 0
..
..
..
..
.
.
.
.
0 ... 0
ln,k
0
... ... 0
.
. . . . . . ..
.
. . . ..
.
0 . . . ..
.
0 . . . ..
. . . . ..
.
. .
..
. 0
... 0 1
(k)
with lj,k =
ajk
(k)
akk
A(k+1)
22
23
.
.
.
..
..
..
..
.
.
= .
..
(k+1)
.
0 ak+1,k+1
.
.
..
..
.
(k+1)
0
...
0
an,k+1
...
...
..
.
a1n
(2)
a2n
..
.
(k)
(k+1) = Lk A
. . . ak+1,n
..
.
(k+1)
. . . ann
1 0
..
.
0
.
. ...
.
. .
. .
. .
1
Lk = . .
.. ..
. .
.. ..
. .
.. ..
0
...
... ... 0
.
..
.
...
. . . . . . ..
.
1
0
...
. . . ..
.
0
1
0
0 . . . ..
.
0 lk+1,k 1
0 . . . ..
..
. . . . ..
.
. .
. lk+2,k 0
..
..
..
..
. 0
.
.
.
0 ... 0
ln,k
0 ... 0 1
with
0
..
.
0
..
.
...
...
...
l2,1
...
...
0
...
...
l3,1 l3,2 1
.
.
.
..
..
L = ..
1
0
...
.
.
..
..
..
..
.
.
lk+1,k
.
.
.
.
..
..
..
..
1
ln,1 ln,2 . . . ln,k . . . ln,n1
0
..
.
..
.
..
.
..
.
0
1
(3.2)
first solve Lz = b
then solve U x = z
After we computed the LU-decomposition in 23 n3 + O(n2 ) operations we can solve the linear equation system by only using forward- and backward-substitution, which requires O(n2 )
operations.
For our example we get
1
1
L1 =
2
0
0
1
0
0
0
0
0
1
0
0
1
0
1 0 0
0 1 0
L2 =
0 3 1
0 4 0
0
0
0
1
1
0
L3 =
0
0
0 0
1 0
0 1
0 1
0
0
0
1
and therefore
1
1
L=
2
0
0
1
3
4
0
0
0
1
0
0
1
1
and U = A(4)
2 1 4 7
0 3 1 2
=
0 0 1 0
0 0 0 5
It is easy to check that A = LU holds. To solve the linear equation system we first solve
1
1
2
0
0
1
3
4
0
0
1
1
6
6
0
0
5 z = 1
z
=
1
16
0
0
1
5
2 1 4 7
6
1
0 3 1 2
1
0
0 0 1 0 x = 1 x = 1
0 0 0 5
0
0
3.2
We can not guarantee existence of the LU decomposition for any invertible matrix A Rnn .
(k)
It is clear that the Gaussian elimination stops if any akk is 0. To prove the following first result
on existence and uniqueness of the LU decomposition we will make use of definition 1.7:
Theorem 3.1 Let A Rnn be symmetric positive definite. Then a LU decomposition A = LU
(k)
exists and when using the Gaussian elimination we have: akk > 0 for k = 1, ..., n
34
proof: As A is symmetric and positive definite we have a11 = eT1 Ae1 > 0. Therefore the first
step of the Gaussian elimination can be executed and we get:
a
a
a
a21
a31
an1
a11 a21 a31 . . . an1
11
11
11
a21 a22 a32 . . . an2
+
..
..
.
.
an1 an2 an3 . . . ann +
22
32
n2
(2)
(2)
(2)
(2)
0
a
a
.
.
.
a
L1 A = A =
32
33
n3
..
..
.
.
0
(2)
an2
(2)
an3
(2)
. . . ann
an1
11
a31
11
a21
11
+
y
22
32
n2
(2)
(2)
0 a(2)
a
.
.
.
a
32
33
n3
..
..
.
.
(2)
(2)
(2)
0 an2 an3 . . . ann
a11 0
0 ... 0
0 a(2) a(2) . . . a(2)
22
32
n2
(2)
(2)
(2)
T
(2)
0
a
a
.
.
.
a
L1 AL1 =: B =
32
33
n3
..
..
.
.
(2)
(2)
(2)
0 an2 an3 . . . ann
Now the general matrix B (k) after k 1 steps is given by
=B
(k)
a11
(2)
a22
=
0
0
..
.
(k)
(k)
. . . an,k
..
..
.
.
ak,k
..
.
(k)
(k)
an,k . . . ann
(n1)
. . . LTn1
(1)
a11
(2)
a22
..
.
(n)
=: D
ann
(i)
with all aii > 0. This means we can rewrite A in the form
1
1
T
T T
A = L1
= LDLT
1 L2 . . . Ln1 DLn1 . . . L2 L1
And by setting
= LD1/2
L
with
q
D1/2
(1)
a11
(2)
a22
..
.
q
(n)
ann
we get
L
T
A = LDLT = LD1/2 D1/2 LT = LD1/2 (LD1/2 )T = L
is a lower triangular matrix with positive entries on the diagonal. This decomposition
Here L
is called the Cholesky decomposition and is named after Andre-Louis Cholesky (1875-1918). To
calculate the Cholesky factorization by simply applying the Gaussian elimination would be too
costly and would not take the symmetry into account. If we write the decomposition in the form
l11
l11
l21 l22
..
.
..
.
ln1 . . . . . . lnn
l21 . . . ln1
a11 a21 . . . an1
.. a
l22
21 a22 . . . an2
.
=
..
. ..
..
.
. .. .
an1 an2 . . . ann
lnn
l2 if i = k
lkk = akk
kj
j=1
36
end
v
u
k1
X
u
lkk = takk
l2
kj
j=1
end
The computational cost is
n X
k1
n
X
X
1
(2i 1) +
2(k 1) = n3 + O(n2 )
3
k=1 i=1
k=1
16 4
4 4
4
5
1 1
A=
4
1 10 7
4 1 7 9
be given. Applying upper algorithm we get the
4
1
=
L
1
1
Cholesky decomposition
0 0 0
2 0 0
0 3 0
0 2 2
and it holds
L
T
A=L
3.3
Partial Pivoting
As we have already mentioned before the Gaussian elimination can not be applied to any regular
matrix, as there might be a 0 on the diagonal. A simple example is given by
0 1
A=
1 1
where Gaussian elimination of section 3.1 can not be applied without swapping rows(columns).
Furthermore problems might arise if the pivot element is very small:
20
10
1
A=
1
1
Using the Gaussian elimination we get
20
1
0
10
1
L=
and U =
1020 1
0
1 1020
37
If one uses a computer with a machine precision of 1016 the difference 1 1020 is 1020 .
Therefore we get the disturbed matrices
20
1
0
10
1
L=
and U =
1020 1
0
1020
and
U
=
L
1020 1
1
0
Using the LU decomposition for the solution of the LES Ax = (1, 0)T we get
1
L |{z}
Ux =
0
=:z
and therefore
z=
and then
1
1020
0
x=
1
1101 20
1
11020
1
1020 1
1
1
So instead of solving
20
1
10
1
x=
0
1
1
we want to solve:
0
1
1
x=
1
1020 1
1
0
1020 1
=
and U
1
1
20
10
1 + 1020
1 1
0 1
1
1
20
10
1
Using the LU decomposition to solve upper linear equation system we get (by taking machine
precision into account)
1
x=
1
The Gaussian elimination with partial pivoting in the k-th step looks like
38
(k)
1. For the step A(k) A(k+1) choose a p {k, . . . , n}, such that |apk | |ajk | for j
{k, . . . , n}
2. Swap the rows k and p
(k)
(k)
with
(k)
a
ij
(k)
akj
= a(k)
pj
(k)
aij
if i=p
if i=k
else
Pk =
p-th column
1
..
| k-th row
| p-th row
.
1
0 0 ... 0 1
0 1
0
..
..
..
.
.
.
0
1 0
1 0 ... 0 0
1
..
.
1
which is just the unity matrix where column k and row p are swapped. It is easy to see that
Pk1 = Pk . Similar to the LU decomposition before we are able to construct an upper triangular
matrix U by
U = Ln1 Pn1 Ln2 Pn2 Ln3 Pn3 . . . L2 P2 L1 P1 A
To get a LU decomposition we need to construct matrices similar to our Li before to keep the
known structure. It is easy to see that if we multiply Pk+1 (which swaps row k + 1 and pk+1 )
with Lk we get
k + 1-th
1
..
.
lpk+1 ,k 0
lk+2,k
0
..
..
Pk+1 Lk =
.
.
l
0
pk+1 1,k
lk+1,k
1
l
pk+1 +1,k
..
.
ln,k
col.
| k + 1-th row
0 ... 0 1
1
0
..
..
.
.
1 0
0 ... 0 0
1
..
.
1
39
and then by swapping the k + 1-th and pk+1 -th column we regain the former structure. To swap
column k + 1 and pk+1 we need to multiply Pk+1 from the right.
k + 1-th col.
Pk+1 Lk Pk+1
1
..
| k + 1-th row
1
lpk+1 ,k
lk+2,k
..
.
lpk+1 1,k
lk+1,k
lpk+1 +1,k
..
.
1 0 ... 0 0
0 1
0
..
..
..
.
.
.
0
1 0
0 0 ... 0 1
1
..
ln,k
=I
= Ln1 Pn1 Ln2 Pn1 Pn1 Pn2 Ln3 Pn2 Pn1 Pn1 Pn2 Pn3 L4 . . . L2 P2 L1 P1 A
| {z } |
{z
}|
{z
}
n1
=:L
n2
=:L
n3
=:L
= ...
n1 L
n2 . . . L
1 (Pn1 . . . P1 ) A
=L
{z
}
|
=:P
with
k = Pn1 . . . Pk+1 Lk Pk+1 . . . Pn1
L
having the same structure as the Lk of section 3.1 with swapped entries. and therefore
n1 L
n2 . . . L
1 )1 U = L
1 L
1 . . . L
1 U = LU
P A = (L
n1
| 1 2 {z
}
=:L
It is clear that this construction works always if A is not singular as there has to be at least one
entry in the column (from row k downwarts) that is not 0. We summarize this in the following
theorem
Theorem 3.2 Let A Rnn be non singular. Then a decomposition
P A = LU
with a permutation matrix P a lower triangular matrix L (with |lij | 1) and an upper triangular
matrix U exists.
Remark: As the additional computational cost for the step k of the elimination process only
consists in finding the element of largest absolute value in the corresponding row (k operations)
and then swapping two rows (2k operations) the overall computational cost of
n1
X
i=1
3k = 3
n(n 1)
= O(n2 )
2
40
is negligible.
To solve the linear equation system Ax = b with the given decomposition P A = LU we use
Ax = b P Ax = P b LU x = P b
first solve Lz = P b
then solve U x = z
Example: Let the Matrix
1/3
1/4
1/2
12 12 12 0
4
4
5
6
A=
3
3
21 12 +
+
6
4 17 9
be given. In the first step no swap is needed which means that P1 = I and L1 is given by
1 0
1 1
3
L1 =
1 0
4
1
2 0
0
0
1
0
0
0
0
1
A(2)
12 12 12 0
0
0
1
6
0
6
18 12
0
2
11
9
A(2)
12 12 12 0
0
1/3
6
18 12
0
0
1
6
0
2
11
9
1
0
P2 =
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1 0
0 1
L2 =
0 0
0 13
0
0
1
0
0
0
0
1
A(3)
12 12 12 0
0
6
18 12
=
0
0
1
6
0
0
5
5
A(3)
12 12 12 0
0
6
18 12
=
0
1/5
0
5
5
0
0
1
6
with
1
0
P3 =
0
0
0
1
0
0
0
0
0
1
0
0
1
0
1
0
L3 =
0
0
0 0
1 0
0 1
0 15
1
0
P = P3 P2 P1 =
0
0
0
0
0
1
and finally
12 12 12 0
0
6
18 12
U =
0
0
5
5
0
0
0
5
0
0
0
1
0
1
0
0
0
0
1
0
41
i we get
For the L
1 0
1 1
1 = P3 P2 L1 P2 P3 = 41
L
2 0
1
3 0
1 0
0 1
3 =
L
0 0
0 0
and therefore
0
1 0
0 1
0
2 = P3 L2 P3 =
L
0 1
0
3
1
0 0
0 0
0 0
1 0
15 1
0
0
1
0
1
1
4
L=
1
2
13
0
1
1
3
0
0
1
0
0
0
0
1
0 0
0 0
1 0
1
5 1
with
1
0
0
0
|
0
0
0
1
0
1
0
0
{z
P
12 12 12 0
1
0
4
5
6 41
0 4
=
3
3
21 12 12
1
6
4 17
9
31
0
}|
{z
} |
A
1 0
1 1
4
Solve
1 1
2
3
13 0
and then
0 0
12 12 12 0
0 0
6
18 12
0
0
0
5
5
3 1 0
0 15 1
0
0
0
5
{z
}
{z
}|
0
1
0 0
12
12
12
0 0
z = P b = 15
z=
5
1 0
15
1
10
5
5 1
12
12 12 12 0
0
6
18 12
x = z = 12
Solve
5
0
0
5
5
5
0
0
0
5
1
0
x=
0
1
42
Chapter 4
4.1
44
1.8
1.8
1.6
1.6
1.4
1.4
1.2
1.2
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0.5
0.51
0.52
0.53
0.54
0.55
0.56
0.57
0.58
0.59
0.6
10
12
14
16
Let g20 be the slope of the straight line g2 . We then can calculate the abscissa of the point of
intersection by
c
c1 c2 1
1
1
t =
= g20 g20
= Ax
c
g20
|
{z
} | {z2 }
=:A
=:x
with A R12 . In this case the function that computes our solution f (x) is given by
f : R2 R1 with f (x) = Ax
and instead of f (x) one usually computes f (
x).
We will now get to the case of matrix-vector multiplication. Let a matrix A Rmn be
be given. We then have for the
given. Furthermore let a vector x and a disturbed vector x
error in the result/output data eout of the matrix-vector multiplication
)
eout = Ax A
x = A (x x
| {z }
ein
with ein being the error in the input data. For a vector x Rk we can measure the length of
the vector in the so called p-norms with p N or p = :
p
kxkp = p |x1 |p + + |xk |p
(4.1)
with the special cases of the Taxicab norm
kxk1 = |x1 | + + |xk |
the Euclidean norm
kxk2 =
|x1 |2 + + |xk |2
45
We now get back to the question how the error in the input data is increased: The absolute
condition, with respect to a p-norm in the input and a q-norm in the output data, is the smallest
number such that
kp = kein kp
keout kq = kAx A
xkq kx x
Rn . We can rewrite upper statement as
holds for all x
kp for all x
Rn }
abs = inf{ : kAx A
xkq kx x
kAx A
xk q
kAykq
= max n
= max n
=: kAkp,q
R
kp
x6=x
06=yR
kx x
kykp
(4.2)
It is easy to show that kAkp,q is in fact a norm on Rmn for the by A indicated linear mapping
which is called the (by the vector norms k kp and k kq ) induced matrix norm. In the case p = q
we write kAkp instead of kAkp,p .
Remark: The definition of an induced matrix norm does not cover all matrix norms. There
are some matrix norms that are not induced by a vector norm and for these norms the following
Definition is of interest: We call a matrix norm k kM consistent with a vector norm k kV if
kAxkV kAkM kxkV
(4.3)
for all x Rn . Examples for norms that are not induced by vector norms are the Frobenius
norm
v
uX
n
um X
kAkF := t
a2ij
(4.4)
i=1 j=1
(4.5)
i,j
The Frobenius norm is consistent with the Euclidean norm, whereas the maximum norm is not
consistent with any p-norm.
Generally we are more interested in relative errors than in absolute errors as we want to take
the magnitude of the input data/ solutions into account. Therefore we get the relative condition
of the matrix-vector multiplication by using relative errors: The relative condition, with respect
to a p-norm in the input and a q-norm in the output data, is the smallest number rel such that
kp
keout kq
kAx A
xkq
kx x
kein kp
=
rel
= rel
kAxkq
kAxkq
kxkp
kxkp
Rn . As before we can rewrite upper statement as
holds for all x
kp
kx x
kAx A
xk q
Rn }
rel
for all x
kAxkq
kxkp
kAx A
xkq kxkp
kxkp
= max n
= kAkp,q
R
kp kAxkq
x6=x
kx x
kAxkq
rel = inf{ :
(4.6)
So we see that the relative condition not only depends on A, p and q but on x as well. A very
important number is the so called condition number p,q . Therefore we assume A to be quadratic
(m = n) and invertible. If we take the relative condition (4.6) and take the maximum over all x
(worst case scenario) we get
p,q = max n kAkp,q
06=xR
kxkp
kAxkq
y=Ax
= kAkp,q max n
06=yR
kA1 ykp
= kAkp,q kA1 kq,p
kykq
(4.7)
For the condition number p,q we write p if p = q. Generally these matrix norms are hard to
compute and therefore we will restrict ourselves to the cases p = q and p = 1, p = 2 and p = .
We start with the special case p = 2:
46
Theorem 4.1 (Spectral norm) For the Euclidean norm k k2 the induced matrix norm
kAk2 = max n
06=yR
kAyk2
kyk2
is given by:
q
max{ : is eigenvalue of AT A}
q
= max{ : is eigenvalue of AAT }
kAk2 =
(4.8)
if A is symmetric we have
kAk2 = max{|| : is eigenvalue of A}
(4.9)
Proof: Let n m. We know that AAT Rmm has the same eigenvalues as AT A Rnn
with additional m n zero eigenvalues. Therefore we only have to show
q
kAk2 = max{ : is eigenvalue of AT A}
The definition of the k k2 norm of A gives immediatly
kAk22 = maxn
yR
yT AT Ay
kAyk22
= maxn
2
yR
yT y
kyk2
QT AT AQ = =
.
.
.
m
with i being eigenvalue of AT A.It follows AT A = QQT as QT Q = I. Therefore we get
yT AT Ay
yR
yT y
T
y QQT y
= maxn T
yR
y QQT y
zT z
= maxn T
zR
z z
Pn
i zi2
= maxn Pi=1
n
2
zR
i=1 zi
kAk22 = maxn
= max i
i=1,...,n
with z = QT y. For m n the proof is analogously. For a symmetric matrix A we get (4.9)
immediately by using the fact that the eigenvalues of A2 are the squares of the eigenvalues of A.
Without proof the following theorem states how the matrix norms for p = 1 and p = can
be calculated.
Theorem 4.2 (Maximum absolute column/row sum norm) For the induced matrix norms
k k1 and k k we have
m
kAk1 = max n
06=yR
X
kAyk1
= max
|aij |
j=1,...,n
kyk1
i=1
47
kAk = max n
06=yR
X
kAyk
= max
|aij |
i=1,...,m
kyk
j=1
(4.10)
kbkp
kA1 bkq
(4.11)
If we again take the maximum of all right hand sides b (the worst case scenario) we get
p,q = kAkp,q kA1 kq,p
It has to be noted that in practice one normally uses p = q and p = 1, p = 2 and p = .
Example: Let p = q = 2 and let the quadratic Hilbert-Matrix be given:
1/2
1/3
1/n
1
...
1/2 1/3
1/4
. . . 1/n+1
1/5
. . . 1/n+2
H = 1/3 1/4
..
..
..
..
.
.
.
.
.
.
.
1/n 1/n+1 1/n+2 . . . 1/2n1
and the right hand side
1
1/2
b = 1/3
..
.
1/n
We now disturb the vector b by going from double precision (error of 2.221 1016 ) to single
precision (error of 1.193 107 ). The calculations are all done using double precision. The
errors are listed in table 4.1
n
4
8
12
2
kb bk
9.9341 109
1.3154 108
1.3764 108
k2
kx x
8.1301 105
13.3924
1.4870 107
kH 1 k2
1.0341 104
8.9965 109
9.0997 1015
2
kbbk
/kbk2
109
8.3259
1.0643 108
1.1002 108
kx
xk2/kxk2
105
8.1301
13.3924
1.4870 107
2 (H)
1.5514 104
1.5258 1010
1.6776 1016
48
4.2
Iterative Solvers
In applications one typically has to handle large linear equation systems with only few non-zero
entries (finite element methods, finite differences, finite volumes,...). The matrices arising in this
systems are called sparse matrices.The main problem is that direct solvers might be
- inefficient
- memory consuming
The first point is important if in applications no exact solution is needed and one is only interested in an approximation. In contrast to direct solvers, Iterative solvers deliver a sequence of
approximations x(0) , x(1) , . . . and can be stopped at any time with giving an approximation.
The second point is illustrated in figure 4.2. The original matrix A is a sparse matrix with
only 180 entries that are not zero. The matrices L and U arising from the LU decomposition
with partial pivoting (P A = LU ) have 451 and 419 entries. This results in 5 times the memory
consumption of the original matrix (or more than 2 times if we only consider U ). The problem
of fill in is quite common for sparse matrices and might introduce serious memory problems.
10
10
10
20
20
20
30
30
30
40
40
40
50
50
50
60
60
0
10
20
30
nz = 180
40
50
60
60
0
10
20
30
nz = 451
40
50
60
10
20
30
nz = 419
40
50
60
a11
a21
..
.
a12
a22
..
.
an1 an2
. . . a1n
x1
b1
x2 b2
. . . a2n
.. .. = ..
..
.
. . .
. . . ann
xn
bn
(0)
x1
(0)
x2
=
..
.
x(0)
(0)
xn
Iterative methods are generally based on a fixed point equation of the form
x = Gx + c
where the matrix G and the vector c define the iterative method given by the fixed point iteration
x(k+1) = Gx(k) + c.
(4.12)
49
The matrix G Rnn is called the iteration matrix. To derive some first
use of the following splitting
..
..
..
.
.
.
.
.
.
an1 an2 . . . ann
{z
}
|
A
0 a12
0
a11
a21 0
a22
0
= .
+
+
.
.
.
..
..
..
..
an1 . . . an,n1 0
ann
|
{z
} |
{z
} |
L
...
a1n
..
..
.
.
..
. an1,n
0
{z
}
U
aii xi
= bi
n
X
(k)
aij xj
for i = 1, . . . , n
j=1,j6=i
or
(k+1)
xi
n
X
1
(k)
(bi
aij xj ) for i = 1, . . . , n
aii
(4.13)
j=1,j6=i
This method is called the Jacobi method. It should be mentioned that one is able to calculate
(k+1)
xi
for all i simultaneously. Therefore this method is suitable for parallelization. In sequential
(k+1)
programming all xi
are computed consecutively. Therefore one might conceive the idea to
use the already computed better approximations. If the numbering of the computations is
given by increasing i upper equation becomes
(k+1)
xi
i1
n
X
X
1
(k+1)
(k)
(bi
aij xj
aij xj ) for i = 1, . . . , n
aii
j=1
(4.14)
j=i+1
(k+1)
aij xj
= bi
n
X
(k)
aij xj
for i = 1, . . . , n
j=i+1
and we are now able to put into the form of equation (4.12):
(D + L)x(k+1) = b U x(k) or x(k+1) = (D + L)1 (b U x(k) )
(4.15)
50
In this case we need to calculate the new entries step by step and an easy implementation of
a parallel algorithm is not possible. The solutions we get by using the Gauss-Seidel method are
generally better than the ones we get by using the Jacobi method.
One is free to choose the numbering of the computed entries. If one starts first by computing
the last entries of x(k+1) (decreasing i) one gets
(k+1)
xi
i1
n
X
X
1
(k)
(k+1)
=
(bi
aij xj
aij xj
) for i = 1, . . . , n
aii
j=1
(4.16)
j=i+1
(4.17)
If we combine these two methods by using one step of the Gauss-Seidel method and the next
step of the backward Gauss-Seidel method we get the symmetric Gauss-Seidel method:
x(k+ /2) = (D + L)1 (b U x(k) )
1
(4.18)
which is equivalent to
x
(k+1)
= (D + U )
b L(D + L)
(b U x
(k)
)
(4.19)
Let
e(k) = x x(k)
be the error in the k-th step. We can then derive for the Jacobi method
e(k+1) = x x(k+1) = x D1 (b (L + U )x(k) ) = x D1 (Ax (L + U )x(k) )
= x D1 ((D + L + U )x (L + U )x(k) ) = D1 (L + U )(x x(k) )
(4.20)
= D1 (L + U )e(k)
Analogously we get for the other methods:
e(k+1) = (D + L)1 U e(k)
e(k+1) = (D + U )1 Le(k)
(k+1)
= (D + U )
L(D + L)
(k)
Ue
Gauss-Seidel
(4.21)
backward Gauss-Seidel
(4.22)
symmetric Gauss-Seidel
(4.23)
ke
(k+1)
ke
k k(D + L)
(k)
U kke
k k(D + U )
(k)
Lkke
Jacobi
(4.24)
Gauss-Seidel
(4.25)
backward Gauss-Seidel
(4.26)
symmetric Gauss-Seidel
(4.27)
(4.28)
Thus, if kGk < 1 the error ke(k) k 0 for k . We will see below that kGk < 1 is a
sufficient convergence criterion for iterative methods of form (4.12) which follows directly from
51
another criterion. Under the assumption that kGk < 1 one can apply the Banach fixed point
theorem (the general statement and a proof will be given in the next chapter!) to obtain the
error estimates
kGk
kx(k) x(k1) k ("a posteriori")
1 kGk
kGkk
ke(k) k
kx(1) x(0) k ("a priori").
1 kGk
ke(k) k
(4.29)
(4.30)
With these inequalities one can estimate the error without knowing the exact solution.
In the following we motivate a convergence criterion for the iterative method (4.12) which is
based on the eigenvalues of G:
We assume that the iteration matrix G has a basis of eigenvectors v(1) , . . . , v(n) to the eigenvalues
1 , . . . , n . This assumption is equivalent to the requirement that the matrix A is diagonalizable.
In this case we can express the initial error e(0) = x x(0) uniquely as
(0)
= a1 v
(1)
+ . . . + an v
(n)
n
X
ai v(i)
i=1
k = 0, 1, 2, . . .
and therefore
e(k) = Gk e(0) =
n
X
ai Gk v(i) =
i=1
n
X
ai ki v(i) ,
k = 0, 1, 2, . . . .
i=1
It is clear that e(k) 0 if the absolute values of all eigenvalues are less than 1 or equivalently
the largest absolute value of all eigenvalues of A is less than one.
Definition 4.3 The spectral radius of a matrix A Rnn is defined by
(A) := max{|| : eigenvalue of A}.
One can show that the characterization above is generally valid and also necessary for convergence:
Theorem 4.4 Every iterative method of the form
x(k+1) = Gx(k) + c
(cf. (4.12))
with iteration matrix G converges for all initial values x(0) Rn to the solution of Ax = b if
and only if (G) < 1.
From Theorem 4.4 we can conclude:
Corollary 4.5 Let k kp := k kp,p be the (by the vector norm k kp ) induced matrix norm.
If kGkp < 1 then the iterative method (4.12) converges for all initial values x(0) Rn to the
solution of Ax = b.
Proof:
Let be an arbitrary eigenvalue of the iteration matrix G with corresponding eigenvector v Rn .
With the help of the definition of the matrix norm k kp (cp. (4.2)) it follows
kGkp = max n
06=yR
kGykp
kGvkp
kvkp
=
= ||.
kykp
kvkp
kvkp
52
Then by Definition 4.3 it holds also (G) kGkp . Due to the assumption kGkp < 1 it follows
(G) < 1 and therefore the convergence of (4.12) for an arbitrary initial guess x(0) by Theorem
4.4.
By Corollary 4.5 it is sufficient to show kGkp < 1 for only one matrix norm to ensure
convergence of the iterative method.
At the end of this section we state some convergence criteria for the Jacobi and the three Gauss Seidel methods. These criteria have the advantage that they are based on the given matrix A
and not on the iteration matrix G. For the first criterion we start with the following definition.
Definition 4.6 A matrix A Rnn is said to be strictly row diagonally dominant if and only if
n
X
|aij | < |aii |
for i = 1, . . . , n.
j=1
j6=i
for j = 1, . . . , n.
i=1
i6=j
Theorem 4.7 If the matrix A is strictly row or strictly column diagonally dominant, then the
Jacobi and all Gauss Seidel methods converge for all initial values x(0) Rn .
Proof:
We prove the statement for simplicity only for the Jacobi method. A simple calculation leads to
a12
0
aa1n
a11
11
a21
0 aa2n
a22
22
1
G := D (L + U ) =
.
..
an1
an2
0
ann
ann . . .
for the Jacobi method. Thus we obtain
n
n
n
n
1 X
X
1 X
1 X
kGk = max
|gij | = max
|a1j |,
|a2j |, . . . ,
|anj |
i=1,...,n
|a |
|a22 |
|ann |
j=1
j=1
j=1
11 j=1
j6=1
1
i=1,...,n |aii |
= max
j6=2
j6=n
n
X
|aij | < 1,
j=1
j6=i
{z
<1
due to the assumed strictly row diagonally dominance of A. By Corollary 4.5 we obtain the
convergence of the Jacobi method.
If the matrix A is strictly column diagonally dominant one can get analogously
k(L + U )D
k1 = max
j=1,...,n
n
X
i=1
|((L + U )D
1 X
)ij | = max
|aij | < 1.
j=1,...,n |ajj |
i=1
i6=j
Since kAk1 = kAT k for all matrices A it follows 1 > k(L + U )D1 k1 = kD1 (L + U )T k .
D1 (L + U )T is the Jacobi iteration matrix corresponding to AT x = b. Again by Corollary
53
4.5 the Jacobi method converges for AT . By Theorem 4.4 it must hold (D1 (L + U )T ) < 1.
Since the eigenvalues of A and AT and the eigenvalues of AB and BA are equal for arbitrary
matrices A, B Rnn it follows
1 > (D1 (L + U )T ) = ((L + U )D1 ) = (D1 (L + U )) = (G),
i.e. by Theorem 4.4 the convergence of the Jacobi method for the matrix A.
The next two convergence criteria are based on the positive definiteness of A:
Theorem 4.8 Let A Rnn be symmetric positive definite. Then the Gauss - Seidel, the backward Gauss - Seidel and the symmetric Gauss - Seidel methods converge for arbitrary initial values
x(0) Rn .
Note that the symmetry and positive definiteness of A is not sufficient for convergence of the
Jacobi method. For this method one needs an additional condition:
Theorem 4.9 Let A = L + D + U Rnn be symmetric positive definite. Then the Jacobi
method converges for arbitrary initial values x(0) Rn if and only if the matrix D L U is
also symmetric positive definite.
4.3
In this section we consider two iterative methods, the steepest descent and the conjugate gradient method. The steepest descent method is a very simple method and can be used to solve
unconstrained minimization problems for real - valued functions. We will use this method for
the solution of linear systems of equations Ax = b under the assumption that A Rnn is a
symmetric positive definite matrix. However, we will see exemplarily that this method often
converges quite slowly. In contrast the so - called conjugate gradient method needs at most n iterations using exact arithmetic to achieve the correct solution. This method is therefore efficient
and often used in applications.
Before we consider the methods we define a function f : Rn R by
1
f (x) = xT Ax xT b
2
(4.31)
for given symmetric positive definite A Rnn and given b Rn . In the following lemma we
show that the minimization of f is equivalent to solving Ax = b.
Lemma 4.10 x Rn is a solution of Ax = b if and only if x minimizes f , i.e. f (x ) =
minn f (x).
xR
Proof:
"": We have to show that f (x ) f (x + h) for each h Rn .
1
(x + h)T A(x + h) (x + h)T b
2
1 T
1
1
1
= (x ) Ax + hT Ax + (x )T Ah + hT Ah (x )T b hT b
2
2
2
2
1 T
T
= f (x ) + h Ax h b +
h Ah
2 | {z }
f (x + h) =
> 0 h6 = 0
f (x ) + h (Ax b) = f (x ).
T
54
4.3.1
The idea of the steepest descent method is quite simple. Starting from an initial guess x(0) Rn
we seek successively search directions d(k) Rn and k R such that the new iterations
x(k+1) := x(k) + k d(k) ,
k = 0, 1, 2, . . .
become closer to x . A natural approach for the search direction is to use the steepest descent
d(k) = f (x(k) ) = b Ax(k) .
With the residuals r(k) := b Ax(k) follows d(k) = r(k) .
For given x(k) and d(k) we determine k such that f (x(k+1) ) = f (x(k) + k d(k) ) is minimal,
i.e. we minimize f along the straight line x(k) + k d(k) with respect to k . Using the chain rule
leads to the necessary condition
T
(d(k) )T d(k)
(r(k) )T r(k)
=
.
(d(k) )T Ad(k)
(r(k) )T Ar(k)
Determine k =
Set x(k+1) =
Set k = k + 1 and determine r(k) = b Ax(k) ;
end while
become very slow. Moreover, one can prove the error estimate
kx
where kxkA :=
(k)
x kA
2 (A) 1
2 (A) + 1
k
kx(0) x kA ,
(4.32)
definite matrix A Rnn . If the condition number 2 (A) is large the fraction
2 (A)1
2 (A)+1
1
2 (A)
1+ 1(A)
2
4.3.2
55
The idea of the conjugate gradient method is to use a set of so - called A - conjugated vectors as
search directions instead of the residuals r(k) .
Definition 4.11 Let A Rnn be symmetric positive definite. A set of vectors {p(0) , p(1) , . . . , p(l) }
with 0 6= p(j) Rn is called conjugated with respect to A (or A - conjugated), if
(p(i) )T Ap(j) = 0 for i 6= j.
(4.33)
In the following we firstly assume that such a set {p(0) , p(1) , . . . , p(l) } is given. Based on an old
approximation x(k) we set our new approximation again as
x(k+1) = x(k) + k p(k) ,
k = 0, . . . , l,
(4.34)
similar to the steepest gradient method. Starting with an given initial guess x(0) Rn this leads
to
x(k+1) = x(k) + k p(k) = . . . = x(0) + k p(k) + . . . + 0 p(0) = x(0) +
k
X
i p(i) ,
k = 0, . . . , l.
i=0
(4.35)
We seek now real numbers 0 , . . . , k such that
f (x
(k+1)
)=f
(0)
k
X
!
i p
(i)
=: g(0 , . . . , k )
i=0
is minimized.
As necessary condition for a minimum of g we obtain
T
g(0 , . . . , k ) = f (x(k+1) )
j
j
A x(0) +
k
X
(0)
k
X
!
(i)
i p
T
= Ax(k+1) b p(j)
i=0
!T
!
i p(i)
p(j)
i=0
k
X
T
i p(i) Ap(j)
i=0
= p(j)
T
T
!
(Ax(0) b) + j p(j) Ap(j) = 0
j = 0, 1, . . . , k.
(4.36)
Since the vectors p(j) , j = 0, . . . , k, are A - conjugated we obtain for fixed j {1, . . . , k} the
relation
!
j1
T
T
T
X
p(j) Ax(j) = p(j) A x(0) +
i p(i) = p(j) Ax(0) .
(4.37)
i=0
56
p(j)
T
T
T
g(0 , . . . , k1 ) = 0
r(k) = r(k) p(j) = Ax(k) b p(j) =
j
for j = 0, . . . , k 1.
We come now to the construction of a set of A - conjugated vectors. We set p(0) := r(0) = bAx(0)
and
p(j+1) := r(j+1) + j+1 p(j) , j = 0, 1, 2, . . . ,
(4.39)
depending on the old search direction p(j) . We determine j+1 such that p(j) and p(j+1) are
A - conjugated, i.e.
T
T
T
!
0 = p(j) Ap(j+1) = p(j) Ar(j+1) + j+1 p(j) Ap(j) .
(4.40)
p(j)
T
Ar(j+1)
.
T
p(j) Ap(j)
(4.41)
(4.42)
T
(r(0) ) r(0)
(p(0) ) r(0)
= (0) T (0) .
T
(0)
(0)
(p ) Ap
(p ) Ap
(4.43)
j = 0, 1, . . . , k 1,
j = 0, 1, . . . , k 1.
Proof:
will be done with mathematical induction over k in Exercise 41
Provided that p(i) 6= 0 for i = 0, . . . , l, Lemma 4.13 implies that the set {p(0) , . . . , p(l) } is
A - conjugated. Moreover we obtain the following corollary:
Corollary 4.14 If {p(0) , . . . , p(l) } is A - conjugated, then the vectors p(i) , i = 0, . . . , l, are linear
independent.
57
l
P
We start with
0=
l
X
T
T
i p(j) Ap(i) = j p(j) Ap(j) j = 0.
|
{z
}
i=0
>0
With the help of Lemma 4.13 and (4.41) one can also obtain (cf. Exercise 41) the representation
T
r(j+1) r(j+1)
j+1 =
, j = 0, 1, 2, . . . .
(4.44)
T
r(j) r(j)
Combining (4.42), (4.34), (4.43), (4.44) and (4.39) results in the conjugate gradient algorithm:
Algorithm 2 Conjugate gradient method
Require: initial guess x(0) Rn and tolerance tol
Set k = 0, determine r(0) = b Ax(0) and set p(0) = r(0) ;
while kr(k) k > tol do
(r(k) )T r(k)
;
(p(k) )T Ap(k)
(k+1)
(k)
(k)
x
= x + k p and r(k+1) = r(k) k Ap(k) ;
T
(r(k+1) ) r(k+1)
and p(k+1) = r(k+1) + k+1 p(k) ;
k+1 =
T
(r(k) ) r(k)
Determine k =
Set
Set
Set k = k + 1;
end while
The essential computational costs per iteration step are one matrix - vector multiplication, three
vector - vector products, three multiplications with a scalar and three additions of vectors.
Theorem 4.15 Let A Rnn be symmetric positive definite and {p(0) , . . . , p(n1) } A - conjugated.
Then the exact solution of Ax = b for given b Rn is obtained after n steps using exact arithmetic.
Proof: By Lemma 4.12 we know for k = n that
T
r(n) p(j) = 0 for j = 0, . . . , n 1
Since {p(0) , . . . , p(n1) } is linear independent by Corollary 4.14 and thus a basis of Rn we can
T
T
conclude r(n) ei = 0 Ax(n) b ei = 0 for i = 1, . . . , n. This leads to Ax(n) = b, i.e.
x(n) solves Ax = b.
It may happen that Ax(k) = b for k < n in the conjugate gradient method. This happens
exactly if r(k) = 0 respectively p(k) = 0 during the algorithm. Theorem 4.15 means in particular
that the conjugate gradient method needs at most n iterations to obtain the exact solution of
Ax = b, provided that rounding errors are excluded.
Moreover, similar to (4.32) one can prove
!k
p
2 (A) 1
(k)
kx x kA 2 p
kx(0) x kA ,
(4.45)
2 (A) + 1
i.e. compared to (4.32) only the square root of the condition number 2 (A) is involved in this
estimate.
58
Chapter 5
m,n
60
5.1
x, y E.
Then it holds
a) It exists an unique fixed point x of in E, i.e. x E with (x ) = x .
b) For every initial guess x(0) E the fixed point iteration
x(k+1) := (x(k) ),
k = 0, 1, 2, . . . ,
(5.1)
converges to x .
c) The error estimates
L
kx(k) x(k1) k ("a posteriori"),
1L
Lk
kx(k) x k
kx(1) x(0) k ("a priori")
1L
kx(k) x k
(5.2)
(5.3)
(5.4)
for all j = k, . . . , k + m 1, due to (5.1) and the property that is contractive. In particular
we get for j = k the inequality
kx(k+1) x(k) k Lkx(k) x(k1) k . . . Lk kx(1) x(0) k
(5.5)
for k = 1, 2, . . ..
With the help of the triangle inequality and the geometric series this implies
kx(k+m) x(k) k = k
m
m
X
X
(x(k+i) x(k+i1) )k
kx(k+i) x(k+i1) k
i=1
k+m1
X
i=1
kx
(j+1)
(j)
j=k
= kx(k) x(k1) k
k kx
(k)
(k1)
k+m1
X
Ljk+1
j=k
m1
X
j=0
(5.6)
m1
X
Lj
j=0
1 Lm (k)
1 Lm (1)
=L
kx x(k1) k Lk
kx x(0) k 0
1L
1L
for k . This means that x(k) kN is a Cauchy sequence in E. Since E is closed in the
Banach space Rn the limit x = lim x(k) E exists by Lemma 5.3.
k
61
(k)
(k1)
(k1)
x = lim x = lim (x
) = lim x
= (x ).
k
1
kx(k) x(k1) k,
1L
k = 1, 2, . . . ,
1
kx(1) x(0) k,
1L
k = 1, 2, . . . .
Lemma 5.5 Let E Rn be a convex set, i.e. x, y E implies tx+(1t)y E for all t [0, 1],
and the function : E Rn continuously differentiable. Furthermore let L := sup k0 (x)k < 1,
xE
t[0,1]
!
= sup k0 (tx + (1 t)y)(x y)k
t[0,1]
62
5.2
Newtons method
In the following we consider Newtons method which is of great importance in practice. Let
f : Rn Rn be a continuously differentiable function. The aim of Newtons method is to find
roots of f , i.e. we seek x Rn with f (x ) = 0.
We start with the case n = 1 as motivation: By Taylor expansion (of order 1) about the point
x(k) we obtain
f (x) f (x(k) ) + f 0 (x(k) )(x x(k) )
(5.8)
in a (small) neighborhood of x(k) . The right - hand side of (5.8) is the tangent line of f at x(k) .
The idea of Newtons method is now to use the root of (5.8) as new approximation which leads
to
x(k+1) = x(k)
f (x(k) )
,
f 0 (x(k) )
(5.9)
4
3
2
1
x(1)
(0)
(2)
x
-1
-2
-3
-4
-5
0.2
0.4
0.6
x
0.8
1.2
Figure 5.1: Newtons method for f (x) = ln(x) and x(0) = 0.2
For the general n - dimensional case we use analogously the root of the Taylor expansion (of order
1) as new iteration. This leads to
f 0 (x(k) ) (x(k+1) x(k) ) = f (x(k) ),
|
{z
}
(k)
=:
(5.10)
63
where f 0 (x(k) ) Rnn denotes now the Jacobi matrix of f in x(k) . If f 0 (x(k) ) is regular, we
obtain an unique (k) Rn by solving the linear system of equations (5.10). Afterwards we get
the new approximation as x(k+1) = x(k) + (k) .
Algorithm 3 Newtons method
Require:
continuously differentiable function f : Rn Rn , initial guess x(0) Rn and tolerance tol
Set k = 0;
while kf (x(k) )k > tol do
Solve f 0 (x(k) ) (k) = f (x(k) );
Set x(k+1) = x(k) + (k) and k = k + 1;
end while
Remark 5.6 Applying Newtons method to f (x) = Ax b for given A Rnn and b Rn
leads with f 0 (x) = A to A (0) = (Ax(0) b) and x(1) = x(0) + (0) in the first Newton step
(k = 0). Due to
f (x(1) ) = Ax(1) b = Ax(0) + A (0) b = Ax(0) (Ax(0) b) b = 0,
x(1) is a root of f , i.e. we have obtained the exact solution of Ax = b after one Newton step.
In the following we prove that Newtons method converges (local quadratically) to an root x of
f , if one chooses the initial guess x(0) sufficiently close to x .
Before we prove this convergence result, we state that Newtons method
x(k+1) = x(k) (f 0 (x(k) ))1 f (x(k) )
(5.11)
(5.12)
Definition 5.7 Let (xk )kN be a sequence in a normed space V converging to x V . We call
the sequence convergent with order p 1, if there exists a positive constant C > 0 such that
kxk+1 x kV Ckxk x kpV
k N.
In the case p = 1 we require C < 1 for convergence and say that the sequence converges linearly.
For p = 2 we say that the sequence converges quadratically.
Theorem 5.8 Let D Rn be open and convex, k k a norm on Rn (with induced matrix norm
k k), f C 1 (D, Rn ), f 0 (x) invertible for all x D, k(f 0 (x))1 k for all x D and f 0
Lipschitz continuous in D, i.e. > 0 with
kf 0 (x) f 0 (y)k kx yk
x, y D.
(5.13)
Moreover we assume that it exists x with f (x ) = 0 and choose > 0 and x(0) E:= K (x ) :=
2
{x Rn : kx xk } with E D,
and + 2 2 2 + kf 0 (x(0) )k < 1.
Then the sequence of Newton iterates x(k) converges to x and it holds
kx(k+1) x k
i.e. the sequence converges quadratically.
(k)
kx x k2 ,
2
(5.14)
64
Proof:
We know that Newtons method can be described by x(k+1) = (x(k) ) with given by (5.12).
The idea of the proof is now to show that the conditions of Theorem 5.4 are satisfied. Then the
sequence x(k) converges to x .
We also know that E is (as closed ball about x ) a closed set. It remains to show that maps
from E to E and is contractive.
As first intermediate result we prove that
kf (y) f (x) f 0 (x)(y x)k
ky xk2
2
x, y D.
(5.15)
We define g(t) = f (ty + (1 t)x) for t [0, 1] and obtain similar to (5.7) the derivative
g 0 (t) = f 0 (ty + (1 t)x)(y x). We use again the fundamental theorem of calculus and obtain
Z
g (t) dt =
R1
0
Subtracting f 0 (x)(y x) =
f 0 (x)(y x) dt leads to
1
f 0 (ty + (1 t)x) f 0 (x) (y x) dt.
Rb
Using the general inequality k a h(t) dtk a kh(t)k dt, valid for an arbitrary continuous function h : R Rn , and the assumption (5.13) leads to
Rb
1
kf (y) f (x) f (x)(y x)k = k
f 0 (ty + (1 t)x) f 0 (x) (y x) dtk
0
Z 1
ky xk + 2 kf (y)k ky xk.
ky xk2 + 2 kf (y)kky xk =
2
2
(5.16)
65
for x, y E.
In particular we obtain for the choice y = x (i.e. f (y) = f (x ) = 0) and arbitrary x E
k(x) x k = k(x) (x )k
2
kx xk2
=
,
2
2
2
|{z}
(5.17)
(5.18)
ky xk + 2 kf (y)k ky xk
2
+ 2 2 2 + kf 0 (x(0) )k kx yk
= + 2 2 2 + kf 0 (x(0) )k kx yk
|
{z
}
k(x) (y)k
<1
for all x, y E by assumption, i.e. we know that is a contractive mapping. Thus all assump
tions of Banach fixed point theorem (Theorem 5.4) are satisfied such that the sequence x(k) of
Newton iterates converges to x in E. Inserting x(k) as x in (5.17) leads directly to (5.14).
Theorem 5.8 is a local convergence result. The problem is that it is only valid if the initial
guess x(0) is sufficiently close to an exact root x of f . Otherwise the method can diverge. For
instance, applying Newtons method to the function f (x) = arctan(x) with x(0) = 1.5 diverges
(see Table 5.1).
x(0)
1.5000
x(1)
1.6941
x(2)
2.3211
x(3)
5.1141
x(4)
32.2957
x(5)
1.5753 103
x(6)
3.8950 106
Table 5.1: Divergence of Newtons method for f (x) = arctan(x) with x(0) = 1.5
A remedy of this problem can be reached through so - called globalization strategies, e.g. damping
strategies or trust region methods. An in practice often used damping strategy is the line - search strategy which leads to the following algorithm:
66
x(1)
0.0970
x(2)
6.0806 104
x(3)
1.4988 1010
x(4)
0
Table 5.2: Convergence of Algorithm 4 for f (x) = arctan(x) with x(0) = 1.5, = 103