0% found this document useful (0 votes)
57 views63 pages

Liesen NLA1 PDF

Numerical linear algebra involves solving problems involving matrices, such as linear systems and eigenvalue problems. The document defines basic matrix operations and properties. It discusses matrix classes such as Hermitian positive (semi)definite matrices, where real-valuedness and positive (semi)definiteness impose important structural properties. Key results are presented on inverses, determinants, and eigenvalues of Hermitian matrices.

Uploaded by

Tino Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views63 pages

Liesen NLA1 PDF

Numerical linear algebra involves solving problems involving matrices, such as linear systems and eigenvalue problems. The document defines basic matrix operations and properties. It discusses matrix classes such as Hermitian positive (semi)definite matrices, where real-valuedness and positive (semi)definiteness impose important structural properties. Key results are presented on inverses, determinants, and eigenvalues of Hermitian matrices.

Uploaded by

Tino Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Numerical Linear Algebra I

Jörg Liesen - TU Berlin

Winter Semester 2016/2017


“Thus finite linear systems stand at the heart of all mathematical computa-
tion. Moreover, as science and technology develop, and computers become
more powerful, systems to be handled become larger and require techniques
that are more refined and efficient.”

“In fact, our subject is more than just vectors and matrices, for virtually every-
thing we do carries over to functions and operators. Numerical linear algebra
is really functional analysis, but with the emphasis always on practical algo-
rithmic ideas rather than mathematical technicalities.”

The two quotes above reflect two main characteristics of numerical linear algebra: As a
mathematical field it is closely related to functional analysis, and one of its major driving
forces is the practical requirement to solve linear algebraic problems of rapidly increasing
sizes. The quotes are taken from two excellent books, one modern and one classical, on
numerical linear algebra:

• L. N. Trefethen and D. Bau III, Numerical Linear Algebra, SIAM, 1997,

• A. S. Householder, The Theory of Matrices in Numerical Analysis, Blaisdell,


1964.

Can you guess which quote is taken from which book?


Further excellent books on which this course is based are:

• N. J. Higham Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, 2002,

• G. W. Stewart and J.-g. Sun, Matrix Perturbation Theory, Academic Press,


1990.

Many thanks to Carlos Echeverrı́a Serur and Luis Garcı́a Ramos for typing my hand-
written notes in the Winter Semester 2014/2015. Their work forms the basis of the current
version. Thanks also to Davide Fantin, Alexander Hopp, Mathias Klare, Thorsten Lucke,
Ekkehard Schnoor, Olivier Sète, and Jan Zur for careful reading of previous versions and
providing corrections. Please send further corrections to me at liesen@[Link].

Jörg Liesen, Berlin, February 21, 2017

2
Contents

0 Matrices: Basic Definitions and Matrix Classes 4

1 A Survey of Matrix Decompositions 8

2 Perturbation Theory 20
2.1 Norms, errors, and conditioning . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Perturbation results for matrices and linear algebraic systems . . . . . . . 25

3 Direct Methods for Solving Linear Algebraic Systems 35


3.1 Basics of rounding error analysis . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Stability and cost of the Cholesky decomposition . . . . . . . . . . . . . . 38
3.3 Computing the LU decomposition . . . . . . . . . . . . . . . . . . . . . . . 43

4 Iterative Methods for Solving Linear Algebraic Systems 49


4.1 Classical iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Projection methods based on Krylov subspaces . . . . . . . . . . . . . . . . 53

3
Chapter 0

Matrices: Basic Definitions and


Matrix Classes

Numerical Linear Algebra is concerned with the numerical solution of linear algebraic
problems involving matrices. Major examples are

• linear algebraic systems, Ax = b,

• eigenvalue problems, Ax = λx,

• generalized eigenvalue problems, Ax = λBx,

• least square problems minx kb − Axk2

• computing functions of matrices f (A).

In this lecture we will (mostly) consider complex matrices, or matrices over C, i.e.,
matrices of the form  
a11 · · · a1m
A = [aij ] =  ... ... ..  ∈ Cn×m

. 
an1 · · · anm
with aij ∈ C, i = 1, . . . , n, j = 1, . . . , m. If m = 1 we will usually write Cn (instead of
Cn×1 ). If n = m, we have a square matrix A.
The matrix In := [δij ] ∈ Cn×n is called the identity matrix. Here
(
1, i = j,
δij :=
0, i 6= j,

is the Kronecker delta. When the size is clear or irrelevant we write I. The matrix
0n×m := [0] ∈ Cn×m is called the zero matrix. Usually we just write 0.

4
If A = [aij ] ∈ Cn×n satisfies

diagonal,
 
 i 6= j  
aij = 0 for i>j then A is called upper triangular,
i<j
  
lower triangular.

We sometimes write diagonal matrices as A = diag(a11 , . . . , ann ). An upper or lower


triangular matrix with aii = 1 for i = 1, . . . , n is called unit upper or lower triangular,
respectively.
For A = [aij ] ∈ Cn×m the matrices

AT = [bij ] ∈ Cm×n with bij := aji , and


AH = [bij ] ∈ Cm×n with bij := aji

are called the transpose and Hermitian transpose of A. If A = AT or A = AH , then A is


called symmetric or Hermitian, respectively. Note that if A ∈ Cn×n is Hermitian, then

xH Ax = xH AH x = (xH Ax)H = xH Ax,

and thus xH Ax ∈ R for all x ∈ Cn .


If A ∈ Cn×n is Hermitian and

xH Ax > 0 for all x ∈ Cn \ {0}, or


xH Ax ≥ 0 for all x ∈ Cn ,

then A is called Hermitian positive definite (HPD) or Hermitian positive semidefinite


(HPSD), respectively. If the reverse inequalities hold, A is Hermitian negative (semi)definite.
If A is HPD (HPSD), we write A > 0 (A ≥ 0). If A, B are Hermitian and A − B is
HPD (HPSD), we write A > B (A ≥ B).
Proposition 0.1. The ordering “≥” defines a Löwner partial ordering on the set of Her-
mitian matrices, i.e., for all Hermitian matrices A, B, C ∈ Cn×n it satisfies:
(1) A ≥ A.

(2) If A ≥ B and B ≥ A, then A = B.

(3) If A ≥ B and B ≥ C, then A ≥ C.


Proof. (1) holds since the matrix A − A = 0 is HSPD. In order to prove (2), suppose that
A ≥ B and B ≥ A. Let D := A − B = [dij ] and let {e1 , . . . , en } be the standard basis of
Cn . For each j = 1, . . . , n we then have

eTj (A − B)ej = eTj Dej = djj ≥ 0, and eTj (B − A)ej = −eTj Dej = −djj ≥ 0,

5
so than djj = 0 for j = 1, . . . , n. Moreover, since D is Hermitian, we have dji = dij for all
i, j = 1, . . . , n. Thus,

(ei + ej )T (A − B)(ei + ej ) = (ei + ej )T D(ei + ej ) = dii + dij + dji + djj = 2 Re(dij ) ≥ 0,


(ei + ej )T (B − A)(ei + ej ) = −(ei + ej )T D(ei + ej ) = −2 Re(dij ) ≥ 0,

giving Re(dij ) = 0 for all i, j = 1, . . . , n. A similar argument shows that also Im(dij ) = 0
for all i, j = 1, . . . , n, so that in fact D = A − B = 0, or A = B.
It remains to show (3). If A ≥ B and B ≥ C, then for each x ∈ Cn we have

xH (A − C)x = xH (A − B + B − C)x = xH (A − B)x + xH (B − C)x ≥ 0,

and hence A ≥ C.
In the next lemma we collect some useful properties of HPSD matrices.

Lemma 0.2. If A ∈ Cn×n is HPSD, then the following assertions hold:

(1) λ ≥ 0 for every eigenvalue λ of A.

(2) det(A) ≥ 0.

(3) X H AX is HPSD for all X ∈ Cn×k with rank(X) = k.


 
a11 · · · a1k
(4) A(1 : k, 1 : k) :=  ... . . . ...  is HPSD for all k = 1, . . . , n.
 
ak1 · · · akk

Proof. (1) If Ax = λx, x 6= 0, then 0 ≤ xH Ax = λxH x, giving λ ≥ 0.

(2) This follows from (1) since det(A) is equal to the product of the eigenvalues of A.

(3) Let y ∈ Cn \{0} and write z := Xy, then z 6= 0 since rank(X) = k, and y H (X H AX)y =
z H Az ≥ 0 since A is HPSD.

(4) This follows from (3) using X = [e1 , . . . , ek ] ∈ Cn×k .

All assertions in Lemma 0.2 also hold for HPD matrices A ∈ Cn×n , when “≥” is replaced
by “>” in (1) and (2).
If for A ∈ Cn×n there exists a matrix B ∈ Cn×n with AB = BA = In , then A is called
nonsingular. Otherwise A is called singular. It is well known that A is nonsingular if and
only if det(A) 6= 0, which holds if and only if rank(A) = n.

Lemma 0.3. For every A ∈ Cn×n the following assertions hold:

6
(1) If A is nonsingular, then there exists only one matrix B ∈ Cn×n with AB = BA = In .
We call this matrix the inverse of A and denote it by A−1 .

(2) If AB = In or BA = In holds for some B ∈ Cn×n , then A is nonsingular and


B = A−1 .

Proof. (1) Suppose that AB = BA = In and AC = CA = In . Then C = CIn = CAB =


In B = B.
(2) If AB = In , then n = rank(In ) = rank(AB) ≤ rank(A). Hence A is nonsingular
with a unique inverse A−1 . Then A−1 = A−1 In = A−1 AB = In B = B. A similar argument
applies when BA = In .
Item (2) of Lemma 0.3 shows that only one of the equations AB = In or BA = In
needs to be verified in order to show that a given matrix B ∈ Cn×n is the (unique) inverse
of a nonsingular matrix A ∈ Cn×n .
If A ∈ Cn×n satisfies AH A = In , then A is called unitary. Item (2) of Lemma 0.3
implies that A is nonsingular with A−1 = AH , and that also AAH = In .
If we write A = [a1 , . . . , an ] ∈ Cn×n with aj ∈ Cn , j = 1 . . . , n, then AH A = [aH
i aj ] =
In = [δij ] means that the n columns of A are pairwise orthonormal with respect to the
Euclidean inner product on Cn , which is defined by

(v, w) := wH v for all v, w ∈ Cn .

Hence A ∈ Cn×n is unitary if and only if the columns a1 , . . . , an form an orthonormal basis
of Cn with respect to the inner product (·, ·). The equation AAH = In means the same
holds for the (transposed) rows of A.
If A ∈ Cn×m with n > m satisfies AH A = Im , then A has pairwise orthonormal columns
(with respect to (·, ·)), but A is not a unitary matrix. In this case P := AAH is a projection
(i.e., P 2 = P ) with rank(P ) = m.
If A ∈ Cn×n satisfies AH A = AAH then A is called normal. Note that Hermitian and
unitary matrices are normal.

7
Chapter 1

A Survey of Matrix Decompositions

In the January/February 2000 issue of Computing in Science & Engineering, a joint pub-
lication of the American Institute of Physics and the IEEE Computer Society, a list of
the “Top Ten Algorithms of the Century” was published. Among the Top Ten Algorithms
(ordered by year, so there is no “No. 1 Algorithm”) is

“1951: Alston Householder of Oak Ridge National Laboratory formalizes the


decompositional approach to matrix computations.”

In his introduction to the topic, Stewart [18] wrote that

“the introduction of matrix decomposition into numerical linear algebra revo-


lutionized matrix computations1 ”.

A matrix decomposition is nothing but a factorization of the original matrix into “sim-
pler” factors. Householder started with a systematical analysis of methods for inverting
matrices (or solving linear algebraic systems) from the viewpoint of matrix decomposition
in 1950 [7]. In 1957 he wrote [8]:

“Most, if not all closed [i.e. direct] methods can be classified as methods of
factorizations and methods of modification [...] these methods of factorization
aim to express A as a product of two factors, each of which is readily inverted,
or, equivalently, to find matrices P and Q such that P A = Q and Q is easily
inverted.”

Major advantages of the decompositional approach as stated by Stewart [18] are:


1
Stewart and many others use the term matrix computations as a synonym for numerical linear algebra.
This has been done at least since Stewart’s book Introduction to Matrix Computations, Academic Press,
1973. Stewart wrote in 1987 [16, p. 211]: “It is customary to identify the beginnings of modern numerical
linear algebra with the introduction of the digital computer in the mid nineteen forties. ... Wherever one
chooses to place the beginnings of matrix computations, it is certain that by the mid forties it had entered
an expansive phase, from which it has not yet emerged.” (My emphasis.)

8
• A matrix decomposition, which is generally expensive to compute, can be reused to
solve new problems involving the original matrix.
• The decompositional approach often shows that apparently different algorithms are
actually computing the same object.
• The decompositional approach facilitates rounding error analysis.
• Many matrix decompositions can be updated, sometimes with great savings in com-
putation.
• By focusing on a few decompositions instead of a host of specific problems, software
developers have been able to produce highly effective matrix packages.
We will now discuss several important matrix decompositions from a mathematical point
of view, i.e., we will be mostly interested in their existence and uniqueness. In later chap-
ters we will derive algorithms for computing the decompositions, analyze their numerical
stability, and apply them in order to solve problems of numerical linear algebra.
Theorem 1.1 (LU decomposition). The following assertions are equivalent for every ma-
trix A ∈ Cn×n :
(1) There exist a unit lower triangular matrix L ∈ Cn×n , a nonsingular diagonal matrix
D ∈ Cn×n , and a unit upper triangular matrix U ∈ Cn×n , such that
A = LDU.

(2) For each k = 1, . . . , n the matrix A(1 : k, 1 : k) ∈ Ck×k is nonsingular.


If (2) holds, then there exists only one set of matrices L, D, U with the properties stated in
(1) and A = LDU .
Proof. (1) =⇒ (2): If A = LDU , where L, D, U have the stated properties, we can consider
any fixed k between 1 and n, and partition
     
A11 A12 L11 0 D11 0 U11 U12
A= = ,
A21 A22 L21 L22 0 D22 0 U22

where A11 = A(1 : k, 1 : k) ∈ Ck×k . We see that A11 = L11 D11 U11 , and since L11 and U11
are unit lower and upper triangular, respectively, we obtain
det(A11 ) = det(L11 ) det(D11 ) det(U11 ) = det(D11 ) 6= 0.
(2) =⇒ (1): Induction on n. For n = 1 we can take L = [1], D = [a11 ], U = [1]. Now
suppose that the statement is true for all matrices up to order n − 1 for some n ≥ 2 and
let A ∈ Cn×n . With the nonsingular matrix A11 := A(1 : n − 1, 1 : n − 1) we can write
In−1 A−1
       
A11 A12 In−1 0 A11 0 11 A12 A11 0
A= = =: Ln Un ,
A21 A22 A21 A−1
11 1 0 s 0 1 0 s

9
where s := A22 − A21 A−1
11 A12 ∈ C is the Schur complement of A11 in A. From

0 6= det(A) = det(Ln ) det(A11 ) s det(Un ),


| {z } | {z } | {z }
=1 6=0 =1

we get s 6= 0. By the induction hypothesis, the matrix A11 has a factorization A11 =
Ln−1 Dn−1 Un−1 , where Ln−1 , Dn−1 and Un−1 have the required properties. Then
   
Ln−1 0 Dn−1 0 Un−1 0
A = Ln Un
0 1 0 s 0 1
| {z }| {z }| {z }
=:L =:D =:U

is the required decomposition.


(k) (k)
Finally, suppose that A = L1 D1 U1 = L2 D2 U2 with Lk = [lij ] and Uk = [uij ], k = 1, 2,
unit lower and upper triangular, respectively. Then L−1 2 L1 D1 = D2 U2 U1
−1
is lower and
(1) (2) (1) (2)
upper triangular, and hence diagonal. From lii = lii = 1 = uii = uii and the structure
of the matrices we immediately obtain L−1 2 L1 = In and U1 U2
−1
= In , hence L1 = L2 ,
U1 = U2 , and D1 = D2 .
A simple example where the condition (2) in Theorem 1.1 does not hold is given by the
matrix  
0 1
A= .
1 1
This matrix does not have an LU decomposition. However, if we exchange (i.e., permute)
the rows of A, then the resulting matrix
   
1 1 0 1
PA = , P := ,
0 1 1 0

has an (obvious) LU decomposition. Allowing row exchanges is also important for the nu-
merical stability of numerical methods for computing an LU decomposition; see Chapter 3.

Corollary 1.2 (LDLH decomposition.). If A = AH ∈ Cn×n and A(1 : k, 1 : k) is non-


singular for all k = 1, . . . , n, then there exist a uniquely determined unit lower triangular
matrix L ∈ Cn×n and a uniquely determined nonsingular diagonal matrix D ∈ Rn×n , such
that
A = LDLH .

Proof. Let A = LDU be the uniquely determined factorization from Theorem 1.1. Since
A = AH we have A = LDU = (LDU )H = U H DH LH . Here U H and LH are unit lower
and upper triangular, respectively, and DH is diagonal and nonsingular. The uniqueness
of the factorization now implies that U H = L and D = DH , and hence in particular
D ∈ Rn×n .

10
Corollary 1.3 (Cholesky decomposition). If A ∈ Cn×n is HPD, then there exists a uniquely
determined lower triangular matrix L ∈ Cn×n with positive diagonal elements, such that

A = LLH .

1st Proof. If A is HPD, then by Corollary 1.2 there exists a uniquely determined factor-
ization A = LD e L eH , where L
e ∈ Cn×n is unit lower triangular and D = [dij ] ∈ Rn×n is
nonsingular. By (3) in Lemma 0.2 the matrix D = L e−1 AL
e−H is HPD and hence dii > 0
e 1/2 , where D1/2 := diag(d1/2
for i = 1, . . . , n. We set L := LD
1/2
11 , . . . , dnn ) ∈ R
n×n
, then
H
A = LL .
1/2
2nd Proof. Induction on n. If n = 1, we have A = [a11 ] with a11 > 0, and set L := [a11 ].
Suppose the statement is true for matrices up to order n − 1 for some n ≥ 2. Let A ∈
Cn×n be HPD, and let An−1 := A(1 : n − 1, 1 : n − 1) ∈ C(n−1)×(n−1) , which is HPD;
cf. (4) in Lemma 0.2. By the induction hypothesis, there exists a uniquely determined
lower triangular matrix Ln−1 ∈ C(n−1)×(n−1) with positive diagonal elements, such that
An−1 = Ln−1 LHn−1 . We thus can write

In−1 A−1
     
An−1 b In−1 0 A11 0 n−1 b
A= = ,
bH ann bH A−1
n−1 1 0 s 0 1

where s := ann − bH A−1


n−1 b. Taking determinants yields

0 < det(A) = det(An−1 )s,

and hence s > 0. Now define c := L−1


n−1 b, then

kck22 = bH L−H −1 H −1
n−1 Ln−1 b = b An−1 b < ann .

Let αbe the positive


 square root of ann − kck22 = s > 0, then the lower triangular matrix
L 0
L := n−1 H has positive diagonal elements and satisfies A = LLH .
c  α 
Ln−1 0 eH = LLH , then
If L :=
e with β > 0 satisfies L
eL
dH β
   
An−1 Ln−1 d An−1 Ln−1 c
2 = 2 .
dH LH 2
n−1 kdk2 + β cH LH 2
n−1 kck2 + α

Since Ln−1 is nonsingular, we must have d = c, and hence β 2 = α2 , giving β = α, since α


and β are both positive.
Theorem 1.4 (Schur decomposition). For every matrix A ∈ Cn×n there exist a unitary
matrix U ∈ Cn×n and an upper triangular matrix R ∈ Cn×n , such that

A = U RU H ,

i.e., A can be unitarily triangularized.

11
Proof. Induction on n. If n = 1, set U = I1 , R = A. Suppose the statement is true for
matrices up to order n−1 for some n ≥ 2, and let A ∈ Cn×n . Suppose that λ is an eigenvalue
of A with corresponding unit norm eigenvector x, i.e., Ax = λx with kxk22 = xH x = 1. Let
Y ∈ Cn×(n−1) be any matrix such that X := [x, Y ] ∈ Cn×n is unitary. Then
 
H λ xH AY
X AX = ,
0 Y H AY

where Y H AY ∈ C(n−1)×(n−1) . By the induction hypothesis, there exists a unitary matrix


Z ∈ C(n−1)×(n−1) such that Z H (Y H AY )Z = R e is upper triangular. Then a straighforward
n×n
computation shows that U := [x, Y Z] ∈ C is unitary, and in particular that Z H Y H Ax =
λZ H Y H x = 0. Consequently,
λ xH AY Z
   
H xH Ax xH AY Z
U AU = = =: R,
Z H Y H Ax Z H Y H AY Z 0 R
e

which is upper tringular.


The matrix R in this theorem is called a Schur form of A. Its diagonal elements are
the eigenvalues of A. As indicated by the proof, they can be chosen in any order. The
uniqueness of the strictly upper triangular part of R as well as numerous further results
on unitary similarity of matrices are discussed in [15].
The Schur decomposition has been called “[p]erhaps the most fundamentally useful fact
of elementary matrix theory” [6, p. 79]. It has the following important corollary, which
characterizes some fundamental classes of matrices.
Corollary 1.5 (Spectral decomposition of normal matrices).
(1) A ∈ Cn×n is normal if and only if A can be unitarily diagonalized, i.e., there exists a
unitary matrix U such that U H AU is diagonal.
(2) A ∈ Cn×n is Hermitian if and only if A can be unitarily diagonalized and all the
eigenvalues of A are real.
(3) A ∈ Cn×n is unitary if and only if A can be unitarily diagonalized and all eigenvalues
λ of A satisfy |λ| = 1.
Proof. (1) Let A be normal and let A = U RU H be a Schur
 decomposition.
 Then AH A =
R1 r
AAH implies that RH R = RRH . If we write R = with R1 ∈ C(n−1)×(n−1) ,
0 ρ
we obtain the equality
 H   
H R1 R1 R1H r R1 R1H + rrH ρr
R R= H = = RRH .
r R1 krk22 + |ρ|2 ρrH |ρ|2
A comparison of the (2, 2) entries shows that r = 0, and hence
 
R1 0
R= , where R1H R1 = R1 R1H .
0 ρ

12
Inductively it follows that R must be diagonal.
On the other hand, if A = U DU H with U unitary and D diagonal, then

AH A = (U DH U H )(U DU H ) = U DH DU H = U DDH U H = AAH .

(2) If A is Hermitian, then A is normal, and hence A = U DU H with a diagonal matrix


D. Now
A = U DU H = AH = U DH U H
shows D = DH .
On the other hand, if A = U DU H with D ∈ Rn×n , then AH = (U DU H )H =
U DH U H = U DU H = A.

(3) If A is unitary, then A is normal, and hence A = U DU H with a unitary matrix U


and a diagonal matrix D. Now

In = AH A = (U DH U H )(U DU H ) = U DH DU H

implies DH D = In , and thus |dii | = 1.


On the other hand, if A = U DU H with a unitary matrix U and DH D = In , then

AH A = (U DH U H )(U DU H ) = In ,

and hence A is unitary.

Note that the decomposition A = U DU H with U = [u1 , . . . , un ] and D = diag(λ1 , . . . , λn )


can be written as n
X
A= λj uj uH
j .
j=1

The matrix uj uH
jis Hermitian and satisfies (uj uH 2 H
j ) = uj uj . Thus, a normal matrix can be
decomposed into the sum of n rank-one matrices, where each such matrix is an orthogonal
projection onto the subspace spanned by an eigenvector of A.
A general matrix A can also be decomposed into the sum of rank-one matrices, but in
general these matrices are not orthogonal projections onto eigenspaces of A.

Theorem 1.6 (Singular value decomposition, SVD). If A ∈ Cn×m has rank r, then there
exist unitary matrices U ∈ Cn×n , V ∈ Cm×m and a diagonal matrix Σ+ = diag(σ1 , . . . , σr )
with σ1 ≥ · · · ≥ σr > 0, such that
  r
Σ+ 0 H
X
A=U V , or A= σj uj vjH . (1.1)
0 0
j=1

13
Proof. The matrix AH A ∈ Cm×m is HPSD, since xH AH Ax = kAxk22 ≥ 0 for all x ∈ Cm .
Hence AH A can be unitarily diagonalized with nonnegative real eigenvalues (cf. (1) in
Lemma 0.2 and (2) in Corollary 1.5). Denote the r = rank(A) = rank(AH A) positive
eigenvalues by σ12 ≥ · · · ≥ σr2 > 0 and Σ2+ := diag(σ12 , . . . , σr2 ), and let the unitary diago-
nalization be
 2 
H H Σ+ 0
V A AV = ,
0 0

for a unitary matrix V ∈ Cm×m . Then with V = [V1 , V2 ], where V1 ∈ Cm×r , we see that
V2H AH AV2 = 0, giving AV2 = 0. Define U1 := AV1 Σ−1
+ , then

U1H U1 = Σ−1 H H −1
+ V1 A AV1 Σ+ = Ir .

We therefore can choose a matrix U2 so that U := [U1 , U2 ] ∈ Cn×n is unitary. Then, by


construction, U2H AV1 = U2H U1 Σ+ = 0, so that
 H   
H U1 AV1 U2H AV2 Σ+ 0
U AV = = ,
U2H AV1 U2H AV2 0 0

as required.
The numbers σ1 ≥ · · · ≥ σr > 0 in (1.1) are called the (nonzero) singular values of A,
and the columns of the unitary matrices U and V are called the left and right singular
vectors of A.
As we have seen in the proof, the singular values are the (positive) square roots of
the nonzero eigenvalues of AH A. Thus, the singular values of A are uniquely determined.
Similar to eigenvectors, the singular vectors are not uniquely determined. In particular,
for any φ1 , . . . , φr ∈ R we can write
r
X r
X
A= σj uj vjH = σj (eiφj uj )(eiφj vj )H ,
j=1 j=1

where the sum on the right also is an SVD of A.


The SVD has a long history (see [17]), but its modern form as given in the theorem
above appears to be due to Eckart and Young [2]. Because of its significance, it has been
called the “Swiss Army Knife” as well as the “Rolls Royce” of matrix decompositions [3].

Corollary 1.7 (Polar decomposition). If A ∈ Cn×n is nonsingular, there exist unitary


matrices U1 , U2 ∈ Cn×n and HPD matrices H1 , H2 ∈ Cn×n , such that A = U1 H1 = H2 U2 .

Proof. If A is nonsingular, it has an SVD of the form A = U ΣV H with U, V ∈ Cn×n unitary


and Σ = diag(σ1 , . . . , σn ), where σi > 0 for i = 1, . . . , n. Then A = (U V H )(V ΣV H ) =:
U1 H1 with U1 unitary and H1 HPD. Similarly, A = (U ΣU H )(U V H ) =: H2 U2 , with H2
HPD and U2 unitary.

14
The decomposition in this result is the matrix analogue of the polar decomposition of
a nonzero complex number z = eiφ ρ = ρeiφ , where ρ = |z| > 0.
Theorem 1.8 (QR decomposition). Let A ∈ Cn×m with n ≥ m. Then there exist a
unitary matrix Q ∈ Cn×n and an upper triangular matrix R = [rij ] ∈ Cm×m with rii ≥ 0
for i = 1, . . . , m, such that
   
H R R
Q A= , or A = Q .
0 0
If rank(A) = m, then rii > 0, i = 1, . . . , m. Moreover, denoting Q = [Q1 , Q2 ] with
Q1 ∈ Cn×m , the matrices Q1 and R with rii > 0, i = 1, . . . , m, are uniquely determined.
Proof. Induction on m. If m = 1 we have A = [a] ∈ Cn . If a = 0, set Q := I and R := [0].
If a 6= 0, let Q ∈ Cn×n be a unitary matrix with first column a/kak2 . Then
 
H R
Q a= ,
0
where R = [kak2 ] has the required form.
Now let A have m > 1 columns, and write A = [a, A1 ], where A1 ∈ Cn×(m−1) . If a = 0,
set Q1 := In . If a 6= 0, let Q1 ∈ Cn×n be a unitary matrix with first column a/kak2 . Then
 
H kak2 bT
Q1 A =
0 C
for some bT ∈ C1×(m−1) and C ∈ C(n−1)×(m−1) . By the induction hypothesis, there exists a
R2
decomposition QH 2 C = , where Q2 ∈ C(n−1)×(n−1) is unitary and R2 ∈ C(m−1)×(m−1)
0
is upper triangular with nonnegative diagonal elements. Hence with the unitary matrix
 
1 0
Q := Q1 ,
0 Q2
we have  
  kak2 bT  
1 0 R
QH A = H
Q1 A =  0 R2  =:
0 QH
2 0
0 0
in the required form.
If rank(A) = m, then rank(R) = m, which implies rii > 0 for i = 1, . . . , m. If
A = Q1 R = Q
e1 R,
e
e1 ∈ Cn×m have orthonormal columns and R1 , R
where Q1 , Q e1 ∈ Cm×m are nonsingular
upper triangular matrices with positive diagonal elements, then RH R = R
eH R,
e and hence

R e−1 = |R−H eH e−1 H −1


|R{z } {zR } = ((RR ) ) ,
upper triangular lower triangular

e−1 =: D is diagonal with positive diagonal elements. But then


which implies that RR
H −1
D = (D ) shows that D = Im , from which we see R = R e and thus Q1 = Qe1 .

15
Suppose that A = [a1 , . . . , am ] ∈ Cn×m has rank m. Then the (classical) Gram-Schmidt
algorithm yields Q ∈ Cn×m with orthonormal columns and R = [rij ] ∈ Cm×m upper
triangular with rii > 0, i = 1, . . . , m, such that A = QR:
Set q1 = a1 /r11 , where r11 = ka1 k2
for j = 1, . . . , m −P1 do
qbj+1 = aj+1 − ji=1 ri,j+1 qi , where ri,j+1 = (aj+1 , qi )
qj+1 = qbj+1 /rj+1,j+1 , where rj+1,j+1 = kbqj+1 k2
end for
Let us briefly show that the algorithm indeed generates pairwise orthonormal vectors.
Suppose that for some j ∈ {1, . . . , m − 1} the vectors q1 , . . . , qj are pairwise orthonormal,
i.e., (qi , q` ) = δi` (Kronecker-δ). Then for each ` = 1, . . . , j we have
j j
X  X
qj+1 , q` ) = aj+1 −
(b (aj+1 , qi )qi , q` = (aj+1 , q` ) − (aj+1 , qi )(qi , q` )
i=1 i=1
= (aj+1 , q` ) − (aj+1 , q` ) = 0.

Thus, q1 , . . . , qj+1 are pairwise orthonormal.


There is a close relation between the QR and the Cholesky decomposition. Let A ∈
n×m
C have full rank m. If  
R
A=Q with rii > 0
0
is the (uniquely determined) QR decomposition of A, then AH A = RH R is the (uniquely
determined) Cholesky decomposition of the HPD matrix AH A.
On the other hand, let AH A = LLH be the (uniquely determined) Cholesky decom-
position. Then the matrix Q := AL−H satisfies QH Q = L−1 AH AL−H = Im , and hence
A = QLH is the (uniquely determined) QR decomposition of A.
Let A ∈ Cn×n and v ∈ Cn \ {0}. In the Krylov sequence

v, Av, A2 v, . . .

there exists a smallest integer d ≥ 1 such that v, Av, . . . , Ad−1 v are linearly independent,
and v, Av, . . . , Ad v are linearly dependent. This integer d = d(A, v) is called the grade of
the vector v with respect to the matrix A.
It is easy to see that d = 1 holds if and only if v is an eigenvector of A. If

MA (λ) = λm + αm−1 λm−1 + · · · + α0

is the minimal polynomial of A, then MA (A) = 0 and hence MA (A)v = 0 for any vector
v ∈ Cn . Thus,
m−1
X
Am v = − αj Aj v,
j=0

16
which shows that v, Av, . . . , Am v are linearly dependent. Consequently, the grade of a
vector can be at most equal to the degree of the minimal polynomial of A (which is less
than or equal to n).
If A = Jn (0) is the Jordan block of size n × n and with eigenvalue 0, and ej is the jth
standard basis vector of Cn , then Aej = ej−1 for every j = 1, . . . , n, where we set e0 = 0.
Hence for each j = 1, . . . , n the vectors
ej , Aej , . . . , Aj−1 ej
are linearly independent, and
ej , Aej , . . . , Aj−1 ej , Aj ej
are linearly dependent (since Aj ej = 0), which shows that d(A, ej ) = j.
Using the same idea and the Jordan canonical form of a matrix A ∈ Cn,n , one can
show that for each j = 1, . . . , m (= degree of A’s minimal polynomial) there exists a vector
vj ∈ Cn with d(A, vj ) = j.
Theorem 1.9 (Arnoldi decomposition). Let A ∈ Cn×n and v ∈ Cn \ {0} be of grade d with
respect to A. Then there exists V ∈ Cn×d with orthonormal columns and an unreduced
upper Hessenberg matrix H = [hij ] ∈ Cd×d , i.e., hij = 0 for i > j + 1 and hi+1,i 6= 0,
i = 1, . . . , d − 1, such that
AV = V H.
Proof. Let W = [v, . . . , Ad−1 v] ∈ Cn×d P. d−1
By assumption, rank(W ) = d. We know that
A v ∈ span{v, . . . , A v}, i.e., A v = j=0 γj Aj v for some γ0 , . . . , γd−1 ∈ C.
d d−1 d
 
R
Let W = [V, Ve ] with V ∈ Cn×d and R ∈ Cd×d be the QR decomposition. Then
0
W = V R, and  
0 γ0
. .. 
 1 .. . 

d
AW = [Av, . . . , A v] = W  . .. 
 .. 0 . 
1 γd−1
leads to  
0 γ0
.. ..
 1
 . .

AV R = V R  ,

 . .. 0 .
.. 
1 γd−1
and hence AV = V H, where V H V = Id , and
 
0 γ0
. ..
 1 .. .
 
 −1
H = R . .. R
 .. 0 . 
1 γd−1

17
is unreduced upper Hessenberg.
As we see in the proof of Theorem 1.9, the columns of the matrix V ∈ Cn×d form an
orthonormal basis of the dth Krylov subspace of A and v, which is defined by

Kd (A, v) := span{v, Av, . . . , Ad−1 v}. (1.2)

Since
AKd (A, v) = span{Av, A2 v . . . , Ad v},
and Ad v is by construction a linear combination of v, Av, . . . , Ad−1 v, we see that

AKd (A, v) ⊆ Kd (A, v), (1.3)

i.e., the d-dimensional subspace Kd (A, v) is invariant under A. Moreover, each eigenvalue
of the Hessenberg matrix H = V H AV ∈ Cd×d is an eigenvalue of A.
Since Av, . . . , Ad−1 v are linear independent, we have

d − 1 ≤ dim(AKd (A, v)) ≤ d.

A strict inclusion occurs in (1.3) if and only if dim(AKd (A, v)) = d − 1. As an example,
consider again A = Jn (0) and v = e1 . Then d = d(A, v) = 1 and

AK1 (A, v) = span{Av} = {0} ⊂ K1 (A, d) = span{e1 }.

If A is nonsingular, we always have dim(AKd (A, v)) = d and hence equality in (1.3). In
order to see this, consider the equation

0 = γ1 Av + γ2 A2 v + · · · + γd Ad v.

Since A is nonsingular, we can multiply from the left with A−1 , which gives

0 = γ1 v + γ2 Av + · · · + γd Ad−1 v.

The linear independence of v, Av, . . . , Ad−1 v implies that γ1 , . . . , γd must be 0, and hence
Av, A2 v, . . . , Ad v are linearly independent.
An orthogonal basis of the Krylov subspace Kd (A, v) can be computed by appliying the
Gram-Schmidt algorithm to the matrix [v, Av, . . . , Ad−1 v], which by assumption has full
rank d. The algorithm then reads as follows:
Set v1 = v/r11 , where r11 = kvk2
for j = 1, . . . , d −P
1 do
vbj+1 = A v − ji=1 ri,j+1 vi , where ri,j+1 = (Aj v, vi )
j

vj+1 = vbj+1 /rj+1,j+1 , where rj+1,j+1 = kb


vj+1 k2
end for
As observed by Arnoldi [1], it is numerically superior to replace Aj v in this algorithm
by Avj , i.e., to consider the following variant, which is called the Arnoldi algorithm:

18
Set v1 = v/kvk2
for j = 1, . . . , d −P1 do
vej+1 = Avj − ji=1 hij vi , where hij = (Avj , vi )
vj+1 = vej+1 /hj+1,j , where hj+1,j = ke
vj+1 k2
end for
The vectors v1 , . . . , vd generated by this algorithm still form an orthonormal basis of
Kd (A, v) (in exact arithmetic). In each step j = 1, . . . , d − 1 we have a relation of the form
j
X
Avj = hj+1,j vj+1 + hij vi .
i=1
Thus, if we rewrite the algorithm in matrix form, we obtain the Arnoldi decomposition
from Theorem 1.9, i.e.,
A[v1 , . . . , vd ] = [v1 , . . . , vd ]H,
where H = [hij ] is unreduced upper Hessenberg. The last column of H is determined
by expressing Avd as a linear combination of v1 , . . . , vd . Note that this column is not
explicitly computed in the above Arnoldi algorithm, since this algorithm terminates at the
step j = d − 1.
The Arnoldi decomposition has the following important special case.
Corollary 1.10 (Lanczos decomposition). If A ∈ Cn×n is Hermitian and v ∈ Cn \ {0} is
of grade d with respect to A, then there exists V ∈ Cn×d with orthonormal columns and
an unreduced Hermitian tridiagonal matrix H = [hij ] ∈ Cd×d , i.e., H = H H , hij = 0 for
|i − j| > 1, hi+1,i 6= 0 6= hi,i+1 , i = 1, . . . , d − 1, such that AV = V H.
Proof. Let AV = V H be the Arnoldi decomposition. Then H = V H AV = V H AH V = H H
shows that H is Hermitian. Since H is unreduced upper Hessenberg, this matrix is in fact
tridiagonal with hi+1,i 6= 0 6= hi,i+1 for i = 1, . . . , d − 1.
Note that if in the decomposition AV = V H the matrix H is tridiagonal, then a
comparison of the jth columns gives
Avj = hj+1,j vj+1 + hjj vj + hj−1,j vj−1 , j = 1, . . . , d,
where we set v0 = vd+1 = 0. Thus,
hj+1,j vj+1 = Avj − hjj vj − hj−1,j vj−1 ,
which means that the vector vj+1 satisfies a 3-term recurrence. In other words, if A is
Hermitian, an orthogonal basis of Kd (A, v) can be generated by a 3-term recurrence, while
for a general matrix A we require the (full) Arnoldi recurrence
j
X
hj+1,j vj+1 = Avj − hij vi .
i=1

The existence of short (3-term) recurrences for generating orthogonal bases of Krylov
subspaces has been intensively studied since the early 1980s; see [10] for a survey of some
results in this area.

19
Chapter 2

Perturbation Theory

In this chapter we will give an introduction into the theory of errors in numerical analysis
with a focus on numerical linear algebra problems, and into a field that is called “matrix
perturbation theory”.

2.1 Norms, errors, and conditioning


Errors and perturbations are usually measured using norms on the respective vector spaces.
Definition 2.1 (Norm). A function k · k : V → R is called a norm on a real or complex
vector space V when the following hold:
(1) kvk ≥ 0 for all v ∈ V with equality if and only if v = 0,
(2) kαvk = |α|kvk for all scalars α and v ∈ V,
(3) kv1 + v2 k ≤ kv1 k + kv2 k for all v1 , v2 ∈ V (so-called triangle inequality).
We will mostly use the following types of norms.
Definition 2.2 (Consistent norms). Let | · |, k · k, k · k∗ be norms on Cn×m , Cm×k , Cn×k ,
respectively. These norms are called consistent, when
kABk∗ ≤ |A| · kBk
holds for all A ∈ Cn×m and B ∈ Cm×k . A single norm k · k on Cn×n is called consistent if
kABk ≤ kAk kBk holds for all A, B ∈ Cn×n .
We will now show that for each given (vector) norm on Cn there exists a (matrix) norm
k · k∗ on Cn×n such that the two norms are consistent, and vice versa, when the given
matrix norm is consistent.
If k · k is any norm on Cn , then
kAzk
k · k∗ : Cn×n → R with kAk∗ := max kAzk = max for all A ∈ Cn×n ,
kzk=1 z6=0 kzk
is a norm on Cn×n (called the matrix norm induced by k · k):

20
(1) We have kAk∗ ≥ 0 for all A ∈ Cn×n , and kAk∗ = 0 holds if and only if kAzk = 0 for
all z ∈ Cn , which holds if and only if A = 0.

(2) For all A ∈ Cn×n and α ∈ C we have

kαAk∗ = max kαAzk = |α| max kAzk = |α|kAk∗ .


kzk=1 kzk=1

(3) For all A, B ∈ Cn×n we have

kA + Bk∗ = max k(A + B)zk ≤ max (kAzk + kBzk) ≤ kAk∗ + kBk∗ .


kzk=1 kzk=1

The norms k · k and k · k∗ are consistent, since for all x ∈ Cn \ {0} we have

kAxk kAzk
≤ max = kAk∗ ,
kxk z6=0 kzk

which implies kAxk ≤ kAk∗ kxk.


On the other hand, if k · k∗ is any consistent norm on Cn×n , we choose a nonzero vector
y ∈ Cn and define the function

k · k : Cn → R with kxk := kxy H k∗ for all x ∈ Cn ,

which is a norm on Cn :

(1) We have kxk = kxy H k∗ ≥ 0 for all x ∈ Cn . Moreover, kxy H k∗ = 0 holds if and only if
xy H = 0. Multiplying from the right by y 6= 0 gives x(y H y) = 0, and since y H y > 0,
we must have x = 0.

(2) For all x ∈ Cn and α ∈ C we have kαxk = kαxy H k∗ = |α|kxy H k∗ = |α|kxk.

(3) For all x1 , x2 ∈ Cn we have kx1 + x2 k = k(x1 + x2 )y H k∗ ≤ kx1 y H k∗ + kx2 y H k∗ =


kx1 k + kx2 k.

Moreover, for each A ∈ Cn×n we have

kAxk = kA(xy H )k∗ ≤ kAk∗ kxy H k∗ = kAk∗ kxk,

where we have used that k · k∗ is consistent.


Next we will look at the general theory of errors in numerical analysis. Consider a
function (or “problem”) f : X → Y, where (X , k · kX ) and (Y, k · kY ) are normed vector
spaces, called the input (or “data”) and output (or “solution”) space, respectively. We
are interested in studying the behavior of f at a particular input point x ∈ X . If yb is an
approximation of y = f (x), for example resulting from a numerical computation, then the
accuracy of this approximation can be measured by

21
– the absolute forward error kb
y − ykY , or
kb
y −ykY
– the relative forward error kykY
.

We can also ask which input for the function f yields the output yb, i.e., for which
perturbation ∆x of x we have yb = f (x + ∆x). The quantity

k∆xkX is called the absolute backward error,

and
k∆xkX
is called the relative backward error.
kxkX
This can be illustrated as follows:

f
x y = f (x)

backward error k∆xkX forward error kb


y − ykY

x + ∆x yb = f (x + ∆x)
f

A function (or “problem”) is called well-conditioned at the input x, when small per-
turbations of x lead only to small changes in the resulting function values, i.e., a small
k∆xk implies a small kb
y − yk. Here the word “small” needs to be interpreted in the given
context. A function that is not well-conditioned at x is called ill-conditioned at x.
How can we determine whether a function is well-conditioned? For a motivation we
consider a twice continuously differentiable function f : R → R. For a given x ∈ R let
y = f (x) and suppose that yb = f (x + ∆x). Then by Taylor’s theorem

yb − y = f (x + ∆x) − f (x) = f 0 (x)∆x + O(|∆x|2 ),

if |∆x| is small enough, giving


yb − y xf 0 (x) ∆x

2
y f (x) x + O(|∆x| ).
= (2.1)

The quantity |xf 0 (x)|/|f (x)| measures, for small |∆x|, the relative change in the output for
a given relative change in the input. This quantity is called the relative condition number
of f at x. Using y = f (x) and cancelling |x| on the right hand side, we can rewrite (2.1)
for the absolute instead of the relative quantities, i.e.,

y − y| = |f 0 (x)||∆x| + O(|∆x|2 ).
|b

22
The absolute value of the derivative therefore can be considered the (absolute) condition
number of f at x, and the equation (2.1) can be read as

(relative) forward error . (relative) condition number×(relative) backward error. (2.2)

This rule of thumb will appear frequently below.


Let us generalize the idea of the relative condition number to a function f between two
normed vector spaces.

Definition 2.3. The relative condition number of a function f : X → Y at x ∈ X is


defined by
kf (x + ∆x) − f (x)kY kxkX
κf (x) := lim sup .
δ→0 k∆xk≤δ kf (x)kY k∆xkX

Note that the first factor in the definition of κf (x) is the (relative) forward error, and
the second factor is the reciprocal of the (relative) backward error. The value of κf (x) of
course depends on the choice of the norms in the spaces X and Y. If f is differentiable
at x, then
kJf (x)kkxkX
κf (x) = ,
kf (x)kY
where Jf := [∂fi /∂xj ] is the Jacobian of f , and the norm k · k is induced by the norms on
X and Y.

Example 2.4.

(1) For the function f : R → R with f (x) = αx for some nonzero α ∈ R and the norm
k · k = | · | (absolute value) on R we get

|f 0 (x)| |x| |α| |x|


κf (x) = = = 1 for all x ∈ R,
|f (x)| |αx|

and hence this function is perfectly conditioned everywhere. (For x = 0 we have


κf (0) = lim|x|→0 |αx|/|αx| = 1.)

(2) For the function f : R+ → R with f (x) = log(x) the norm k · k = | · | (absolute value)
on R we get

|f 0 (x)| |x| |1/x| |x| 1


κf (x) = = = for all x ∈ R+ .
|f (x)| | log(x)| | log(x)|

Hence κf (1) = +∞, and f is ill-conditioned in the neighborhood of x = 1. In order


to illustrate the ill-conditioning numerically, let

x = 1 + 10−8 , ∆x = 10−10 .

23
An evaluation in MATLAB gives
log(x) = 9.999999889225291 × 10−9 ,
log(x + ∆x) = 1.009999989649433 × 10−8 ,
and hence
|∆x|
= 9.999999900000002 × 10−11 ,
|x|
κlog (x) = 1.000000011077471 × 108 ,
| log(x + ∆x) − log(x)|
= 0.010000000837678 × 100 ,
| log(x)|
Thus, due to the large condition number of f at x, a small relative perturbation of
the input (or a small relative backward error) leads to a large change in the output
(or a large relative forward error).
Example 2.5. For a given matrix A ∈ Cn×n we consider the function f : Cn → Cn with
f (x) = Ax. By k · k we denote a given norm on Cn as well as the induced matrix norm on
Cn×n . Then the relative condition number of f at x ∈ Cn \ {0} is
kA(x + ∆x) − Axk kxk
κf (x) = lim sup
δ→0 k∆xk≤δ kAxk k∆xk
kA∆xk kxk kxk
= lim sup = kAk .
δ→0 k∆xk≤δ kAxk k∆xk kAxk
If A is nonsingular, we can use that
kA−1 zk kyk kxk
kA−1 k = max = max ≥ for each x ∈ Cn \ {0},
z6=0 kzk y6 = 0 kAyk kAxk
which gives the bound
κf (x) ≤ kAk kA−1 k for each x ∈ Cn \ {0}.
e ∈ Cn \ {0} such that
Note that there exists some vector x
kyk ke
xk
max = .
y6=0 kAyk kAexk
x) = kAkkA−1 k.
For this vector we obtain the equality κf (e
Analogously, the relative condition number of the function g : Cn → Cn with g(x) =
A−1 x at x ∈ Cn \ {0} satisfies
kxk
κg (x) = kA−1 k ≤ kA−1 k kAk,
kA−1 xk
where we have used that kxk/kA−1 xk ≤ kAk. Again there exists some vector x
e ∈ Cn \ {0}
−1
x) = kAkkA k.
with κg (e

24
This example motivates the following definition.
Definition 2.6. If A ∈ Cn×n is nonsingular, and k · k is a norm on Cn×n , then

κ(A) := kAk kA−1 k

is called the condition number of A with respect to the norm k · k.


The condition number of matrices occurs naturally in many perturbation results. We
give a few examples in the following section.

2.2 Perturbation results for matrices and linear alge-


braic systems
We start with a result for the matrix inverse.
Theorem 2.7 (Perturbation of the inverse). Let A ∈ Cn×n and A b = A + E ∈ Cn×n be
nonsingular. Then for any consistent norm k · k on Cn×n we have
b−1 − A−1 k
kA kEk
≤ kA−1 Ek ≤ κ(A) . (2.3)
b−1 k
kA kAk

Moreover, if kA−1 Ek < 1, then

b−1 − A−1 k
kA kA−1 Ek κ(A) kEk
kAk
≤ ≤ . (2.4)
−1
kA k −1
1 − kA Ek 1 − κ(A) kEk
kAk

b−1 − A−1 = (I − A−1 A)


Proof. From A bA b−1 = −A−1 E A
b−1 we obtain

b−1 − A−1 k ≤ kA−1 EkkA


b−1 k ≤ kA−1 kkAk kEk b−1
kA kA k,
kAk
which gives (2.3).
b−1 − A−1 = −A−1 E A
In order to show (2.4), we start with A b−1 , which yields

b−1 k ≤ kA−1 k + kA−1 EkkA


kA b−1 k,

and hence
b−1 k
kA 1 1
≤ ≤ .
−1
kA k −1
1 − kA Ek 1 − κ(A) kEk
kAk

Now (2.3) gives


b−1 − A−1 k
kA b−1 k
kA kEk
−1
≤ −1
κ(A) ,
kA k kA k kAk
and (2.4) follows immediately.

25
If we interpret taking the inverse as a function f : Cn×n → Cn×n , f (A) = A−1 , then
b−1 = f (A + E), and kEk is the backward error. Hence the theorem is another instance
A
of our rule of thumb (2.2).

Example 2.8. We consider the matrix


 
1+ 1
Aε = ,
1 1

which is nonsingular for each (real or complex) ε 6= 0, with its inverse given by
 
−1 1/ε −1/ε
Aε = .
−1/ε 1 + 1/ε

A MATLAB (R2015b) computation with


 −7 
−6 10 0
ε = 10 , E = , A
b = Aε + E = Aε+10−7
0 0

then yields

kEk2
= 4.999998749999999 × 10−8 ,
kAk2
κ2 (A) = kAk2 kA−1 k2 = 4.000002000000751 × 106 ,
kA−1 − Ab−1 k2
= 0.099999972500000,
b−1 k2
kA

which is an illustration of the bound (2.3). (Note that since the inverses are known explic-
itly, one only needs to compute norms of matrices in this example.)

We will next show how the condition number of a nonsingular matrix is related to the
“distance to singularity” of the matrix.

Lemma 2.9. If k · k is a consistent norm on Cn×n , then for each A ∈ Cn×n we have

kAk ≥ ρ(A),

where ρ(A) := max{ |λ| : λ is an eigenvalue of A} is called the spectral radius of A.

Proof. We have shown above that there exists a norm k · k∗ on Cn so that k · k and k · k∗
are consistent. If Ax = λx, x 6= 0, then

|λ|kxk∗ = kλxk∗ = kAxk∗ ≤ kAkkxk∗ ,

and hence |λ| ≤ kAk.

26
Theorem 2.10. If A ∈ Cn×n is nonsingular and E ∈ Cn×n is such that A + E ∈ Cn×n is
singular, then for any consistent norm k · k on Cn×n we have

kEk 1
≥ .
kAk κ(A)

Proof. If A is nonsingular we can write A + E = A(I + A−1 E). Since A + E is singular, the
matrix I +A−1 E must be singular, and thus −1 must be an eigenvalue of A−1 E. Lemma 2.9
then gives
kEk
1 ≤ ρ(A−1 E) ≤ kA−1 Ek ≤ kA−1 kkAk ,
kAk
which yields the desired inequality.
This theorem shows that a nonsingular matrix must be perturbed by a matrix with
(relative) norm at least 1/κ(A) in order to make it singular. In short, “well-conditioned
matrices are far from singular”.
Example 2.11. Let A ∈ Cn×n be nonsingular, and let A = U ΣV H be an SVD in the
notation of Theorem 1.6. Then A−1 = V Σ−1 U H , and since the matrix 2-norm, which is
induced by the Euclidean norm on Cn , is unitarily invariant, we easily see that

kAk2 = kΣk2 = σ1 and kA−1 k2 = kΣ−1 k2 = 1/σn ,

and hence κ(A) = σ1 /σn . For the matrix E := −σn un vnH we have kEk2 = σn , and A + E
has rank n − 1 and thus is singular. Moreover,
kEk2 σn 1
= = ,
kAk2 σ1 κ(A)

and therefore E is a perturbation with minimal (relative) 2-norm such that the perturbed
matrix is singular.
In the next result the vector x
b should be interpreted as an approximate solution of the
given linear algebraic system.
Theorem 2.12 (Residual-based forward error bound). Let A ∈ Cn×n be nonsingular,
x ∈ Cn \ {0} and b = Ax. Then for consistent norms and every x
b ∈ Cn we have

kb
x − xk krk
≤ κ(A) , (2.5)
kxk kbk

where r := b − Ab
x is the residual and krk/kbk is the relative residual norm.
Proof. Using x = A−1 b and the definition of the residual we get
krk
x − xk = kA−1 (Ab
kb x − b)k ≤ kA−1 kkrk = κ(A) .
kAk

27
Moreover, kbk ≤ kAkkxk gives
1 kAk
≤ ,
kxk kbk
which implies the desired inequality.
Any essential observation to be made in (2.5) is that in case of an ill-conditioned matrix
a small residual norm does not guarantee that the forward error is small as well.
Example 2.13. For a numerical example we consider
       
1+ε 1 1 2+ε 0
Aε = , x= , b = Ax = , x
b= ;
1 1 1 2 2

cf. Example 2.8. Then


 
ε kx − x
bk2
r = b − Ab
x= , krk2 = |ε|, while = 1.
0 kxk2

For ε = 10−6 we have κ2 (A) ≈ 4 × 106 . If we then try to solve Aε x = b using y=inv(A)*b
in MATLAB (R2015b), we get the computed approximation
 
1.000000000232831 kx − yk2
y= with the relative forward error ≈ 2.33 × 10−10 .
0.999999999767169 kxk2

Since the machine precition (or unit roundoff ) is u ≈ 1.11×10−16 (see Chapter 3), we have
lost six significant digits in the computed solution. Similarly, if we compute [L,U]=lu(A)
and then y=U\(L\b), we obtain the computed approximation
 
1.000000000111022 kx − yk2
y= with the relative forward error ≈ 1.11 × 10−10 .
0.999999999888978 kxk2
Finally, with [Q,R]=qr(A) and y=R\(Q’*b) we obtain
 
1.000000000314019 kx − yk2
y= with the relative forward error ≈ 3.14 × 10−10 .
0.999999999685981 kxk2
This example suggests the following rule of thumb:
If κ(A) ≈ 10k , then expect a loss of k significant digits in a computed (i.e.,
approximate) solution of Ax = b.
The loss of significant digits (or large relative forward error) is a consequence of the ill-
conditioning of the problem, and hence it is independent of the numerical algorithm that
is used for computing the approximation1 . The best approach to deal with this situation
is to avoid ill-conditioning of the problem in the first place.
1
Of course, with a poor algorithm we will likely lose even more significant digits, while if A−1 is known
explicitly, we may be able to compute x = A−1 b very accurately despite a large κ(A).

28
In the notation of Theorem 2.12, suppose that x
b is the solution of the linear algebraic
system with a perturbed right hand side, i.e.,

Ab
x = b + ∆b.

x − b = −r, and (2.5) becomes


Then ∆b = Ab

kb
x − xk k∆bk
≤ κ(A) .
kxk kbk

Since only a perturbation of b is considered, k∆bk/kbk is the relative backward error, and
we recognize our rule of thumb (2.2).
We now ask about a perturbation of the matrix A such that an approximate solution
x
b solves the perturbed linear algebraic system.

Theorem 2.14 (Residual-based backward error bound). Let A ∈ Cn×n , x ∈ Cn and


b ∈ Cn \ {0}, then (A + E)b
b = Ax. Let x xH /kb
x = b for E := rb xk22 , and
 
kEk2 krk2 k∆Ak2
= = min : (A + ∆A)b x=b .
kAk2 kAk2 kbxk 2 kAk2

xH /kb
Proof. For E = rb xk22 we get

bH x
x krk2
and kEk2 =
b
(A + E)b
x = Ab
x+r = b, .
xk22
kb kb
xk 2

x = b, then r = b − Ab
If ∆A is arbitrary with (A + ∆A)b x = b − (b − ∆Ab
x) = ∆Ab
x. Hence
krk2 ≤ k∆Ak2 kb xk2 , giving
k∆Ak2 krk2
≥ ,
kAk2 kAk2 kbxk 2
where the lower bound is attained for ∆A = E, since kEk2 = krk2 /kb
xk 2 .
Thus, E = rbxH /kbxk22 is a matrix with minimal (relative) backward error in the 2-norm
so that x
b solves the perturbed system (A + ∆A)y = b.
We next study the backward error when allowing perturbations in both A and b. Let
x 6= 0, b = Ax, r = b − Ab x, and consider any α ∈ C and y ∈ Cn with y H x b = 1.
H
Define ∆Aα := αry and ∆bα := (α − 1)r, then a simple computation shows that (A +
∆Aα )bx = b+∆bα , i.e., x
b exactly solves a whole family of perturbed linear algebraic systems
parametrized by α. For α = 1 we have ∆Aα := ry H and ∆bα = 0, and choosing y = x xk22
b/kb
(hence y H x
b = 1) we get the matrix E from the previous theorem.
In order to characterize minimum norm backward perturbations we need some addi-
tional theory.

29
Lemma 2.15. Let k · k be a norm on Cn and define

k · kD : Cn → R with kxkD := max |xH z| for all x ∈ Cn .


kzk=1

Then k · kD is a norm on Cn which is called the dual norm of k · k.

Proof. Clearly kxkD ≥ 0 with equality if and only x = 0. For λ ∈ C we have kλxkD =
maxkzk=1 k(λx)H zk = |λ|kxkD . If x1 , x2 ∈ Cn , then

kx1 + x2 kD = max |(x1 + x2 )H z| ≤ max (|xH H H H


1 z| + |x2 z|) ≤ max |x1 z| + max |x2 z|
kzk=1 kzk=1 kzk=1 kzk=1
D D
= kx1 k + kx2 k ,

which completes the proof.


For the Euclidean norm k · k2 on Cn and each nonzero vector x ∈ Cn we have

|xH x|
kxkD H
2 = max |x z| ≥ = kxk2 ,
kzk2 =1 kxk2
kxkD
2 = max |(z, x)| ≤ max (kzk2 kxk2 ) = kxk2 ,
kzk2 =1 kzk2 =1

and thus kxk2 = kxkD 2 , where in the upper bound we have used the Cauchy-Schwarz
inequality. In other words, the dual norm of the Euclidean norm on Cn is the Euclidean
norm itself. More generally, for any 1 ≤ p ≤ ∞, the p-norm on Cn is defined by
n
n
X 1/p
k · kp : C → R with kxkp := |xi |p for all x = [x1 , . . . , xn ]T ∈ Cn ,
i=1

1 −1 1 1
and one can show that k · kD
p = k · kq , where q = (1 − p ) , so that p
+ q
= 1.
Basic properties of the dual norm are shown in the next result.

Lemma 2.16. For all x, y ∈ Cn we have

|xH y| ≤ kxkD kyk and |xH y| ≤ kxkkykD .

Proof. Both inequalities are obvious for y = 0. If y 6= 0, then for each x ∈ Cn we have

H y
max |xH z| = kxkD , and hence |xH y| ≤ kxkD kyk.
kyk ≤ kzk=1
x

The second inequality follows from the first by using |xH y| = |y H x|.

30
For the Euclidean norm k · k2 , both inequalities in this lemma read
|xH y| = |y H x| = |(x, y)| ≤ kxk2 kyk2 ,
which is nothing but the Cauchy-Schwarz inequality in Cn equipped with the Euclidean
inner product.
More generally, for the p- and q-norm on Cn , where p1 + 1q = 1, the second inequality in
the lemma reads

n

n
!1/p n
!1/q
X X X
|xH y| = |y H x| = |(x, y)| ≤ kxkp kykq , or xi y i ≤ |xi |p |yi |q ,


i=1 i=1 i=1
(2.6)
which is called the Hölder inequality.
If k · kDD denotes the dual norm of k · kD , i.e.,
kxkDD := max |xH z| for all x ∈ Cn ,
kzkD =1

then using the previous lemma gives


kxkDD = max |xH z| ≤ max (kxkkzkD ) = kxk.
kzkD =1 kzkD =1

With some more effort one can also show the reverse inequality, which gives the following
important theorem.
Theorem 2.17 (Duality Theorem). If k·k is a norm on Cn , k·kD is the dual norm of k·k,
and k · kDD is the dual norm of k · kD , then kxk = kxkDD for all x ∈ Cn , i.e., k · k = k · kDD .
Corollary 2.18. Let k · k be a norm on Cn , let k · kD be the dual norm and let x
b ∈ Cn \ {0}.
Then there exists a vector y ∈ Cn \ {0} with
1 = yH x xkkykD .
b = kb
Such a vector y is called a dual vector of x
b.
Proof. For the given vector x
b we know that
kb xkDD = max |b
xk = kb xH z|.
kzkD =1

z kD = 1. Set y := ze/kb
The maximum is attained for some vector ze with ke xk, then kykD =
D
kz̃k /kb
xk, giving
z kD = kb
1 = ke xkkykD .
Moreover, using ze = kb
xky we get
kb xH ze| = kb
xk = |b xH y| = kb
xk|b xk|y H x
b|,
so that |y H x
b| = 1. Without loss of generality we can assume that y H x
b = 1, since y can be
multiplied by a suitable constant eiθ .

31
b ∈ Cn \ {0} is given by y = x
For the Euclidean norm k · k2 a dual vector of x xk22 ,
b/kb
since then kyk2 = 1/kb
xk2 and

bH
x
1= b = kb
x xk2 kyk2 .
xk22
kb
Remark 2.19. Corollary 2.18 is a finite dimensional version of the following corollary of
the Hahn-Banach Theorem:
If (X , k · kX ) is a normed linear space, and (X ∗ , k · kX ∗ ) is the dual space with k`kX ∗ =
supkxkX ≤1 |`(x)|, then for each nonzero x0 ∈ X there exists a nonzero `0 ∈ X ∗ with `0 (x0 ) =
k`0 kX ∗ kx0 kX .
Theorem 2.20 (Rigal & Gaches [13]). Let A ∈ Cn×n , x ∈ Cn , b = Ax, and x b ∈ Cn \ {0}.
Let k · k be any norm on Cn as well as the induced matrix norm on Cn×n . Let E ∈ Cn×n
and f ∈ Cn be given, and suppose that y ∈ Cn is a dual vector of xb. Then the normwise
backward error of the approximate solution x
b of Ax = b is given by

x) := min {ε : (A + ∆A)b
ηE,f (b x = b + ∆b with k∆Ak ≤ εkEk and k∆bk ≤ εkf k}
krk
= ,
kEkkb xk + kf k
and the second equality is attained by the perturbations
kEkkbxk
∆Amin := ry H ,
kEkkb
xk + kf k
kf k
∆bmin := − r.
kEkkb
xk + kf k
Proof. We have
kEkkbxk
(A + ∆Amin )b
x = Ab
x+ r yH x
kEkkb
xk + kf k |{z}
b
=1
kEkkbxk kf k
= b+ Ab
x
kEkkbxk + kf k kEkkb
xk + kf k
= b + ∆bmin ,

i.e., x
b solves the system that is perturbed by ∆Amin and ∆bmin .
Next, if ∆A and ∆b are arbitrary with (A + ∆A)b x = b + ∆b and k∆Ak ≤ εkEk,
k∆bk ≤ εkf k, then r = b − Ab x = ∆Ab x − ∆b, and hence

krk ≤ k∆Akkb
xk + k∆bk ≤ ε(kEkkb
xk + kf k),

which shows that ε is bounded from below as


krk
ε≥ =: εmin .
kEkkb
xk + kf k

32
Finally, we show that the value εmin is attained by the perturbations ∆bmin , ∆Amin :

kf k
k∆bmin k = krk = εmin kf k,
kEkkb xk + kf k
k∆Amin zk kEkkb xk kry H zk
k∆Amin k = max = max
z6=0 kzk kEkkb xk + kf k z6=0 kzk
kEkkb xk |y H z|
= krk max = εmin kEk,
kEkkb xk + kf k z6=0 kzk
| {z }
xk
=1/kb

where we have used that 1 = y H x xkkykD = kb


b = kb xk · maxkzk=1 |y H z|.
As a special case of Theorem 2.20 consider the Euclidean norm k · k2 = k · kD
2 , so that
2
y=x xk2 is a dual vector of x
b/kb b. Then for E = A and f = 0 we obtain

krk2 xH
rb
ηA,0 (b
x) = , ∆Amin = , ∆bmin = 0,
kEk2 kb
xk 2 xk22
kb

and hence we recover Theorem 2.14.


More generally, if k · k is any norm on Cn and y ∈ Cn is a dual vector of x
b, we define
the matrix
B := by H + ZB (I − x
by H ),
where ZB ∈ Cn×n is arbitrary. Then a simple computation shows that Bb
x = b, and hence
we have
(A + ∆A)bx = b for ∆A := B − A.
In case of the Euclidean norm, y = x xk22 , and ZB = A, we obtain again the minimum
b/kb
perturbation, i.e., in this case ∆A = B − A = rb xH /kb
xk22 .
For E = A and f = b, the resulting quantity

x) := min {ε : (A + ∆A)b
ηA,b (b x = b + ∆b with k∆Ak ≤ εkAk, k∆bk ≤ εkbk}
krk
=
kAkkb xk + kbk

is called the normwise relative backward error of the approximation x


b. A numerical method
for solving Ax = b with A ∈ Cn×n nonsingular is called normwise backward stable if
it produces a computed approximation x b of x with ηA,b (b
x) on the order of the machine
precision u (or somewhat larger, when the context allows).
The main point is that a normwise backward stable method yields an exact solution
x
b of a (slightly) perturbed system, i.e., (A + ∆A)b x = b + ∆b with k∆Ak/kAk ≤ ε and
k∆bk/kbk ≤ ε, where ε is “small”. When our original system Ax = b contains uncertainties
(e.g. measurement errors), the perturbed system may just be the system we wanted to
solve!

33
A numerical method for solving Ax = b with A ∈ Cn×n nonsingular is called normwise
forward stable when the computed approximation x
b satisfies
kb
x − xk
= O(κ(A)u);
kxk

again recall our rule of thumb (2.2).


The next result shows that under some reasonable assumptions normwise backward
stability implies normwise forward stability.
Theorem 2.21. Let A ∈ Cn×n be nonsingular, x ∈ Cn \ {0}, b = Ax, ε > 0, and suppose
b ∈ Cn satisfies
that x
(A + ∆A)bx = b + ∆b,
where the perturbations satisfy k∆Ak ≤ εkEk and k∆bk ≤ εkf k for some E ∈ Cn×n and
f ∈ Cn , respectively. Suppose further that εkA−1 kkEk < 1. Then for consistent norms we
have

εkA−1 k
 
kb
x − xk kf k
≤ + kEk .
kxk 1 − εkA−1 kkEk kxk
In particular, if E = A and f = b, then
kb
x − xk 2εκ(A)
≤ .
kxk 1 − εκ(A)

Thus, if a numerical method produces x


b with ηA,b (b
x) = u, i.e., the method is normwise
backward stable, then
kb
x − xk
= O(κ(A)u),
kxk
i.e., the method is normwise forward stable.
x − x) = ∆b − ∆Ab
Proof. From A(b b − x = A−1 (∆b − ∆Ax + ∆A(x − x
x we obtain x b)).
−1
Taking norms yields kb
x − xk ≤ kA k(εkf k + εkEkkxk + εkEkkbx − xk), and hence

(1 − εkA−1 kkEk)kbx − xk ≤ εkA−1 k(kf k + kEkkxk)


εkA−1 k
 
kb
x − xk kf k
⇔ ≤ + kEk ,
kxk (1 − εkA−1 kkEk) kxk

where we have used that εkA−1 kkEk < 1.


If E = A and f = b, we use kbk = kAxk ≤ kAkkxk and get

kb
x − xk 2εκ(A)
≤ .
kxk 1 − εκ(A)

Since by assumption εκ(A) < 1, the last expression is on the order O(κ(A)u) if ε = u

34
Chapter 3

Direct Methods for Solving Linear


Algebraic Systems

In this chapter we consider linear algebraic systems Ax = b with a nonsingular matrix


A ∈ Cn×n , so that x = A−1 b is well defined.
Direct methods for solving Ax = b are based (at least implicitly) on a decomposition or
factorization of A into easily invertible factors, and the subsequent solution of the systems
involving these factors. A computed approximation x b of the exact solution x is available
only at the end of this process. Iterative methods, on the other hand, generate a sequence
of intermediate approximations, and can be stopped once a user-specified accuracy of the
approximate solution is attained.
For an example of a direct method we consider an HPD matrix A and its uniquely
determined Cholesky decomposition A = LLH , so that x = (LLH )−1 b = L−H (L−1 b). In
practice ones does not invert the matrices L and LH but rather solves the two triangular
systems
Ly = b and LH x = y,
which in exact arithmetic gives x = L−H y = L−H (L−1 b) = A−1 b. In finite precision
arithmetic, however, all computations are affected by rounding errors. The direct method of
solving Ax = b based on the Cholesky decomposition therefore generates an approximation
x
b of x, and the quality of this approximation depends on the errors made in the computation
of the decomposition, and in solving the two triangular systems.

3.1 Basics of rounding error analysis


In order to analyze the errors in such finite precision computations, we will have a brief
look at the computer arithmetic. In this arithmetic we do not have all real numbers R, but
only a finite subset, the floating point numbers F . For each x ∈ R we denote by f l(x) ∈ F
the closest floating point number to x. The machine precision u determines the maximum
distance between x ∈ R and f l(x) ∈ F . It is assumed that for each x ∈ R there exists

35
some δx ∈ R with |δx | ≤ u, such that

f l(x) = x(1 + δx ) ≤ x(1 + u),

and hence the relative error in the approximation of x by the floating point number f l(x)
satisfies
|x − f l(x)|
≤ u.
|x|
The error made when working with f l(x) instead of the exact number x is called a rounding
error.
For example, in MATLAB1 the command eps shows the spacing between two floating
point numbers:

>> eps
ans =
2.220446049250313e-16

The machine precision u is half this distance, so u ≈ 1.1102 × 10−16 .


We will next describe the standard model for computing with floating point numbers.
For any floating point numbers α, β ∈ F the computed result f l(α ~ β), where ~ is one of
+, −, ∗, ÷, satisfies
f l(α ~ β) = (α ~ β)(1 + δ)
for some δ (depending on α, β) with |δ| ≤ u. Again, the error made when working with
f l(α~β) instead of the exact value α~β is a rounding error, and the relative error satisfies
|α ~ β − f l(α ~ β)|
≤ u.
|α ~ β|
The following MATLAB example shows that rounding errors may occur even when com-
puting with small integers:

>> 1-(1/49)*49
ans =
1.110223024625157e-16

The number x = 49 is the smallest integer for which evaluating 1 − (1/x) · x in MATLAB
(and hence IEEE arithmetic) does not give exactly zero.
The (relative) rounding error in a single computation is bounded by the machine pre-
cision, which is very small. If we perform many computations, however, then the rounding
errors may “add up”, and ultimately lead to an inaccurate final result. It is therefore
important to understand the way in which numerical algorithms are affected by rounding
errors.
1
MATLAB constructs the floating point numbers using the IEEE Standard 754 from 2008. For details
see [Link]

36
A basic but nevertheless useful example, which occurs in many algorithms, is the com-
putation of the inner product of two real vectors, i.e.,
n
X
T
y x= xi y i , x, y ∈ Rn .
i=1

The number y T x can be computed the following algorithm:


s=0
for i = 1, . . . , n do
s = s + xi yi
end for
Let
k
X 
sk = f l xi yi , k = 1, . . . , n,
i=1
denote the computed partial sum of the first k terms. Then there exist |δi | ≤ u such that
s1 = f l(x1 y1 ) = x1 y1 (1 + δ1 ),
s2 = f l(s1 + x2 y2 ) = (s1 + x2 y2 (1 + δ2 ))(1 + δ3 )
= x1 y1 (1 + δ1 )(1 + δ3 ) + x2 y2 (1 + δ2 )(1 + δ3 ).
In order to reduce the technicalities we will write 1 ± δ instead of 1 + δi , where |δ| ≤ u.
Thus,
s2 = x1 y1 (1 ± δ)2 + x2 y2 (1 ± δ)2 ,
s3 = f l(s2 + x3 y3 ) = (s2 + x3 y3 (1 ± δ))(1 ± δ)
= x1 y1 (1 ± δ)3 + x2 y2 (1 ± δ)3 + x3 y3 (1 ± δ)2 ,
and inductively we get
sn = f l(y T x) = x1 y1 (1 ± δ)n + x2 y2 (1 ± δ)n + x3 y3 (1 ± δ)n−1 + · · · + xn yn (1 ± δ)2 . (3.1)
The derivation of an error estimate is based on the following elementary, yet important
result.
Lemma 3.1. If |δi | ≤ u for i = 1, . . . , n, and nu < 1, then
n
Y nu
(1 + δi ) = 1 + θn , where |θn | ≤ =: γn .
i=1
1 − nu

Proof. We have
n n
Y Y nu
(1 + δi ) ≤ (1 + u) = (1 + u)n ≤ 1 + ,
i=1 i=1
1 − nu

where the last inequality can be shown by induction on n under the assumption that
nu < 1.

37
Note that
nu
γn = = nu(1 + nu + (nu)2 + · · · ) = nu + O(n2 u2 ). (3.2)
1 − nu
Using this lemma we can write (3.1) as

f l(y T x) = x1 y1 (1 + θn ) + x2 y2 (1 + θen ) + x3 y3 (1 + θn−1 ) + · · · + xn yn (1 + θ2 ), or


y T x = f l(y T x) − (x1 y1 θn + x2 y2 θen + x3 y3 θn−1 + · · · + xn yn θ2 ).

If nu < 1 and |x| := [|x1 |, . . . , |xn |]T we therefore get the error estimate
n
X
T T
|y x − f l(y x)| ≤ γn |xi yi | = γn |y|T |x| = nu|y|T |x| + O(n2 u2 ). (3.3)
i=1

3.2 Stability and cost of the Cholesky decomposition


We will next study the numerical stability of computing the Cholesky decomposition. For
simplicity of notation, we will consider a real symmetric (rather than complex Hermitian)
positive definite matrix A. Its Cholesky decomposition is given by

l11 l21 · · ·
  
l11 ln1
. .. .. .. 
 l21 . . . . . 
 
T
A = LL =  . . , (3.4)

 .. .. . ..

 . . . ln,n−1 
ln1 · · · ln,n−1 lnn lnn

where L = [lij ] ∈ Rn×n with lii > 0, i = 1 . . . , n, is uniquely determined (cf. Theorem 1.3).
If we equate the columns in (3.4), we immediately obtain the following recursive algorithm
for computing the entries of L:
for j = 1, . . . , nP
do
j−1 2 1/2
ljj = (ajj − k=1 l )
1
Pj−1jk
lij = ljj (aij − k=1 lik ljk ) for i = j + 1, . . . , n
end for
If ljT denotes the jth row of L, then lj is the jth column of LT , and the two steps of
this algorithm can be rewritten as
j j
X X
2
ajj = ljk = ljT lj and aij = lik ljk = liT lj . (3.5)
k=1 k=1

This is, of course, not a recursive algorithm for computing the entries of L. But the
rewritten version clearly shows that (apart from the square root) every step of the recursive

38
algorithm consists of evaluating inner products. Thus, the rounding errors made in every
step can be estimated using (3.3).
Because of the recursive nature of the algorithm, its error analysis must consider the
entries of the computed Cholesky factor L b = [blij ] instead of the exact factor L = [lij ].
Thus, instead of (3.5) we must look at equations of the form
j j
X X
2
ajj = ljk
b ljT b
=b lj and aij = likb
b liT b
ljk = b lj . (3.6)
k=1 k=1

Based on (3.3) and skipping some technical details2 , we then obtain the following result.
Theorem 3.2. If L b is computed by the above algorithm applied to an SPD matrix A ∈
Rn×n , and nu < 1, then
A=L
bLbT + ∆A, where |∆A| ≤ γn+1 |L| bT |.
b |L

The matrix inequality in this theorem is meant entrywise. For any matrix norm k · k
the theorem gives the backward error bound
k|∆A|k ≤ γn+1 k|L||
b L bT |k,

which we will analyze more closely now. For 1 ≤ p ≤ ∞, let k · kp be the norm on Cn×n
induced by the p-norm on Cn , i.e.,
kAkp = max kAxkp .
kxkp =1

Then, in particular,
n
X n
X
1−1/p
kAk1 = max |aij | ≤ n kAkp and kAk∞ = max |aij | ≤ n1/p kAkp .
1≤j≤n 1≤i≤n
i=1 j=1

Moreover, for each A ∈ Cn×n we have


1/p
1−1/p
kAkp ≤ kAk1 kAk∞ ,
which reminds of the Hölder inequality (2.6). Since kAk1 = k|A|k1 and kAk∞ = k|A|k∞ ,
we get
1/p 1−1/p 1/p 1−1/p
k|A|kp ≤ k|A|k1 k|A|k∞ = kAk1 kAk∞ ≤ n2(1−1/p)/p kAkp ,
and in particular k|A|k2 ≤ n1/2 kAk2 .
If M ∈ Rn×n has an SVD of the form M = U ΣV T , then M M T = U Σ2 U T , from which
we see that kM M T k2 = kM k22 . Thus,

k|L||
b L bT |k2 = k|L|k
b 22 ≤ nkLk
b 22 = nkL bLbT k2 = nkA − ∆Ak2 ≤ n(kAk2 + k∆Ak2 )
≤ n(kAk2 + γn+1 k|L||
b L bT |k2 ),
2
As Higham writes, such details are “not hard to see after a little thought, but ... tedious to write
down” [5, p. 142].

39
from which we obtain
bT |k2 ≤ n
k|L||
b L kAk2 , (3.7)
1 − nγn+1
and hence, using (3.2),
bT |k2 ≤ nγn+1
kAk2 = n2 u + O(n3 u2 ) kAk2 .

k∆Ak2 ≤ k|∆A|k2 ≤ γn+1 k|L||
b L
1 − nγn+1
We thus have shown the following backward error result.
Corollary 3.3. The computed Cholesky factor L
b satisfies

bT + ∆A, k∆Ak2 nγn+1


A=L
bL where ≤ = n2 u + O(n3 u2 ).
kAk2 1 − nγn+1
Similar bounds on the relative backward error can be derived for other matrix norms.
All these bounds will be of the form k∆Ak/kAk ≤ p(n)u, where p(n) is a polynomial
of small degree in n. These bounds show that the above algorithm for computing the
Cholesky decomposition is normwise backward stable.
As mentioned above, if we have given the Cholesky factorization A = LLT , we can
compute an approximation of the solution of Ax = b by solving the two nonsingular
triangular systems

(1) Ly = b (hence in exact arithmetic y = L−1 b),


(2) LT x = y (hence in exact arithmetic x = L−T y = A−1 b).

A lower (or upper) triangular system can be solved using forward (or back) substitution.
The forward substitution algorithm used for solving Ly = b can be written as
j−1
1 X
yj = (bj − ljk yk ), j = 1, . . . , n.
ljj k=1

Evaluating the right hand side costs j multiplications P and j − 1 subtractions, and hence
the total cost of the forward substitution algorithm is nj=1 (2j − 1) = n2 .
In finite precision computations each operation is affected by rounding errors, and
hence the algorithm does not yield the exact solution y = [y1 , . . . , yn ]T but a computed
approximation yb = [b y1 , . . . , ybn ]T . Thus, for the rounding error analysis we must consider
the recurrence
j−1
1 X
ybj = (bj − ljk ybk ), j = 1, . . . , n,
ljj k=1

which can also be written as


j
X
bj = ljk ybk , j = 1, . . . , n.
k=1

40
This version shows that the numerical stability analysis of the algorithm can be done as
for the Cholesky algorithm (3.6). Obviously, the same analysis applies to the backward
substitution, i.e., the solution of an upper triangular system, and this leads to the following
result.

Theorem 3.4. If the linear algebraic system T y = b with the lower (or upper) triangular
matrix T ∈ Rn×n and b ∈ Rn is solved by the forward (or back) substitution algorithm as
stated above, then the computed approximation yb satisfies

(T + ∆T )b
y = b, where |∆T | ≤ γn |T |.

Using the same analysis as above one can now show that the relative backward errors
in the forward and back substitution algorithms satisfy bounds of the form k∆T k/kT k ≤
p(n)u, where k · k is an appropriate matrix norm, and p(n) is a small polynomial in n.
Thus, both algorithms are normwise backward stable.
Consequently, when we solve Ax = b with an SPD matrix A ∈ Rn×n in finite precision
arithmetic by first computing A = L bLbT + ∆A, and then computing approximations to
the solutions of Ly
b = b and L bT x = y by forward and back substitution, we obtain an
approximation xb such that

k∆Ak
(A + ∆A)b
x = b, where ≤ p(n)u;
kAk

see, e.g., [5, Theorem 10.4 and equation (10.7)]. Thus, the norm of the residual satisfies

krk = kb − Ab
xk = k∆Ab
xk ≤ p(n)kAkkb
xku,

giving
krk kAkkbxk
≤ p(n)u < p(n)u, (3.8)
kAkkb
xk + kbk kAkkb
xk + kbk
and hence the method is normwise backward stable; cf. the discussion of Theorem 2.20.
If we write the SPD matrix A ∈ Rn×n as
 
a11 aT1
A= ,
a1 A1

we have eT1 Ae1 = a11 > 0 and we can perform the following basic factorization step:
  " 1/2
#   1/2 1/2

a11 aT1 a11 0 1 0 a11 −a11 aT1
A= = −1/2 ,
a1 A1 a11 a1 In−1 0 S1 0 In−1
| {z } | {z }
=:L1 =:LT
1

a1 aT
where S1 := A1 − a11
1
∈ R(n−1)×(n−1) is called Schur complement of a11 in A.

41
 
1 0
The symmetric matrices A and are congurent and hence have the same inertia
0 S1
(i.e., number of positive, negative and zero eigenvalues). Since A is SPD, S1 must be
SPD as well. Hence we can perform the basic factorization step on S1 , which leads to a
factorization of the form  
1 0 0
A = L1 L2 0 1 0  LT2 LT1 ,
0 0 S2
where S2 ∈ R(n−2)×(n−2) is another SPD Schur complement. After n steps we obtain the
Cholesky decomposition

A = (L1 L2 · · · Ln )(LTn · · · LT2 LT1 ) =: LLT .

Note that in the nth step we only take one square root and do not form a Schur
complement. Forming the Schur complement Sj in step j = 1, . . . , n − 1 requires the
following types of operations:
1
(1) α
v with v ∈ Rn−j : n − j multiplications.

(2) ( α1 v)v T : (n−j)(n−j+1)


2
multiplications. (Since this matrix is symmetric, only its upper
(or lower) triangular part needs to be computed.)
(n−j)(n−j+1)
(3) Sj = M −( α1 vv T ) with M ∈ R(n−j)×(n−j) : 2
operations. (Again, this matrix
is symmetric.)

Disregarding the square roots, the total cost for computing the Cholesky decomposition
using the factorization approach described above is
n−1
X n−1
X
[(n − j)(n − j + 1) + (n − j)] = (n − j)(n − j + 2)
j=1 j=1
n−1
X n−1
X n−1
X
2
= k(k + 2) = k +2 k
k=1 k=1 k=1
(n − 1)n(2(n − 1) + 1) (n − 1)n
= +2
6 2
1
= (n(n − 1)(2n − 1) + 6(n2 − n))
6
1 1
= (2n3 + 3n2 − 5n) ≈ n3 (for large n).
6 3
Many applications involve sparse matrices, i.e., matrices with a significant number of
zero entries. Zero entries need not be stored, and they do not take part in numerical
evaluations (multiplication, addition, subtraction). When A is sparse, we also would like
to have a sparse Cholesky factor L.

42
When aij = 0 but lij 6= 0, the element lij is called a fill-in element. When A is SPD,
then P T AP is SPD for any permutation matrix P . An important line of research in the
context of the (sparse) Cholesky decomposition is concerned with finding permutations P
so that the Cholesky factor of P T AP has the least possible number of nonzero entries. In
this context we speak of the sparse Cholesky decomposition.

3.3 Computing the LU decomposition


Let us now consider a general (nonsingular) matrix A ∈ Cn×n . If the leading principal
minors A(1 : k, 1 : k) ∈ Ck×k are nonsingular for all k = 1, . . . , n, then the decomposition

A = LU

with a unit lower triangular matrix L ∈ Cn×n and a nonsingular upper triangular ma-
trix U ∈ Cn×n exists; see Theorem 1.1. Then x = A−1 b = U −1 (L−1 b), so that we can
again compute x (or rather an approximation x b) using the forward and back substitution
algorithms, which are normwise backward stable.
The LU decomposition can be computed by “Gaussian elimination”: At step j =
1, . . . , n − 1, multiples of the j-th row are subtracted from rows j + 1, . . . , n in order to
introduce zeros in the column j below the entry in position (j, j), and the result is the
upper triangular matrix U . Schematically:
     
× × × × × × × × × × × ×
 × × × × 
→ 0 × × × → 0 × × × 
   
A = × × × ×   0 × × ×   0 0 × × 
× × × × 0 × × × 0 0 × ×
 
× × × ×
 0 × × × 
→  0 0 × × =U

0 0 0 ×

Each step j in the process can be considered one left-multiplication of A by a suitable


unit lower triangular matrix Lj . In step j = 1 we have
    
1 0 a11 wH a11 wH 1
−1 = , where S1 := A1 − vwH .
v I n−1 v A1 0 S 1 a 11
| a11 {z }| {z } | {z }
=:L1 =A =:U1

By the assumption on A, we are guaranteed that a11 6= 0 and that the leading principal
minors of S1 are nonsingular. Hence the process can be continued with the matrix S1 .
After n − 1 steps we obtain

Ln−1 · · · L2 L1 A = U or A = (L−1 −1 −1
1 L2 · · · Ln−1 )U =: LU,

43
where L1 , . . . , Ln−1 ∈ Cn×n and hence L = L−1 −1 −1
1 L2 · · · Ln−1 are unit lower triangular, and
U ∈ Cn×n is nonsingular and upper triangular.
Each matrix Lj is of the form
 
Ij−1  
0j
Lj =  1  = In + eTj ,
lj
lj In−j

for some lj ∈ Cn−j , and hence


 
0j
Lj−1 = In − eTj , (3.9)
lj

which shows that L = L−1 −1 −1


1 L2 · · · Ln−1 can be easily computed from L1 , . . . , Ln−1 . More-
over,
          
−1 −2 01 T 02 T 01 T 02
L1 L2 = In − e1 In − e2 = In − e1 − eT2 ,
l1 l2 l1 l2

and inductively we see that


n−1  
X 0j
L = In − eTj . (3.10)
lj
j=1

In [19, pp. 149–151], the simple form (3.9) of the inverse L−1
j and the simple form (3.10)
of the product of these inverses are called (the first) “two strokes of luck” of Gaussian
elimination.
The main cost in the algorithm described above is in forming the Schur complement
matrices Sj ∈ C(n−j)×(n−j) , j = 1, . . . , n − 1. For a nonsymmetric (or non-Hermitian)
matrix this is (approximately) twice as expensive as forming the symmetric (or Hermitian)
Schur complement in the algorithm for computing the Cholesky decomposition. The cost
for computing the LU decomposition (for large n) therefore is approximately 32 n3 .
Assuming that the decomposition A = LU exists, and that the above algorithm runs
to completion, we can perform a rounding error analysis similar to the analysis for the
Cholesky decomposition. This analysis can be based on writing the decomposition in the
(inner product) form
i
X i
X
aij = lik ukj , and aij = lik u
b bkj , (3.11)
k=1 k=1

and it results, analogously to Theorem 3.2, in a componentwise error bound of the form

A=L
bUb + ∆A, where |∆A| ≤ γn |L||
b U b |;

44
see, e.g., [5, Theorem 9.3]. However, on the contrary to the Cholesky decomposition, the
sizes of the entries in the factors |L|
b and |U
b | are not bounded by kAk, and we can not
derive a bound analogous to (3.7).
For example, the matrix
 
ε 1
A= , ε∈ / {0, 1},
1 1

has nonsingular leading principal minors, and the the first step of the algorithm gives
    
1 0 ε 1 ε 1
= ,
−ε−1 1 1 1 0 1 − ε−1

so that     
ε 1 1 0 ε 1
= =: LU.
1 1 ε−1 1 0 1 − ε−1
For |ε| → 0 the largest entries in the factors L and U grow unboundedly, hence kLk
and kU k become arbitrarily large, while the largest entry in the matrix A is 1. The
potential numerical instabilities are illustrated by the following MATLAB computation
using ε = 10−16 :

>> e=1e-16; L=[1 0;1/e 1]; U=[e 1; 0 1-1/e]; L*U


ans =
0.000000000000000 1.000000000000000
1.000000000000000 0

The computed product LU is obviously quite far from the exact matrix A.
In order to control the sizes of the entries in the factors one can use pivoting 3 . We
assume that A is nonsingular. (It will turn out that the nonsingularity assumption on the
leading principal minors is not required when pivoting is used.) After j ≥ 0 steps of the
above algorithm we have


 
Uj
(j)
 ujj ∗ · · · ∗ 
Lj · · · L1 A =  ,
 
.
. .
. .
.
 0 . . . 
(j)
unj ∗ ... ∗

where Uj ∈ Cj×j is upper triangular. (Here U0 is the empty matrix.) Since A is nonsingular,
(j) (j)
at least one of the entries ujj , . . . , unj must be nonzero. In the strategy of partial (or row)
(j)
pivoting, we select an entry ukj of maximum modulus among these entries. We then
3
According to the Merriam-Webster Dictionary, a pivot is “a person, thing, or factor having a major
or central role, function, or effect”. This real-life definition fits well to the mathematical meaning of the
pivot in Gaussian elimination, which is described in this paragraph.

45
exchange the rows j and k, and form the elimination matrix Lj using the submatrix with
the exchanged rows. When forming Lj we divide the first column of the corresponding
(j)
submatrix by the pivot ukj , which has the largest magnitude in that column. Consequently,
all entries in the matrix Lj , and hence all entries in L−1
j , are bounded in modulus by 1.
In matrix notation, the exchange of rows j and k, where 1 ≤ j ≤ k ≤ n, corresponds
to a left-multiplication by the permutation matrix

Pjk := [e1 , . . . , ej−1 , ek , ej+1 , . . . , ek−1 , ej , ek+1 , . . . , en ].

For any nonsingular matrix A ∈ Cn×n , the Gaussian elimination algorithm with partial
pivoting produces a factorization of the form

Ln−1 Pn−1,kn−1 · · · L2 P2,k2 L1 P1,k1 A = U,


(j) (j)
where Lj = [lst ] is unit lower triangular with |lst | ≤ 1 for all s, t, the matrices Pj,k are
permutation matrices, and U is nonsingular and upper triangular.
For example, if 0 < |ε| < 1, then
     
1 0 0 1 ε 1 1 1
= ,
−ε 1 1 0 1 1 0 1−ε
| {z } | {z } | {z } | {z }
=:L1 =P12 =A =:U

or A = P LU , where L = L−1 T −1
1 and P = P12 = P12 .

Note that since 1 < 2 ≤ k2 ≤ n, we have eT1 P2,k2 = eT1 , and therefore
     
01 01
P2,k2 L1 = P2,k2 + P2,k2 T
e1 = In + e eT1 P2,k2 ,
l1 l1

where e
l1 and l1 have the same entries, except for possibly two permuted ones (if 2 < k2 ).
More generally, it follows that

Ln−1 Pn−1,kn−1 · · · L2 P2,k2 L1 P1,k1 = L


en−1 · · · L e1 Pn−1,kn−1 , · · · P2,k2 P1,k1
e2 L
= LP,
e

where L lij ] is unit lower triangular with |e


e = [e lij | ≤ 1 and the same structure as the matrix
L in (3.10), and P is a permutation matrix. In [19, pp. 159–160] this is called the “third
stroke of luck” of Gaussian elimination. From L ePeA = U we now obtain the following
important result.

Theorem 3.5 (LU decomposition with partial pivoting). Each nonsingular matrix A ∈
Cn×n can be factorized A = P LU , where P ∈ Cn×n is a permutation matrix, L = [lij ] ∈
Cn×n is unit lower triangular with |lij | ≤ 1, and U ∈ Cn×n is nonsingular and upper
triangular.

46
Let us write P T A = [e
aij ] = LU , where L = [lij ] ∈ Cn×n is unit lower triangular with
|lij | ≤ 1, and U is upper triangular. Then from Theorem 3.5 we obtain

a1j = l11 u1j


e ⇒ u1j = e a1j ,
a2j = l21 u1j + l22 u2j
e ⇒ |u2j | ≤ 2 max{|e
a1j |, |e
a2j |},
a3j = l31 u1j + l32 u2j + l33 u3j
e ⇒ |u3j | ≤ 4 max{|e
a1j |, |e
a2j |, |e
a3j |},

and inductively we see that

|uij | ≤ 2i−1 max |e


akj |, for all i, j = 1, . . . , n. (3.12)
1≤k≤i

Thus, when using the partial pivoting strategy the entries of the upper triangular factor U
are bounded in terms of the entries of A, where the upper bound contains the (inconvenient)
constant 2i−1 . A closer analysis of this situation is based on the following definition.
Definition 3.6. The growth factor for the (nonsingular) matrix A in the LU factorization
algorithm with partial pivoting as explained above is given by 4
maxi,j |uij |
ρ(A) := .
maxi,j |aij |
One can now show that if the LU factorization algorithm with partial pivoting (and
even without partial pivoting!) runs to completion, the computed factors satisfy
k∆Ak
A = PbL
bUb + ∆A, where ≤ p(n)ρ(A)u,
kAk
and p(n) is some low-degree polynomial in n. Moreover, when subsequently using the
computed LU decomposition for solving the linear algebraic system Ax = b using the
normwise backward stable forward and back substitution algorithms, we obtain a computed
approximation x
b with
k∆Ak
(A + ∆A)b
x = b, where ≤ p(n)ρ(A)u;
kAk
see, e.g., [5, p. 165]. Analogously to (3.8) we now obtain
krk kAkkbxk
≤ p(n)ρ(A)u < p(n)ρ(A)u.
kAkkb
xk + kbk kAkkb
xk + kbk
Apart from the growth factor ρ(A), the backward error bounds for the LU factorization
algorithm with partial pivoting coincide with those for Hermitian positive definite matrices
and the Cholesky decomposition. We see from (3.12) that

ρ(A) ≤ 2n−1
4
The standard notation for the growth factor is unfortunately in conflict with the standard notation
for the spectral radius.

47
for all (nonsingular) matrices A ∈ Cn×n . There are in fact matrices for which this up-
per bound on the growth factor is attained. However, Gaussian elimination with partial
pivoting is “utterly stable in practice” in the sense that matrices A with large growth fac-
tors rarely occur. For discussions of this fact also from a historical point of view, see [19,
pp. 166–170] or [5, Section 9.4].

48
Chapter 4

Iterative Methods for Solving Linear


Algebraic Systems

As indicated in the introdution to Chapter 3, an essential difference between iterative and


direct solution methods for solving linear algebraic systems is that the former generate a
sequence of intermediate approximations (called iterates), while the latter yield an approx-
imation of the exact solution only at the very end of the computation. Using the iterates it
is possible to estimate the error (or residual) norm, which allows to stop the iteration when
a desired accuracy is reached. This can be a significant advantage in practical applications,
where we usually do not require a highly accurate approximation of the exact solution.

4.1 Classical iterative methods


In the following we consider a linear algebraic system Ax = b with A ∈ Cn×n nonsingular.
Many “classical” iterative methods are based on a splitting A = M − N , where the matrix
M should be nonsingular. For a computationally efficient method we also require that
M −1 can be easily computed, or linear algebraic systems with M can be (approximately)
solved at low cost. This is satisfied, for example, when M is a diagonal or a triangular
matrix.
If A = M − N , where M is nonsingular, then Ax = b can be written as (M − N )x = b,
and hence
x = M −1 N x + M −1 b.
This fixed point equation suggests the iterative method

xk+1 = M −1 N xk + M −1 b, k = 0, 1, 2, . . . , (4.1)

where x0 is a given initial approximation.


The kth error of the iteration is given by ek := x − xk . Using

M −1 b = M −1 (M − N )x = x − M −1 N x,

49
we obtain
ek = x − xk = x − (M −1 N xk−1 + x − M −1 N x) = M −1 N ek−1 ,
and by induction
ek = (M −1 N )k e0 , k = 0, 1, 2, . . . . (4.2)
The matrix M −1 N is called the iteration matrix of (4.1). The iteration converges to the
exact solution x, when ek → 0 for k → ∞. For consistent norms k · k we have the error
bound
kek k = k(M −1 N )k e0 k ≤ kM −1 N kk ke0 k.
Thus, if there exists some consistent norm k · k with kM −1 N k < 1, then the iteration (4.1)
converges for any x0 .
More generally, from (4.2) we see that the iteration converges for any given x0 , if and
only if M −1 N → 0 for k → ∞. In order to analyze this situation we consider the Jordan
decomposition
M −1 N = XJX −1 , J = diag(Jd1 (λ1 ), . . . Jdm (λm )).
Then (M −1 N )k = XJ k X −1 , and (M −1 N )k → 0 holds if and only if
J k = diag((Jd1 (λ1 ))k , . . . (Jdm (λm ))k ) → 0.
The kth power of a Jordan block is given by
min{k,d}  
k k
X k k−j
(Jd (λ)) = (λId + Jd (0)) = λ (Jd (0))j
j=0
j
min{k,d}
X 1 k!λk−j
= (Jd (0))j ,
j=0
j! (k − j)!

and thus (Jd (λ))k → 0 holds if and only if


k!λk−j k!|λ|k−j

(k − j)! = (k − j)! → 0

for every j = 0, 1 . . . , min{k, d}. It is clear that this holds when λ = 0. For λ 6= 0 and a
fixed j we divide two consecutive terms of this sequence and obtain, for k ≥ d,
(k + 1)!|λ|k+1−j (k − j)! k+1 k+1 1
k−j
= |λ| ≤ |λ| = |λ| d
.
(k + 1 − j)! k!|λ| k+1−j k+1−d 1 − k+1

The last term approaches 1 for k → ∞. Therefore |λ| < 1 is necessary and sufficient for
(Jd (λ))k → 0. In summary, we have shown the following result.
Theorem 4.1. The iteration (4.1) converges for any initial vector x0 , if and only if the
spectral radius of the iteration matrix satisfies ρ(M −1 N ) < 1. A sufficient condition for
convergence for any initial vector x0 is kM −1 N k < 1 for some consistent norm k · k.

50
For a simple example, suppose that
1 
−1 2
α
M N= 1 .
0 2

Then ρ(M −1 N ) = 12 and (M −1 N )k → 0 for k → ∞. However, for |α|  1 we will have


kM −1 N k  1 for any matrix norm k · k. The iteration (4.1) will therefore converge for any
given x0 , although the norm of the iteration matrix is larger than 1. Recall also Lemma 2.9,
where we have shown that kAk ≥ ρ(A) holds for any consistent norm.
In order to construct specific examples of the iteration (4.1), we write A as

A = L + D + U,

where L and U are the strictly lower and upper triangular parts. This yields the following
classical methods:

• Jacobi method: M = D, N = −(L + U ), hence M −1 N = −D−1 (L + U ) =: RJ .

• Gauss-Seidel method: M = L + D, N = −U , hence M −1 N = −(L + D)−1 U =: RG .

In both cases M −1 exists if and only if A = [aij ] has nonzero diagonal elements.
For the ∞-norm and the Jacobi method we then have
X |aij |
kRJ k∞ = max .
1≤i≤n
j6=i
|aii |

A matrix that satisfies


X |aij |
max <1
1≤i≤n
j6=i
|aii |

is called strictly (row) diagonally dominant. Thus, the Jacobi method converges for any
x0 when applied to such matrices A.
For the Gauss-Seidel method and a strictly (row) diagonally dominant matrix A we
consider the equation RG x = λx, or

−U x = λ(L + D)x ⇔ λDx = (λL − U )x.

Assuming, without loss of generality, that the entry of largest magnitude in the vector x
is x` = 1, we obtain
`−1 n Pn n
j=`+1 |a`j | |a`j |
X X X
|λ||a`` | ≤ |λ| |a`j | + |a`j | ⇒ |λ| ≤ ≤ < 1,
|a`` | − `−1 |a`` |
P
j=1 j=`+1 j=1 |a`j | j=`+1

and thus ρ(RG ) < 1, if A is strictly (row) diagonally dominant.

51
Although forming RG is “more expensive” than forming RJ , we are not guaranteed
that the Gauss-Seidel method performs better than the Jacobi method. For example,
 
1 1 −1
A = 1 1 1 
2 2 1
yields    
0 −1 1 0 −2 2
RJ = −1 0 −1 , RG = 0 2 −3 .
−2 −2 0 0 0 2
Since RJ3 = 0, i.e., RJ is nilpotent, we have ρ(RJ ) = 0, while ρ(RG ) = 2.
If ρ(M −1 N ) is close to 1, then the convergence of (4.1) can be very slow. In order to
improve the speed of convergence we can introduce a (real) relaxation parameter ω > 0
and consider ωAx = ωb instead of Ax = b. We then split
ωA = ω(L + D + U ) = (D + ωL) + (ωU + (ω − 1)D) =: M − N,
which results in the iteration
xk+1 = RSOR (ω)xk + ωM −1 b, k = 0, 1, 2, . . . , (4.3)
where
RSOR (ω) := −(D + ωL)−1 (ωU + (ω − 1)D). (4.4)
In order to form RSOR (ω), we still require that A has nonzero diagonal elements. For ω = 1
this gives the Gauss-Seidel method, and for ω > 1 this method is called the Successive Over
Relaxation (SOR) method. For 0 < ω < 1 the resulting methods are called under-relaxation
methods.
Numerous publications, in particular from the 1950s and 1960s, are concerned with
choosing an “optimal” ω in the sense that ρ(RSOR (ω)) is minimal in different applications.
The following result of Kahan [9] shows that one can restrict the search of an optimal (real
and positive) ω to the interval (0, 2).
Theorem 4.2. The matrix RSOR (ω) in (4.4) satisfies ρ(RSOR (ω)) ≥ |1 − ω|, and hence
the method (4.3)–(4.4) can converge only if ω ∈ (0, 2).
Proof. Let λ1 (ω), . . . , λn (ω) be the eigenvalues of RSOR (ω). Then the determinant multi-
plication theorem yields
n
Y
λj (ω) = det(RSOR (ω)) = det(−(I + ωD−1 L)−1 D−1 ) det(D(ωD−1 U + (ω − 1)I))
j=1

= (−1)n det((I + ωD−1 L)−1 ) det(D−1 ) det(D) det(ωD−1 U + (ω − 1)I)


= (−1)n (ω − 1)n = (1 − ω)n ,
where we have
Qnused that the matrices D−1 L and D−1 U are strictly lower and upper trian-
n
gular. Now j=1 |λj | = |1 − ω| implies ρ(RSOR (ω)) = max1≤j≤n |λj (ω)| ≥ |1 − ω|.

52
The theorem above does not show that ω ∈ (0, 2) is sufficient for the convergence of the
method (4.3)–(4.4). This is sufficient, however, when A is HPD, as shown by the following
theorem, which is a special case of a result of Ostrowski [11]. In particular, the theorem
shows that the Gauss-Seidel method converges for HPD matrices.
Theorem 4.3. If A ∈ Cn×n is HPD, then ρ(RSOR (ω)) < 1 holds for each ω ∈ (0, 2).
More generally, we can consider a splitting A = M − N , a relaxation parameter ω > 0,
and write ωAx = ωb in the equivalent form
x = ((1 − ω)I + ωM −1 N )x + ωM −1 b.
This yields the iteration
xk+1 = R(ω)xk + ωM −1 b, k = 0, 1, 2, . . . ,
where
R(ω) := (1 − ω)I + ωM −1 N.
Now ek = R(ω)k e0 , so that the convergence is determined by ρ(R(ω)). The spectrum of
the iteration matrix is given by
Λ(R(ω)) = (1 − ω) + ωΛ(M −1 N ),
which can be used for determining an optimal ω when Λ(M −1 N ) is (approximately) known.
In all such iterations the convergence is asymptotically (for large k) linear, with the average
“reduction factor” per step given by kR(ω)k.

4.2 Projection methods based on Krylov subspaces


In this section we consider methods that are based on projections onto subspaces. Suppose
that x0 is a given approximation of x = A−1 b. In step k = 1, 2, . . . , of a projection method
we construct an approximation of the form
xk ∈ x0 + Sk , (4.5)
where Sk is a k-dimensional subspace of Cn called the search space (we of course assume
k ≤ n). Since we have k degrees of freedom to construct xk , we require k constraints in
order to determine xk . We impose these on the residual rk = b − Axk and require that
rk ⊥ Ck , (4.6)
where Ck is a k-dimensional subspace of Cn called the constraints space.
Suppose that Sk , Ck ∈ Cn×k represent any bases of Sk and Ck , respectively, then (4.5)-
(4.6) can be written as
xk = x0 + Sk tk for some tk ∈ Ck , (4.7)
and
0 = CkH rk = CkH (b − Ax0 − ASk tk ), or CkH ASk tk = CkH r0 . (4.8)
In order to obtain tk , we thus need to solve a projected system of order k.

53
Lemma 4.4. Let Sk , Ck ∈ Cn×k represent bases of the k-dimensional subspaces Sk , Ck ⊆
Cn . Then the following statements are equivalent

(1) The matrix CkH ASk ∈ Ck×k is nonsingular.

(2) Cn = ASk ⊕ Ck⊥ .

Proof. If CkH ASk is nonsingular, then rank(ASk ) = k and CkH ASk z = 0 implies z = 0.
Hence Cn⊥ ∩ ASk = {0}, so that Ck = ASk ⊕ Ck⊥ .
On the other hand, if Cn = ASk ⊕ Ck⊥ , then ASk has dimension k. Let CkH ASk z = 0
for some z ∈ Ck . Then ASk z ∈ ASk ∩ Ck⊥ and hence ASk z = 0. Since ASk has rank k, we
have z = 0 and CkH ASk is nonsingular.
This lemma shows that the question whether tk in (4.7)–(4.8) is uniquely determined
depends only on A, Sk , Ck but not on the choice of bases for Sk , Ck .

Definition 4.5. If tk in (4.7)–(4.8) is uniquely determined, i.e., CkH ASk is nonsingular,


we call the projection method (4.5)–(4.6) well defined at step k.

Let the projection method be well defined at step k. Then

tk = (CkH ASk )−1 CkH r0


hence
xk = x0 + Sk tk = x0 + Sk (CkH ASk )−1 CkH r0 ,
and
rk = b − Axk = (I − Pk )r0 , where Pk = ASk (CkH ASk )−1 CkH . (4.9)
The matrix Pk is a projection since Pk2 = Pk . For all v ∈ Cn we have

Pk v ∈ ASk , and (I − Pk )v ∈ Ck⊥ ,

and hence Pk projects onto ASk orthogonally to Ck . Equation (4.9) can be written as

r0 = Pk r0 + rk .
|{z} |{z}
∈ASk ∈Ck⊥

If ASk = Ck , then this decomposition is orthogonal since its two components are mutually
orthogonal. In this case Cn = Ck ⊕ Ck⊥ and we call the method an orthogonal projection
method. When ASk 6= Ck , we call the method an oblique projection method.
Note that in an orthogonal projection method we have

kr0 k22 = kPk r0 k22 + krk k22 , or krk k22 = kr0 k22 − kPk r0 k22 .

If ASk ⊆ ASk+1 , then kPk r0 k2 ≤ kPk+1 r0 k2 and hence krk+1 k2 ≤ krk k2 , i.e., the Euclidean
norm of the residual is monotonically decreasing.

54
Theorem 4.6. In the notation established above, a projection method is well defined at
step k, if any of the following conditions hold:

(i) A is HPD and Ck = Sk ,

(ii) A is nonsingular and Ck = ASk .

Proof. (i) If Ck = Sk , then for any bases Ck , Sk we have Ck = Sk Z for some nonsingular
Z ∈ Ck×k , and CkH ASk = Z H SkH ASk , which is nonsingular since SkH ASk is nonsingular
(even HPD).
(ii) Now we have Ck = ASk Z and CkH ASk = Z H SkH AH ASk . This matrix is nonsingular,
since A is nonsingular, and hence SkH AH ASk is HPD.
After studying when the projection method (4.5)–(4.6) is well defined we now study
when the method terminates (in exact arithmetic) with rk = 0.

Lemma 4.7. In the notation established above, let the projection method be well defined
at step k. If r0 ∈ Sk and ASk = Sk , then rk = 0.

Proof. If r0 ∈ Sk and ASk = Sk , then r0 ∈ ASk , and hence rk = r0 − Pk r0 = 0; cf.(4.9).


The lemma gives a sufficient condition for the finite termination property of a projection
process. It motivates to use search spaces Sk with

S1 = span{r0 }, and S1 ⊂ S2 ⊂ S3 ⊂ . . .

such that these spaces automatically satisfy

ASk = Sk for some k.

These properties are guaranteed when

Sk = Kk (A, r0 ) := span{r0 , Ar0 , . . . , Ak−1 r0 }, k ≥ 1,

where Kk (A, r0 ) is the kth Krylov subspace of A and r0 ; see (1.2). In the following result
we collect important properties of the Krylov subspaces.

Lemma 4.8. Let A ∈ Cn×n be nonsingular and r0 ∈ Cn \ {0}.

(i) There exists a uniquely determined d = d(A, r0 ) with 1 ≤ d ≤ n and

K1 (A, r0 ) ⊂ . . . ⊂ Kd (A, r0 ) = Kd+j (A, r0 )

for all j ≥ 1. This d is called the grade of r0 with respect to A. In particular,


dim(Kk (A, r0 )) = k for k = 1, . . . , d.

(ii) If r0 is of grade d with respect to A, then AKd (A, r0 ) = Kd (A, r0 ).

55
(iii) If r0 is of grade d with respect to A, Sk = Kk (A, r0 ) in the projection method (4.5)–
(4.6), and the method is well defined at step d, then rd = 0.

Proof. (i) and (ii) were already shown in Chapter 1; see in particular the discussion of
(1.3). (iii) follows from (ii) and Lemma 4.7.
We can now use this lemma and Theorem 4.6 to obtain the following mathematical
characterization of several well defined Krylov subspace methods.

Theorem 4.9. Consider the projection method (4.5)–(4.6) for solving a linear algebraic
system Ax = b with initial approximation x0 and let r0 = b − Ax0 be of grade d ≥ 1 with
respect to A.

(i) If A is HPD, Sk = Ck = Kk (A, r0 ), k = 1, 2, . . . , then the projection method is well


defined at every step k until it terminates with rd = 0 at step d. It is characterized
by the orthogonality property

rk ⊥ Kk (A, r0 ), or x − xk ⊥A Kk (A, r0 ),

and the equivalent optimality property

kx − xk kA = min kx − zkA .
z∈x0 +Kk (A,r0 )

(Mathematical description of the Conjugate Gradient (CG) method.)

(ii) If A is nonsingular, Sk = Kk (A, r0 ), Ck = ASk = AKk (A, r0 ), k = 1, 2, . . . , then the


projection method is well defined at every step k until it terminates with rd = 0 at
step d. It is characterized by the orthogonality property

rk ⊥ AKk (A, r0 ) or x − xk ⊥AH A Kk (A, r0 )

and the equivalent optimality property

krk k2 = min kb − Azk2


z∈x0 +Kk (A,r0 )

(Mathematical description of the MINRES and GMRES method.)

Proof. It only remains to prove the equivalent optimality properties.


(i) For any z ∈ x0 + Kk (A, r0 ) we have

kx − zk2A = k (x − xk ) + (xk − z) k2A = kx − xk k2A + kxk − zk2A ≥ kx − xk k2A ,


| {z } | {z }
∈Kk (A,r0 )⊥A ∈Kk (A,r0 )

and hence xk minimizes kx − zk2A over x0 + Kk (A, r0 ).


(ii) Exercise.

56
In order to implement the Krylov subspace methods that are characterized in the above
theorem, we need bases of Sk = Kk (A, r0 ) and Ck , for k = 1, 2, . . . . The “canonical” basis
r0 , Ar0 , . . . , Ak−1 r0 of Kk (A, r0 ) should not be used in practical computations, since the
corresponding matrix usually is (very) ill conditioned: For simplicity, assume that A is
diagonalizable, A = X diag(λ1 , . . . , λn ) X −1 , with a single dominant eigenvalue, |λ1 | > |λj |
for j = 2, . . . , n, and suppose that r0 = X[α1 , . . . , αn ]T with α1 6= 0. Then
  
1 α1
 (λ2 /λ1 )k   α2 
Ak r0 = λk1 X    ..  .
  
..
 .   . 
k
(λn /λ1 ) αn

It is easy to see that the sequence

vk = Ak−1 r0 /kAk−1 r0 k, k = 1, 2, . . .

converges towards the (normalized) first column of X, i.e. an eigenvector of A correspond-


ing to the dominant eigenvalue.
For numerical reasons, a well conditioned, at best orthonormal basis of Kk (A, r0 ) is
preferred. Such a basis is generated by the Arnoldi algorithm (see Chapter 1), which
yields the Arnoldi decomposition
AVd = Vd Hd,d
from Theorem 1.9. The matrix Hd,d ∈ Cd×d is unreduced upper Hessenberg, and the
columns of the matrix Vd ∈ Cn×d form an orthonormal basis of the Krylov subspace
Kd (A, r0 ). After k ≤ d − 1 steps of this algorithm we have

AVk = Vk+1 Hk+1,k ,

where Hk+1,k contains the first k + 1 rows and k columns of Hd,d .


Using the Arnoldi decomposition we can implement the projection process (4.5)–(4.6)
or (4.7)–(4.8) with Sk = Kk (A, r0 ) and Ck = ASk (cf. (ii) in Theorem 4.9):

xk = x0 + Vk tk ,
rk = b − Axk = r0 − AVk tk = r0 − Vk+1 Hk+1,k tk
= Vk+1 (kr0 k2 e1 − Hk+1,k tk ).

The orthogonality condition rk ⊥ AKk (A, r0 ) gives

0 = (AVk )H rk = (Vk+1 Hk+1,k )H rk = Hk+1,k


H H
Vk+1 Vk+1 (kr0 k2 e1 − Hk+1,k tk )
H H
⇔ (Hk+1,k Hk+1,k )tk = Hk+1,k (kr0 k2 e1 )
+
⇔ tk = Hk+1,k (kr0 k2 e1 ),
+
where Hk+1,k H
= (Hk+1,k Hk+1,k )−1 Hk+1,k
H
is the Moore-Penrose pseudoinverse of Hk+1,k .

57
The equivalent optimality property is
krk k2 = min kb − Azk2 = min kVk+1 (kr0 k2 e1 − Hk+1,k t)k2
z∈x0 +Kk (A,r0 ) t∈Ck

= min kkr0 k2 e1 − Hk+1,k tk2 ,


t∈Ck

+
and again the unique minimizer is tk = Hk+1,k (kr0 k2 e1 ).
The QR decomposition of the unreduced upper Hessenberg matrix Hk+1,k ∈ Ck+1,k can
2
be computed using k Givens rotations,
  and hence the cost for  this computation is O(k ).
Rk Rk
This gives Gk · · · G1 Hk+1,k = , or Hk+1,k = Qk , where Rk ∈ Ck×k is upper
0 0
triangular and nonsingular, and Qk = GH H
1 · · · Gk ∈ C
(k+1)×(k+1)
is unitary.
Hence
krk k2 = min kkr0 k2 e1 − Hk+1,k tk2
t∈Ck
 
H Rk
= min kkr0 k2 Qk e1 − tk2 ,
t∈Ck 0

so that tk = kr0 k2 Rk−1 (QH H


k e1 )1:k , and krk k2 = kr0 k2 (Qk e1 )k+1 .
The last equation shows that the residual norm krk k2 is available without forming the
approximation xk = x0 + Vk tk and the actual residual rk = b − Axk . Due to rounding
errors the quantity kr0 k2 (QH k e1 )k+1 , which is called the updated residual norm, may be
(significantly) different from the computed residual norm kb − Axk k2 . In practice one
should therefore not rely only on the size of the updated residual norm as a stopping
criterion.
The GMRES method of Saad and Schultz[14] is an implementation of the above using
the MGS variant of the Arnoldi algorithm. Note that the Arnoldi algorithm does not
explicitly require the matrix A; only a function that implements the map v 7→ Av needs to
be known. This can be a significant advantage in practical applications particularly when
A is sparse and the matrix-vector product A · v can be computed at low cost. On the other
hand, the Arnoldi algorithm uses a full recurrence for computing the orthonormal basis of
Kk (A, r0 ). Hence work and storage requirements in the GMRES method grow (linearly)
with the iteration step k. Even for sparse matrices A the Arnoldi basis vectors v1 , v2 , . . .
usually will not be sparse. This may lead to storage problems in large applications and
when many GMRES iteration steps need to be performed before a desired convergence
tolerance is reached.
When A is Hermitian, the upper Hessenberg matrix Hd,d is Hermitian and tridiagonal
(cf. the Lanczos decomposition in Corollary 1.10). This means that the full recurrence in
the Arnoldi algorithm reduces to a 3−term recurrence (since hij = 0 for |i − j| > 1):
hk+1,k vk+1 = Avk − hkk vk − hk−1,k vk−1 .
The resulting orthogonalization algorithm in the MGS variant is called the Lanczos al-
gorithm. The MINRES method of Paige and Saunders [12]. is an implementation of

58
the projection method characterized in (ii) of Theorem 4.9 for Hermitian and nonsingular
matrices, which is based on the Lanczos algorithm. It is mathematically equivalent to GM-
RES, but due to the Lanczos algorithm it uses 3−term instead of full recurrences. Hence
work and storage requirements in the MINRES method remain constant throughout the
iteration.
For HPD matries we can again use the Lanczos algorithm, which computes the decom-
position
AVk = Vk+1 Tk+1,k , k = 1, . . . , d − 1, AVd = Vd Td,d
in order to implement the projection method (4.5)-(4.6) with Sk = Ck = Kk (A, r0 ). Using
the orthogonality property rk ⊥ Kk (A, r0 ) (cf. (i) in Theorem 4.9) we get

xk = x0 + Vk tk ,
0 = VkH rk = VkH (b − Axk ) = VkH (r0 − AVk tk )
= kr0 k2 e1 − VkH AVk tk = kr0 k2 e1 − Tk tk .

Here Tk ∈ Ck×k is HPD (since A is) and its Cholesky decomposition exists; cf. Theorem 1.3.
Since Tk is tridiagonal, we have Tk = Lk LH k , where Lk is lower bidiagonal. A clever use
of this structure leads to simple formulas for computing tk = kr0 k2 Tk−1 e1 . The most well-
known variant is the Conjugate Gradient (CG) method of Hestenes and Stiefel [4].
We will now analyze the convergence properties of the algorithms introduced above.
We start with the CG method, which is well defined for HPD matrices A ∈ Cn×n . The CG
method is mathematically characterized by xk ∈ x0 + Kk (A, r0 ) and

kx − xk kA = min kx − zkA (4.10)


z∈x0 +Kk (A,r0 )

Note that r0 = A(x−x0 ) and hence for every z ∈ x0 +Kk (A, r0 ) there exist γ0 , . . . , γk−1 ∈ C
with
k−1
! k−1
X X
x − z = x − x0 + γj Aj r0 = (I − γj Aj+1 )(x − x0 ) = p(A)(x − x0 ),
j=0 j=0

where p(z) = 1 − k−1 j+1


P
j=0 γj z is a polynomial of degree k with p(0) = 1. If Pk (0) denotes
the set of all such ploynomials, (4.10) can be written as

kx − xk kA = min kp(A)(x − x0 )kA (4.11)


p∈Pk (0)

The HPD matrix A is unitarily diagonalizable with real positive eigenvalues, A = XΛX H
with X H X = I, Λ = diag(λ1 , . . . , λn ), 0 < λ1 ≤ . . . ≤ λn . We can thus define the square
root
1/2
A1/2 := XΛ1/2 X H , Λ1/2 := diag(λ1 , . . . , λn1/2 ),
which satisfies (A1/2 )2 = A. For every v ∈ Cn we thus get kvk2A = v H Av = (v H A1/2 )(A1/2 v) =
kA1/2 vk22 .

59
Using this result in (4.11) yields

kx − xk kA = min kp(A)(x − x0 )kA


p∈Pk (0)

= min kA1/2 p(A)(x − x0 )k2


p∈Pk (0)

= min kp(A)A1/2 (x − x0 )k2


p∈Pk (0)

≤ min kp(A)k2 kA1/2 (x − x0 )k2


p∈Pk (0)

= min kp(Λ)k2 kx − x0 kA
p∈Pk (0)

= min max |p(λi )|kx − x0 kA .


p∈Pk (0) 1≤i≤n

Theorem 4.10. The relative A-norm of the error in step k of the CG method satisfies
kx − xk kA
≤ min max |p(λi )|
kx − x0 kA p∈Pk (0) 1≤i≤n
≤ min max |p(λ)|
p∈Pk (0) λ∈[λ1 ,λn ]
√ k
κ−1
≤2 √ ,
κ+1
λn
where κ = λ1
is the condition number of A.
Proof. The first inequality was shown above. The second follows easily since the discrete
set of eigenvalues in the min-max problem is replaced by the inclusion set [λ1 , λn ]. The
third inequality can be shown using suitably shifted and normalized Chebyshev polynomials
which solve the min-max problem on [λ1 , λn ].
Main observations:
(1) The bounds in the theorem are worst-case bounds for the given matrix A, since they
are independent of the choice of x0 and the right hand side b.

(2) The last bound shows that the (worst-case) convergence will be fast when the con-
dition number of A is close to 1.

(3) The fact in (2) motivates the preconditioning of the system Ax = b: Find an easily
invertible matrix L and consider the (equivalent) system

(L−1 AL−H )(LH x) = L−1 b.

The matrix L−1 AL−H is HPD and the convergence bound for CG will involve the
condition number of this matrix. The goal of preconditioning in this context is to
find a matrix L with
κ(L−1 AL−H )  κ(A).

60
The GMRES algorithm is characterized by xk ∈ x0 + Kk (A, r0 ) and

krk k2 = kb − Axk k2 = min kb − Azk2 .


z∈x0 +Kk (A,r0 )

For a diagonalizable matrix A = XΛX −1 we obtain

krk k2 = min kr0 − Azk2 = min kp(A)r0 k2


z∈Kk (A,r0 ) p∈Pk (0)
−1
= min kXp(Λ)X r0 k2
p∈Pk (0)

≤ κ(X)kr0 k2 min kp(Λ)k2


p∈Pk (0)

= κ(X)kr0 k2 min max |p(λi )|,


p∈Pk (0) 1≤i≤n

and thus
krk k2
≤ κ(X) min max |p(λi )|. (4.12)
kr0 k2 p∈Pk (0) 1≤i≤n

The right-hand-side of (4.12) is a worst-case bound on the relative residual norm in step
k. It shows that GMRES converges quickly when the eigenvalues are in a single “cluster”
that is far away from zero and the eigenvectors are well conditioned.

61
Bibliography

[1] W. E. Arnoldi, The principle of minimized iteration in the solution of the matrix
eigenvalue problem, Quart. Appl. Math., 9 (1951), pp. 17–29.

[2] C. Eckart and G. Young, A principal axis transformation for non-hermitian


matrices, Bull. Amer. Math. Soc., 45 (1939), pp. 118–121.

[3] G. H. Golub, M. W. Mahoney, P. Drineas, and L.-H. Lim, Bridging the


gap between numerical linear algebra, theoretical computer science, and applications,
SIAM News, 39 (2006).

[4] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear
systems, J. Research Nat. Bur. Standards, 49 (1952), pp. 409–436.

[5] N. J. Higham, Accuracy and Stability of Numerical Algorithms, Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, second ed., 2002.

[6] R. A. Horn and C. R. Johnson, Matrix analysis, Cambridge University Press,


Cambridge, 1990. Corrected reprint of the 1985 original.

[7] A. S. Householder, Some numerical methods for solving systems of linear equa-
tions, Amer. Math. Monthly, 57 (1950), p. 453.

[8] A. S. Householder, A class of methods for inverting matrices, J. Soc. Ind. Appl.
Math., 6 (1958), pp. 189–195.

[9] W. Kahan, Gauss–Seidel methods of solving large systems of linear equations, PhD
thesis, University of Toronto, 1958.

[10] J. Liesen and Z. Strakoš, On optimal short recurrences for generating orthogonal
Krylov subspace bases, SIAM Rev., 50 (2008), pp. 485–503.

[11] A. M. Ostrowski, On the linear iteration procedures for symmetric matrices, Rend.
Mat. e Appl., 14 (1954), pp. 140–163.

[12] C. C. Paige and M. A. Saunders, Solution of sparse indefinite systems of linear


equations, SIAM J. Numer. Anal., 12 (1975), pp. 617–629.

62
[13] J. L. Rigal and J. Gaches, On the compatibility of a given solution with the data
of a linear system, Journal of the ACM (JACM), 14 (1967), pp. 543–548.

[14] Y. Saad and M. H. Schultz, GMRES: a generalized minimal residual algorithm


for solving nonsymmetric linear systems, SIAM J. Sci. Statist. Comput., 7 (1986),
pp. 856–869.

[15] H. Shapiro, A survey of canonical forms and invariants for unitary similarity, Linear
Algebra Appl., 147 (1991), pp. 101–167.

[16] G. W. Stewart, Review of Matrix Computations by Gene H. Golub and Charles F.


Van Loan, Linear Algebra Appl., 95 (1987), pp. 211–215.

[17] , On the early history of the singular value decomposition, SIAM Rev., 35 (1993),
pp. 551–566.

[18] , The decompositional approach to matrix computation, Computing in Science &


Engineering, 2 (2000), pp. 50–59.

[19] L. N. Trefethen and D. Bau, III, Numerical Linear Algebra, Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, 1997.

63

You might also like