Liesen NLA1 PDF
Liesen NLA1 PDF
“In fact, our subject is more than just vectors and matrices, for virtually every-
thing we do carries over to functions and operators. Numerical linear algebra
is really functional analysis, but with the emphasis always on practical algo-
rithmic ideas rather than mathematical technicalities.”
The two quotes above reflect two main characteristics of numerical linear algebra: As a
mathematical field it is closely related to functional analysis, and one of its major driving
forces is the practical requirement to solve linear algebraic problems of rapidly increasing
sizes. The quotes are taken from two excellent books, one modern and one classical, on
numerical linear algebra:
• N. J. Higham Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, 2002,
Many thanks to Carlos Echeverrı́a Serur and Luis Garcı́a Ramos for typing my hand-
written notes in the Winter Semester 2014/2015. Their work forms the basis of the current
version. Thanks also to Davide Fantin, Alexander Hopp, Mathias Klare, Thorsten Lucke,
Ekkehard Schnoor, Olivier Sète, and Jan Zur for careful reading of previous versions and
providing corrections. Please send further corrections to me at liesen@[Link].
2
Contents
2 Perturbation Theory 20
2.1 Norms, errors, and conditioning . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Perturbation results for matrices and linear algebraic systems . . . . . . . 25
3
Chapter 0
Numerical Linear Algebra is concerned with the numerical solution of linear algebraic
problems involving matrices. Major examples are
In this lecture we will (mostly) consider complex matrices, or matrices over C, i.e.,
matrices of the form
a11 · · · a1m
A = [aij ] = ... ... .. ∈ Cn×m
.
an1 · · · anm
with aij ∈ C, i = 1, . . . , n, j = 1, . . . , m. If m = 1 we will usually write Cn (instead of
Cn×1 ). If n = m, we have a square matrix A.
The matrix In := [δij ] ∈ Cn×n is called the identity matrix. Here
(
1, i = j,
δij :=
0, i 6= j,
is the Kronecker delta. When the size is clear or irrelevant we write I. The matrix
0n×m := [0] ∈ Cn×m is called the zero matrix. Usually we just write 0.
4
If A = [aij ] ∈ Cn×n satisfies
diagonal,
i 6= j
aij = 0 for i>j then A is called upper triangular,
i<j
lower triangular.
eTj (A − B)ej = eTj Dej = djj ≥ 0, and eTj (B − A)ej = −eTj Dej = −djj ≥ 0,
5
so than djj = 0 for j = 1, . . . , n. Moreover, since D is Hermitian, we have dji = dij for all
i, j = 1, . . . , n. Thus,
giving Re(dij ) = 0 for all i, j = 1, . . . , n. A similar argument shows that also Im(dij ) = 0
for all i, j = 1, . . . , n, so that in fact D = A − B = 0, or A = B.
It remains to show (3). If A ≥ B and B ≥ C, then for each x ∈ Cn we have
and hence A ≥ C.
In the next lemma we collect some useful properties of HPSD matrices.
(2) det(A) ≥ 0.
(2) This follows from (1) since det(A) is equal to the product of the eigenvalues of A.
(3) Let y ∈ Cn \{0} and write z := Xy, then z 6= 0 since rank(X) = k, and y H (X H AX)y =
z H Az ≥ 0 since A is HPSD.
All assertions in Lemma 0.2 also hold for HPD matrices A ∈ Cn×n , when “≥” is replaced
by “>” in (1) and (2).
If for A ∈ Cn×n there exists a matrix B ∈ Cn×n with AB = BA = In , then A is called
nonsingular. Otherwise A is called singular. It is well known that A is nonsingular if and
only if det(A) 6= 0, which holds if and only if rank(A) = n.
6
(1) If A is nonsingular, then there exists only one matrix B ∈ Cn×n with AB = BA = In .
We call this matrix the inverse of A and denote it by A−1 .
Hence A ∈ Cn×n is unitary if and only if the columns a1 , . . . , an form an orthonormal basis
of Cn with respect to the inner product (·, ·). The equation AAH = In means the same
holds for the (transposed) rows of A.
If A ∈ Cn×m with n > m satisfies AH A = Im , then A has pairwise orthonormal columns
(with respect to (·, ·)), but A is not a unitary matrix. In this case P := AAH is a projection
(i.e., P 2 = P ) with rank(P ) = m.
If A ∈ Cn×n satisfies AH A = AAH then A is called normal. Note that Hermitian and
unitary matrices are normal.
7
Chapter 1
In the January/February 2000 issue of Computing in Science & Engineering, a joint pub-
lication of the American Institute of Physics and the IEEE Computer Society, a list of
the “Top Ten Algorithms of the Century” was published. Among the Top Ten Algorithms
(ordered by year, so there is no “No. 1 Algorithm”) is
A matrix decomposition is nothing but a factorization of the original matrix into “sim-
pler” factors. Householder started with a systematical analysis of methods for inverting
matrices (or solving linear algebraic systems) from the viewpoint of matrix decomposition
in 1950 [7]. In 1957 he wrote [8]:
“Most, if not all closed [i.e. direct] methods can be classified as methods of
factorizations and methods of modification [...] these methods of factorization
aim to express A as a product of two factors, each of which is readily inverted,
or, equivalently, to find matrices P and Q such that P A = Q and Q is easily
inverted.”
8
• A matrix decomposition, which is generally expensive to compute, can be reused to
solve new problems involving the original matrix.
• The decompositional approach often shows that apparently different algorithms are
actually computing the same object.
• The decompositional approach facilitates rounding error analysis.
• Many matrix decompositions can be updated, sometimes with great savings in com-
putation.
• By focusing on a few decompositions instead of a host of specific problems, software
developers have been able to produce highly effective matrix packages.
We will now discuss several important matrix decompositions from a mathematical point
of view, i.e., we will be mostly interested in their existence and uniqueness. In later chap-
ters we will derive algorithms for computing the decompositions, analyze their numerical
stability, and apply them in order to solve problems of numerical linear algebra.
Theorem 1.1 (LU decomposition). The following assertions are equivalent for every ma-
trix A ∈ Cn×n :
(1) There exist a unit lower triangular matrix L ∈ Cn×n , a nonsingular diagonal matrix
D ∈ Cn×n , and a unit upper triangular matrix U ∈ Cn×n , such that
A = LDU.
where A11 = A(1 : k, 1 : k) ∈ Ck×k . We see that A11 = L11 D11 U11 , and since L11 and U11
are unit lower and upper triangular, respectively, we obtain
det(A11 ) = det(L11 ) det(D11 ) det(U11 ) = det(D11 ) 6= 0.
(2) =⇒ (1): Induction on n. For n = 1 we can take L = [1], D = [a11 ], U = [1]. Now
suppose that the statement is true for all matrices up to order n − 1 for some n ≥ 2 and
let A ∈ Cn×n . With the nonsingular matrix A11 := A(1 : n − 1, 1 : n − 1) we can write
In−1 A−1
A11 A12 In−1 0 A11 0 11 A12 A11 0
A= = =: Ln Un ,
A21 A22 A21 A−1
11 1 0 s 0 1 0 s
9
where s := A22 − A21 A−1
11 A12 ∈ C is the Schur complement of A11 in A. From
we get s 6= 0. By the induction hypothesis, the matrix A11 has a factorization A11 =
Ln−1 Dn−1 Un−1 , where Ln−1 , Dn−1 and Un−1 have the required properties. Then
Ln−1 0 Dn−1 0 Un−1 0
A = Ln Un
0 1 0 s 0 1
| {z }| {z }| {z }
=:L =:D =:U
has an (obvious) LU decomposition. Allowing row exchanges is also important for the nu-
merical stability of numerical methods for computing an LU decomposition; see Chapter 3.
Proof. Let A = LDU be the uniquely determined factorization from Theorem 1.1. Since
A = AH we have A = LDU = (LDU )H = U H DH LH . Here U H and LH are unit lower
and upper triangular, respectively, and DH is diagonal and nonsingular. The uniqueness
of the factorization now implies that U H = L and D = DH , and hence in particular
D ∈ Rn×n .
10
Corollary 1.3 (Cholesky decomposition). If A ∈ Cn×n is HPD, then there exists a uniquely
determined lower triangular matrix L ∈ Cn×n with positive diagonal elements, such that
A = LLH .
1st Proof. If A is HPD, then by Corollary 1.2 there exists a uniquely determined factor-
ization A = LD e L eH , where L
e ∈ Cn×n is unit lower triangular and D = [dij ] ∈ Rn×n is
nonsingular. By (3) in Lemma 0.2 the matrix D = L e−1 AL
e−H is HPD and hence dii > 0
e 1/2 , where D1/2 := diag(d1/2
for i = 1, . . . , n. We set L := LD
1/2
11 , . . . , dnn ) ∈ R
n×n
, then
H
A = LL .
1/2
2nd Proof. Induction on n. If n = 1, we have A = [a11 ] with a11 > 0, and set L := [a11 ].
Suppose the statement is true for matrices up to order n − 1 for some n ≥ 2. Let A ∈
Cn×n be HPD, and let An−1 := A(1 : n − 1, 1 : n − 1) ∈ C(n−1)×(n−1) , which is HPD;
cf. (4) in Lemma 0.2. By the induction hypothesis, there exists a uniquely determined
lower triangular matrix Ln−1 ∈ C(n−1)×(n−1) with positive diagonal elements, such that
An−1 = Ln−1 LHn−1 . We thus can write
In−1 A−1
An−1 b In−1 0 A11 0 n−1 b
A= = ,
bH ann bH A−1
n−1 1 0 s 0 1
kck22 = bH L−H −1 H −1
n−1 Ln−1 b = b An−1 b < ann .
A = U RU H ,
11
Proof. Induction on n. If n = 1, set U = I1 , R = A. Suppose the statement is true for
matrices up to order n−1 for some n ≥ 2, and let A ∈ Cn×n . Suppose that λ is an eigenvalue
of A with corresponding unit norm eigenvector x, i.e., Ax = λx with kxk22 = xH x = 1. Let
Y ∈ Cn×(n−1) be any matrix such that X := [x, Y ] ∈ Cn×n is unitary. Then
H λ xH AY
X AX = ,
0 Y H AY
12
Inductively it follows that R must be diagonal.
On the other hand, if A = U DU H with U unitary and D diagonal, then
In = AH A = (U DH U H )(U DU H ) = U DH DU H
AH A = (U DH U H )(U DU H ) = In ,
The matrix uj uH
jis Hermitian and satisfies (uj uH 2 H
j ) = uj uj . Thus, a normal matrix can be
decomposed into the sum of n rank-one matrices, where each such matrix is an orthogonal
projection onto the subspace spanned by an eigenvector of A.
A general matrix A can also be decomposed into the sum of rank-one matrices, but in
general these matrices are not orthogonal projections onto eigenspaces of A.
Theorem 1.6 (Singular value decomposition, SVD). If A ∈ Cn×m has rank r, then there
exist unitary matrices U ∈ Cn×n , V ∈ Cm×m and a diagonal matrix Σ+ = diag(σ1 , . . . , σr )
with σ1 ≥ · · · ≥ σr > 0, such that
r
Σ+ 0 H
X
A=U V , or A= σj uj vjH . (1.1)
0 0
j=1
13
Proof. The matrix AH A ∈ Cm×m is HPSD, since xH AH Ax = kAxk22 ≥ 0 for all x ∈ Cm .
Hence AH A can be unitarily diagonalized with nonnegative real eigenvalues (cf. (1) in
Lemma 0.2 and (2) in Corollary 1.5). Denote the r = rank(A) = rank(AH A) positive
eigenvalues by σ12 ≥ · · · ≥ σr2 > 0 and Σ2+ := diag(σ12 , . . . , σr2 ), and let the unitary diago-
nalization be
2
H H Σ+ 0
V A AV = ,
0 0
for a unitary matrix V ∈ Cm×m . Then with V = [V1 , V2 ], where V1 ∈ Cm×r , we see that
V2H AH AV2 = 0, giving AV2 = 0. Define U1 := AV1 Σ−1
+ , then
U1H U1 = Σ−1 H H −1
+ V1 A AV1 Σ+ = Ir .
as required.
The numbers σ1 ≥ · · · ≥ σr > 0 in (1.1) are called the (nonzero) singular values of A,
and the columns of the unitary matrices U and V are called the left and right singular
vectors of A.
As we have seen in the proof, the singular values are the (positive) square roots of
the nonzero eigenvalues of AH A. Thus, the singular values of A are uniquely determined.
Similar to eigenvectors, the singular vectors are not uniquely determined. In particular,
for any φ1 , . . . , φr ∈ R we can write
r
X r
X
A= σj uj vjH = σj (eiφj uj )(eiφj vj )H ,
j=1 j=1
14
The decomposition in this result is the matrix analogue of the polar decomposition of
a nonzero complex number z = eiφ ρ = ρeiφ , where ρ = |z| > 0.
Theorem 1.8 (QR decomposition). Let A ∈ Cn×m with n ≥ m. Then there exist a
unitary matrix Q ∈ Cn×n and an upper triangular matrix R = [rij ] ∈ Cm×m with rii ≥ 0
for i = 1, . . . , m, such that
H R R
Q A= , or A = Q .
0 0
If rank(A) = m, then rii > 0, i = 1, . . . , m. Moreover, denoting Q = [Q1 , Q2 ] with
Q1 ∈ Cn×m , the matrices Q1 and R with rii > 0, i = 1, . . . , m, are uniquely determined.
Proof. Induction on m. If m = 1 we have A = [a] ∈ Cn . If a = 0, set Q := I and R := [0].
If a 6= 0, let Q ∈ Cn×n be a unitary matrix with first column a/kak2 . Then
H R
Q a= ,
0
where R = [kak2 ] has the required form.
Now let A have m > 1 columns, and write A = [a, A1 ], where A1 ∈ Cn×(m−1) . If a = 0,
set Q1 := In . If a 6= 0, let Q1 ∈ Cn×n be a unitary matrix with first column a/kak2 . Then
H kak2 bT
Q1 A =
0 C
for some bT ∈ C1×(m−1) and C ∈ C(n−1)×(m−1) . By the induction hypothesis, there exists a
R2
decomposition QH 2 C = , where Q2 ∈ C(n−1)×(n−1) is unitary and R2 ∈ C(m−1)×(m−1)
0
is upper triangular with nonnegative diagonal elements. Hence with the unitary matrix
1 0
Q := Q1 ,
0 Q2
we have
kak2 bT
1 0 R
QH A = H
Q1 A = 0 R2 =:
0 QH
2 0
0 0
in the required form.
If rank(A) = m, then rank(R) = m, which implies rii > 0 for i = 1, . . . , m. If
A = Q1 R = Q
e1 R,
e
e1 ∈ Cn×m have orthonormal columns and R1 , R
where Q1 , Q e1 ∈ Cm×m are nonsingular
upper triangular matrices with positive diagonal elements, then RH R = R
eH R,
e and hence
15
Suppose that A = [a1 , . . . , am ] ∈ Cn×m has rank m. Then the (classical) Gram-Schmidt
algorithm yields Q ∈ Cn×m with orthonormal columns and R = [rij ] ∈ Cm×m upper
triangular with rii > 0, i = 1, . . . , m, such that A = QR:
Set q1 = a1 /r11 , where r11 = ka1 k2
for j = 1, . . . , m −P1 do
qbj+1 = aj+1 − ji=1 ri,j+1 qi , where ri,j+1 = (aj+1 , qi )
qj+1 = qbj+1 /rj+1,j+1 , where rj+1,j+1 = kbqj+1 k2
end for
Let us briefly show that the algorithm indeed generates pairwise orthonormal vectors.
Suppose that for some j ∈ {1, . . . , m − 1} the vectors q1 , . . . , qj are pairwise orthonormal,
i.e., (qi , q` ) = δi` (Kronecker-δ). Then for each ` = 1, . . . , j we have
j j
X X
qj+1 , q` ) = aj+1 −
(b (aj+1 , qi )qi , q` = (aj+1 , q` ) − (aj+1 , qi )(qi , q` )
i=1 i=1
= (aj+1 , q` ) − (aj+1 , q` ) = 0.
v, Av, A2 v, . . .
there exists a smallest integer d ≥ 1 such that v, Av, . . . , Ad−1 v are linearly independent,
and v, Av, . . . , Ad v are linearly dependent. This integer d = d(A, v) is called the grade of
the vector v with respect to the matrix A.
It is easy to see that d = 1 holds if and only if v is an eigenvector of A. If
is the minimal polynomial of A, then MA (A) = 0 and hence MA (A)v = 0 for any vector
v ∈ Cn . Thus,
m−1
X
Am v = − αj Aj v,
j=0
16
which shows that v, Av, . . . , Am v are linearly dependent. Consequently, the grade of a
vector can be at most equal to the degree of the minimal polynomial of A (which is less
than or equal to n).
If A = Jn (0) is the Jordan block of size n × n and with eigenvalue 0, and ej is the jth
standard basis vector of Cn , then Aej = ej−1 for every j = 1, . . . , n, where we set e0 = 0.
Hence for each j = 1, . . . , n the vectors
ej , Aej , . . . , Aj−1 ej
are linearly independent, and
ej , Aej , . . . , Aj−1 ej , Aj ej
are linearly dependent (since Aj ej = 0), which shows that d(A, ej ) = j.
Using the same idea and the Jordan canonical form of a matrix A ∈ Cn,n , one can
show that for each j = 1, . . . , m (= degree of A’s minimal polynomial) there exists a vector
vj ∈ Cn with d(A, vj ) = j.
Theorem 1.9 (Arnoldi decomposition). Let A ∈ Cn×n and v ∈ Cn \ {0} be of grade d with
respect to A. Then there exists V ∈ Cn×d with orthonormal columns and an unreduced
upper Hessenberg matrix H = [hij ] ∈ Cd×d , i.e., hij = 0 for i > j + 1 and hi+1,i 6= 0,
i = 1, . . . , d − 1, such that
AV = V H.
Proof. Let W = [v, . . . , Ad−1 v] ∈ Cn×d P. d−1
By assumption, rank(W ) = d. We know that
A v ∈ span{v, . . . , A v}, i.e., A v = j=0 γj Aj v for some γ0 , . . . , γd−1 ∈ C.
d d−1 d
R
Let W = [V, Ve ] with V ∈ Cn×d and R ∈ Cd×d be the QR decomposition. Then
0
W = V R, and
0 γ0
. ..
1 .. .
d
AW = [Av, . . . , A v] = W . ..
.. 0 .
1 γd−1
leads to
0 γ0
.. ..
1
. .
AV R = V R ,
. .. 0 .
..
1 γd−1
and hence AV = V H, where V H V = Id , and
0 γ0
. ..
1 .. .
−1
H = R . .. R
.. 0 .
1 γd−1
17
is unreduced upper Hessenberg.
As we see in the proof of Theorem 1.9, the columns of the matrix V ∈ Cn×d form an
orthonormal basis of the dth Krylov subspace of A and v, which is defined by
Since
AKd (A, v) = span{Av, A2 v . . . , Ad v},
and Ad v is by construction a linear combination of v, Av, . . . , Ad−1 v, we see that
i.e., the d-dimensional subspace Kd (A, v) is invariant under A. Moreover, each eigenvalue
of the Hessenberg matrix H = V H AV ∈ Cd×d is an eigenvalue of A.
Since Av, . . . , Ad−1 v are linear independent, we have
A strict inclusion occurs in (1.3) if and only if dim(AKd (A, v)) = d − 1. As an example,
consider again A = Jn (0) and v = e1 . Then d = d(A, v) = 1 and
If A is nonsingular, we always have dim(AKd (A, v)) = d and hence equality in (1.3). In
order to see this, consider the equation
0 = γ1 Av + γ2 A2 v + · · · + γd Ad v.
Since A is nonsingular, we can multiply from the left with A−1 , which gives
0 = γ1 v + γ2 Av + · · · + γd Ad−1 v.
The linear independence of v, Av, . . . , Ad−1 v implies that γ1 , . . . , γd must be 0, and hence
Av, A2 v, . . . , Ad v are linearly independent.
An orthogonal basis of the Krylov subspace Kd (A, v) can be computed by appliying the
Gram-Schmidt algorithm to the matrix [v, Av, . . . , Ad−1 v], which by assumption has full
rank d. The algorithm then reads as follows:
Set v1 = v/r11 , where r11 = kvk2
for j = 1, . . . , d −P
1 do
vbj+1 = A v − ji=1 ri,j+1 vi , where ri,j+1 = (Aj v, vi )
j
18
Set v1 = v/kvk2
for j = 1, . . . , d −P1 do
vej+1 = Avj − ji=1 hij vi , where hij = (Avj , vi )
vj+1 = vej+1 /hj+1,j , where hj+1,j = ke
vj+1 k2
end for
The vectors v1 , . . . , vd generated by this algorithm still form an orthonormal basis of
Kd (A, v) (in exact arithmetic). In each step j = 1, . . . , d − 1 we have a relation of the form
j
X
Avj = hj+1,j vj+1 + hij vi .
i=1
Thus, if we rewrite the algorithm in matrix form, we obtain the Arnoldi decomposition
from Theorem 1.9, i.e.,
A[v1 , . . . , vd ] = [v1 , . . . , vd ]H,
where H = [hij ] is unreduced upper Hessenberg. The last column of H is determined
by expressing Avd as a linear combination of v1 , . . . , vd . Note that this column is not
explicitly computed in the above Arnoldi algorithm, since this algorithm terminates at the
step j = d − 1.
The Arnoldi decomposition has the following important special case.
Corollary 1.10 (Lanczos decomposition). If A ∈ Cn×n is Hermitian and v ∈ Cn \ {0} is
of grade d with respect to A, then there exists V ∈ Cn×d with orthonormal columns and
an unreduced Hermitian tridiagonal matrix H = [hij ] ∈ Cd×d , i.e., H = H H , hij = 0 for
|i − j| > 1, hi+1,i 6= 0 6= hi,i+1 , i = 1, . . . , d − 1, such that AV = V H.
Proof. Let AV = V H be the Arnoldi decomposition. Then H = V H AV = V H AH V = H H
shows that H is Hermitian. Since H is unreduced upper Hessenberg, this matrix is in fact
tridiagonal with hi+1,i 6= 0 6= hi,i+1 for i = 1, . . . , d − 1.
Note that if in the decomposition AV = V H the matrix H is tridiagonal, then a
comparison of the jth columns gives
Avj = hj+1,j vj+1 + hjj vj + hj−1,j vj−1 , j = 1, . . . , d,
where we set v0 = vd+1 = 0. Thus,
hj+1,j vj+1 = Avj − hjj vj − hj−1,j vj−1 ,
which means that the vector vj+1 satisfies a 3-term recurrence. In other words, if A is
Hermitian, an orthogonal basis of Kd (A, v) can be generated by a 3-term recurrence, while
for a general matrix A we require the (full) Arnoldi recurrence
j
X
hj+1,j vj+1 = Avj − hij vi .
i=1
The existence of short (3-term) recurrences for generating orthogonal bases of Krylov
subspaces has been intensively studied since the early 1980s; see [10] for a survey of some
results in this area.
19
Chapter 2
Perturbation Theory
In this chapter we will give an introduction into the theory of errors in numerical analysis
with a focus on numerical linear algebra problems, and into a field that is called “matrix
perturbation theory”.
20
(1) We have kAk∗ ≥ 0 for all A ∈ Cn×n , and kAk∗ = 0 holds if and only if kAzk = 0 for
all z ∈ Cn , which holds if and only if A = 0.
The norms k · k and k · k∗ are consistent, since for all x ∈ Cn \ {0} we have
kAxk kAzk
≤ max = kAk∗ ,
kxk z6=0 kzk
which is a norm on Cn :
(1) We have kxk = kxy H k∗ ≥ 0 for all x ∈ Cn . Moreover, kxy H k∗ = 0 holds if and only if
xy H = 0. Multiplying from the right by y 6= 0 gives x(y H y) = 0, and since y H y > 0,
we must have x = 0.
21
– the absolute forward error kb
y − ykY , or
kb
y −ykY
– the relative forward error kykY
.
We can also ask which input for the function f yields the output yb, i.e., for which
perturbation ∆x of x we have yb = f (x + ∆x). The quantity
and
k∆xkX
is called the relative backward error.
kxkX
This can be illustrated as follows:
f
x y = f (x)
x + ∆x yb = f (x + ∆x)
f
A function (or “problem”) is called well-conditioned at the input x, when small per-
turbations of x lead only to small changes in the resulting function values, i.e., a small
k∆xk implies a small kb
y − yk. Here the word “small” needs to be interpreted in the given
context. A function that is not well-conditioned at x is called ill-conditioned at x.
How can we determine whether a function is well-conditioned? For a motivation we
consider a twice continuously differentiable function f : R → R. For a given x ∈ R let
y = f (x) and suppose that yb = f (x + ∆x). Then by Taylor’s theorem
The quantity |xf 0 (x)|/|f (x)| measures, for small |∆x|, the relative change in the output for
a given relative change in the input. This quantity is called the relative condition number
of f at x. Using y = f (x) and cancelling |x| on the right hand side, we can rewrite (2.1)
for the absolute instead of the relative quantities, i.e.,
y − y| = |f 0 (x)||∆x| + O(|∆x|2 ).
|b
22
The absolute value of the derivative therefore can be considered the (absolute) condition
number of f at x, and the equation (2.1) can be read as
Note that the first factor in the definition of κf (x) is the (relative) forward error, and
the second factor is the reciprocal of the (relative) backward error. The value of κf (x) of
course depends on the choice of the norms in the spaces X and Y. If f is differentiable
at x, then
kJf (x)kkxkX
κf (x) = ,
kf (x)kY
where Jf := [∂fi /∂xj ] is the Jacobian of f , and the norm k · k is induced by the norms on
X and Y.
Example 2.4.
(1) For the function f : R → R with f (x) = αx for some nonzero α ∈ R and the norm
k · k = | · | (absolute value) on R we get
(2) For the function f : R+ → R with f (x) = log(x) the norm k · k = | · | (absolute value)
on R we get
x = 1 + 10−8 , ∆x = 10−10 .
23
An evaluation in MATLAB gives
log(x) = 9.999999889225291 × 10−9 ,
log(x + ∆x) = 1.009999989649433 × 10−8 ,
and hence
|∆x|
= 9.999999900000002 × 10−11 ,
|x|
κlog (x) = 1.000000011077471 × 108 ,
| log(x + ∆x) − log(x)|
= 0.010000000837678 × 100 ,
| log(x)|
Thus, due to the large condition number of f at x, a small relative perturbation of
the input (or a small relative backward error) leads to a large change in the output
(or a large relative forward error).
Example 2.5. For a given matrix A ∈ Cn×n we consider the function f : Cn → Cn with
f (x) = Ax. By k · k we denote a given norm on Cn as well as the induced matrix norm on
Cn×n . Then the relative condition number of f at x ∈ Cn \ {0} is
kA(x + ∆x) − Axk kxk
κf (x) = lim sup
δ→0 k∆xk≤δ kAxk k∆xk
kA∆xk kxk kxk
= lim sup = kAk .
δ→0 k∆xk≤δ kAxk k∆xk kAxk
If A is nonsingular, we can use that
kA−1 zk kyk kxk
kA−1 k = max = max ≥ for each x ∈ Cn \ {0},
z6=0 kzk y6 = 0 kAyk kAxk
which gives the bound
κf (x) ≤ kAk kA−1 k for each x ∈ Cn \ {0}.
e ∈ Cn \ {0} such that
Note that there exists some vector x
kyk ke
xk
max = .
y6=0 kAyk kAexk
x) = kAkkA−1 k.
For this vector we obtain the equality κf (e
Analogously, the relative condition number of the function g : Cn → Cn with g(x) =
A−1 x at x ∈ Cn \ {0} satisfies
kxk
κg (x) = kA−1 k ≤ kA−1 k kAk,
kA−1 xk
where we have used that kxk/kA−1 xk ≤ kAk. Again there exists some vector x
e ∈ Cn \ {0}
−1
x) = kAkkA k.
with κg (e
24
This example motivates the following definition.
Definition 2.6. If A ∈ Cn×n is nonsingular, and k · k is a norm on Cn×n , then
b−1 − A−1 k
kA kA−1 Ek κ(A) kEk
kAk
≤ ≤ . (2.4)
−1
kA k −1
1 − kA Ek 1 − κ(A) kEk
kAk
and hence
b−1 k
kA 1 1
≤ ≤ .
−1
kA k −1
1 − kA Ek 1 − κ(A) kEk
kAk
25
If we interpret taking the inverse as a function f : Cn×n → Cn×n , f (A) = A−1 , then
b−1 = f (A + E), and kEk is the backward error. Hence the theorem is another instance
A
of our rule of thumb (2.2).
which is nonsingular for each (real or complex) ε 6= 0, with its inverse given by
−1 1/ε −1/ε
Aε = .
−1/ε 1 + 1/ε
then yields
kEk2
= 4.999998749999999 × 10−8 ,
kAk2
κ2 (A) = kAk2 kA−1 k2 = 4.000002000000751 × 106 ,
kA−1 − Ab−1 k2
= 0.099999972500000,
b−1 k2
kA
which is an illustration of the bound (2.3). (Note that since the inverses are known explic-
itly, one only needs to compute norms of matrices in this example.)
We will next show how the condition number of a nonsingular matrix is related to the
“distance to singularity” of the matrix.
Lemma 2.9. If k · k is a consistent norm on Cn×n , then for each A ∈ Cn×n we have
kAk ≥ ρ(A),
Proof. We have shown above that there exists a norm k · k∗ on Cn so that k · k and k · k∗
are consistent. If Ax = λx, x 6= 0, then
26
Theorem 2.10. If A ∈ Cn×n is nonsingular and E ∈ Cn×n is such that A + E ∈ Cn×n is
singular, then for any consistent norm k · k on Cn×n we have
kEk 1
≥ .
kAk κ(A)
Proof. If A is nonsingular we can write A + E = A(I + A−1 E). Since A + E is singular, the
matrix I +A−1 E must be singular, and thus −1 must be an eigenvalue of A−1 E. Lemma 2.9
then gives
kEk
1 ≤ ρ(A−1 E) ≤ kA−1 Ek ≤ kA−1 kkAk ,
kAk
which yields the desired inequality.
This theorem shows that a nonsingular matrix must be perturbed by a matrix with
(relative) norm at least 1/κ(A) in order to make it singular. In short, “well-conditioned
matrices are far from singular”.
Example 2.11. Let A ∈ Cn×n be nonsingular, and let A = U ΣV H be an SVD in the
notation of Theorem 1.6. Then A−1 = V Σ−1 U H , and since the matrix 2-norm, which is
induced by the Euclidean norm on Cn , is unitarily invariant, we easily see that
and hence κ(A) = σ1 /σn . For the matrix E := −σn un vnH we have kEk2 = σn , and A + E
has rank n − 1 and thus is singular. Moreover,
kEk2 σn 1
= = ,
kAk2 σ1 κ(A)
and therefore E is a perturbation with minimal (relative) 2-norm such that the perturbed
matrix is singular.
In the next result the vector x
b should be interpreted as an approximate solution of the
given linear algebraic system.
Theorem 2.12 (Residual-based forward error bound). Let A ∈ Cn×n be nonsingular,
x ∈ Cn \ {0} and b = Ax. Then for consistent norms and every x
b ∈ Cn we have
kb
x − xk krk
≤ κ(A) , (2.5)
kxk kbk
where r := b − Ab
x is the residual and krk/kbk is the relative residual norm.
Proof. Using x = A−1 b and the definition of the residual we get
krk
x − xk = kA−1 (Ab
kb x − b)k ≤ kA−1 kkrk = κ(A) .
kAk
27
Moreover, kbk ≤ kAkkxk gives
1 kAk
≤ ,
kxk kbk
which implies the desired inequality.
Any essential observation to be made in (2.5) is that in case of an ill-conditioned matrix
a small residual norm does not guarantee that the forward error is small as well.
Example 2.13. For a numerical example we consider
1+ε 1 1 2+ε 0
Aε = , x= , b = Ax = , x
b= ;
1 1 1 2 2
For ε = 10−6 we have κ2 (A) ≈ 4 × 106 . If we then try to solve Aε x = b using y=inv(A)*b
in MATLAB (R2015b), we get the computed approximation
1.000000000232831 kx − yk2
y= with the relative forward error ≈ 2.33 × 10−10 .
0.999999999767169 kxk2
Since the machine precition (or unit roundoff ) is u ≈ 1.11×10−16 (see Chapter 3), we have
lost six significant digits in the computed solution. Similarly, if we compute [L,U]=lu(A)
and then y=U\(L\b), we obtain the computed approximation
1.000000000111022 kx − yk2
y= with the relative forward error ≈ 1.11 × 10−10 .
0.999999999888978 kxk2
Finally, with [Q,R]=qr(A) and y=R\(Q’*b) we obtain
1.000000000314019 kx − yk2
y= with the relative forward error ≈ 3.14 × 10−10 .
0.999999999685981 kxk2
This example suggests the following rule of thumb:
If κ(A) ≈ 10k , then expect a loss of k significant digits in a computed (i.e.,
approximate) solution of Ax = b.
The loss of significant digits (or large relative forward error) is a consequence of the ill-
conditioning of the problem, and hence it is independent of the numerical algorithm that
is used for computing the approximation1 . The best approach to deal with this situation
is to avoid ill-conditioning of the problem in the first place.
1
Of course, with a poor algorithm we will likely lose even more significant digits, while if A−1 is known
explicitly, we may be able to compute x = A−1 b very accurately despite a large κ(A).
28
In the notation of Theorem 2.12, suppose that x
b is the solution of the linear algebraic
system with a perturbed right hand side, i.e.,
Ab
x = b + ∆b.
kb
x − xk k∆bk
≤ κ(A) .
kxk kbk
Since only a perturbation of b is considered, k∆bk/kbk is the relative backward error, and
we recognize our rule of thumb (2.2).
We now ask about a perturbation of the matrix A such that an approximate solution
x
b solves the perturbed linear algebraic system.
xH /kb
Proof. For E = rb xk22 we get
bH x
x krk2
and kEk2 =
b
(A + E)b
x = Ab
x+r = b, .
xk22
kb kb
xk 2
x = b, then r = b − Ab
If ∆A is arbitrary with (A + ∆A)b x = b − (b − ∆Ab
x) = ∆Ab
x. Hence
krk2 ≤ k∆Ak2 kb xk2 , giving
k∆Ak2 krk2
≥ ,
kAk2 kAk2 kbxk 2
where the lower bound is attained for ∆A = E, since kEk2 = krk2 /kb
xk 2 .
Thus, E = rbxH /kbxk22 is a matrix with minimal (relative) backward error in the 2-norm
so that x
b solves the perturbed system (A + ∆A)y = b.
We next study the backward error when allowing perturbations in both A and b. Let
x 6= 0, b = Ax, r = b − Ab x, and consider any α ∈ C and y ∈ Cn with y H x b = 1.
H
Define ∆Aα := αry and ∆bα := (α − 1)r, then a simple computation shows that (A +
∆Aα )bx = b+∆bα , i.e., x
b exactly solves a whole family of perturbed linear algebraic systems
parametrized by α. For α = 1 we have ∆Aα := ry H and ∆bα = 0, and choosing y = x xk22
b/kb
(hence y H x
b = 1) we get the matrix E from the previous theorem.
In order to characterize minimum norm backward perturbations we need some addi-
tional theory.
29
Lemma 2.15. Let k · k be a norm on Cn and define
Proof. Clearly kxkD ≥ 0 with equality if and only x = 0. For λ ∈ C we have kλxkD =
maxkzk=1 k(λx)H zk = |λ|kxkD . If x1 , x2 ∈ Cn , then
|xH x|
kxkD H
2 = max |x z| ≥ = kxk2 ,
kzk2 =1 kxk2
kxkD
2 = max |(z, x)| ≤ max (kzk2 kxk2 ) = kxk2 ,
kzk2 =1 kzk2 =1
and thus kxk2 = kxkD 2 , where in the upper bound we have used the Cauchy-Schwarz
inequality. In other words, the dual norm of the Euclidean norm on Cn is the Euclidean
norm itself. More generally, for any 1 ≤ p ≤ ∞, the p-norm on Cn is defined by
n
n
X 1/p
k · kp : C → R with kxkp := |xi |p for all x = [x1 , . . . , xn ]T ∈ Cn ,
i=1
1 −1 1 1
and one can show that k · kD
p = k · kq , where q = (1 − p ) , so that p
+ q
= 1.
Basic properties of the dual norm are shown in the next result.
Proof. Both inequalities are obvious for y = 0. If y 6= 0, then for each x ∈ Cn we have
H y
max |xH z| = kxkD , and hence |xH y| ≤ kxkD kyk.
kyk ≤ kzk=1
x
The second inequality follows from the first by using |xH y| = |y H x|.
30
For the Euclidean norm k · k2 , both inequalities in this lemma read
|xH y| = |y H x| = |(x, y)| ≤ kxk2 kyk2 ,
which is nothing but the Cauchy-Schwarz inequality in Cn equipped with the Euclidean
inner product.
More generally, for the p- and q-norm on Cn , where p1 + 1q = 1, the second inequality in
the lemma reads
n
n
!1/p n
!1/q
X X X
|xH y| = |y H x| = |(x, y)| ≤ kxkp kykq , or xi y i ≤ |xi |p |yi |q ,
i=1 i=1 i=1
(2.6)
which is called the Hölder inequality.
If k · kDD denotes the dual norm of k · kD , i.e.,
kxkDD := max |xH z| for all x ∈ Cn ,
kzkD =1
With some more effort one can also show the reverse inequality, which gives the following
important theorem.
Theorem 2.17 (Duality Theorem). If k·k is a norm on Cn , k·kD is the dual norm of k·k,
and k · kDD is the dual norm of k · kD , then kxk = kxkDD for all x ∈ Cn , i.e., k · k = k · kDD .
Corollary 2.18. Let k · k be a norm on Cn , let k · kD be the dual norm and let x
b ∈ Cn \ {0}.
Then there exists a vector y ∈ Cn \ {0} with
1 = yH x xkkykD .
b = kb
Such a vector y is called a dual vector of x
b.
Proof. For the given vector x
b we know that
kb xkDD = max |b
xk = kb xH z|.
kzkD =1
z kD = 1. Set y := ze/kb
The maximum is attained for some vector ze with ke xk, then kykD =
D
kz̃k /kb
xk, giving
z kD = kb
1 = ke xkkykD .
Moreover, using ze = kb
xky we get
kb xH ze| = kb
xk = |b xH y| = kb
xk|b xk|y H x
b|,
so that |y H x
b| = 1. Without loss of generality we can assume that y H x
b = 1, since y can be
multiplied by a suitable constant eiθ .
31
b ∈ Cn \ {0} is given by y = x
For the Euclidean norm k · k2 a dual vector of x xk22 ,
b/kb
since then kyk2 = 1/kb
xk2 and
bH
x
1= b = kb
x xk2 kyk2 .
xk22
kb
Remark 2.19. Corollary 2.18 is a finite dimensional version of the following corollary of
the Hahn-Banach Theorem:
If (X , k · kX ) is a normed linear space, and (X ∗ , k · kX ∗ ) is the dual space with k`kX ∗ =
supkxkX ≤1 |`(x)|, then for each nonzero x0 ∈ X there exists a nonzero `0 ∈ X ∗ with `0 (x0 ) =
k`0 kX ∗ kx0 kX .
Theorem 2.20 (Rigal & Gaches [13]). Let A ∈ Cn×n , x ∈ Cn , b = Ax, and x b ∈ Cn \ {0}.
Let k · k be any norm on Cn as well as the induced matrix norm on Cn×n . Let E ∈ Cn×n
and f ∈ Cn be given, and suppose that y ∈ Cn is a dual vector of xb. Then the normwise
backward error of the approximate solution x
b of Ax = b is given by
x) := min {ε : (A + ∆A)b
ηE,f (b x = b + ∆b with k∆Ak ≤ εkEk and k∆bk ≤ εkf k}
krk
= ,
kEkkb xk + kf k
and the second equality is attained by the perturbations
kEkkbxk
∆Amin := ry H ,
kEkkb
xk + kf k
kf k
∆bmin := − r.
kEkkb
xk + kf k
Proof. We have
kEkkbxk
(A + ∆Amin )b
x = Ab
x+ r yH x
kEkkb
xk + kf k |{z}
b
=1
kEkkbxk kf k
= b+ Ab
x
kEkkbxk + kf k kEkkb
xk + kf k
= b + ∆bmin ,
i.e., x
b solves the system that is perturbed by ∆Amin and ∆bmin .
Next, if ∆A and ∆b are arbitrary with (A + ∆A)b x = b + ∆b and k∆Ak ≤ εkEk,
k∆bk ≤ εkf k, then r = b − Ab x = ∆Ab x − ∆b, and hence
krk ≤ k∆Akkb
xk + k∆bk ≤ ε(kEkkb
xk + kf k),
32
Finally, we show that the value εmin is attained by the perturbations ∆bmin , ∆Amin :
kf k
k∆bmin k = krk = εmin kf k,
kEkkb xk + kf k
k∆Amin zk kEkkb xk kry H zk
k∆Amin k = max = max
z6=0 kzk kEkkb xk + kf k z6=0 kzk
kEkkb xk |y H z|
= krk max = εmin kEk,
kEkkb xk + kf k z6=0 kzk
| {z }
xk
=1/kb
krk2 xH
rb
ηA,0 (b
x) = , ∆Amin = , ∆bmin = 0,
kEk2 kb
xk 2 xk22
kb
x) := min {ε : (A + ∆A)b
ηA,b (b x = b + ∆b with k∆Ak ≤ εkAk, k∆bk ≤ εkbk}
krk
=
kAkkb xk + kbk
33
A numerical method for solving Ax = b with A ∈ Cn×n nonsingular is called normwise
forward stable when the computed approximation x
b satisfies
kb
x − xk
= O(κ(A)u);
kxk
εkA−1 k
kb
x − xk kf k
≤ + kEk .
kxk 1 − εkA−1 kkEk kxk
In particular, if E = A and f = b, then
kb
x − xk 2εκ(A)
≤ .
kxk 1 − εκ(A)
kb
x − xk 2εκ(A)
≤ .
kxk 1 − εκ(A)
Since by assumption εκ(A) < 1, the last expression is on the order O(κ(A)u) if ε = u
34
Chapter 3
35
some δx ∈ R with |δx | ≤ u, such that
and hence the relative error in the approximation of x by the floating point number f l(x)
satisfies
|x − f l(x)|
≤ u.
|x|
The error made when working with f l(x) instead of the exact number x is called a rounding
error.
For example, in MATLAB1 the command eps shows the spacing between two floating
point numbers:
>> eps
ans =
2.220446049250313e-16
>> 1-(1/49)*49
ans =
1.110223024625157e-16
The number x = 49 is the smallest integer for which evaluating 1 − (1/x) · x in MATLAB
(and hence IEEE arithmetic) does not give exactly zero.
The (relative) rounding error in a single computation is bounded by the machine pre-
cision, which is very small. If we perform many computations, however, then the rounding
errors may “add up”, and ultimately lead to an inaccurate final result. It is therefore
important to understand the way in which numerical algorithms are affected by rounding
errors.
1
MATLAB constructs the floating point numbers using the IEEE Standard 754 from 2008. For details
see [Link]
36
A basic but nevertheless useful example, which occurs in many algorithms, is the com-
putation of the inner product of two real vectors, i.e.,
n
X
T
y x= xi y i , x, y ∈ Rn .
i=1
Proof. We have
n n
Y Y nu
(1 + δi ) ≤ (1 + u) = (1 + u)n ≤ 1 + ,
i=1 i=1
1 − nu
where the last inequality can be shown by induction on n under the assumption that
nu < 1.
37
Note that
nu
γn = = nu(1 + nu + (nu)2 + · · · ) = nu + O(n2 u2 ). (3.2)
1 − nu
Using this lemma we can write (3.1) as
If nu < 1 and |x| := [|x1 |, . . . , |xn |]T we therefore get the error estimate
n
X
T T
|y x − f l(y x)| ≤ γn |xi yi | = γn |y|T |x| = nu|y|T |x| + O(n2 u2 ). (3.3)
i=1
l11 l21 · · ·
l11 ln1
. .. .. ..
l21 . . . . .
T
A = LL = . . , (3.4)
.. .. . ..
. . . ln,n−1
ln1 · · · ln,n−1 lnn lnn
where L = [lij ] ∈ Rn×n with lii > 0, i = 1 . . . , n, is uniquely determined (cf. Theorem 1.3).
If we equate the columns in (3.4), we immediately obtain the following recursive algorithm
for computing the entries of L:
for j = 1, . . . , nP
do
j−1 2 1/2
ljj = (ajj − k=1 l )
1
Pj−1jk
lij = ljj (aij − k=1 lik ljk ) for i = j + 1, . . . , n
end for
If ljT denotes the jth row of L, then lj is the jth column of LT , and the two steps of
this algorithm can be rewritten as
j j
X X
2
ajj = ljk = ljT lj and aij = lik ljk = liT lj . (3.5)
k=1 k=1
This is, of course, not a recursive algorithm for computing the entries of L. But the
rewritten version clearly shows that (apart from the square root) every step of the recursive
38
algorithm consists of evaluating inner products. Thus, the rounding errors made in every
step can be estimated using (3.3).
Because of the recursive nature of the algorithm, its error analysis must consider the
entries of the computed Cholesky factor L b = [blij ] instead of the exact factor L = [lij ].
Thus, instead of (3.5) we must look at equations of the form
j j
X X
2
ajj = ljk
b ljT b
=b lj and aij = likb
b liT b
ljk = b lj . (3.6)
k=1 k=1
Based on (3.3) and skipping some technical details2 , we then obtain the following result.
Theorem 3.2. If L b is computed by the above algorithm applied to an SPD matrix A ∈
Rn×n , and nu < 1, then
A=L
bLbT + ∆A, where |∆A| ≤ γn+1 |L| bT |.
b |L
The matrix inequality in this theorem is meant entrywise. For any matrix norm k · k
the theorem gives the backward error bound
k|∆A|k ≤ γn+1 k|L||
b L bT |k,
which we will analyze more closely now. For 1 ≤ p ≤ ∞, let k · kp be the norm on Cn×n
induced by the p-norm on Cn , i.e.,
kAkp = max kAxkp .
kxkp =1
Then, in particular,
n
X n
X
1−1/p
kAk1 = max |aij | ≤ n kAkp and kAk∞ = max |aij | ≤ n1/p kAkp .
1≤j≤n 1≤i≤n
i=1 j=1
k|L||
b L bT |k2 = k|L|k
b 22 ≤ nkLk
b 22 = nkL bLbT k2 = nkA − ∆Ak2 ≤ n(kAk2 + k∆Ak2 )
≤ n(kAk2 + γn+1 k|L||
b L bT |k2 ),
2
As Higham writes, such details are “not hard to see after a little thought, but ... tedious to write
down” [5, p. 142].
39
from which we obtain
bT |k2 ≤ n
k|L||
b L kAk2 , (3.7)
1 − nγn+1
and hence, using (3.2),
bT |k2 ≤ nγn+1
kAk2 = n2 u + O(n3 u2 ) kAk2 .
k∆Ak2 ≤ k|∆A|k2 ≤ γn+1 k|L||
b L
1 − nγn+1
We thus have shown the following backward error result.
Corollary 3.3. The computed Cholesky factor L
b satisfies
A lower (or upper) triangular system can be solved using forward (or back) substitution.
The forward substitution algorithm used for solving Ly = b can be written as
j−1
1 X
yj = (bj − ljk yk ), j = 1, . . . , n.
ljj k=1
Evaluating the right hand side costs j multiplications P and j − 1 subtractions, and hence
the total cost of the forward substitution algorithm is nj=1 (2j − 1) = n2 .
In finite precision computations each operation is affected by rounding errors, and
hence the algorithm does not yield the exact solution y = [y1 , . . . , yn ]T but a computed
approximation yb = [b y1 , . . . , ybn ]T . Thus, for the rounding error analysis we must consider
the recurrence
j−1
1 X
ybj = (bj − ljk ybk ), j = 1, . . . , n,
ljj k=1
40
This version shows that the numerical stability analysis of the algorithm can be done as
for the Cholesky algorithm (3.6). Obviously, the same analysis applies to the backward
substitution, i.e., the solution of an upper triangular system, and this leads to the following
result.
Theorem 3.4. If the linear algebraic system T y = b with the lower (or upper) triangular
matrix T ∈ Rn×n and b ∈ Rn is solved by the forward (or back) substitution algorithm as
stated above, then the computed approximation yb satisfies
(T + ∆T )b
y = b, where |∆T | ≤ γn |T |.
Using the same analysis as above one can now show that the relative backward errors
in the forward and back substitution algorithms satisfy bounds of the form k∆T k/kT k ≤
p(n)u, where k · k is an appropriate matrix norm, and p(n) is a small polynomial in n.
Thus, both algorithms are normwise backward stable.
Consequently, when we solve Ax = b with an SPD matrix A ∈ Rn×n in finite precision
arithmetic by first computing A = L bLbT + ∆A, and then computing approximations to
the solutions of Ly
b = b and L bT x = y by forward and back substitution, we obtain an
approximation xb such that
k∆Ak
(A + ∆A)b
x = b, where ≤ p(n)u;
kAk
see, e.g., [5, Theorem 10.4 and equation (10.7)]. Thus, the norm of the residual satisfies
krk = kb − Ab
xk = k∆Ab
xk ≤ p(n)kAkkb
xku,
giving
krk kAkkbxk
≤ p(n)u < p(n)u, (3.8)
kAkkb
xk + kbk kAkkb
xk + kbk
and hence the method is normwise backward stable; cf. the discussion of Theorem 2.20.
If we write the SPD matrix A ∈ Rn×n as
a11 aT1
A= ,
a1 A1
we have eT1 Ae1 = a11 > 0 and we can perform the following basic factorization step:
" 1/2
# 1/2 1/2
a11 aT1 a11 0 1 0 a11 −a11 aT1
A= = −1/2 ,
a1 A1 a11 a1 In−1 0 S1 0 In−1
| {z } | {z }
=:L1 =:LT
1
a1 aT
where S1 := A1 − a11
1
∈ R(n−1)×(n−1) is called Schur complement of a11 in A.
41
1 0
The symmetric matrices A and are congurent and hence have the same inertia
0 S1
(i.e., number of positive, negative and zero eigenvalues). Since A is SPD, S1 must be
SPD as well. Hence we can perform the basic factorization step on S1 , which leads to a
factorization of the form
1 0 0
A = L1 L2 0 1 0 LT2 LT1 ,
0 0 S2
where S2 ∈ R(n−2)×(n−2) is another SPD Schur complement. After n steps we obtain the
Cholesky decomposition
Note that in the nth step we only take one square root and do not form a Schur
complement. Forming the Schur complement Sj in step j = 1, . . . , n − 1 requires the
following types of operations:
1
(1) α
v with v ∈ Rn−j : n − j multiplications.
Disregarding the square roots, the total cost for computing the Cholesky decomposition
using the factorization approach described above is
n−1
X n−1
X
[(n − j)(n − j + 1) + (n − j)] = (n − j)(n − j + 2)
j=1 j=1
n−1
X n−1
X n−1
X
2
= k(k + 2) = k +2 k
k=1 k=1 k=1
(n − 1)n(2(n − 1) + 1) (n − 1)n
= +2
6 2
1
= (n(n − 1)(2n − 1) + 6(n2 − n))
6
1 1
= (2n3 + 3n2 − 5n) ≈ n3 (for large n).
6 3
Many applications involve sparse matrices, i.e., matrices with a significant number of
zero entries. Zero entries need not be stored, and they do not take part in numerical
evaluations (multiplication, addition, subtraction). When A is sparse, we also would like
to have a sparse Cholesky factor L.
42
When aij = 0 but lij 6= 0, the element lij is called a fill-in element. When A is SPD,
then P T AP is SPD for any permutation matrix P . An important line of research in the
context of the (sparse) Cholesky decomposition is concerned with finding permutations P
so that the Cholesky factor of P T AP has the least possible number of nonzero entries. In
this context we speak of the sparse Cholesky decomposition.
A = LU
with a unit lower triangular matrix L ∈ Cn×n and a nonsingular upper triangular ma-
trix U ∈ Cn×n exists; see Theorem 1.1. Then x = A−1 b = U −1 (L−1 b), so that we can
again compute x (or rather an approximation x b) using the forward and back substitution
algorithms, which are normwise backward stable.
The LU decomposition can be computed by “Gaussian elimination”: At step j =
1, . . . , n − 1, multiples of the j-th row are subtracted from rows j + 1, . . . , n in order to
introduce zeros in the column j below the entry in position (j, j), and the result is the
upper triangular matrix U . Schematically:
× × × × × × × × × × × ×
× × × ×
→ 0 × × × → 0 × × ×
A = × × × × 0 × × × 0 0 × ×
× × × × 0 × × × 0 0 × ×
× × × ×
0 × × ×
→ 0 0 × × =U
0 0 0 ×
By the assumption on A, we are guaranteed that a11 6= 0 and that the leading principal
minors of S1 are nonsingular. Hence the process can be continued with the matrix S1 .
After n − 1 steps we obtain
Ln−1 · · · L2 L1 A = U or A = (L−1 −1 −1
1 L2 · · · Ln−1 )U =: LU,
43
where L1 , . . . , Ln−1 ∈ Cn×n and hence L = L−1 −1 −1
1 L2 · · · Ln−1 are unit lower triangular, and
U ∈ Cn×n is nonsingular and upper triangular.
Each matrix Lj is of the form
Ij−1
0j
Lj = 1 = In + eTj ,
lj
lj In−j
In [19, pp. 149–151], the simple form (3.9) of the inverse L−1
j and the simple form (3.10)
of the product of these inverses are called (the first) “two strokes of luck” of Gaussian
elimination.
The main cost in the algorithm described above is in forming the Schur complement
matrices Sj ∈ C(n−j)×(n−j) , j = 1, . . . , n − 1. For a nonsymmetric (or non-Hermitian)
matrix this is (approximately) twice as expensive as forming the symmetric (or Hermitian)
Schur complement in the algorithm for computing the Cholesky decomposition. The cost
for computing the LU decomposition (for large n) therefore is approximately 32 n3 .
Assuming that the decomposition A = LU exists, and that the above algorithm runs
to completion, we can perform a rounding error analysis similar to the analysis for the
Cholesky decomposition. This analysis can be based on writing the decomposition in the
(inner product) form
i
X i
X
aij = lik ukj , and aij = lik u
b bkj , (3.11)
k=1 k=1
and it results, analogously to Theorem 3.2, in a componentwise error bound of the form
A=L
bUb + ∆A, where |∆A| ≤ γn |L||
b U b |;
44
see, e.g., [5, Theorem 9.3]. However, on the contrary to the Cholesky decomposition, the
sizes of the entries in the factors |L|
b and |U
b | are not bounded by kAk, and we can not
derive a bound analogous to (3.7).
For example, the matrix
ε 1
A= , ε∈ / {0, 1},
1 1
has nonsingular leading principal minors, and the the first step of the algorithm gives
1 0 ε 1 ε 1
= ,
−ε−1 1 1 1 0 1 − ε−1
so that
ε 1 1 0 ε 1
= =: LU.
1 1 ε−1 1 0 1 − ε−1
For |ε| → 0 the largest entries in the factors L and U grow unboundedly, hence kLk
and kU k become arbitrarily large, while the largest entry in the matrix A is 1. The
potential numerical instabilities are illustrated by the following MATLAB computation
using ε = 10−16 :
The computed product LU is obviously quite far from the exact matrix A.
In order to control the sizes of the entries in the factors one can use pivoting 3 . We
assume that A is nonsingular. (It will turn out that the nonsingularity assumption on the
leading principal minors is not required when pivoting is used.) After j ≥ 0 steps of the
above algorithm we have
∗
Uj
(j)
ujj ∗ · · · ∗
Lj · · · L1 A = ,
.
. .
. .
.
0 . . .
(j)
unj ∗ ... ∗
where Uj ∈ Cj×j is upper triangular. (Here U0 is the empty matrix.) Since A is nonsingular,
(j) (j)
at least one of the entries ujj , . . . , unj must be nonzero. In the strategy of partial (or row)
(j)
pivoting, we select an entry ukj of maximum modulus among these entries. We then
3
According to the Merriam-Webster Dictionary, a pivot is “a person, thing, or factor having a major
or central role, function, or effect”. This real-life definition fits well to the mathematical meaning of the
pivot in Gaussian elimination, which is described in this paragraph.
45
exchange the rows j and k, and form the elimination matrix Lj using the submatrix with
the exchanged rows. When forming Lj we divide the first column of the corresponding
(j)
submatrix by the pivot ukj , which has the largest magnitude in that column. Consequently,
all entries in the matrix Lj , and hence all entries in L−1
j , are bounded in modulus by 1.
In matrix notation, the exchange of rows j and k, where 1 ≤ j ≤ k ≤ n, corresponds
to a left-multiplication by the permutation matrix
For any nonsingular matrix A ∈ Cn×n , the Gaussian elimination algorithm with partial
pivoting produces a factorization of the form
or A = P LU , where L = L−1 T −1
1 and P = P12 = P12 .
Note that since 1 < 2 ≤ k2 ≤ n, we have eT1 P2,k2 = eT1 , and therefore
01 01
P2,k2 L1 = P2,k2 + P2,k2 T
e1 = In + e eT1 P2,k2 ,
l1 l1
where e
l1 and l1 have the same entries, except for possibly two permuted ones (if 2 < k2 ).
More generally, it follows that
Theorem 3.5 (LU decomposition with partial pivoting). Each nonsingular matrix A ∈
Cn×n can be factorized A = P LU , where P ∈ Cn×n is a permutation matrix, L = [lij ] ∈
Cn×n is unit lower triangular with |lij | ≤ 1, and U ∈ Cn×n is nonsingular and upper
triangular.
46
Let us write P T A = [e
aij ] = LU , where L = [lij ] ∈ Cn×n is unit lower triangular with
|lij | ≤ 1, and U is upper triangular. Then from Theorem 3.5 we obtain
Thus, when using the partial pivoting strategy the entries of the upper triangular factor U
are bounded in terms of the entries of A, where the upper bound contains the (inconvenient)
constant 2i−1 . A closer analysis of this situation is based on the following definition.
Definition 3.6. The growth factor for the (nonsingular) matrix A in the LU factorization
algorithm with partial pivoting as explained above is given by 4
maxi,j |uij |
ρ(A) := .
maxi,j |aij |
One can now show that if the LU factorization algorithm with partial pivoting (and
even without partial pivoting!) runs to completion, the computed factors satisfy
k∆Ak
A = PbL
bUb + ∆A, where ≤ p(n)ρ(A)u,
kAk
and p(n) is some low-degree polynomial in n. Moreover, when subsequently using the
computed LU decomposition for solving the linear algebraic system Ax = b using the
normwise backward stable forward and back substitution algorithms, we obtain a computed
approximation x
b with
k∆Ak
(A + ∆A)b
x = b, where ≤ p(n)ρ(A)u;
kAk
see, e.g., [5, p. 165]. Analogously to (3.8) we now obtain
krk kAkkbxk
≤ p(n)ρ(A)u < p(n)ρ(A)u.
kAkkb
xk + kbk kAkkb
xk + kbk
Apart from the growth factor ρ(A), the backward error bounds for the LU factorization
algorithm with partial pivoting coincide with those for Hermitian positive definite matrices
and the Cholesky decomposition. We see from (3.12) that
ρ(A) ≤ 2n−1
4
The standard notation for the growth factor is unfortunately in conflict with the standard notation
for the spectral radius.
47
for all (nonsingular) matrices A ∈ Cn×n . There are in fact matrices for which this up-
per bound on the growth factor is attained. However, Gaussian elimination with partial
pivoting is “utterly stable in practice” in the sense that matrices A with large growth fac-
tors rarely occur. For discussions of this fact also from a historical point of view, see [19,
pp. 166–170] or [5, Section 9.4].
48
Chapter 4
xk+1 = M −1 N xk + M −1 b, k = 0, 1, 2, . . . , (4.1)
M −1 b = M −1 (M − N )x = x − M −1 N x,
49
we obtain
ek = x − xk = x − (M −1 N xk−1 + x − M −1 N x) = M −1 N ek−1 ,
and by induction
ek = (M −1 N )k e0 , k = 0, 1, 2, . . . . (4.2)
The matrix M −1 N is called the iteration matrix of (4.1). The iteration converges to the
exact solution x, when ek → 0 for k → ∞. For consistent norms k · k we have the error
bound
kek k = k(M −1 N )k e0 k ≤ kM −1 N kk ke0 k.
Thus, if there exists some consistent norm k · k with kM −1 N k < 1, then the iteration (4.1)
converges for any x0 .
More generally, from (4.2) we see that the iteration converges for any given x0 , if and
only if M −1 N → 0 for k → ∞. In order to analyze this situation we consider the Jordan
decomposition
M −1 N = XJX −1 , J = diag(Jd1 (λ1 ), . . . Jdm (λm )).
Then (M −1 N )k = XJ k X −1 , and (M −1 N )k → 0 holds if and only if
J k = diag((Jd1 (λ1 ))k , . . . (Jdm (λm ))k ) → 0.
The kth power of a Jordan block is given by
min{k,d}
k k
X k k−j
(Jd (λ)) = (λId + Jd (0)) = λ (Jd (0))j
j=0
j
min{k,d}
X 1 k!λk−j
= (Jd (0))j ,
j=0
j! (k − j)!
for every j = 0, 1 . . . , min{k, d}. It is clear that this holds when λ = 0. For λ 6= 0 and a
fixed j we divide two consecutive terms of this sequence and obtain, for k ≥ d,
(k + 1)!|λ|k+1−j (k − j)! k+1 k+1 1
k−j
= |λ| ≤ |λ| = |λ| d
.
(k + 1 − j)! k!|λ| k+1−j k+1−d 1 − k+1
The last term approaches 1 for k → ∞. Therefore |λ| < 1 is necessary and sufficient for
(Jd (λ))k → 0. In summary, we have shown the following result.
Theorem 4.1. The iteration (4.1) converges for any initial vector x0 , if and only if the
spectral radius of the iteration matrix satisfies ρ(M −1 N ) < 1. A sufficient condition for
convergence for any initial vector x0 is kM −1 N k < 1 for some consistent norm k · k.
50
For a simple example, suppose that
1
−1 2
α
M N= 1 .
0 2
A = L + D + U,
where L and U are the strictly lower and upper triangular parts. This yields the following
classical methods:
In both cases M −1 exists if and only if A = [aij ] has nonzero diagonal elements.
For the ∞-norm and the Jacobi method we then have
X |aij |
kRJ k∞ = max .
1≤i≤n
j6=i
|aii |
is called strictly (row) diagonally dominant. Thus, the Jacobi method converges for any
x0 when applied to such matrices A.
For the Gauss-Seidel method and a strictly (row) diagonally dominant matrix A we
consider the equation RG x = λx, or
Assuming, without loss of generality, that the entry of largest magnitude in the vector x
is x` = 1, we obtain
`−1 n Pn n
j=`+1 |a`j | |a`j |
X X X
|λ||a`` | ≤ |λ| |a`j | + |a`j | ⇒ |λ| ≤ ≤ < 1,
|a`` | − `−1 |a`` |
P
j=1 j=`+1 j=1 |a`j | j=`+1
51
Although forming RG is “more expensive” than forming RJ , we are not guaranteed
that the Gauss-Seidel method performs better than the Jacobi method. For example,
1 1 −1
A = 1 1 1
2 2 1
yields
0 −1 1 0 −2 2
RJ = −1 0 −1 , RG = 0 2 −3 .
−2 −2 0 0 0 2
Since RJ3 = 0, i.e., RJ is nilpotent, we have ρ(RJ ) = 0, while ρ(RG ) = 2.
If ρ(M −1 N ) is close to 1, then the convergence of (4.1) can be very slow. In order to
improve the speed of convergence we can introduce a (real) relaxation parameter ω > 0
and consider ωAx = ωb instead of Ax = b. We then split
ωA = ω(L + D + U ) = (D + ωL) + (ωU + (ω − 1)D) =: M − N,
which results in the iteration
xk+1 = RSOR (ω)xk + ωM −1 b, k = 0, 1, 2, . . . , (4.3)
where
RSOR (ω) := −(D + ωL)−1 (ωU + (ω − 1)D). (4.4)
In order to form RSOR (ω), we still require that A has nonzero diagonal elements. For ω = 1
this gives the Gauss-Seidel method, and for ω > 1 this method is called the Successive Over
Relaxation (SOR) method. For 0 < ω < 1 the resulting methods are called under-relaxation
methods.
Numerous publications, in particular from the 1950s and 1960s, are concerned with
choosing an “optimal” ω in the sense that ρ(RSOR (ω)) is minimal in different applications.
The following result of Kahan [9] shows that one can restrict the search of an optimal (real
and positive) ω to the interval (0, 2).
Theorem 4.2. The matrix RSOR (ω) in (4.4) satisfies ρ(RSOR (ω)) ≥ |1 − ω|, and hence
the method (4.3)–(4.4) can converge only if ω ∈ (0, 2).
Proof. Let λ1 (ω), . . . , λn (ω) be the eigenvalues of RSOR (ω). Then the determinant multi-
plication theorem yields
n
Y
λj (ω) = det(RSOR (ω)) = det(−(I + ωD−1 L)−1 D−1 ) det(D(ωD−1 U + (ω − 1)I))
j=1
52
The theorem above does not show that ω ∈ (0, 2) is sufficient for the convergence of the
method (4.3)–(4.4). This is sufficient, however, when A is HPD, as shown by the following
theorem, which is a special case of a result of Ostrowski [11]. In particular, the theorem
shows that the Gauss-Seidel method converges for HPD matrices.
Theorem 4.3. If A ∈ Cn×n is HPD, then ρ(RSOR (ω)) < 1 holds for each ω ∈ (0, 2).
More generally, we can consider a splitting A = M − N , a relaxation parameter ω > 0,
and write ωAx = ωb in the equivalent form
x = ((1 − ω)I + ωM −1 N )x + ωM −1 b.
This yields the iteration
xk+1 = R(ω)xk + ωM −1 b, k = 0, 1, 2, . . . ,
where
R(ω) := (1 − ω)I + ωM −1 N.
Now ek = R(ω)k e0 , so that the convergence is determined by ρ(R(ω)). The spectrum of
the iteration matrix is given by
Λ(R(ω)) = (1 − ω) + ωΛ(M −1 N ),
which can be used for determining an optimal ω when Λ(M −1 N ) is (approximately) known.
In all such iterations the convergence is asymptotically (for large k) linear, with the average
“reduction factor” per step given by kR(ω)k.
53
Lemma 4.4. Let Sk , Ck ∈ Cn×k represent bases of the k-dimensional subspaces Sk , Ck ⊆
Cn . Then the following statements are equivalent
Proof. If CkH ASk is nonsingular, then rank(ASk ) = k and CkH ASk z = 0 implies z = 0.
Hence Cn⊥ ∩ ASk = {0}, so that Ck = ASk ⊕ Ck⊥ .
On the other hand, if Cn = ASk ⊕ Ck⊥ , then ASk has dimension k. Let CkH ASk z = 0
for some z ∈ Ck . Then ASk z ∈ ASk ∩ Ck⊥ and hence ASk z = 0. Since ASk has rank k, we
have z = 0 and CkH ASk is nonsingular.
This lemma shows that the question whether tk in (4.7)–(4.8) is uniquely determined
depends only on A, Sk , Ck but not on the choice of bases for Sk , Ck .
and hence Pk projects onto ASk orthogonally to Ck . Equation (4.9) can be written as
r0 = Pk r0 + rk .
|{z} |{z}
∈ASk ∈Ck⊥
If ASk = Ck , then this decomposition is orthogonal since its two components are mutually
orthogonal. In this case Cn = Ck ⊕ Ck⊥ and we call the method an orthogonal projection
method. When ASk 6= Ck , we call the method an oblique projection method.
Note that in an orthogonal projection method we have
kr0 k22 = kPk r0 k22 + krk k22 , or krk k22 = kr0 k22 − kPk r0 k22 .
If ASk ⊆ ASk+1 , then kPk r0 k2 ≤ kPk+1 r0 k2 and hence krk+1 k2 ≤ krk k2 , i.e., the Euclidean
norm of the residual is monotonically decreasing.
54
Theorem 4.6. In the notation established above, a projection method is well defined at
step k, if any of the following conditions hold:
Proof. (i) If Ck = Sk , then for any bases Ck , Sk we have Ck = Sk Z for some nonsingular
Z ∈ Ck×k , and CkH ASk = Z H SkH ASk , which is nonsingular since SkH ASk is nonsingular
(even HPD).
(ii) Now we have Ck = ASk Z and CkH ASk = Z H SkH AH ASk . This matrix is nonsingular,
since A is nonsingular, and hence SkH AH ASk is HPD.
After studying when the projection method (4.5)–(4.6) is well defined we now study
when the method terminates (in exact arithmetic) with rk = 0.
Lemma 4.7. In the notation established above, let the projection method be well defined
at step k. If r0 ∈ Sk and ASk = Sk , then rk = 0.
S1 = span{r0 }, and S1 ⊂ S2 ⊂ S3 ⊂ . . .
where Kk (A, r0 ) is the kth Krylov subspace of A and r0 ; see (1.2). In the following result
we collect important properties of the Krylov subspaces.
55
(iii) If r0 is of grade d with respect to A, Sk = Kk (A, r0 ) in the projection method (4.5)–
(4.6), and the method is well defined at step d, then rd = 0.
Proof. (i) and (ii) were already shown in Chapter 1; see in particular the discussion of
(1.3). (iii) follows from (ii) and Lemma 4.7.
We can now use this lemma and Theorem 4.6 to obtain the following mathematical
characterization of several well defined Krylov subspace methods.
Theorem 4.9. Consider the projection method (4.5)–(4.6) for solving a linear algebraic
system Ax = b with initial approximation x0 and let r0 = b − Ax0 be of grade d ≥ 1 with
respect to A.
rk ⊥ Kk (A, r0 ), or x − xk ⊥A Kk (A, r0 ),
kx − xk kA = min kx − zkA .
z∈x0 +Kk (A,r0 )
56
In order to implement the Krylov subspace methods that are characterized in the above
theorem, we need bases of Sk = Kk (A, r0 ) and Ck , for k = 1, 2, . . . . The “canonical” basis
r0 , Ar0 , . . . , Ak−1 r0 of Kk (A, r0 ) should not be used in practical computations, since the
corresponding matrix usually is (very) ill conditioned: For simplicity, assume that A is
diagonalizable, A = X diag(λ1 , . . . , λn ) X −1 , with a single dominant eigenvalue, |λ1 | > |λj |
for j = 2, . . . , n, and suppose that r0 = X[α1 , . . . , αn ]T with α1 6= 0. Then
1 α1
(λ2 /λ1 )k α2
Ak r0 = λk1 X .. .
..
. .
k
(λn /λ1 ) αn
vk = Ak−1 r0 /kAk−1 r0 k, k = 1, 2, . . .
xk = x0 + Vk tk ,
rk = b − Axk = r0 − AVk tk = r0 − Vk+1 Hk+1,k tk
= Vk+1 (kr0 k2 e1 − Hk+1,k tk ).
57
The equivalent optimality property is
krk k2 = min kb − Azk2 = min kVk+1 (kr0 k2 e1 − Hk+1,k t)k2
z∈x0 +Kk (A,r0 ) t∈Ck
+
and again the unique minimizer is tk = Hk+1,k (kr0 k2 e1 ).
The QR decomposition of the unreduced upper Hessenberg matrix Hk+1,k ∈ Ck+1,k can
2
be computed using k Givens rotations,
and hence the cost for this computation is O(k ).
Rk Rk
This gives Gk · · · G1 Hk+1,k = , or Hk+1,k = Qk , where Rk ∈ Ck×k is upper
0 0
triangular and nonsingular, and Qk = GH H
1 · · · Gk ∈ C
(k+1)×(k+1)
is unitary.
Hence
krk k2 = min kkr0 k2 e1 − Hk+1,k tk2
t∈Ck
H Rk
= min kkr0 k2 Qk e1 − tk2 ,
t∈Ck 0
58
the projection method characterized in (ii) of Theorem 4.9 for Hermitian and nonsingular
matrices, which is based on the Lanczos algorithm. It is mathematically equivalent to GM-
RES, but due to the Lanczos algorithm it uses 3−term instead of full recurrences. Hence
work and storage requirements in the MINRES method remain constant throughout the
iteration.
For HPD matries we can again use the Lanczos algorithm, which computes the decom-
position
AVk = Vk+1 Tk+1,k , k = 1, . . . , d − 1, AVd = Vd Td,d
in order to implement the projection method (4.5)-(4.6) with Sk = Ck = Kk (A, r0 ). Using
the orthogonality property rk ⊥ Kk (A, r0 ) (cf. (i) in Theorem 4.9) we get
xk = x0 + Vk tk ,
0 = VkH rk = VkH (b − Axk ) = VkH (r0 − AVk tk )
= kr0 k2 e1 − VkH AVk tk = kr0 k2 e1 − Tk tk .
Here Tk ∈ Ck×k is HPD (since A is) and its Cholesky decomposition exists; cf. Theorem 1.3.
Since Tk is tridiagonal, we have Tk = Lk LH k , where Lk is lower bidiagonal. A clever use
of this structure leads to simple formulas for computing tk = kr0 k2 Tk−1 e1 . The most well-
known variant is the Conjugate Gradient (CG) method of Hestenes and Stiefel [4].
We will now analyze the convergence properties of the algorithms introduced above.
We start with the CG method, which is well defined for HPD matrices A ∈ Cn×n . The CG
method is mathematically characterized by xk ∈ x0 + Kk (A, r0 ) and
Note that r0 = A(x−x0 ) and hence for every z ∈ x0 +Kk (A, r0 ) there exist γ0 , . . . , γk−1 ∈ C
with
k−1
! k−1
X X
x − z = x − x0 + γj Aj r0 = (I − γj Aj+1 )(x − x0 ) = p(A)(x − x0 ),
j=0 j=0
The HPD matrix A is unitarily diagonalizable with real positive eigenvalues, A = XΛX H
with X H X = I, Λ = diag(λ1 , . . . , λn ), 0 < λ1 ≤ . . . ≤ λn . We can thus define the square
root
1/2
A1/2 := XΛ1/2 X H , Λ1/2 := diag(λ1 , . . . , λn1/2 ),
which satisfies (A1/2 )2 = A. For every v ∈ Cn we thus get kvk2A = v H Av = (v H A1/2 )(A1/2 v) =
kA1/2 vk22 .
59
Using this result in (4.11) yields
= min kp(Λ)k2 kx − x0 kA
p∈Pk (0)
Theorem 4.10. The relative A-norm of the error in step k of the CG method satisfies
kx − xk kA
≤ min max |p(λi )|
kx − x0 kA p∈Pk (0) 1≤i≤n
≤ min max |p(λ)|
p∈Pk (0) λ∈[λ1 ,λn ]
√ k
κ−1
≤2 √ ,
κ+1
λn
where κ = λ1
is the condition number of A.
Proof. The first inequality was shown above. The second follows easily since the discrete
set of eigenvalues in the min-max problem is replaced by the inclusion set [λ1 , λn ]. The
third inequality can be shown using suitably shifted and normalized Chebyshev polynomials
which solve the min-max problem on [λ1 , λn ].
Main observations:
(1) The bounds in the theorem are worst-case bounds for the given matrix A, since they
are independent of the choice of x0 and the right hand side b.
(2) The last bound shows that the (worst-case) convergence will be fast when the con-
dition number of A is close to 1.
(3) The fact in (2) motivates the preconditioning of the system Ax = b: Find an easily
invertible matrix L and consider the (equivalent) system
The matrix L−1 AL−H is HPD and the convergence bound for CG will involve the
condition number of this matrix. The goal of preconditioning in this context is to
find a matrix L with
κ(L−1 AL−H ) κ(A).
60
The GMRES algorithm is characterized by xk ∈ x0 + Kk (A, r0 ) and
and thus
krk k2
≤ κ(X) min max |p(λi )|. (4.12)
kr0 k2 p∈Pk (0) 1≤i≤n
The right-hand-side of (4.12) is a worst-case bound on the relative residual norm in step
k. It shows that GMRES converges quickly when the eigenvalues are in a single “cluster”
that is far away from zero and the eigenvectors are well conditioned.
61
Bibliography
[1] W. E. Arnoldi, The principle of minimized iteration in the solution of the matrix
eigenvalue problem, Quart. Appl. Math., 9 (1951), pp. 17–29.
[4] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear
systems, J. Research Nat. Bur. Standards, 49 (1952), pp. 409–436.
[5] N. J. Higham, Accuracy and Stability of Numerical Algorithms, Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, second ed., 2002.
[7] A. S. Householder, Some numerical methods for solving systems of linear equa-
tions, Amer. Math. Monthly, 57 (1950), p. 453.
[8] A. S. Householder, A class of methods for inverting matrices, J. Soc. Ind. Appl.
Math., 6 (1958), pp. 189–195.
[9] W. Kahan, Gauss–Seidel methods of solving large systems of linear equations, PhD
thesis, University of Toronto, 1958.
[10] J. Liesen and Z. Strakoš, On optimal short recurrences for generating orthogonal
Krylov subspace bases, SIAM Rev., 50 (2008), pp. 485–503.
[11] A. M. Ostrowski, On the linear iteration procedures for symmetric matrices, Rend.
Mat. e Appl., 14 (1954), pp. 140–163.
62
[13] J. L. Rigal and J. Gaches, On the compatibility of a given solution with the data
of a linear system, Journal of the ACM (JACM), 14 (1967), pp. 543–548.
[15] H. Shapiro, A survey of canonical forms and invariants for unitary similarity, Linear
Algebra Appl., 147 (1991), pp. 101–167.
[17] , On the early history of the singular value decomposition, SIAM Rev., 35 (1993),
pp. 551–566.
[19] L. N. Trefethen and D. Bau, III, Numerical Linear Algebra, Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, 1997.
63