0% found this document useful (0 votes)
144 views95 pages

Two Floating Point LLL Reduction Algorithms - Thesis

This thesis develops and analyzes two floating point block LLL reduction algorithms: the left-to-right block LLL (LRBLLL) reduction algorithm and the alternating partition block LLL (APBLLL) reduction algorithm. It compares the performance of these algorithms to the original LLL reduction algorithm and the partial LLL reduction algorithm in terms of runtime, floating point operations (flops), and relative backward error. The simulation results show that the block LLL algorithms have faster runtimes than the partial LLL algorithm and much faster runtimes than the original LLL algorithm, though they may sometimes have higher flop counts or be less numerically stable. The thesis also discusses parallelizing the APBLLL algorithm.

Uploaded by

tomdarel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views95 pages

Two Floating Point LLL Reduction Algorithms - Thesis

This thesis develops and analyzes two floating point block LLL reduction algorithms: the left-to-right block LLL (LRBLLL) reduction algorithm and the alternating partition block LLL (APBLLL) reduction algorithm. It compares the performance of these algorithms to the original LLL reduction algorithm and the partial LLL reduction algorithm in terms of runtime, floating point operations (flops), and relative backward error. The simulation results show that the block LLL algorithms have faster runtimes than the partial LLL algorithm and much faster runtimes than the original LLL algorithm, though they may sometimes have higher flop counts or be less numerically stable. The thesis also discusses parallelizing the APBLLL algorithm.

Uploaded by

tomdarel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Two Floating Point Block LLL

Reduction Algorithms
Yancheng Xiao

Master of Science

School of Computer Science

McGill University
Montreal,Quebec
September 2012

A thesis submitted to McGill University in partial fulfillment of the requirements of


the degree of Master of Science in Computer Science
c
Yancheng
Xiao 2012

DEDICATION

This document is dedicated to my beloved parents.

ii

ACKNOWLEDGEMENTS

I have been indebted in my postgraduate study and research, especially in the


preparation of this thesis, to my supervisor Prof. Xiao-Wen Chang of School of Computer Science at McGill University, whose academic guidance and financial support
with patience and kindness have been invaluable to me. We are grateful to Prof.
Clark Verbrugge for his kindly lending of their lovely AMD high concurrency machine, which has been useful in testing the performance of our block LLL reduction
algorithms. I would like thank all my lab mates of Scientific Computing Lab in School
of Computer Science, Mazen Al Borno, Stephen Breen, Xi Chen, Sevan Hanssian,
Wen-Yang Ku, Wanru Lin, Milena Scaccia, David Titley-Peloquin, Jinming Wen and
Xiaohu Xie, for the pleasant collaboration during my study and research. Thanks
also to all my friends and my boyfriend Bin Zhu for their various help on my study
and living in Montreal.

iii

ABSTRACT

The Lenstra, Lenstra and Lovasz (LLL) reduction is the most popular lattice
reduction and is a powerful tool for solving many complex problems in mathematics
and computer science. The blocking technique casts matrix algorithms in terms
of matrix-matrix operations to permit efficient reuse of data in the algorithms. In
this thesis, we use the blocking technique to develop two floating point block LLL
reduction algorithms, the left-to-right block LLL (LRBLLL) reduction algorithm
and the alternating partition block LLL (APBLLL) reduction algorithm, and give
the complexity analysis of these two algorithms. We compare these two block LLL
reduction algorithms with the original LLL reduction algorithm (in floating point
arithmetic) and the partial LLL (PLLL) reduction algorithm in the literature in
terms of CPU run time, flops and relative backward errors. The simulation results
show that the overall CPU run time of the two block LLL reduction algorithms are
faster than the partial LLL reduction algorithm and much faster than the original
LLL, even though the two block algorithms cost more flops than the partial LLL
reduction algorithm in some cases. The shortcoming of the two block algorithms is
that sometimes they may not be as numerically stable as the original and partial
LLL reduction algorithms. The parallelization of APBLLL is discussed.

iv

ABREG

Le Lenstra, Lenstra et reduction Lovasz (LLL) est la reduction de reseaux plus


populaire et il est un outil puissant pour resoudre de nombreux probl`emes complexes
en mathematiques et en informatique. La technique bloc LLL bloquante reformule
les algorithmes en termes de matrice-matrice operations de permettre la reutilisation
efficace des donnees dans les algorithmes bloc LLL. Dans cette th`ese, nous utilisons
la technique de blocage de developper les deux algorithmes de reduction bloc LLL en
points flottants, lalgorithme de reducton bloc LLL de la gauche vers la droite (LRBLLL) et lalgorithme de reduction bloc LLL en partirion alternative (APBLLL), et
donner `a lanalyse de la complexite des ces deux algorithmes. Nous comparons ces
deux algorithmes de reduction bloc LLL avec lalgorithme de reduction LLL original (en arithmetique au point flottant) et lalgorithme de reduction LLL partielle
(PLLL) dans la litterature en termes de temps dexecution CPU, flops et les erreurs de larri`ere par rapport. Les resultats des simulations montrent que les temps
dexecution CPU pour les deux algorithmes de reduction blocs LLL sont plus rapides
que lalgorithme de reduction LLL partielle et beaucoup plus rapide que la reduction
LLL originale, meme si les deux algorithmes par bloc co
utent plus de flops que
lalgorithme de reduction LLL partielle dans certains cas. Linconvenient de ces
deux algorithmes par blocs, cest que parfois, ils peuvent netre pas aussi stable
numeriquement que les algorithmes originaux et les algorithmes de reduction LLL
partille. Le parallelisation de APBLLL est discutee.

TABLE OF CONTENTS
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

E
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ABREG

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii


LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.1
1.2

Lattice Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contributions and Organization of the Thesis . . . . . . . . . . .

1
4

Introduction to LLL Reduction Algorithms . . . . . . . . . . . . . . . . .

2.1
2.2

.
.
.
.
.
.

7
8
9
9
13
16

. .
. .

17
19

Block LLL Reduction Algorithms . . . . . . . . . . . . . . . . . . . . . .

23

3.1

24

2.3

LLL Reduction . . . . . . . . . . . . . . . . . . . . .
Original LLL Reduction Algorithm . . . . . . . . . .
2.2.1 Size-Reductions . . . . . . . . . . . . . . . . .
2.2.2 Permutations . . . . . . . . . . . . . . . . . . .
2.2.3 Complexity Analysis . . . . . . . . . . . . . . .
Partial LLL Reduction Algorithm . . . . . . . . . . .
2.3.1 Householder QR Factorization with Minimum
Pivoting . . . . . . . . . . . . . . . . . . . .
2.3.2 Partial Size-Reduction and Givens Rotation . .

. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Column
. . . . .
. . . . .

Subroutines of Block LLL Reduction Algorithms . . . . . .


3.1.1 Block Householder QR Factorization with Minimum
umn Pivoting . . . . . . . . . . . . . . . . . . . .
3.1.2 Block Size-Reduction . . . . . . . . . . . . . . . . .
vi

.
.
.
.
.
.

. . . .
Col. . . .
. . . .

24
32

.
.
.
.
.
.
.
.
.

35
39
41
41
45
48
48
53
55

Parallelization of Block LLL Reduction . . . . . . . . . . . . . . . . . . .

71

4.1
4.2

.
.
.
.
.

71
72
73
73
76

Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . .

80

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

3.2

3.3

3.4
4

4.3
5

3.1.3 Local Partial LLL Reduction . . . . . . . . . . .


3.1.4 Block Partial Size-Reduction . . . . . . . . . . .
Left-to-Right Block LLL Reduction Algorithm . . . . .
3.2.1 Partition and Block Operation . . . . . . . . . .
3.2.2 Complexity Analysis . . . . . . . . . . . . . . . .
Alternating Partition Block LLL Reduction Algorithm
3.3.1 Partition and Block Operation . . . . . . . . . .
3.3.2 Complexity Analysis . . . . . . . . . . . . . . . .
Simulation Results and Comparison of Algorithms . . .

Parallel Methods for LLL Reduction . . . . . . . . .


A Parallel Block LLL Reduction Algorithm . . . . .
4.2.1 Parallel Diagonal Block Reduction and Block
4.2.2 Parallel Block Size-Reduction . . . . . . . . .
Performance Evaluation of Parallel Algorithm . . .

vii

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

. . . . . .
. . . . . .
Updating
. . . . . .
. . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.

LIST OF TABLES
Table

page

31 Complexity analysis of LRBLLL reduction algorithm . . . . . . . . . .

48

32 Complexity analysis of APBLLL reduction algorithm . . . . . . . . . .

55

viii

LIST OF FIGURES
Figure

page

11 A lattice in 2-dimension . . . . . . . . . . . . . . . . . . . . . . . . .

31 Partition 1 of matrix R . . . . . . . . . . . . . . . . . . . . . . . . . .

49

32 Partition 2 of matrix R . . . . . . . . . . . . . . . . . . . . . . . . . .

49

33 Performance comparison for Case 1, Intel . . . . . . . . . . . . . . . .

61

34 Performance comparison for Case 2, Intel . . . . . . . . . . . . . . . .

62

35 Performance comparison for Case 3, Intel . . . . . . . . . . . . . . . .

63

36 Performance comparison for Case 2 with dimension 200, Intel . . . . .

64

37 Box plots of run time (left) and relative backward error (right) for
Case 1 (top), Case 2 (middle), Case 3 (bottom) with dimension
200, Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

38 Performance comparison for Case 1, AMD . . . . . . . . . . . . . . .

66

39 Performance comparison for Case 2, AMD . . . . . . . . . . . . . . .

67

310 Performance comparison for Case 3, AMD . . . . . . . . . . . . . . .

68

311 Performance comparison for Case 2 with dimension 200, AMD . . . .

69

312 Box plots of run time (left) and relative backward error (right) for
Case 1 (top), Case 2 (middle), Case 3 (bottom) with dimension
200, AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

41 Task allocation for three processors (P1 , P2 , P3 ) . . . . . . . . . . . . .

74

42 Approximating Parallel Simulation . . . . . . . . . . . . . . . . . . .

79

ix

CHAPTER 1
Introduction
1.1

Lattice Reduction
A set L in the real vector space Rm is referred to as a lattice if there exists a set

of linear independent vectors b1 , b2 , . . . bn Rm such that


( n
)
n
X
X
L=
Zbj =
zj bj | zj Z, 1 j n .
j=1

j=1

The set {b1 , b2 , . . . bn } is a basis of lattice L. The dimension of the lattice is defined
to be n. The matrix B = [b1 , b2 , . . . bn ] is referred to as the lattice basis matrix
which generates L, also written as L(B).
Geometrically, a lattice can be viewed as a set of intersection points in an infinite
grid, as shown in Figure 11. The lines of the grid do not need to be orthogonal to
each other. The same lattice may have different bases. For example in Figure 11,
{b1 , b2 } is a basis of the lattice, and {c1 , c2 } is also a basis.
Suppose that we have two basis matrices B and C. If they generate a same
lattice L(B) = L(C), we say that B and C are equivalent. Two basis matrices
B, C Rmn are equivalent if and only if there exists a unimodular matrix Z Znn
(i.e., an integer matrix with determinant det(Z) = 1) such that C = BZ, see [25,
p4].
The lattice basis reduction is to transform a given lattice basis into a basis
with short and nearly orthogonal basis vectors. There are several kinds of lattice
1

Figure 11: A lattice in 2-dimension


reductions based on the different criteria on the resulted basis, such as the Gaussian reduction [12, Chapter 6.1], the Minkowski reduction [26, 27], the Korkine and
Zolotarev (KZ) reduction [21] and the Lenstra, Lenstra and Lovasz (LLL) reduction
[22].
The lattice reduction is a powerful tool for solving many complex problems in
mathematics and computer science, especially the problems dealing with integers,
such as integer programming [1, 20], factoring polynomials with rational coefficients
[22], integer factoring [34] and cryptography [15].
The LLL reduction is the most popular lattice reduction. The LLL reduction
algorithm given in [22] and its variants have polynomial time complexity. It is wildly
used for applications such as factoring polynomials [22], subset sum problems [37],
digital communications [23, 24, 28, 29, 39], shortest vector problems (SVP) [25] and

closest vector problems (CVP), which are also referred to as the integer least-square
(ILS) problems [2, 4, 9, 10, 17].
Generally, we can classify the LLL reduction algorithms into three categories.
The first category includes exact integer arithmetic LLL reduction algorithms with
both input and output bases being integral. For example, the original LLL algorithm
given in [22] is in this category.
The second category includes the algorithms such as those in [30, 35, 36], which
use not only integer arithmetic, but also floating point arithmetic. The input and
output bases in this category are also integral. The reason to use floating point
arithmetic is that the integer arithmetic is expensive. The algorithms use long enough
floating numbers to approximate the intermediate results, so that the rounding errors
do not lead to an output basis which is not exactly LLL reduced.
The applications of the first and second categories include factoring polynomials
[22], subset sum problem [37] and public-key cryptanalysis [15].
The third category includes floating point algorithms with both input and output
bases being real. This category applies to cases where exact integer arithmetic is not
required and where a nearly LLL reduced basis is acceptable, such as ILS problems
which arise in GPS, e.g., [9, 10, 11, 17, 43], and in multi-input multi-output (MIMO)
communications, e.g., [24, 42]. So an algorithm in this category does not require strict
floating point error control like algorithms in the second category. An algorithm in
category three is much more efficient than those in categories one and two.

1.2

Contributions and Organization of the Thesis


The goal of this thesis is to propose efficient and reliable floating point algorithms

for the LLL reduction with real basis matrices by using blocking technique [14,
Chapter 5]. The algorithms are based on the original LLL reduction algorithm [22]
and the partial LLL (PLLL) reduction algorithm [13].
The computation speed of a matrix algorithm is determined not only by the
number of floating point operations involved, but also by the amount of memory
traffic which is the movements of data between memory and registers. The level
3 basic liner algebra subprograms (BLAS) are designed to reduce these movements
of data. The matrix-matrix operations implemented in level 3 BLAS make efficient reuse of data that resided in cache or local memory to avoid excessive data
movements. The blocking technique casts the algorithms in terms of matrix-matrix
operations to permit efficient reuse of data.
Two block LLL reduction algorithms utilizing this blocking technique are proposed in this thesis with their complexity analysis. Numerical simulations compare
the performance of our block algorithms on the CPU time, flops and numerical stability with the original LLL reduction algorithm and the PLLL reduction algorithm.
On average the computational speeds of the block algorithms are faster than PLLL
and LLL although their numerical stability in some cases may need improvement.
The parallelization of one of the two block LLL reduction algorithms is discussed
in two parts, the parallelization of the block size-reduction and the parallelization
of the diagonal block reduction. Complexity analysis shows that the parallelized
size-reduction part can obtain a speedup of np in ideal cases, if np processors are

used. The improvement of the parallelized diagonal block reduction part is hard to
be observed from the complexity analysis, since the complexity is too pessimistic. A
simple test is designed to examine the performance of the parallelized diagonal block
reduction part. The test result shows that the parallelized diagonal block reduction
part can obtain a speedup of 4.8 with 5 processors in best situations.
The rest of the thesis is organized as follows. In Chapter 2, we first give the
definition of the LLL reduction. Then a description of the original LLL reduction
algorithm in the matrix language is given, followed by its complexity analysis. In
the last section of this chapter, we introduce the partial LLL (PLLL) reduction
algorithm.
In Chapter 3, we first apply the blocking technique to the components of the
PLLL algorithm, leading to block subroutines. Then two block LLL algorithms are
proposed based on these block subroutines. We give the complexity analysis for the
block algorithms under the assumption of using exact arithmetic. Finally, simulation
results are presented, compared and discussed.
In Chapter 4, we first review the literature of parallel LLL algorithms. Then we
discuss the parallelization of one of our two block algorithms.
Chapter 5 gives conclusions and future work.
We now describe the notation to be used in the thesis. The sets of all real and
integer m n matrices are denoted by Rmn and Zmn , respectively, and the set of
real and integer n-vectors are denoted by Rn and Zn , respectively. Upper case letters
are used to denote matrices and bold lower case letters are used to denote vectors.
The identity matrix is denoted by I and its i-th column is denoted by ei . MATLAB

notation is used to denote a sub-matrix. Specifically, if A = (aij ) Rmn , then A(i, :)


denotes the i-th row, A(:, j) denotes the j-th column, and A(i1 : i2 , j1 : j2 ) denotes
the sub-matrix formed by rows i1 to i2 and columns j1 to j2 . For the (i, j) element
of A, sometimes we use aij and sometimes we use A(i, j). For block matrix A, Aij
denotes the (i, j) block. For a scalar z R, we use bze to denote its nearest integer. If
there is a tie, bze denotes the one with smaller magnitude. det(A) is the determinant

of A. Without saying specifically, k k stands for the 2-norm, i.e., kak = aT a, and
qP
2
k kF stands for the Frobenious matrix norm, i.e., kAkF =
i,j aij .

CHAPTER 2
Introduction to LLL Reduction Algorithms
In this chapter first we give the definition of the Lenstra-Lenstra-Lovasz (LLL)
reduction. Then we introduce the original LLL reduction algorithm [22] and the
partial LLL (PLLL) reduction algorithm [43], which will be the bases of our new
LLL reduction algorithms to be presented in later chapters.
2.1

LLL Reduction
The LLL reduction introduced in [22] can be described as a QRZ matrix fac-

torization:


R
B = Q Z 1 = Q1 RZ 1 ,
0

where B Rmn is a given matrix with full column rank, Q = [Q1 , Q2 ] Rmm is
n

mn

orthogonal, Z Znn is unimodular, and R Rnn is upper triangular and satisfies


two conditions:

rij 1
,
rii 2

1 i < j n,

2
2
ri1,i1
rii2 + ri1,i
,

1 < i n,

(2.1)
(2.2)

with the parameter (1/4, 1). The conditions Eq.(2.1) and Eq.(2.2) are named as
the size-reduction condition and the Lovasz condition, respectively. The matrix BZ
or the matrix R is said to be LLL reduced.

The LLL reduction algorithm in [22] is the most well known lattice basis reduction algorithm with polynomial time complexity, which was originally designed for
factoring polynomials with rational coefficients using integer arithmetic operations.
Later, the LLL reduction has widely extended its applications to number theory
(see, e.g., [34, 37]), cryptography (see, e.g., [15, 25]), integer programming (see, e.g.,
[1, 20]), digital communications (see, e.g., [24]), and GPS (see, e.g., [11, 17]). Some
of these extended applications do not require exact integer LLL reduced basis, thus
floating point arithmetic is used to achieve better computational performance in such
application areas. One example of the floating point LLL application is to compute
a suboptimal solution (e.g., the Babai point [4]) or the optimal solution of an integer
least squares (ILS) problem.
In the following part of this chapter, the original LLL reduction algorithm and
the PLLL reduction algorithm are introduced and we assume they use floating point
arithmetic.
2.2

Original LLL Reduction Algorithm


We will describe the original LLL reduction algorithm in the matrix language

(see [44, Algorithm 3.3.1] and [13, Algorithm 2.6.3]). The algorithm involves the
Gram-Schmidt orthogonalization (GSO), integer Gauss transformations (IGT), column permutations and orthogonal transformations. GSO is applied to find the QR
factors Q and R of the given matrix B. The column permutations and IGTs produce
the unimodular matrix Z.
In the original exact integer LLL reduction algorithm, a column scaled Q and
a row scaled R which has unit diagonal entries are computed by a variation of GSO

to avoid square root computations. In the floating point LLL reduction algorithm in
this thesis, the regular GSO is adopted to B and gives the compact form of the QR
factorization:
B = Q1 R,
where Q1 Rmn has orthonormal columns, and R Rnn is upper triangular.
After the GSO of B, integer Gauss transformations, column permutations and
GSO are used to transform R to a LLL reduced basis. IGTs are used to perform sizereduction to the off diagonal entries to achieve Eq.(2.1). The column permutations
are used to order the columns to achieve Eq.(2.2). Since a column permutation
destroys the upper triangular structure, GSO is used to recover the upper triangular
structure.
2.2.1

Size-Reductions

An integer matrix is called an IGT or an integer Gauss matrix if it has the


following form
Zij = In ei eTj ,

i 6= j,

is an integer.

Applying Zij to R from the right gives


= RZij = R Rei eTj .
R
is the same as R, except that rkj = rkj rki , k = 1, , i. By setting
Thus R
= brij /rii e, the nearest integer to rij /rii , we ensure |
rij | |
rii |/2.
2.2.2

Permutations

The column permutations are applied to achieve Eq.(2.2). Suppose that the
Lavosz condition is not satisfied for i = k, then a permutation matrix Pk1,k is
9

performed to interchange columns k 1 and k of R. After the permutation, the


upper triangular structure of R is destroyed. An orthogonal transformation Gk1,k
using the GSO technique (see [22]) is performed to re-construct the upper triangular
structure of R:
= Gk1,k RPk1,k ,
R
where
Gk1,k

Ik2

=
G

Ink

c s
G=
,
s c

rkk
s= q
.
2
2
rk1,k
+ rkk

rk1,k
,
c= q
2
2
rk1,k
+ rkk

The columns k 1, k and the rows k 1, k of R are changed by this permutation


and orthogonalization process. The diagonal and super-diagonal entries of R which
are changed after the permutation and orthogonalization process become
rk1,k1 =

q
2
2
rk1,k
+ rkk
,

rk1,k1 rk1,k
rk1,k = q
,
2
2
rk1,k
+ rkk

rk1,k1 rkk
rk,k = q
.
2
2
rk1,k
+ rkk

2
2
2
Thus, if rk1,k1
> rkk
+ rk1,k
with (1/4, 1), then the above operations guar2
2
2
antee rk1,k1
> rkk
+ rk1,k
.

Based on the above description of size-reductions and permutations, we will


describe the procedure of the LLL reduction algorithm as follows. The algorithm
shall iterate a sequence of stages to satisfy the LLL reduced conditions. And it
works on the columns of R from left to right. Define a column stage variable k which

10

indicates that the first k 1 columns of R are LLL reduced at the current stage, i.e.,

rij 1
,
rii 2

1 i < j k 1,

2
2
ri1,i1
rii2 + ri1,i
,

1 < i k 1.

(2.3)
(2.4)

At the beginning, set k to 2. Then during the reduction procedure, the value of k
shifts between 2 and n + 1 and changes by 1 in each step. At stage k, the algorithm
first uses the integer Gauss transformation to reduce rk1,k . Then it checks if it
needs to permute the columns k 1 and k according to the Lovasz condition. If
2
2
2
, it performs the permutation and applies the corresponding
+ rk1,k
> rkk
rk1,k1

orthogonal transformation, and moves back to stage k 1. Otherwise it reduces


ri,k (i = k 2, k 2, , 1) by IGTs and moves to the next stage k + 1. When
k reaches to n + 1, the conditions Eq.(2.1) and Eq.(2.2) are satisfied, the upper
triangular matrix R is LLL reduced and the algorithm stops. The algorithm is given
as follows.
Algorithm 2.1. (LLL Reduction) Suppose B Rmn has full column rank. This
algorithm computes the LLL reduction: B = Q1 RZ 1 , where Q1 has orthonormal
columns, R is upper triangular and satisfies LLL reduced criteria and Z is unimodular.
function: [R, Z] = LLL(B)
1:

Apply GSO to obtain B = Q1 R

2:

k := 2, Z := In

while k n do


rk1,k 1
4:
if rk1,k1
> 2 then
3:

11

5:

// Reduce rk1,k
m
j
rk1,k
:= rk1,k1

6:

Z(1 : n, k) := Z(1 : n, k) Z(1 : n, k 1)

7:

R(1 : k 1, k) := R(1 : k 1, k) R(1 : k 1, k 1)

8:

end if
// is parameter chosen in ( 41 , 1)

9:

2
2
2
if rk1,k1
> rkk
+ rk1,k
then

10:

Interchange columns Z(1 : n, k) and Z(1 : n, k 1)

11:

Interchange columns R(1 : k, k) and R(1 : k, k 1)

12:

Triangularize R: R := Gk1,k R

13:

if k > 2 then

14:

k := k 1

15:
16:

end if
else
// Size-reduction

17:
18:

for i = k 2 : 1 do
j m
r
:= ri,k
ii

19:

Z(1 : n, k) := Z(1 : n, k) Z(1 : n, i)

20:

R(1 : i, k) := R(1 : i, k) R(1 : i, i)

21:

end for

22:

k := k + 1

23:

end if

24:

end while

12

2.2.3

Complexity Analysis

Assume that the operations used in the algorithm are performed in exact arithmetic. The complexity of Algorithm 2.1 is measured by the number of arithmetic
operations. Part of the results of the complexity analysis will be used in Chapter 3
and Chapter 4. The QR factorization by GSO takes O(mn2 ) arithmetic operations
[16, Section 5.2]. Next, we give the analysis of the complexity of the while loop in
the LLL reduction algorithm. By adding the complexity of QR factorization and the
while loop together, we get the complexity of the LLL reduction algorithm.
For the complexity of the while loop, we would like to first determine the number
of loops and then count the number of arithmetic operations in each loop.
Lemma 2.1 ([22]): Let = maxj kbj k, and let = minxZn /{0} kBxk be the
length of the shortest vector of lattice L(B). The number of permutations involved
in Algorithm 2.1 is bounded by O(n3 + n2 log1/ ) and the algorithm converges.
Proof. We use the proof from [22] and [44, Chapter 3].
After the Gram-Schmidt QR factorization, we obtain QR factors Q1 and R in
the QR factorization B = Q1 R. Let R(p) denote the upper triangular matrix R after
the p-th permutation (R(0) = R). Define the quantities wi and after the p-th
permutation as
(p)

wi

i
Y
(p)
(rjj )2 ,

i = 1, 2, , n

(2.5)

j=1

and
(p)

n
Y
i=1

13

(p)

wi .

(2.6)

Suppose the p-th permutation is applied to columns (q1) and q of matrix R(p1)
and the orthogonal transformation by GSO is applied to keep the upper triangular
structure as described in the algorithm, we obtain matrix R(p) with following feature:
(p)

(p1)

rjj = rjj

, j 6= q 1, q,

(p1)

(p)

(p1)
(p)
|.
| = |rp1,p1 rpp
|rp1,p1 rpp

And by the permutation criterion (see line 9 of Algorithm 2.1) obtained from Eq.(2.2),
we have



(p)

(p1)
rq1,q1 < rq1,q1 .
Then from Eq.(2.5) we obtain
(p)

wi

(p1)

= wi

, i 6= q 1,

(p)

(p1)

wq1 /wq1 < .

Substituting them into Eq.(2.6) gives


(p) < (p1) ,

(2.7)

which means that one permutation operation decreases at least by a multiply of


. Assume that the algorithm involves a total of p permutations before convergence.
From Eq.(2.7) it follows that
(p) < p (0) ,
or equivalently
p < log1/

n
n
Y
Y
(0)
(0)
(p)
(0)
(p)
= log1/ log1/ = log1/
wi log1/
wi .
(p)
i=1
i=1

14

(2.8)

(0)

(0)

Since = maxj kbj k and kbj k2 (rjj )2 , then (rjj )2 2 (j = 1, 2, , n). Thus
from Eq.(2.5)
(0)

wi

2i .

(2.9)

By Theorem I of [7, Chapter II],


 (n1)/2
4
(det (B T B))1/n .
min
kBxk
n
xZ /{0}
3
2

(2.10)

= (Z (p) )1 x, where Z (p) denotes the unimodular


For any x Zn , we can define x
(p)
(p) = B (p) Z (p) = Q(p)
matrix Z after the p-th permutation (Z (0) = In ). Define B
1 R .

From Eq.(2.10) we have


2 =

min
kBxk2 =
n

xZ /{0}

mini

(p) x
k2
kB
min
n

Z /{0}
x

(p) (:, 1 : i)
kB
x(1 : i)k2

(1:i)Z /{0}
x

 (i1)/2
4
(p) (:, 1 : i)T B
(p) (:, 1 : i))|1/i
| det (B

3
 (i1)/2
4
=
| det (R(p) (:, 1 : i)T R(p) (:, 1 : i))|1/i
3
 (i1)/2
4
(p)
(wi )1/i (see Eq.(2.5)).

3
Then it follows that
(p)

wi

(3/4)i(i1)/2 2i .

15

(2.11)

Substituting Eq.(2.9) and Eq.(2.11) into Eq.(2.8) gives


p < log1/

n
Y

n
Y
(3/4)i(i1)/2 2i
log1/
2i

i=1

i=1

= (n + 1)n log1/

n
Y

+ log1/
(4/3)i(i1)/2

i=1

= (n + 1)n log1/

1 3
+ (n n) log1/ (4/3).
6

So Algorithm 2.1 involves at most O(n3 +n2 log1/ ) permutations and the algorithm
converges.
We should note that the bound on the number permutation from the lemma
suits for all kinds of LLL reduction algorithms, if they share the same permutation
criterion with Algorithm 2.1.
In Algorithm 2.1, k is either increased or decreased by 1 in the while loop. Since
the loops in which k is decreased must have a column permutation in it, we have
p loops in which k is decreased. The algorithm starts from k = 2 and ends when
k = n + 1, so the number of loops in which k is increased should equals to p + n 1.
Thus there are 2p + n 1 loops in total, which is bounded by O(n3 + n2 log1/ ).
Each loop costs O(n2 ) arithmetic operations in the worst situation. So the whole
algorithm takes at most O(mn2 + n5 + n4 log1/ ) arithmetic operations.
2.3

Partial LLL Reduction Algorithm


Recently the so-called effective LLL (ELLL) reduction was proposed by Ling

and Howgrave [23], and later the so-called partial LLL (PLLL) reduction algorithm
was developed by Xie, Chang and Borno [43]. Both algorithms are more efficient

16

than Algorithm 2.1. The ELLL reduction algorithm is essentially identical to Algorithm 2.1 after lines 17-21, which reduce the off-diagonal entries of R except the
super-diagonal ones, are removed. It has less computational complexity than LLL,
while it has the same effect on the performance of the Babai integer points as LLL.
[43] shows algebraically that the size-reduction condition of the LLL reduction has
no effect on a typical sphere decoding (SD) search process for solving an integer least
squares (ILS) problem. Thus it has no effect on the performance of the Babai integer point, the first integer point found in the search process. The PLLL is proposed
to avoid the numerical stability problem with ELLL, and to avoid some unnecessary size-reductions involved in LLL and ELLL. Both PLLL and LLL can compute
LLL reduced bases by adding an extra size-reduction procedure at the end of the
algorithms. The following part gives a description of the PLLL reduction.
2.3.1

Householder QR Factorization with Minimum Column Pivoting

The typical LLL algorithm first finds the QR factorization of the given matrix
B. In the original LLL algorithm, the Gram-Schmidt method is adopted for computing the QR factorization. However the Householder method without forming the
orthogonal factor Q which costs 43 mn2 flops, is more efficient than the Gram-Schmidt
method which costs 2mn2 flops [16]. The Householder method requires square root
operations, so it is not suitable for the exact integer LLL reduction. While the floating point LLL reduction has no problem with computing a square root, so it can use
the Householder transformation to computer the QR factorization.
The PLLL reduction uses the Householder QR factorization with minimum column pivoting (QRMCP) instead of the classic Householder QR factorization. In

17

general, the number of permutations is a crucial factor of the cost of the whole LLL
reduction process. If one can make the upper triangular factor close to an LLL reduced one in the QR factorization stage, the number of the permutations in the later
stage is likely to decrease. The minimum column pivoting strategy is used to help
to achieve the Lovasz condition, see [44, Section 4.1].
From Eq.(2.1) and Eq.(2.2), we can easily obtain
1 2
rii2 ,
( )ri1,i1
4

1 < i n,

(1/4, 1).

(2.12)

The Householder QR factorization upper-triangularize the matrix B columns by


columns, while the column index i is increasing from 1 to n. In order to make the
matrix R more likely to satisfy Eq.(2.12), the minimum column pivoting strategy
chooses a column permutation such that |rii | is the smallest in the i-th step. In the
i-th step of the QR factorization, the QRMCP finds the column in B(i : m, i : n) with
the minimum 2-norm, and interchanges the whole column with the i-th column of
B. After this the QRMCP eliminates the off-diagonal entries B(i + 1 : m, i) by a
Householder transformation Hi . By using the minimum column pivoting strategy,
the Householder QR becomes




R
R
BP = Q = Q1 Q2 = Q1 R,
0
0

(2.13)

where P Rnn is a permutation matrix, R Rnn is upper triangular, [Q1 , Q2 ]


n
mm

mn

is orthogonal , Q consists of Q1 and Q2 . Q = Hn Hn1 H1 is the product

of n Householder transformations.
The algorithm is given as follows.
18

Algorithm 2.2. (Householder QR Factorization with Minimum Column Pivoting)


Suppose B Rmn has full column rank. This algorithm computes the QRMCP
factorization: B = Q1 RP T , and Q has orthonormal columns, R is upper triangular
and P is a permutation matrix.
function: [R, P ] = QRM CP (B)
1:

P := In

2: lj
3:

:= kB(1 : m, j)k2 , j = 1 : n

for i = 1 : n do

4:

q := arg minijn lj

5:

if q > i then

6:

Interchange columns B(1 : m, i) and B(1 : m, q)

7:

Interchange columns P (1 : n, i) and P (1 : n, q)

8:

end if

9:

Compute the Householder transformation Hi which zeros B(i + 1 : m, i)

10:

B := Hi B

11:

lj := lj B(i, j)2 , j = i + 1, i + 2, , n

12:

end for

13:

R := B(1 : n, 1 : n)

2.3.2

Partial Size-Reduction and Givens Rotation

After the QRMCP, the PLLL reduction performs permutations, IGTs and Givens
rotations on R in an efficient and numerical stable way. In the k-th column of R,
PLLL checks if it needs to permute the columns k and k 1 according to the Lovasz condition Eq.(2.2). If the Lovasz condition hold, then the permutation will not

19

occur, no IGT will be applied, and the algorithm moves to column k + 1. If the
Lovasz condition does not hold, rk1,k is reduced by IGT, IGTs are also applied to
rk2,k , , r1,k for stability consideration. Then PLLL performs the permutation and
the Givens rotation, and moves back to the previous column.
Givens rotations are used to do triangularization after permutations in PLLL
instead of GSO, in line 12 of Algorithm 2.1. Define the Givens rotation matrix as

c s
G=
,
s c
where
rkk
s= q
.
2
2
rk1,k
+ rkk

rk1,k
,
c= q
2
2
rk1,k
+ rkk

which are used in the following transformation:


c s rk1,k rk1,k1 rk1,k1 rk1,k


=
.

rk,k
0
0
rk,k
s c
The PLLL algorithms is given as follows.
Algorithm 2.3. ( PLLL Reduction) Suppose B Rmn has full column rank. This
algorithm computes the PLLL reduction of B: B = Q1 RZ 1 , and Q1 has orthonormal columns, R is upper triangular and Z is a unimodular. It computes IGTs only
when column permutation occurs.
function: [R, Z] = P LLL(B)
1:

Compute [R, P ] = QRM CP (B)

2:

Set Z := P , k := 2

20

while k n do
m
j
rk1,k
4:
:= rk1,k1

3:

5:

:= rk1,k rk1,k1
// is parameter chosen in ( 41 , 1)

6:

2
2
if rk1,k1
> 2 + rkk
then

// Size-reduce R(1 : k 1, k)
7:
8:

for l = k 1 : 1 do
j m
r
:= rl,k
ll
Z(1 : n, k) := Z(1 : n, k) Z(1 : n, l)

9:

R(1 : l, k) := R(1 : l, k) R(1 : l, l)

10:
11:

end for
// Column permutation and updating

12:

r
c := r2 k1,k+r2
k1,k

13:

14:

kk

rkk

s := r2 +r2
k1,k kk
c s
G :=

s c

15:

Interchange columns Z(1 : n, k) and Z(1 : n, k 1)

16:

Interchange columns R(1 : n, k) and R(1 : n, k 1)

17:

R(k 1 : k, k 1 : n) := GR(k 1 : k, k 1 : n)

18:

if k > 2 then

19:

k := k 1

20:
21:

end if
else

21

22:

k := k + 1

23:

end if

24:

end while
Notice that the final matrix R obtained by the PLLL reduction algorithm are

not fully size-reduced, since the algorithm only performs size-reduction when a permutation is followed immediately. However we can easily add an extra size-reduction
procedure at the end of the PLLL reduction algorithm, and transform R to a LLL reduced matrix. We name the PLLL algorithm with an extra size-reduction procedure
as PLLL+.
The PLLL reduction algorithm uses the same permutation criterion as the LLL
reduction algorithm, so it has the same upper bound of permutations/loops as the
upper bound for the LLL reduction algorithm, which is O(n3 + n2 log1/ ).
For each loop, the PLLL reduction algorithm has O(n2 ) arithmetic operations
in worst case situations. The Household QR costs O(mn2 ) flops [16, Section 5.2]. So
the PLLL algorithm takes at most O(mn2 + n5 + n4 log1/ ) arithmetic operations,
which is the same as the complexity bound of the LLL reduction algorithm. The
simulation results of PLLL in [43] show that it is faster and more stable than the
LLL reduction.

22

CHAPTER 3
Block LLL Reduction Algorithms
The blocking technique has been wildly used to speed up conventional matrix
algorithms on todays high performance computers. The key to achieve high performance on computers with a memory hierarchy has been to recast the algorithms
in terms of matrix-vector and matrix-matrix operations to permit efficient reuse of
data that resided in cache or local memory. The blocking technique partitions a
big matrix into small blocks, and performs matrix-matrix operations implemented
in level 3 basic linear algebra subprograms (BLAS) as much as possible [14]. The
matrix-matrix operations implemented in level 3 BLAS is more efficient than the
matrix-vector operation implemented in level 2 BLAS or the vector-vector operation
implemented in level 1 BLAS. The level 3 BLAS can maximumly reduce the movements of data between memories and registers, which can be as costly as arithmetic
operations on the data in matrix algorithms.
In this chapter, we first explain how to apply the blocking technique to the components of the partial LLL (PLLL) reduction algorithm. Then we propose two block
LLL reduction algorithms with different matrix partition strategies, and compare
their speed and stability with the original LLL reduction algorithm and the PLLL
reduction algorithm introduced in Chapter 2.

23

3.1

Subroutines of Block LLL Reduction Algorithms


In this section we describe a block QR factorization algorithm, a block size-

reduction algorithm named BSR, a variant of the PLLL reduction algorithm named
Local-PLLL and a block partial size-reduction algorithm named BPSR. They will
be used as subroutines of the block LLL reduction algorithms. Local-PLLL suits for
computing the PLLL reduction of blocks of the basis matrix. The block partial sizereduction algorithm uses an efficient size-reduction strategy proposed in the PLLL
reduction algorithm.
3.1.1

Block Householder QR Factorization with Minimum Column Pivoting

In order to design a block Householder QR factorization by means of level 3


BLAS, Schreiber and Van Loan [38] proposed a storage-efficient W Y representation for the product of Householder transformations. Later Quintana-Orti, Sun and
Bischof [32] proposed a level 3 BLAS version of the QR factorization with maximum
column pivoting in order to get a rank-revealing factorization. Based on their work,
we give the block QR factorization algorithm with minimum column pivoting in this
section.
Given a real full column rank matrix B Rmn , the Householder QR factorization with minimum column pivoting gives




R
R
BP = Q = Q1 Q2 = Q1 R,
0
0

(3.1)

where Q = [Q1 , Q2 ] Rmm is orthogonal, R Rnn is upper triangular, and


n
nn

P Z

mn

is a permutation matrix. The orthogonal matrix Q is the product of n


24

Householder transformations:
QT = Hn H2 H1 ,

(3.2)

Hi = In i ui uTi , i = 1, 2, , n,
(3.3)

0
i Rmi+1 is a Householder vector,
where i = 2/(uTi ui ), ui = Rm , u
i
u
Hi Rmm is the Householder transformation matrix which zeros B(i + 1 : m, i).
The permutation matrix P is the product of n permutations:
P = P1 P2 Pn ,
where Pi (i = 1, 2 , n) is the permutation matrix which interchanges the i-th
column and another column in B(1 : m, i : n) such that the 2-norm of B(i : m, i) is
minimum.
In order to explain the block QR implementation, we define B (i) as the value of
B after i Householder transformations and i permutations, i.e.,
B (i) = Hi H2 H1 BP1 P2 Pi ,

(3.4)

(i) as B with only i permutations applied, i.e.,


with B (0) = B. And we define B
(i) = B(P1 P2 Pi ).
B

(3.5)

Here we want to point out that B (i) will not be formed in the i-th step of the block
algorithm, and it is used only for explanations of the algorithm.

25

The storage efficient W Y representation [38] for the product of i Householder


transformations has the following format:
i
Y

i
Y
Ht =
(Im t ut uTt ) = Im Yi Ti YiT ,

t=1

(3.6)

t=1

where
Yi = [u1 , u2 , , ui ] Rmi

(3.7)

is lower trapezoidal, Ti Rii is lower triangular given by the following recursion


formula:

Ti1 0
Ti =
,
hTi i

hTi = uTi Yi1 Ti1 R1(i1) ,

with the base case T1 = 1 .


Substituting Eq.(3.5) and Eq.(3.6) into Eq.(3.4), B (i) can be expressed as
(i) = B
(i) Yi FiT ,
B (i) = (In Yi Ti YiT )B

(3.8)

(i) Rin .
FiT = Ti YiT B

(3.9)

where

It is easy to show that FiT can be computed by recursion:

T
Fi1 Pi

(1) , FiT =
F1T = 1 uT1 B

.
T
(i) i uTi Yi1 Fi1
i uTi B
Pi

26

(3.10)

The block Householder QR factorization algorithm partitions the matrix B


Rmn into d blocks with size m k (for simplification we assume n = dk). The algorithm deals with the blocks sequentially from left to right. Inside a block, k Householder transformations are performed for upper-triangularization, and are accumulated into a single block transformation using the W Y representation in Eq.(3.6).
Then the block transformation is applied to other blocks of B by matrix-matrix
multiplication. Next we show how the block algorithm works.
In the first step, we first compute the squared column norms of B denoted by l:
lj := kB(1 : m, j)k2 ,

j = 1, 2, , n.

Utilizing l, a column in B with minimum 2-norm is permuted with the first column
by the permutation matrix P1 (actually P1 is not formed explicitly). Then we use
the Householder transformation H1 to zero B(2 : m, 1). At this moment, unlike
Algorithm 2.2 we do not apply H1 to other columns of B. However, the first row of
B must be updated in order to downdate the squared column norms:
lj := lj B(1, j)2 ,

j = 2, , n,

(3.11)

which will be used in the next step for minimum column pivoting. In order to
update the first row, we form the following matrices (actually there are vectors)
using Eq.(3.6) and Eq.(3.10):
Y1 := u1 ,

F1T (1, 2 : n) := 1 uT1 B(1 : m, 2 : n).

27

(1) (1 : m, 2 : n) given
Notice that B(1 : m, 2 : n) stores in memory is equivalent to B
in Eq.(3.10). From Eq.(3.8) the first row of B except the first entry is updated as
follows:
B(1, 2 : n) := B(1, 2 : n) Y1 (1, 1)F1T (1, 2 : n).
Then the squared column norms are downdated using Eq.(3.11). Thus at the end
of the first step, the first row and the first column have been updated, and the rest
part of B will be updated later.
In the second step, utilizing the vector l of the squared column norms, we apply
P2 to permute the second column of B with a column, say column p, 2 p n, such
that the 2-norm of B(2 : m, 2) is minimum, and we permute the second column of
F1T with its p-th column (i.e., F1T := F1T P2 ). Then from Eq.(3.8) the second column
B(2 : m, 2) is updated by the first Householder transformation H1 :
B(2 : m, 2) := B(2 : m, 2) Y1 (2 : m, 1)F1T (1, 2).
After this update, we apply the Household transformation H2 to zero B(3 : m, 2).
Same as step 1, we do not use H2 to update the rest columns of B at this moment.
But we need update the second row of B, because it will be used to compute the
2-norms of each column of B(3 : m, 3 : n). In order to perform the update, Y2 and F2
are formed by accumulating H2 into Y1 and F1 using Eq.(3.6) and Eq.(3.10):



T
F1 (1, 3 : n)

Y2 := Y1 u2 , F2T (1 : 2, 3 : n) :=
.
T
T
T
2 u2 B(1 : m, 3 : n) 2 u2 Y1 F1 (1, 3 : n)

28

Note that here F1T has been permuted by P2 . Then we update the second row of B
except the first two entries:
B(2, 3 : n) := B(2, 3 : n) Y2 (2, 1 : 2)F2T (1 : 2, 3 : n),
and compute the squared column norms of B(3 : m, 3 : n):
lj := lj B(2, j)2 ,

j = 3, , n.

At the end of the second step, the first two rows and the first two columns have been
updated.
Now we assume we are in the i-th step of transforming the first block of B to an
upper triangular matrix. The first (i 1) columns of B have been triangulized and
the first (i1) rows have been updated, while the rest part of the matrix B is waiting
to be updated. We first permute the i-th column with a column in B(1 : m, i : n)
such that the 2-norm of B(i : m, i) is minimum, and we permute the corresponding
T
T
T
columns of Fi1
(i.e., Fi1
:= Fi1
Pi ). Then we update the i-th column of B(i : m, i)

by using the Householder transformations H1 , H2 , , Hi as follows (see (3.8)):


T
B(i : m, i) := B(i : m, i) Yi1 (i : m, 1 : i 1)Fi1
(1 : i 1, i).

29

Then the Householder transformation Hi is used to zero B(i + 1 : m, i), and is


accumulated into Yi and Fi :


Yi := Yi1 ui ,

T
Fi1
(1 : i

1, i + 1 : n)

FiT (1 : i, i + 1 : n) :=
.
T
i uTi B(1 : m, i + 1 : n) i uTi Yi1 Fi1
(1 : i 1, i + 1 : n)
Then we update the i-th row B(i, i + 1 : n) and downdate the squared column norms:
B(i, i + 1 : n) := B(i, i + 1 : n) Yi (i, 1 : i)FiT (1 : i, i + 1 : n),
lj := lj B(i, j)2 ,

j = i + 1, , n.

The first i columns and rows of B have been updated.


Like shown in above, the block algorithm updates one row and one column in
each step. At the end of the k-th step, we update the rest part of B by using the
accumulated first k Householder transformations as follows:
B(k + 1 : m, k + 1 : n) := B(k + 1 : m, k + 1 : n) Yk (k + 1 : m, 1 : k)FkT (1 : k, k + 1 : n).
At this point, the first k columns of B (i.e., the first block of B) have been uppertriangulized, and the other columns of B have been updated. Then we can apply the
same procedure to triangulize the second block of B and so on until the final upper
triangular matrix is obtained.
The algorithm of block QR factorization with minimum column pivoting is given
as follows.

30

Algorithm 3.1. (Block Householder QR Factorization with Minimum Column Pivoting) Suppose B Rmn has full column rank, k is the chosen block size which
is a factor of n for simplification. This algorithm computes the QR Factorization:
Q1 R = BP , where Q1 has orthonormal columns, P is a permutation matrix. Note
the matrix B is overwritten by R in computation.
function: [R, P ] = BQRM CP (B, k)
1:

P := In , m
:= m, n
:= n

2: lj
3:

:= kB(1 : m, j)k2 ,

j = 1 : n

for j = 1 : k : n do

4:

Y (1 : m,
1 : k) := 0, F (1 : n
, 1 : k) := 0

5:

for j = 1 : k do
// Permutation

6:

i := j + j 1, q := arg minipn lp

7:

Interchange columns B(1 : m, i) and B(1 : m, q)

8:

Interchange columns P (1 : n, i) and P (1 : n, q)

9:

Interchange rows F (j, 1 : k) and F (q j + 1, 1 : k)


// Update the i-th column

10:

B(i : m, i) := B(i : m, i) Y (j : m,
1 : j 1)F (j, 1 : j 1)T

11:

Zero B(i + 1 : m, i) by the Householder transformation Hi = In i ui uTi


// Accumulation of the block transformation

12:

Y (j : m,
j) := ui (i : m)

13:

F (j + 1 : n
, j) := i B(1 : m, i + 1 : n)T ui

14:

F (1 : n
, j) := F (1 : n
, j) i F (1 : n
, 1 : j 1)Y (j : m,
1 : j 1)T ui (i : m)

31

// Update the i-th row and downdate the norm


15:

B(i, i + 1 : n) := B(i, i + 1 : n) Y (j, 1 : j)F (j + 1 : n


, 1 : j)T

16:

l(i + 1 : n) := l(i + 1 : n) B(i, i + 1 : n). B(i, i + 1 : n)


end for

17:

// Block transformation to unprocessed parts of the matrix


B(i + 1 : m, i + 1 : n) := B(i + 1 : m, i + 1 : n)

18:

Y (j + 1 : m,
1 : k)F (j + 1 : n
, 1 : k)T
m
:= m k, n
:= n k

19:
20:

end for

21:

R := B(1 : n, 1 : n)

Here we make a remark. In our implementation, we actually use an n-dimension


vector to store the permutation matrix P , and we do not form P explicitly for
efficiency.
3.1.2

Block Size-Reduction

The idea of block size-reduction is to accumulate several IGTs into a block


updating, so the algorithm is rich in matrix-matrix operations. The size-reduction
of an upper triangular matrix R Rnn can be described as
U = RZ,
where Z Znn is an unimodular matrix, U Rnn is size-reduced, i.e., |uij |
1
2

|uii | (1 i < j n). The size-reduction algorithm repeatedly applies IGTs to R.

Z is the product of a sequence of IGTs which have the form of In ei eTj (1 i <
j n), where is an integer (see Section 2.2.1).

32

In a conventional size-reduction algorithm, the matrix R is reduced by applying


IGTs to all of its off-diagonal elements rij , where i = j 1 : 1 : 1 and j = 2 : n. The
conventional size-reduction reduces the elements column by column, while the block
size-reduction reduces the elements block by block.
We partition R into d d blocks with block size k k (assume n = dk for
simplification),

U11 U1d R11 R1d Z11 Z1d

..
..
..
..
..
..

.
.
.
.
.
.

Udd
Rdd
Zdd
The blocks of R are size-reduced in a order similar to the order used in the conventional size-reduction. Thus the blocks Rij are reduced by IGTs, in the order of
i = j : 1 : 1 and j = 1 : d.
Let us use an example to illustrate the block size-reduction procedure. Assume
d = 2, the following block size-reduction is desired,


U11 U12 R11 R12 Z11 Z12

=
R22
Z22
U22

(3.12)

The blocks R11 , R22 and R12 are reduced one by one in 3 steps.
In step 1, R11 is reduced by applying IGTs to it as shown in Section 2.2.1, i.e.,
U11 = R11 Z11 where U11 is size-reduced, Z11 is formed by these IGTs. Thus, after
step 1 the matrix R becomes

U11 R12 R11 R12 Z11 0

.
R22
R22
Ik
33

(3.13)

In step 2, R22 is reduced by applying IGTs to it, i.e. U22 = R22 Z22 where U22 is
size-reduced, Z22 is formed by these IGTs. Thus, the matrix R becomes

U11 R12 U11 R12 Ik 0

,
U22
R22
Z22

(3.14)

12 = R12 Z22 . Notice that R


12 can be obtained by using a matrix-matrix
where R
operation.
12 , which involve U11 , i.e.,
In step 3, perform size-reductions on R

U11 U12 U11 R12 Ik Z12

,
=

U22
Ik
U22

(3.15)

where all the entries of U12 are size-reduced. Therefore from Eq.(3.13), Eq.(3.14)
and Eq.(3.15), Eq.(3.12) hold, where

Z11 Z12 Z11 0 Ik 0 Ik Z12


.

Ik
Ik
Z22
Z22
Notice that Z12 = Z11 Z12 can be obtained by a matrix-matrix operation.
The block size-reduction algorithm is given as follows.
Algorithm 3.2. (Block Size-Reduction) Given an upper triangular matrix R Rnn
and a block size k. This algorithm computes a size-reduced matrix: U = RZ, where
U is upper triangular and Z is unimodular. In the computation, the matrix R is
overwritten by U . We use Ai1 :i2 ,j to denote the sub-matrix formed by block rows i1
to i2 in the j-th block column of A.
function: [U, Z] = BSR(R, k)

34

1:

Z := In , d := n/k

2:

for j = 1 : d do
for i = j : 1 : 1 do

3:

if i = j then

4:
5:

Perform size-reductions on Rii : Rii := Rii Zii

6:

Update R1:i1,i : R1:i1,i := R1:i1,i Zii


else

7:
8:

Perform size-reductions on Rij : Rij := Rij + Rii Zij

9:

Update R1:i1,j : R1:i1,j := R1:i1,j + R1:i1,i Zij


Update Z1:i,j : Z1:i,j := Z1:i,i Zij

10:

end if

11:

end for

12:
13:

end for

14:

U := R

3.1.3

Local Partial LLL Reduction

We assume k in Section 3.1.2 is even. The upper triangular matrix R Rnn is


partitioned into d0 d0 blocks with block size k 0 k 0 where d0 = 2d, k 0 = k/2:

R11 R1d0

..
..
nn
k0 k0
R=
, 1 i j d0 .
.
.

R , Rij R

Rd0 d0

35

Define Rlocal Rkk as a diagonal sub-matrix of R:

Rii Ri,i+1
0
Rlocal =
, 1 i d 1.
Ri+1,i+1
The Local-PLLL reduction computes the PLLL reduction of Rlocal :
Rlocal = Ql Rl Zl1
where Ql Rkk is orthogonal, Zl Rkk is unimodular and Rl Rkk is PLLL
reduced.
The Local-PLLL reduction algorithm is a variant of the PLLL reduction algorithm described in Section 2.3. Since the Local-PLLL reduction is applied to a
sub-matrix Rlocal instead of the whole matrix, four modifications are made to PLLL
in order to suit the structure of Rlocal .
First, Rlocal is already upper triangular. Thus, the initial QR factorization in
PLLL is not needed in Local-PLLL.
Second, IGTs if required are applied to all the entries in a column of R in
PLLL for stability consideration. In Local-PLLL, Rlocal is part of the matrix R,
and the Local-PLLL reduction algorithm can only access columns in Rlocal which
are parts of the columns in R. For stability consideration, if some columns of Rlocal
are size-reduced by Local-PLLL, other parts of these columns in R should also be
size-reduced. So the Local-PLLL subroutine stores the information of columns which
are size-reduced by IGTs in a vector c and returns c in order to reduce the other
parts of the columns later.

36

Third, as a subroutine of block LLL reduction algorithms, before applying LocalPLLL the first half of Rlocal may be already PLLL reduced (see details in Section
3.2). If the first half of Rlocal is PLLL reduced, it is more efficient to start the LocalPLLL reduction algorithm with column k 0 + 1 of Rlocal instead of column 2 in PLLL.
Thus, a parameter f stating whether the first half of Rlocal is PLLL reduced.
Fourth, the Local-PLLL reduction algorithm must form the orthogonal factor
Ql of the PLLL reduction of Rlocal in order to update other blocks of R, while the
PLLL reduction algorithm does not form the orthogonal factor Q for efficiency.
The Local-PLLL reduction algorithm is given as follows.
Algorithm 3.3. (Local-PLLL Reduction) Given an upper triangular matrix Rlocal
Rkk and a scalar f . This algorithm computes the PLLL reduction: Rlocal = Ql Rl Zl1 ,
and returns a vector c Rk storing indexes of size-reduced columns. If f = 0, the
first half of Rlocal is not PLLL reduced.
function: [Ql , Rl , Zl , c] = Local-P LLL(Rlocal , f )
1:
2:
3:
4:

if f = 0 then
i := 2
else
i := k/2 + 1

5:

end if

6:

Ql := Ik , Zl := Ik , c := 0

while i k do
m
j
ri1,i
8:
:= ri1,i1
7:

9:

:= ri1,i ri1,i1

37

// is parameter chosen in ( 14 , 1)
10:

2
if ri1,i1
< 2 + rii2 then

// Size-reduce Rlocal (1 : i 1, i) and store the information in c


11:

ci := 1

12:

for l = i 1 : 1 do
j m
r
:= rl,i
ll

13:
14:

Rlocal (1 : l, i) := Rlocal (1 : l, i) Rlocal (1 : l, l)

15:

Zl (1 : k, i) := Zl (1 : k, i) Zl (1 : k, l)

16:

end for
// Column permutation and updating

17:
18:

19:

r
c := 2i1,i

2
ri1,i +rii

s := 2 rii 2
ii
ri1,i +r
c s
G :=

s c

20:

Interchange columns Rlocal (1 : n, i) and Rlocal (1 : n, i 1)

21:

Interchange columns Z(1 : n, i) and Z(1 : n, i 1)

22:

Interchange values of ci and ci1

23:

Rlocal (i 1 : i, i 1 : k) := GRlocal (i 1 : i, i 1 : k)

24:

Ql (1 : k, i 1 : i) := Ql (1 : k, i 1 : i)GT

25:

if i > 2 then

26:

i := i 1

27:
28:

end if
else

38

i := i + 1

29:
30:

end if

31:

end while

32:

Rl = Rlocal

3.1.4

Block Partial Size-Reduction

A block partial size-reduction algorithm is designed to coordinate with the LocalPLLL reduction algorithm. In BSR (Algorithm 3.2), all off-diagonal entries of the
upper triangular matrix are checked for IGTs. However, it is not the case for the
PLLL reduction, where the off-diagonal entries are reduced only when they are necessary. More specifically, if an IGT is applied to a super-diagonal entry of R, other
IGTs are applied to the off-diagonal entries in the same column in order to prevent
producing large numbers which may cause numerical stability problem. Thus, only
the entries in the columns which are affected by IGTs in Local-PLLL need to be
reduced. Local-PLLL stored the information of those columns affected by IGTs in
c, so block partial size-reduction (BPSR) algorithm can reduce only those marked
columns by IGTs.
Give an upper triangular matrix R Rnn which consists of d d blocks (here
we do not assume that each block has

R11

...
R=

the same size):

R1d
..
.
, 1 i j d.

Rdd

39

It has sub-matrices:

R11

...
=
R

R1,i1
..
.
Ri1,i1

R1i

R2i

=
R
. ,
..

Ri1,i

1 < i d,

(3.16)

has k columns, Ri,i with 1 < i i has k rows, and R1,i may have either k
where R
or k/2 rows.
Given a vector c Zk , whose entries are either one or zero. For j = 1 : k, if
by applying IGTs to it, which
cj = 1 we performs size-reductions on column j of R
if cj = 0 we do nothing. After this, part of the entries of R
are size-reduced
involve R;
according to c:
:= R
+R
Z,

R
where Z which is formed by those IGTs has the same dimension and block partition

as R:

Z1i

Z2i

Z = . ,
..

Zi1,i

(3.17)

I Z
and
is unimodular.
0 I
The BPSR algorithm is given as follows.
R
in
Algorithm 3.4. (Block Partial Size-reduction) Given two sub-matrices R,
R
:= R+
R
Z,

Eq.(3.16) and a vector c, this algorithm size-reduces the columns of R:


40

where Z has block partition as in Eq.(3.17). We use Ai1 :i2 ,j to denote the sub-matrix
formed by block rows i1 to i2 in the j-th block column of A.
Z]
= BP SR(R,
R,
c)
function: [R,
1:

for i = i 1 : 1 : 1 do
// Partial size-reduction of Ri,i by Zi,i involving Ri,i
for j = 1 : k do

2:

if cj = 1 then

3:

Size-reduce Ri,i (:, j): Ri,i (:, j) := Ri,i (:, j) + Ri,i Zi,i (:, j)

4:

end if

5:
6:

end for

7:

Update R1:i1,i : R1:i1,i := R1:i1,i R1:i1,i Zi,i

8:

end for

3.2

Left-to-Right Block LLL Reduction Algorithm


In this section, we present a left-to-right block LLL (LRBLLL) reduction algo-

rithm utilizing the subroutines introduced in the previous section, i.e., the block QR
factorization (Algorithm 3.1), the block size-reduction (Algorithm 3.2), the LocalPLLL reduction algorithm (Algorithm 3.3) and the block partial size-reduction (Algorithm 3.4). The complexity analysis of LRBLLL is presented in the second part
of this section.
3.2.1

Partition and Block Operation

The left-to-right block LLL reduction algorithm combines the blocking technique
with the PLLL algorithm. It includes 7 steps as follows.

41

Step 1. Compute the block QR factorization (Algorithm 3.1) of the full column
rank matrix B Rmn with minimum column pivoting: BP = Q1 R.
Step 2. Partition matrix R to d0 d0 blocks with block size k 0 (here for simplicity,
we assume that n is multiple of k 0 , i.e., n = d0 k 0 , and d0 is even, and define k = 2k 0 ,
d = d0 /2):

R11

..
R=
.

R1d0
..
nn
.
R ,

Rd0 d0

Rij Rk k ,

1 i j d0 .

Initialize a block index i = 1.


Step 3. Compute the Local-PLLL reduction (Algorithm 3.3) of Rlocal

Rii
=

Rlocal := QTlocal Rlocal Zlocal .


Step 4. Update the relevant blocks of Rlocal using block transformations:
Rright := QTlocal Rright ,

Rup := Rup Zlocal ,

where

Ri,i+3
Ri,i+2
Rright =
Ri+1,i+2 Ri+1,i+3

Ri,d0
,
Ri+1,d0

42

Rup

R
R
1,i+1
1,i

R2,i
R2,i+1

= .
.
.
..

.
.

Ri1,i Ri1,i+1

Ri,i+1
,
Ri+1,i+1

Step 5. Size-reduce Rup using the block partial size-reduction algorithm (Algorithm 3.4) :

Rup

R11

...
:= Rup +

R1,i1
..
.
Ri1,i1

Zupdate .


Step 6. Set := r(i1)k0 ,(i1)k0 +1 br(i1)k0 ,(i1)k0 +1 /r(i1)k0 ,(i1)k0 e r(i1)k0 ,(i1)k0 .
2
2
2
Check if the Lovasz condition r(i1)k
0 ,(i1)k 0 ( + r(i1)k 0 +1,(i1)k 0 +1 ) holds for the

first column of Rlocal and the column before it in R.


If i = 1 or the Lovasz condition holds, set i := i + 1.
Else if i 6= 1 and the Lovasz condition does not hold, set i := i 1.
If i < d0 , go to step 3; else, go to step 7.
Step 7. Apply block size-reduction (Algorithm 3.2) to the whole matrix R, stop
the algorithm.
In Section 3.1.3, we stated that the first k 0 columns of Rlocal may be PLLL
reduced before applying Local-PLLL in step 3. It is easy to check from the algorithm
that the first k 0 columns of Rlocal are PLLL reduced except the first call of LocalPLLL in step 3.
The left-to-right block LLL reduction algorithm is given as follows.
Algorithm 3.5. (Left-to-Right Block LLL Reduction) Given a full column rank matrix B Rmn and a block size k which is even. This algorithm computes the LLL
factorization: B = Q1 RZ 1 , where Q1 has orthonormal columns, R is upper triangular and LLL reduced, and Z is unimodular. In the algorithm, we assume Z

43

is partitioned into blocks in the same way as R. We use Ai1 :i2 ,j1 :j2 to denote the
sub-matrix formed by block rows i1 to i2 and block columns j1 to i2 of A.
function: [R, Z] = LRBLLL(B, k)
// Compute the block QR factorization using Algorithm 3.1
1:

[R, Z] = BQRM CP (B, k)

2:

i := 1, k 0 := k/2, d0 := 2n/k, f := 0

3:

while i < d0 do
// PLLL reduction of Rii using Algorithm 3.3

4:

Ri:i+1,i:i+1 , Z,
r] = Local-P LLL(Ri:i+1,i:i+1 , f )
[Q,

5:

f := 1

6:

if Z = I then
// The diagonal block is unchanged. The algorithm moves ahead.

7:

i := i + 1

8:

Continue

9:

end if
// Block updating

10:

Z1:d0 ,i:i+1 := Z1:d0 ,i:i+1 Z

11:

R1:i1,i:i+1 := R1:i1,i:i+1 Z

12:

T Ri:i+1,i+2:d0
Ri:i+1,i+2:d0 := Q
// Size-reduce the corresponding columns of R1:i1,i:i+1 using Algorithm 3.4

13:

= BP SR(R1:i1,i:i+1 , R1:i1,1:i1 , r)
[R1:i1,i:i+1 , Z]

14:

Z1:d0 ,i:i+1 = Z1:d0 ,i:i+1 + Z1:d0 ,1:i1 Z


// Check the Lovasz condition, then move forward or backward

44

15:

:= bR((i1)k 0 , (i1)k 0 +1)/R((i1)k 0 , (i1)k 0 )e

16:

:= R((i1)k 0 , (i1)k 0 +1) R((i1)k 0 , (i1)k 0 )


// is parameter chosen in ( 14 , 1)
if R((i1)k 0 , (i1)k 0 )2 2 + R((i1)k 0 +1, (i1)k 0 +1)2 or i = 1 then

17:

i := i + 1

18:

else

19:

i := i 1

20:
21:

end if

22:

end while
// Size-reduce R using Algorithm 3.2

23:

= BSR(R)
[R, Z]

24:

Z := Z Z

Notice that if the Local-PLLL output Z is an identity matrix, we do not apply block
updating and BPSR to relevant blocks for efficiency. Also notice that, if the matrix
dimension n is not a multiple of the block size k, the algorithm still works by simply
changing the block size of last column blocks to fit the matrix dimension. At the
end of each while loop the first ik 0 columns of R are PLLL reduced. The while loop
breaks when i = d0 . Then the n = d0 k 0 columns of R are PLLL reduced. And the
matrix R is size-reduced after the final size-reduction. Thus the LRBLLL algorithm
outputs a basis matrix which is LLL reduced.
3.2.2

Complexity Analysis

In the LRBLLL algorithm, the column permutation operations are executed in


the Local-PLLL subroutine. Since LRBLLL uses the same permutation criterion as

45

LLL (Algorithm 2.1), Lemma 2.1 can be also applied to LRBLLL. As in Section 2.2.3,
we define = maxj kbj k, and = minxZn /{0} kBxk. Thus the LRBLLL algorithm
has at most O(n3 + n2 log1/ ) permutations, and the algorithm converges. During
the procedure of LRBLLL, the permutation operations are performed inside the
Local-PLLL subroutine. In the following part, we would like to obtain an upper
bound of the number of calls of Local-PLLL.
In the while loop of LRBLLL, it calls Local-PLLL reductions of diagonal submatrices of R. At each loop, the PLLL reduction of a diagonal sub-matrix is performed, and a diagonal sub-matrix which will be performed by the PLLL reduction
in the next loop, is selected in the current loop. From step 3 of LRBLLL, the diagonal sub-matrix Rlocal contains 2 diagonal blocks Ri,i and Ri+1,i+1 . And Rlocal may
move one diagonal block forward or backward at the end of each loop, according to
whether the Lovasz condition holds for columns (i 1)k 0 and (i 1)k 0 + 1, see step
6 of LRBLLL described at Section 3.2. The matrix R which is divided into d0 d0
blocks has d0 diagonal blocks. In the first call of Local-PLLL, Rlocal contains the first
two diagonal blocks R1,1 and R2,2 , and the block index i equals to 1; while in the
last call of Local-PLLL, Rlocal contains the last two diagonal blocks Rd0 1,d0 1 and
Rd0 ,d0 , and the block index i equals to d0 1. It needs only d0 1 loops for i to move
forward to i = d0 1 from i = 1, if there are no backward moves. Actually there may
be some backward moves, say s times, and the times of moving forward should be
added by an extra s. Thus the total number of moves of Rlocal is 2s + d0 1 which
equals to 2s + 2d 1.

46

The rest problem is to determine an upper bound of s which is the times of


the block index i moving backward during the excution of LRBLLL. Assume in a
loop except the first one, the Lovasz condition does not hold for columns (i 1)k 0
and (i 1)k 0 + 1, so the algorithm moves one block back and the block index i is
decreased by one. However at the beginning of this loop the Lovasz condition holds
for columns (i 1)k 0 and (i 1)k 0 + 1. Then the Local-PLLL subroutine of LRBLLL
must have modified column (i 1)k 0 + 1 of R. To modify column (i 1)k 0 + 1,
which is the first column of the current Rlocal , Local-PLLL must perform at least
k 0 permutations. Since subroutine Local-PLLL starts with column (k 0 + 1) of Rlocal
(see Section 3.1.3), it takes at least k 0 permutations to get back to the first column
from column k 0 + 1. Thus if the block index i is decreased in a loop, there is at
least k 0 permutations taking place in Local-PLLL in this loop. Assume there are
p permutations involved in LRBLLL before convergence. So s, i.e., the number of
loops in which i is decreased, is bounded above by p/k 0 which equals to (2n/d)p.
Then, the cost of LRBLLL is given as follows. The QR factorization with
minimum column pivoting takes O(mn2 ) arithmetics [16, Section 5.2]. In Local-PLLL
a permutation causes at most O(k 2 ) arithmetic operations for subsequent updating
and size-reduction. In each loop after Local-PLLL is called, the block updating of
R takes O(nk 2 ) operations. The subroutine BPSR takes O(n2 k) operations in worst
case in each loop. And the block size-reduction subroutine at the end of the algorithm
takes O(n3 ) operations. From above, there are p permutations and 2s + 2d 1 loops.
The cost of LRBLLL is
CLRBLLL = O(mn2 ) + p O(k 2 ) + (2s + 2d 1) O(n2 k + nk 2 ) + O(n3 ).
47

Notice that p is bounded above by O(n3 + n2 log1/ ), so s is bounded above by


O(dn2 + dn log1/ ). The total cost of LRBLLL is bounded above by O(mn2 + n5 +
n4 log1/ ). This bound is the same as the bounds of LLL and PLLL.
Table 31 lists the costs of the important processes and the total cost of LRBLLL.
Table 31: Complexity analysis of LRBLLL reduction algorithm
Processes
Bound
Cost of QR factorization
O(mn2 )
Cost of one permutation in Local-PLLL
O(k 2 )
Cost of block updating in one loop
O(nk 2 )
Cost of size-reduction in one loop
O(n2 k)
Cost of final block size-reduction
O(n3 )
Number of permutations: p
O(n3 + n2 log1/ )
Number of loops: 2s + 2d 1
O(dn2 + dn log1/ )
Total cost of the algorithm
O(mn2 + n5 + n4 log1/ )
3.3

Alternating Partition Block LLL Reduction Algorithm


In this section we propose a alternating partition block LLL (APBLLL) reduc-

tion algorithm which is easier to be parallelized. The complexity analysis of APBLLL


is also given.
3.3.1

Partition and Block Operation

The LRBLLL algorithm is actually a mimic of PLLL. LRBLLL works on the


matrix from left to right, and may moves forward or backward during the procedure.
In this new alternating partition block LLL reduction algorithmic, we do not move
the algorithm forward and backward, we do it in another way.

48

R44 1k

1k
R11

1k

1k

1k

R12

R13

R14

1k

R23

R24

1k

R22

1.5k

1k

1.5k

R11

R12

R13

1.5k

R23

1k

R33

1.5k

R22
R33

R34

1k

R44 1k

Figure 31: Partition 1 of matrix R

Figure 32: Partition 2 of matrix R

We first perform BQRMCP on B Rmn (see Algorithm 3.1):

1.5k

1k

1.5k

R11

R12

R13

BP = Q1 R,
1.5k

where Q1 Rmn has orthonormal columns, R Rnn is upper triangular and

1k
R22
R
23
P Znn is a permutation
matrix.
Next we use an example to show how APBLLL works iteratively with two al-

1.5k31 and Figure 32.


R33in Figure
ternating partitions as shown
In the first iteration, R is partitioned into 4 4 blocks, each block has size k k
(see Figure 31). This partition is refereed to as partition 1 for convenience. Then
we work on the blocks of partition 1. First we perform Local-PLLL (Algorithm 3.3)
generated by Local-PLLL. Second,
to R11 , then we update R12 , R13 and R14 by Q
generated by this
we perform Local-PLLL to R22 , then we update R23 and R24 by Q
Local-PLLL, and update R12 by Z also generated by this Local-PLLL, then BPSR
49

(Algorithm 3.4) is applied to R12 to do partial size-reduction. Third, we perform


generated by current Local-PLLL,
Local-PLLL to R33 , then we update R34 by Q
and update R13 and R23 by Z also generated by current Local-PLLL, then BPSR
is applied to the block column R13 and R23 . Fourth, we perform Local-PLLL to
R44 , then we update R14 , R24 and R34 by Z generated by current Local-PLLL, then
BPSR is applied to the block column R14 , R24 and R34 . After this, the first iteration
has finished. After the first iteration, all the diagonal blocks R11 , R22 , R33 and R44
are PLLL reduced.
In the second iteration, we repartition R into 3 3 blocks (see Figure 32), the
block size is indicated in the figure. This repartition is referred to as partition 2. We
do exactly the same for the blocks of partition 2 as we do in the first iteration. After
the second iteration, diagonal blocks R11 , R22 and R33 are PLLL reduced.
Then in the following iterations, the same process with either partition 1 or
partition 2 are performed iteratively (partition 1 and partition 2 are preformed alternately), until no permutation takes place in a iteration. At this point, it is easy
to see that R is PLLL reduced. Then an extra block size-reduction (Algorithm 3.2)
is applied to R. After the final size-reduction, R is LLL reduced and the algorithm
ends.
The two alternating partitions of R for the general case are given as follows.
Assume the block size

R11

R=

is k and n = dk. Partition 1 partitions R into d d blocks:

R1d
..
..
nn
kk
.
.
R , Rij R , 1 i j d.

Rdd
50

And partition 2 partitions R into (d 1) (d 1) blocks:

R11 R1,d1

..
...
Rnn ,
R=
.

Rd1,d1
R11 R1.5k1.5k ,
R1v R1.5k1k ,

R1,d1 R1.5k1.5k ,

Ru,d1 R1.5k1k ,

Rd1,d1 R1.5k1.5k ,

Ruv R1k1k ,

1 < u v < d 1.

The alternating partition block LLL reduction algorithm is given as follows.


Algorithm 3.6. (Alternating Partition Block LLL Reduction) Given a full column
rank matrix B Rmn and a block size k (assume n is multiple of k, i.e., n = dk).
This algorithm computes the LLL reduction: B = Q1 RZ 1 , where Q1 has orthonormal columns, R is upper triangular and is LLL reduced, and Z is unimodular. In
the algorithm, we assume Z is partitioned into blocks in the same way as R. We use
Ai1 :i2 ,j1 :j2 to denote the sub-matrix formed by block rows i1 to i2 and block columns
j1 to i2 of A.
function: [R, Z] = AP BLLL(B, k)
// Compute the block QR factorization using Algorithm 3.1
1:

[R, Z] = BQRM CP (B, k)

2:

d := n/k, f := 0

3:

for i = 1 : d do

4:

changei := 1, nextChangei := 1

5:

end for

6:

while (1) do
51

7:

Partition R into blocks using Partition 1 or 2 iteratively

8:

for i = 1 : d (for Partition 2: i = 1 : d 1, we assume partition 1 is used in


the following description ) do

9:
10:
11:

if changei 6= 1 then
continue
end if
// Apply Local-PLLL to all diagonal blocks using Algorithm 3.3

12:

Rii , Z,
r] = Local-P LLL(Rii , f )
[Q,

13:

if Z = I then
// The diagonal block is unchanged, and updates are not needed

14:
15:

continue
end if
// Perform the corresponding updates

16:

nextChangemax(1,i1) := 1, nextChangei := 1
// Block updating

17:

Z1:d,i := Z1:d,i Z

18:

R1:i1,i := R1:i1,i Z

19:

T Ri,i+1:d
Ri,i+1:d := Q
// Size-reduce the corresponding columns of R1:i1,i using Algorithm 3.4

20:

= BP SR(R1:i1,i , R1:i1,1:i1 , r)
[R1:i1,i , Z]

21:

Z1:d,i = Z1:d,i + Z1:d,1:i1 Z

22:

end for

23:

if nextChange = 0 then

52

// Break when no permutation applied


break

24:
25:

end if

26:

f := 1

27:

for i = 1 : d do
changei := nextChangei , nextChangei := 0

28:
29:

end for

30:

end while
// Size-reduce R using Algorithm 3.2

31:

= BSR(R)
[R, Z]

32:

Z := Z Z

Notice that the two vectors change and nextChange are used to tracing that if
the diagonal blocks are PLLL reduced in each iteration. If two diagonal blocks are
unchanged in a iteration, in the next iteration we do not apply Local-PLLL to the
diagonal block whose diagonal entries come from the two unchanged diagonal blocks,
since this diagonal block should also be PLLL reduced. Also notice that if the LocalPLLL output matrix Z is an identity matrix, we do not apply block updating and
BPSR to relevant blocks for efficiency.
3.3.2

Complexity Analysis

The APBLLL algorithm shares the same QR and final size-reduction parts as
LRBLLL. Thus the costs of these two parts are the same as they are in LRBLLL,
which are O(mn2 ) arithmetic operations for the QR factorization and O(n3 ) arithmetic operations for the final size-reduction. The cost of the rest parts of APBLLL

53

are divided into two parts: the cost of subroutine Local-PLLL and the cost outside
subroutine Local-PLLL, i.e., the block updating and the block partial size-reductions.
These two parts are calculated separately.
Since APBLLL uses the same permutation criterion as LLL (Algorithm 2.1),
Lemma 2.1 can be also applied to APBLLL. Thus the total number of permutations
p taking place in Local-PLLL reductions is bounded above by O(n3 + n2 log1/ ).
In Local-PLLL a permutation causes at most O(k 2 ) arithmetic operations for subsequent updating and size-reductions. Thus, all the call to subroutine Local-PLLL
cost O(n3 k 2 + n2 k 2 log1/ ) arithmetic operations.
In APBLLL, only if the output matrix Z of Local-PLLL is not identity, i.e.
there are some permutations taking place during the execution of Local-PLLL, the
block updating and BPSR line 17-21 are performed. Because the total number of
permutations is p, there are at most p calls to Local-PLLL and each one of which
So the worst case is that the block updating and BPSR
does not produce identity Z.
are executed p times. For each execution, the block updating and BPSR cause at
most O(n2 k) arithmetic operations. Thus the total cost of block updating and BPSR
is p O(n2 k) in the worst case.
From above, the total cost of APBLLL is obtained by adding the cost of all the
parts together:
CAP BLLL = O(mn2 ) + p O(k 2 ) + p O(n2 k) + O(n3 ) = O(mn2 + n5 k + n4 k log1/

).

This bound is lager than the bounds of LRBLLL, PLLL and LLL. However its simulation result shows that it performs better than LLL and PLLL and performs similar

54

as LRBLLL. The simulation results and analysis of the two block LLL reduction
algorithms will be given in the next section.
Table 32 lists the costs of the important processes and the total cost of APBLLL.
Table 32: Complexity analysis of APBLLL reduction algorithm
Processes
Bound
Cost of QR factorization
O(mn2 )
Cost of one permutation in Local-PLLL
O(k 2 )
Cost of block updating and
O(n2 k)
size-reduction for one diagonal block
Cost of final block size-reduction
O(n3 )
Number of permutations: p
O(n3 + n2 log1/ )
Total cost of the algorithm
O(mn2 + n5 k + n4 k log1/ )
3.4

Simulation Results and Comparison of Algorithms


The simulations are performed on MATLAB on two types of machines. One

has MATLAB 7.12.0 on a 64-bit Ubuntu 11.10 system with 4 Intel Xeon(R) CPU
W3530 2.8GH processors and 5GB memory. The other has MATLAB 7.13.0 on
a 64-bit Red Hat 6.2 system with 64 AMD Opteron(TM) 2.2GH processors and
64G memory. Our simulations use conventional MATLAB not Parallel MATLAB.
MATLAB use IEEE double precision model for the floating point arithmetic by
default. The unit round-off for double precision is about 1016 . We compare four
algorithms, i.e., the original LLL algorithm (Algorithm 2.1), the PLLL+ algorithm,
the LRBLLL algorithm (Algorithm 3.5), and the APBLLL algorithm (Algorithm
3.6). The PLLL+ algorithm is the PLLL algorithm (Algorithm 2.3) with an extra
size-reduction procedure to guarantee the resulted matrix is size-reduced. All these

55

four algorithms produce LLL reduced matrices. We compare the CPU run time, the
flops, and the relative backward errors

kBQc Rc Zc1 kF
kBkF

of the four algorithms, where

Qc is the computed orthogonal matrix, Rc is the computed LLL reduced matrix and
Zc1 is the unimodular matrix formed by the inverses of the computed permutation
matrix and IGTs. And the run time is measured by two separate parts, the run time
for the QR factorization and the run time for the rest part of each algorithm (for
simply, we just call this part the reduction), in order to observe how the blocking
technique performances in each part.
In the simulation, we test three cases of matrix B Rnn with n = 100 : 50 :
1000. The square matrices Bs are generated as follows.
Case 1: B is generated by MATLAB function randn: B = randn(n, n), i.e.,
each element follows the normal distribution N (0, 1).
Case 2: B = U SV T , U and V are randomly generated orthogonal matrices,
and S is a diagonal matrix as follows,
S(i, i) = 104(i1)/(n1) , i = 1, , n.
Case 3: B = U SV T , U and V are randomly generated orthogonal matrices,
and S is a diagonal matrix as follows,
S(i, i) = 1000, i = 1, , bn/2e
S(i, i) = 0.1, i = bn/2e + 1, , n.
Case 1 are the most typical testing matrices for numerical solutions. Case 2 and 3
intends to show the reduction speed when the condition numbers are fixed at 104 .

56

Case 3 also shows that the block algorithms gain more efficiency at the reduction
part, when it takes a long time to run.
For each dimension of all cases, we randomly generate 20 different matrices to
do the test. We only test 20 simulation runs, because LLL is too time consuming.
However we use box plots to show that the behaviors of the algorithms are stable,
thus 20 runs are enough for our simulation. For the block algorithms, the optimal
block size may vary according to the dimension of the matrix. In the simulation, a
fixed block size of 32 is adopted for matrices at all dimensions for simplicity. In the
average QR/reduction run time plots, the y-axis is the average run time (seconds)
for the 20 matrices, and the x-axis is the dimension. In the average flops plots, the
y-axis is the average flops, and the x-axis is the dimension. In the average relative
backward error plots, the y-axis is the relative backward error, and the x-axis is the
dimension.
In the simulation, we test matrices with various condition numbers, and give
the results in the various condition number plots. In these plots, the y-axis is the
average QR/reduction run time, the average flops or the average relative back ward
errors for 20 matrices with dimension 200 in case 2, and the x-axis is the matrix
condition number from 101 to 106 . Box plots of run time and relative backward
errors of all three cases with dimension 200 are drawn. In the box plot, the y-axis is
either the algorithm run time or the relative backward errors, and the x-axis is the
four algorithms, i.e., LLL, PLLL+, LRBLLL and APBLLL.
The simulation results given by Intel processors are shown in Figure 33, Figure
34 and Figure 35 for the overall performance of three cases, in Figure 36 for case

57

2 with different condition numbers, and in Figure 37 for the box plot of all the
cases. And the results given by AMD processors are shown in Figure 38, Figure 3
9, Figure 310 Figure 36 and Figure 37, respectively. For the overall performance
of each case, we give six plots. The two plots in the first row are the average run time
of QR factorization and the average reduction run time of LLL respectively. LLL
runs much longer than the other three algorithms, so we put it in individual plots in
order to compare the other three algorithms easily. The two plots in the middle row
are the average QR/reduction run time for PLLL+, LRBLLL, and APBLLL. The
two plots in the bottom row are the average flops and the average relative backward
errors for LLL, PLLL+, LRBLLL, and APBLLL. For case 2 with different condition
numbers, we also give six plots which are ordered in the same way as the overall
performance plots. For the box plot figure, we give six plots. The three plots in the
left column are the average algorithm run time of three cases. The three plots in the
right column are the average relative backward error of three cases.
From the simulation results, we can draw following observations and conclusions.
1. Comparing the results between two machines with Intel or AMD, we can observe that the performance of the four algorithms is consistent between these
two machines.
2. By comparing the run time of different algorithms, we found that LLL is the
slowest among the four algorithms. LRBLLL is as fast as APBLLL, and both
are faster than PLLL+ in all three cases. So on average the computational CPU
times for the four algorithms have the following order LRBLLL AP BLLL <
P LLL+ < LLL.

58

3. The block QR factorization with minimum column pivoting in LRBLLL and


APBLL is very efficient in all the cases comparing with the un-block QR in
LLL and PLLL+. And the run time of QR factorization dominates the run
time of the whole algorithms of PLLL+, LRBLLL and APBLLL in Case 1 for
dimensions between 600 to 1000 in Case 2. In Case 3, the reduction part after
QR dominates the total run time.
4. The reduction part of LLL is always much slower than the reduction parts
of other three algorithms, so in the following comparison of the reduction run
time we omit LLL. The reduction parts of LRBLLL and APBLLL are similar in
most of the cases. They are slightly faster than the reduction part of PLLL+
in Case 1. The improvement of the block reduction part is not significant
here, because the matrix R after BQRM CP is close to LLL reduced, and
the reduction part only involves few permutations. In Case 2, the reduction
run time forms a weird shape. Either LRBLLL and APBLLL or PLLL+ lead
parts of the graph. The reduction run time of PLLL+, LRBLLL and APBLLL
first decreases from dimension 100 to 200, then increases from dimension 200
to 600, finally decreases again from dimension 600 to 1000. Starting from
dimension 800, the reduction time of PLLL+ becomes less than the reduction
time of LRBLLL and APBLLL. The reason of this weird shape is left for further
studies. In Case 3, LRBLLL and APBLLL are much faster than PLLL+, and
APBLLL is slightly faster than LRBLLL. For each of the three algorithms, the
run time of the reduction part dominates the run time of the whole algorithm
in Case 3. The improvement caused by the blocking technique in the block

59

reduction part improves the efficiency of the whole algorithm significantly in


this case.
5. From the flop plots of the four algorithms, we have the following observations.
LRBLLL and APBLLL cost more flops than PLLL+ in Case 1 and 2, and they
cost similar flops as PLLL+ in Case 3. It is because the matrix R after QR is
close to LLL-reduced in Case 1 and 2, and the unimodular matrix Z is sparse.
The block algorithms do not consider this structure of Z, thus cost more flops
than PLLL+. As our observation, LRBLLL and APBLLL are faster than
PLLL+ in terms of run time unlike their flops results. It is because LRBLLL
and APBLLL using blocking technique execute more flops in the same CPU
time than PLLL+ and LLL. LLL costs much more flops.
6. By comparing the relative backward stability of the four algorithms, we found
that PLLL+ performs best in all the cases, LLL ranks second, LRBLLL and
APBLLL are similar to each other, while LRBLLL is slightly better than APBLLL with dimension from 100 to 300 in Case 3. So on average the backward
stability for the four algorithms have the following order P LLL+ > LLL >
LRBLLL > AP BLLL.
7. In Figure 36 and Figure 311, the test of matrices with various condition
numbers shows that the QR time is not affected by the condition number of the
matrices, and the reduction time, flops and the relative backward error of the
four algorithms increases when the condition number of the matrix increases.
8. The box plot shows the behaviors of LLL, PLLL+, LRBLLL and APBLLL on
the tests are stable for different simulation runs.

60

1500
LLL

Reduction Run Time

QR Run Time

2.5
2
1.5
1

LLL
1000

500

0.5
0

200

400
600
Dimension

800

1000

4
Reuction Run Time

QR Run Time

200

400
600
Dimension

800

800

1000

800

1000

0.3
0.2
0.1
0

10

200

400
600
Dimension

13

Relative Backward Error

10

10
Flops

400
600
Dimension

PLLL+
LRBLLL
APBLLL

0.4

1000

10

10

LLL
PLLL+
LRBLLL
APBLLL

10

10

200

0.5
PLLL+
LRBLLL
APBLLL

14

10

LLL
PLLL+
LRBLLL
APBLLL
15

200

400
600
Dimension

800

10

1000

200

400
600
Dimension

61
Figure 33: Performance comparison for Case 1, Intel

800

1000

600

QR Run Time

Reduction Run Time

LLL

2.5
2
1.5
1
0.5
0

200

400
600
Dimension

800

200

200

400
600
Dimension

800

1000

0.8
Reuction Run Time

QR Run Time

300

1000

PLLL+
LRBLLL
APBLLL

4
3
2
1

200

400
600
Dimension

800

0.4

0.2

10

PLLL+
LRBLLL
APBLLL

0.6

1000

200

400
600
Dimension

800

1000

10

10

Flops

10

LLL
PLLL+
LRBLLL
APBLLL

10

200

400
600
Dimension

800

LLL
PLLL+
LRBLLL
APBLLL

Relative Backward Error

10

10

400

100

LLL

500

15

10

1000

200

400
600
Dimension

62
Figure 34: Performance comparison for Case 2, Intel

800

1000

1200
LLL

Reduction Run Time

QR Run Time

2.5
2
1.5
1
0.5
0

800
600
400
200

200

400
600
Dimension

800

1000

200

400
600
Dimension

800

400
600
Dimension

800

1000

800

1000

20
10
0

11

PLLL+
LRBLLL
APBLLL

30

1000

200

400
600
Dimension

10

10

10

10

LLL
PLLL+
LRBLLL
APBLLL

10

Relative Backward Error

10

Flops

200

40

Reuction Run Time

QR Run Time

10

50
PLLL+
LRBLLL
APBLLL

LLL

1000

10

10

LLL
PLLL+
BLLL
APBLLL

10

10

12

200

400
600
Dimension

800

10

1000

200

400
600
Dimension

63
Figure 35: Performance comparison for Case 3, Intel

800

1000

40

0.06

0.04

0.02
LLL
0
1
10

10

LLL

Reduction Run Time

QR Run Time

0.08

10
10
Condition Number

10

30

20

10

0
1
10

10

0.03

0.02
PLLL+
LRBLLL
APBLLL

0.01

0
1
10

10

10
10
Condition Number

10

10

10

10

10
10
Condition Number

10

10

10
LLL
PLLL
LRBLLL
APBLLL

10

10

Relative Backward Error

Flops

PLLL+
LRBLLL
APBLLL

0
1
10

10

10

15

10

10
10
Condition Number

8
Reuction Run Time

QR Run Time

0.04

10

LLL
PLLL+
LRBLLL
APBLLL
10

10

15

10

10

10
10
Condition Number

10

10

10

10

10

10
10
Condition Number

10

64
Figure 36: Performance comparison for Case 2 with dimension 200, Intel

10

13

10
1

Relative Backward Error

Total Run Time

10

10

10

15

10

14

10

10
LLL

PLLL

LRBLLL

APBLLL

LLL

PLLL

LRBLLL

APBLLL

LLL

PLLL

LRBLLL

APBLLL

PLLL

LRBLLL

APBLLL

10

Relative Backward Error

Total Run Time

10

10

10

10

10

10

10
LLL

PLLL

LRBLLL

APBLLL

11

10
1

Relative Backward Error

Total Run Time

10

10

13

10

12

10

10
LLL

PLLL

LRBLLL

APBLLL

LLL

65
Figure 37: Box plots of run time (left) and relative backward error (right) for Case
1 (top), Case 2 (middle), Case 3 (bottom) with dimension 200, Intel

4000
LLL

LLL
Reduction Time

QR Time

200

400
600
Dimension

800

3000

2000

1000

1000

10
PLLL+
LRBLLL
APBLLL

Reuction Time

QR Time

4
2

800

1000

0.6

800

1000

0.4
0.2

200

400
600
Dimension

800

1000

10

200

400
600
Dimension

13

10

10

10

LLL
PLLL+
LRBLLL
APBLLL

10

200

400
600
Dimension

800

Relative Backward Error

10

Flops

400
600
Dimension

PLLL+
LRBLLL
APBLLL

0.8

10

200

14

10

LLL
PLLL+
LRBLLL
APBLLL
15

10

1000

200

400
600
Dimension

66
Figure 38: Performance comparison for Case 1, AMD

800

1000

1200
LLL
Reduction Time

6
QR Time

LLL

1000

800
600
400
200

200

400
600
Dimension

800

1000

10

Reuction Time

QR Time

6
4

400
600
Dimension

800

1000

PLLL+
LRBLLL
APBLLL

1.5

0.5

200

400
600
Dimension

800

1000

10

200

400
600
Dimension

800

1000

10

10

Flops

10

LLL
PLLL+
LRBLLL
APBLLL

10

200

400
600
Dimension

800

LLL
PLLL+
LRBLLL
APBLLL

Relative Backward Error

10

10

200

2
PLLL+
LRBLLL
APBLLL

15

10

1000

200

400
600
Dimension

67
Figure 39: Performance comparison for Case 2, AMD

800

1000

2500

8
LLL
Reduction Time

QR Time

1500
1000
500

200

400
600
Dimension

800

1000

10

Reuction Time

QR Time

60

400
600
Dimension

800

1000

800

1000

40

20

200

400
600
Dimension

800

1000

11

200

400
600
Dimension

10

10

10

10

LLL
PLLL+
LRBLLL
APBLLL

10

Relative Backward Error

10

Flops

200

PLLL+
LRBLLL
APBLLL

80

PLLL+
LRBLLL
APBLLL

10

100

LLL

2000

10

10

LLL
PLLL+
LRBLLL
APBLLL

10

10

12

200

400
600
Dimension

800

10

1000

200

400
600
Dimension

68
Figure 310: Performance comparison for Case 3, AMD

800

1000

0.2

80
LLL

0.1

0.05

10

10
10
Condition Number

10

20

0
1
10

10

0.08

20

0.06

15

0.04
PLLL+
LRBLLL
APBLLL

0.02

0
1
10

10

10
10
Condition Number

10

10

10

PLLL+
LRBLLL
APBLLL

0
1
10

10

10
10
Condition Number

10

10

Relative Backward Error

10
LLL
PLLL+
LRBLLL
APBLLL

10

10

15

10

10
10
Condition Number

10

10

10

10

Flops

40

Reuction Time

QR Time

0
1
10

LLL

60

Reduction Time

QR Time

0.15

LLL
PLLL+
LRBLLL
APBLLL

10

10

15

10

10

10
10
Condition Number

10

10

10

10

10

10
10
Condition Number

10

69
Figure 311: Performance comparison for Case 2 with dimension 200, AMD

10

13

10
1

Relative Backward Error

Total Time

10

10

15

10

14

10

10
LLL

PLLL

LRBLLL

APBLLL

LLL

PLLL

LRBLLL

APBLLL

LLL

PLLL

LRBLLL

APBLLL

PLLL

LRBLLL

APBLLL

10

Relative Backward Error

Total Time

10

10

10

10

10

10

10

10
LLL

PLLL

LRBLLL

APBLLL

11

10
1

Relative Backward Error

Total Time

10

10

13

10

10

12

10

LLL

PLLL

LRBLLL

APBLLL

LLL

70
Figure 312: Box plots of run time (left) and relative backward error (right) for Case
1 (top), Case 2 (middle), Case 3 (bottom) with dimension 200, AMD

CHAPTER 4
Parallelization of Block LLL Reduction
Recently, some work has been done on the parallelization of LLL reduction
algorithms. We would like to give a brief review of these efforts in the first section
of this chapter. Then we discuss how to parallelize the components of APBLLL
(Algorithm 3.6) introduced in the previous chapter. Finally, the performance of
parallelized APBLLL is investigated.
4.1

Parallel Methods for LLL Reduction


Villard [40] proposed the so-called all-swap LLL reduction for parallelization.

The algorithm works iteratively in competitive alternating phases, an odd and an


even phases. These phases facilitate performing as much vector swaps as possible
involving disjoint vector sets. The all-swap reduction algorithm uses exact integer
arithmetic when it firstly proposed. Later Heckler and Thiele [18, 19] proposed a variant of the all-swap reduction algorithm for mesh-connected processor arrays, which
uses both floating point arithmetic and integer arithmetic to achieve higher computational efficiency. Later Wetzel [41] proposed a blocked version of Villards algorithm
while reducing over blocks. Although Wetzels algorithm also uses blocks, it does
not consider the blocking technique, i.e., using matrix-matrix operations as much as
possible, which is the basic of our block algorithms LRBLLL and APBLLL. Recently,
Bartkewitz [6] modified Villards algorithm and proposed an implementation based
on graphics processing units (GPU).
71

A quite efficient parallel implementation of the Schorr-Euchner (SE) algorithm


[37], which uses both floating point arithmetic and integer arithmetic, was introduced
by Backes and Wetzel [5] for multi-processor, multi-core computer architectures.
Their implementation is optimized for reducing high-dimensional lattice basis with
big entries.
Ahmad, Amin, Li, Pollin, Van der Perre and Catthoor [3] proposed a scalable
block-based parallel lattice reduction algorithm for software defined radio (SDR)
baseband processors based on the fixed complexity LLL [39], which is introduced by
Vetter, Ponnampalam, Sandell, and Hoeher. The fixed complexity LLL uses only
floating point arithmetic, and is designed to suit the limited run-time in real time
systems.
4.2

A Parallel Block LLL Reduction Algorithm


The two block LLL reduction algorithms given in Chapter 3 are serial algo-

rithms. The APBLLL algorithm (Algorithm 3.6) is suitable for parallelization. In


each iteration, it partitions the matrix R into independent blocks, and each diagonal
block can be reduced independently and concurrently on different processors.
APBLLL mainly consists of three parts: QR factorization, Local-PLLL reduction of diagonal blocks and corresponding block updating, and size-reductions. The
parallel QR factorization with column pivoting has been investigated in the literature (see [33]), so that here we only show how to parallelize the other two parts with
np processor in this section (in a block algorithm, the block size k is usually chosen,
and the matrix is partitioned into d d blocks, here we assume d np , if d < np we
change the block size k and make d = np ).

72

4.2.1

Parallel Diagonal Block Reduction and Block Updating

The APBLLL algorithm partitions the basis matrix into blocks, and then performs the Local-PLLL reductions to the diagonal blocks. In order to do the parallelization, we distribute the diagonal blocks among different processors according to
the index of the diagonal blocks. Diagonal block j (the j-th diagonal block counted
from left to right) is allocated to processor b(j 1)/np c + 1. For example, if d = 7
and np = 3, diagonal blocks 1,2,3 are allocated to processor 1, diagonal blocks 4,5,6
are allocated to processor 2, diagonal block 7 is allocated to processor 3. After this
allocation, each processor performs Local-PLLL on the diagonal blocks allocated to
it as well as the corresponding block updating. Using above strategy, lines 7-22 of
APBLLL (Algorithm 3.6) can be computed concurrently. The algorithm description
of this part is given in lines 7-20 of parallel APBLLL (Algorithm 4.1) given in next
section.
4.2.2

Parallel Block Size-Reduction

In order to size-reduce all the off-diagonal blocks, we first need to distribute these
blocks among the processors. Define modb (a) as the residual of a over b. The j-th
off-diagonal block column is allocated to processor modnp (j 1). Each processor will
thus manage at most ed/np d block columns. Since the block columns have diverse
lengths, they will cost diverse number of operations. For efficiency considerations,
we do not want processors to wait for other processors. Actually they do not need
to. If one processor finishes the size-reduction of a block column, say block column
j, then it starts to size-reduce the (j + np )-th block column without causing any
conflict.

73

i=5

R55

Figure 4-1: Related blocks of block Rij (i=2, j=3)

j=1

j=2

j=3

j=4

j=5

P1(1)

P2(2)

P3(3)

P1(5)

P2(1)

P3(2)

P1(4)

P3(1)

P1(3)

P1(2)

i=1
i=2
i=3
i=4
i=5

FigureFigure
41:4-2:
Task
three
processors
Taskallocation
allocation forfor
three
processors
(P1, P2, (P
P3)1 , P2 , P3 )
An example is given here to show how the parallel size-reduction works as shown
in Figure 41. Assume d = 5 and np = 3, the off-diagonal blocks (i, j) where i < j
need to be size-reduced in parallel. In the first step, the 3 processors reduce the
off-diagonal blocks (1, 2), (2, 3), (3, 4) respectively. In the next step, the 3 processors
reduce the blocks (4, 5), (1, 3), (2, 4) respectively. Then the processor 2 idles, the
processor 1, 3 reduce blocks (3, 5), (1, 4) respectively. At this point the processor 3
also idles, processor 1 will then finish reducing blocks (2, 5), (1, 5). We can see that
no data conflict between different processor accrues during the processor.
Thus, the size-reduction in lines 20-21 and 31-32 of APBLLL can be computed
concurrently.
During this size-reduction procedure, each processor size-reduces d2 /(2np ) blocks
in average, costs O(n3 /np ) arithmetic operations, while the sequential size-reduction
will cost O(n3 ) arithmetic operations.
A parallel APBLLL (PAPBLLL) reduction algorithm based on the previous
discussion is given as follows.

74

Algorithm 4.1. (Parallel Repartition Block LLL Reduction) Given a full column
rank matrix B Rmn and a block size k (assume n is multiple of k). This algorithm
computes the LLL reduction: B = Q1 RZ 1 in parallel.
function: [R, Z] = P AP BLLL(B, k)
1:

Compute [R, Z] = QRM CP (B) in parallel

2:

d := n/k, f := 0

3:

for i = 1 : d do

4:

changei := 1, nextChangei := 1

5:

end for

6:

while (1) do

7:

Partition the matrix into blocks using Partition 1 or 2 iteratively

8:

for i = 1 : d (for Partition 2: i = 1 : d 1, we assume partition 1 is used in


the following description) do in parallel do

9:
10:

if changei 6= 1 then
Continue;

11:

end if

12:

Rii , Z,
r] = Local-P LLL(Rii , f )
[Q,

13:

if Z = I then

14:

continue

15:

end if

16:

nextChangemax(1,l1) := 1, nextChangei := 1

17:

Z1:d,i := Z1:d,i Z

18:

R1:i1,i := R1:i1,i Z

75

19:

T Ri,i+1:d
Ri,i+1:d := Q

20:

end for

21:

for i = 2 : d (for Partition 2: i = 2 : d 1) do in parallel do

22:

= BP SR(R1:i1,i , R1:i1,1:i1 , r)
[R1:i1,i , Z]

23:

Z:,i = Z:,i + Z:,1:i1 Z

24:

end for

25:

if nextChange = 0 then

26:

break

27:

end if

28:

f := 1

29:

for i = 1 : d do

30:

exchangei := nextChangei , nextChangei := 0

31:

end for

32:

end while

33:

Size-reduce all the blocks of R in parallel

4.3

Performance Evaluation of Parallel Algorithm


From the famous Amdahls law [14, Chapter 4], the performance of a parallel

algorithm can be measured by the index of speedup which refers to how much a
parallel algorithm is faster than a corresponding sequential algorithm and can be
written as the ratio of the execution time of sequential algorithm over the execution
time of parallel algorithm:
S=

Ts + Tp
Ts +

76

Tp
np

Where Ts is the sequential portion of execution time of the algorithm during which
the algorithm must be executed sequentially, and Tp is the parallel portion of the
execution time of the algorithm during which the algorithm can be parallelized in a
machine with np processors.
By denoting parallel fraction as the parallel portion of the execution time over
the total execution time of the algorithm, i.e., fp =
S=

1
(1 fp ) +

fp
np

Tp
,
Ts +Tp

the speedup becomes:

It is easy to understand that if an algorithm has a larger parallel portion, the


larger speedup gains from parallelization. The Amdahls law on the speedup takes the
ratio of serial and parallel portions of execution time and the number of processors
into account.
The parallel strategy given in the above section has a large parallel portion.
The operations, which must be computed in serial, account for only small part of
the whole algorithm, i.e., the variable initialization and the condition checking. The
cost for this serial portion is negligible comparing with the parallel portion. Thus,
our PAPBLLL is of higher parallel efficiency.
From the parallel complexity analysis given in Section 4.2.2, the complexity of
size-reduction will reduce by a 1/np factor with np processor in idea cases. Since the
complexity analysis of APBLLL is too pessimistic, the improvement of the parallelized diagonal block reduction and its corresponding block updating is hard to be

77

observed from the complexity analysis. In order to investigate how the parallel diagonal block reduction and block updating suggested in Section 4.2.1 works, a small
test is made in the following part of this section.
To simplify the test, assume the number of processors np is equal to the number
of diagonal blocks d. The serial diagonal block reduction in APBLLL is used to
simulate the parallel diagonal block reduction in PAPBLLL.
Define t(i, j) as the run time of the diagonal block reduction and block updating
at the i-th block or i-th processor during the j-th iteration, where 1 i d, 1
j s, s is the total number of iterations the algorithm needed. The value maxi t(i, j)
refers to the maximum run time of one diagonal block reduction and its corresponding
block updating during the j-th iteration, which will be the threshold of the parallel
diagonal block reduction and updating, since other processors must wait processor
i to finish and move to the next iteration. Thus we can assume the computation
time of the parallel diagonal block reduction and block updating will be the sum of
maxi t(i, j) where j is from 1 to s, while the computation time of the serial diagonal
block reduction and block updating is the sum of t(i, j) where 1 i d, 1 j s.
The speedup of the diagonal block reduction can be obtained by comparing these two
P
P
sums, the parallel one j=1:s maxi t(i, j) and the serial one j=1:s,i=1:d maxi t(i, j).
P
P
The ratio ( j=1:s,i=1:d maxi t(i, j))/( j=1:s maxi t(i, j)) shows the efficiency of the
parallel computing.
The Intel machine which is introduced in Chapter 3 is used in the test. The
average run time of the serial/parallel diagonal block reduction and block updating is
measured. We use 5 processors to test random upper triangular matrices B Rnn

78

Plot of run time


2
SerialTime
ParallelTime

1.8
1.6

Average run time

1.4
1.2
1
0.8
0.6
0.4
0.2
0
100

200

300

400

500
600
Dimension

700

800

900

1000

Figure 42: Approximating Parallel Simulation


with n = 100 : 50 : 1000. For each dimension, we randomly generate 50 different
matrices by MATLAB function randn: B = triu(randn(n, n)), i.e., each element
follows the normal distribution. The test result is given in Figure 42, where the
y-axis is the average run time of parallel or serial block reduction, and the x-axis is
the dimension.
The results show that the parallel speedup of the diagonal block reduction and
updating part approaches 4 when 5 processors are used and the dimension increases
to 1000.
In the test, a good parallel performance is achieved for Local-PLLL of diagonal
blocks and its corresponding updating. The parallel performance of PAPBLLL is
expected by implementing it in a real parallel machine in the future.

79

CHAPTER 5
Conclusion and Future Work
The LLL reduction is the most popular lattice reduction and is a powerful tool
for solving many complex problems in mathematics and computer science such as
integer least square problems. The computation speed of a matrix algorithm is
determined not only by the number of floating point operations involved, but also by
the amount of memory traffic which is the movement of data between memory and
registers. The blocking technique casts matrix algorithms in terms of matrix-matrix
operations to permit efficient reuse of data.
In this thesis, two floating point block LLL reduction algorithms named the
left-to-right block LLL (LRBLLL) reduction algorithm and the alternating partition block LLL (APBLLL) reduction algorithm using the blocking technique have
been proposed, and the parallelization of APBLLL is discussed. The complexity
of LRBLLL is bound above by , this complexity bound of LRBLLL is the same
the complexity bound of LLL in the literature. First, the ordinary floating point
LLL reduction and its variant partial LLL (PLLL) reduction are introduced as fundamentals. Then the left-to-right block LLL (LRBLLL) reduction algorithm and
the alternating partition block LLL (APBLLL) reduction algorithm are proposed to
introduce efficient LLL reduction algorithms utilizing the blocking technique. The
performances of fours algorithms named LLL, PLLL+, LRBLLL and APBLLL are
compared. Later the parallelization of the APBLLL reduction is given with possible
80

complexity analysis. A test is designed to evaluate the performance of the parallel


diagonal block reduction part of parallel APBLLL on a serial machine. Based on the
analysis and computational results of this thesis, we can draw following conclusions.
1. The block QR factorization with minimum column pivoting used in LRBLLL
and APBLLL is a few times faster than the non-block QR used in LLL and PLLL+
in our test cases (e.g., in Figure 35 the former is 4 times faster than the later
when dimension equals to 1000). The reduction parts of LRBLLL and APBLLL are
generally faster than the reduction part of PLLL+ (e.g., in Figure 35 the former is
4 times faster than the later when dimension equals to 1000), and are much faster
than the reduction part of LLL (e.g., in Figure 35 the former can be 100 times
faster than the later when dimension equals to 1000). Comparing the overall CPU
run times, which are the sum of the QR and reduction run times, we found that both
of our block LLL algorithms, LRBLLL and APBLLL, are faster than PLLL+ and
much faster than original LLL.
2. Although the block LLL reduction algorithms LRBLLL and APBLLL cost
more or similar flops than PLLL+, they are still faster than PLLL+ because the
blocking technique can perform more flops in the same CPU time, which shows the
benefit of blocking technique. LLL costs much more flops.
3. By running those algorithms on the Intel and AMD machines, it is found
that each of the block LLL algorithm has similar behaviors and both have good
performances in both machines.
4. The performance test on the parallel diagonal block reduction part of PAPBLLL shows that the speedup of this part approaches 4 with 5 processors. The

81

complexity analysis on the size-reduction part of PAPBLLL shows that a speedup


of np can be obtained with np processors in ideal cases.
In the future we would like to:
1. Understand the weird shape of the curves of the reduction run time versus
matrix dimensions in Case 2 (see Figure 34 and Figure 39).
2. Find out the causes of the higher relative backward errors of the block algorithms LRBLLL and APBLLL and the solution to improve their numerical stability.
3. Implement parallel APBLLL (PAPBLLL) on a real parallel machine and
investigate its performance.

82

References
[1] K. Aardal and F. Eisenbrand. The LLL algorithm and integer programming. In
[31], pp. 293-314, 2009.
[2] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger. Closest point search in lattices.
IEEE Transaction of Information Theory, vol. 48, pp. 2201-2214, 2002.
[3] U. Ahmad, A. Amin, M. Li, S. Pollin, L. Van der Perre, and F. Catthoor. Scalable
block-based parallel lattice reduction algorithm for an SDR baseband processor.
IEEE International Conference on Communications (ICC), pp. 1-5, 2011.
[4] L. Babai. On Lovasz lattice reduction and the nearest lattice point. Symposium
on Theoretical Aspects of Computer Science (STACS), vol. 182, pp. 13-20, 1985.
[5] W. Bacjes and S. Wetzel. Parallel lattice basis reduction using a multi-threaded
Schnorr-Euchner LLL algorithm. Euro-Par 2009, Lecture Notes in Computer
Science (LNCS), vol. 5704, pp. 960-973, Springer, 2009.
[6] T. Bartkewitz. Improved lattice basis reduction algorithms and their efficient
implementation on parallel systems. Diploma thesis, Department of Electrical
Engineering and Information Sciences, Ruhr-University Bochum, 2009.
[7] J.W.S. Cassels. An Introduction to the Geometry of Numbers. Springer, Berlin,
Heidelberg, New York, 1971.
[8] X.-W. Chang, X. Yang, and T. Zhou. MLAMBDA: A modified LAMBDA method
for integer least-squares estimation. Journal of Geodesy, vol. 79, pp. 552-565,
2005.
[9] X.-W. Chang and G.H. Golub. Solving ellipsoid-constrained integer least squares
problems. SIAM Journal on Matrix Analysis and Applications archive, vol. 31,
no. 3, pp. 1071-1089, 2009.
[10] X.-W. Chang and Q. Han. Solving box-constrained integer least square problems. IEEE Transactions on Wireless Communications, vol. 7, no. 1, pp. 277-287,
2008.
83

84
[11] X.-W. Chang and T. Zhou. MILES: MATLAB package for solving mixed integer
least squares problem. GPS Solutions, vol. 11, pp. 289-294, 2007.
[12] I.V.L. Clarkson. Approximation of linear forms by lattice points with applications to signal processing. PhD thesis, The Australian National University,
1997.
[13] H. Cohen. A Course in Computational Algebraic Number Theory. SpringerVerlag, Berlin, Germany, 1993.
[14] J.J. Dongarra, L.S. Duff , D.C. Sorensen, and H.A. van der Vorst. Numerical Linear Algebra for High-Performance Computers. Society for Industrial and
Applied Mathematics, Philadelphia, PA, 1998.
[15] O. Goldreich, S. Goldwasser, and S. Halevi. Public-key cryptosystems from
lattice reduction problems. CRYPTO 97 : Advances in Cryptology, vol. 1294,
pp. 112-131, 1997.
[16] G.H. Golub and C. Van Loan. Matrix Computations. The Johns Hopkins
University Press, Baltimore, Maryland, 3rd edition, 1996.
[17] A. Hassibi and S. Boyd. Integer parameter estimation in linear models with
applications to GPS. IEEE Transaction of Signal Processing, vol. 46, pp. 29382952, 1998.
[18] C. Heckler and L. Thiele. A parallel lattice basis reduction for mesh-connected
processor arrays and parallel complexity. In the Proceedings of Fifth IEEE Symposium on Parallel and Distributed Processing, pp. 400-407, 1993.
[19] C. Heckler and L. Thiele. Complexity analysis of a parallel lattice basis reduction
algorithm. SIAM J. Comput., vol. 27, no. 5, pp. 1295-1320, 1998.
[20] R. Kannan. Improved algorithms for integer programming and related lattice
problems. In the Proceedings of the 15th Annual ACM Symposium on Theory
of Computing (STOC), pp. 193-206, 1983.
[21] A. Korkine and G. Zolotareff. Sur les formes quadratiques. Mathematische
Annalen, vol. 6, pp. 366-389, 1873.
[22] A.K. Lenstra, H.W. Lenstra, and L. Lovasz. Factoring polynomials with rational
coefficients. Mathematische Annalen, vol. 261, pp. 515-534, 1982.

85
[23] C. Ling and N. Howgrave-Graham. Effective LLL reduction for lattice decoding.
In the Proceedings of IEEE International Symposium on Information Theory, pp.
196-200, 2007.
[24] C. Ling, W.H. Mow, and N. Howgrave-Graham. Variants of the LLL algorithm
in digital communications: Complexity analysis and fixed-complexity implementation. IEEE Trans. Inf. Theory, submitted for publication. Available online:
http://arxiv.org/abs/1006.1661.
[25] D. Micciancio and S. Goldwasser. Complexity of Lattice Problems: A Cryptographic Perspective. Kluwer Academic publishers, Boston, 2002.
[26] H. Minkowski. Geometrie der zahlen. Teubner, 1896.
[27] H. Minkowski. Diophantische approximationen. Teubner, 1907.
[28] W.H. Mow. Universal lattice decoding: Principle and recent advances. Wireless
Communications and Mobile Computing, vol. 3, pp. 553-569, 2003.
[29] W.H. Mow. Universal lattice decoding: A review and some recent results. In
the Proceedings of IEEE International Conference on Communications, vol. 5,
pp. 2842-2846, 2004.
[30] P.Q. Nguyen and D. Stehle. Floating-point LLL revisited. EUROCRYPT 2005,
Lecture Notes in Computer Science (LNCS) 3494, pp. 215-233, 2005.
[31] P.Q. Nguyen and B. Vallee (editors). The LLL Algorithm: Survey and Applications. Information Security and Cryptography, Springer, Berlin, 2009.
[32] G. Quintana-Orti, X. Sun, and C.H. Bischof. A BLAS-3 version of the QR
factorization with column pivoting. SIAM Journal on Scientific Computing, vol.
19, pp. 1486-1494, 1998.
[33] G. Quintana-Ort and E.S. Quintana-Ort. Parallel codes for computing the
numerical rank. Linear Algebra and its Applications, vol. 275-276, pp. 451-470,
1998.
[34] C.P. Schnorr. Factoring integers and computing discrete logarithms via diophantine approximation. In Advances in Cryptology: EuroCrypt 91, vol. 547,
pp. 281-293, 1991.

86
[35] C.P. Schnorr. Fast LLL-type lattice reduction. Information and Computation
vol. 204, no. 1, pp. 1-25, 2006.
[36] C.P. Schnorr. Progress on LLL and lattice reduction. In [31], pp. 145-178, 2009.
[37] C.P. Schnorr and M. Euchner. Lattice basis reduction: Improved practical
algorithms and solving subset sum problems. Mathematical Programming, vol.
66, pp. 181-191, 1994
[38] R. Schreiber and C. Van Loan. A storage efficient W Y representation for products of Householder transformations. SIAM Journal on Scientific and Statistical
Computing, vol. 10, pp. 53-57, 1989.
[39] H. Vetter, V. Ponnampalam, M. Sandell, and P.A. Hoeher. Fixed complexity
LLL algorithm. IEEE Transactions on Signal Processing, vol. 57, pp. 1634-1637,
2009.
[40] G. Villard. Parallel lattice basis reduction. In the Proceedings of The International Symposium on Symbolic and Algebraic Computation, pp. 269-277, 1992.
[41] S. Wetzel. An efficient parallel block-reduction algorithm. Algorithmic Number
Theory Symposium, Lecture Notes in Computer Science (LNCS), vol. 1423, pp.
323-337, 1998.
[42] D. Wubben, D. Seethaler, J. Jalden, and G. Matz. Lattice reduction: A survey
with applications in wireless communications. IEEE Signal Processing Magazine,
pp. 70-91, May 2011.
[43] X. Xie, X.-W. Chang, and M.A. Borno. Partial LLL reduction. In the Proceedings of IEEE GLOBECOM, 5 pages, 2011.
[44] T. Zhou. Modified LLL algorithms. Masters thesis, School of Computer Science,
McGill University, 2006.

You might also like