Two Floating Point LLL Reduction Algorithms - Thesis
Two Floating Point LLL Reduction Algorithms - Thesis
Reduction Algorithms
Yancheng Xiao
Master of Science
McGill University
Montreal,Quebec
September 2012
DEDICATION
ii
ACKNOWLEDGEMENTS
iii
ABSTRACT
The Lenstra, Lenstra and Lovasz (LLL) reduction is the most popular lattice
reduction and is a powerful tool for solving many complex problems in mathematics
and computer science. The blocking technique casts matrix algorithms in terms
of matrix-matrix operations to permit efficient reuse of data in the algorithms. In
this thesis, we use the blocking technique to develop two floating point block LLL
reduction algorithms, the left-to-right block LLL (LRBLLL) reduction algorithm
and the alternating partition block LLL (APBLLL) reduction algorithm, and give
the complexity analysis of these two algorithms. We compare these two block LLL
reduction algorithms with the original LLL reduction algorithm (in floating point
arithmetic) and the partial LLL (PLLL) reduction algorithm in the literature in
terms of CPU run time, flops and relative backward errors. The simulation results
show that the overall CPU run time of the two block LLL reduction algorithms are
faster than the partial LLL reduction algorithm and much faster than the original
LLL, even though the two block algorithms cost more flops than the partial LLL
reduction algorithm in some cases. The shortcoming of the two block algorithms is
that sometimes they may not be as numerically stable as the original and partial
LLL reduction algorithms. The parallelization of APBLLL is discussed.
iv
ABREG
TABLE OF CONTENTS
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
E
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ABREG
ix
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
1.2
Lattice Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contributions and Organization of the Thesis . . . . . . . . . . .
1
4
2.1
2.2
.
.
.
.
.
.
7
8
9
9
13
16
. .
. .
17
19
23
3.1
24
2.3
LLL Reduction . . . . . . . . . . . . . . . . . . . . .
Original LLL Reduction Algorithm . . . . . . . . . .
2.2.1 Size-Reductions . . . . . . . . . . . . . . . . .
2.2.2 Permutations . . . . . . . . . . . . . . . . . . .
2.2.3 Complexity Analysis . . . . . . . . . . . . . . .
Partial LLL Reduction Algorithm . . . . . . . . . . .
2.3.1 Householder QR Factorization with Minimum
Pivoting . . . . . . . . . . . . . . . . . . . .
2.3.2 Partial Size-Reduction and Givens Rotation . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Column
. . . . .
. . . . .
.
.
.
.
.
.
. . . .
Col. . . .
. . . .
24
32
.
.
.
.
.
.
.
.
.
35
39
41
41
45
48
48
53
55
71
4.1
4.2
.
.
.
.
.
71
72
73
73
76
80
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
3.2
3.3
3.4
4
4.3
5
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . .
. . . . . .
Updating
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
LIST OF TABLES
Table
page
48
55
viii
LIST OF FIGURES
Figure
page
11 A lattice in 2-dimension . . . . . . . . . . . . . . . . . . . . . . . . .
31 Partition 1 of matrix R . . . . . . . . . . . . . . . . . . . . . . . . . .
49
32 Partition 2 of matrix R . . . . . . . . . . . . . . . . . . . . . . . . . .
49
61
62
63
64
37 Box plots of run time (left) and relative backward error (right) for
Case 1 (top), Case 2 (middle), Case 3 (bottom) with dimension
200, Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
66
67
68
69
312 Box plots of run time (left) and relative backward error (right) for
Case 1 (top), Case 2 (middle), Case 3 (bottom) with dimension
200, AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
74
79
ix
CHAPTER 1
Introduction
1.1
Lattice Reduction
A set L in the real vector space Rm is referred to as a lattice if there exists a set
j=1
The set {b1 , b2 , . . . bn } is a basis of lattice L. The dimension of the lattice is defined
to be n. The matrix B = [b1 , b2 , . . . bn ] is referred to as the lattice basis matrix
which generates L, also written as L(B).
Geometrically, a lattice can be viewed as a set of intersection points in an infinite
grid, as shown in Figure 11. The lines of the grid do not need to be orthogonal to
each other. The same lattice may have different bases. For example in Figure 11,
{b1 , b2 } is a basis of the lattice, and {c1 , c2 } is also a basis.
Suppose that we have two basis matrices B and C. If they generate a same
lattice L(B) = L(C), we say that B and C are equivalent. Two basis matrices
B, C Rmn are equivalent if and only if there exists a unimodular matrix Z Znn
(i.e., an integer matrix with determinant det(Z) = 1) such that C = BZ, see [25,
p4].
The lattice basis reduction is to transform a given lattice basis into a basis
with short and nearly orthogonal basis vectors. There are several kinds of lattice
1
closest vector problems (CVP), which are also referred to as the integer least-square
(ILS) problems [2, 4, 9, 10, 17].
Generally, we can classify the LLL reduction algorithms into three categories.
The first category includes exact integer arithmetic LLL reduction algorithms with
both input and output bases being integral. For example, the original LLL algorithm
given in [22] is in this category.
The second category includes the algorithms such as those in [30, 35, 36], which
use not only integer arithmetic, but also floating point arithmetic. The input and
output bases in this category are also integral. The reason to use floating point
arithmetic is that the integer arithmetic is expensive. The algorithms use long enough
floating numbers to approximate the intermediate results, so that the rounding errors
do not lead to an output basis which is not exactly LLL reduced.
The applications of the first and second categories include factoring polynomials
[22], subset sum problem [37] and public-key cryptanalysis [15].
The third category includes floating point algorithms with both input and output
bases being real. This category applies to cases where exact integer arithmetic is not
required and where a nearly LLL reduced basis is acceptable, such as ILS problems
which arise in GPS, e.g., [9, 10, 11, 17, 43], and in multi-input multi-output (MIMO)
communications, e.g., [24, 42]. So an algorithm in this category does not require strict
floating point error control like algorithms in the second category. An algorithm in
category three is much more efficient than those in categories one and two.
1.2
for the LLL reduction with real basis matrices by using blocking technique [14,
Chapter 5]. The algorithms are based on the original LLL reduction algorithm [22]
and the partial LLL (PLLL) reduction algorithm [13].
The computation speed of a matrix algorithm is determined not only by the
number of floating point operations involved, but also by the amount of memory
traffic which is the movements of data between memory and registers. The level
3 basic liner algebra subprograms (BLAS) are designed to reduce these movements
of data. The matrix-matrix operations implemented in level 3 BLAS make efficient reuse of data that resided in cache or local memory to avoid excessive data
movements. The blocking technique casts the algorithms in terms of matrix-matrix
operations to permit efficient reuse of data.
Two block LLL reduction algorithms utilizing this blocking technique are proposed in this thesis with their complexity analysis. Numerical simulations compare
the performance of our block algorithms on the CPU time, flops and numerical stability with the original LLL reduction algorithm and the PLLL reduction algorithm.
On average the computational speeds of the block algorithms are faster than PLLL
and LLL although their numerical stability in some cases may need improvement.
The parallelization of one of the two block LLL reduction algorithms is discussed
in two parts, the parallelization of the block size-reduction and the parallelization
of the diagonal block reduction. Complexity analysis shows that the parallelized
size-reduction part can obtain a speedup of np in ideal cases, if np processors are
used. The improvement of the parallelized diagonal block reduction part is hard to
be observed from the complexity analysis, since the complexity is too pessimistic. A
simple test is designed to examine the performance of the parallelized diagonal block
reduction part. The test result shows that the parallelized diagonal block reduction
part can obtain a speedup of 4.8 with 5 processors in best situations.
The rest of the thesis is organized as follows. In Chapter 2, we first give the
definition of the LLL reduction. Then a description of the original LLL reduction
algorithm in the matrix language is given, followed by its complexity analysis. In
the last section of this chapter, we introduce the partial LLL (PLLL) reduction
algorithm.
In Chapter 3, we first apply the blocking technique to the components of the
PLLL algorithm, leading to block subroutines. Then two block LLL algorithms are
proposed based on these block subroutines. We give the complexity analysis for the
block algorithms under the assumption of using exact arithmetic. Finally, simulation
results are presented, compared and discussed.
In Chapter 4, we first review the literature of parallel LLL algorithms. Then we
discuss the parallelization of one of our two block algorithms.
Chapter 5 gives conclusions and future work.
We now describe the notation to be used in the thesis. The sets of all real and
integer m n matrices are denoted by Rmn and Zmn , respectively, and the set of
real and integer n-vectors are denoted by Rn and Zn , respectively. Upper case letters
are used to denote matrices and bold lower case letters are used to denote vectors.
The identity matrix is denoted by I and its i-th column is denoted by ei . MATLAB
of A. Without saying specifically, k k stands for the 2-norm, i.e., kak = aT a, and
qP
2
k kF stands for the Frobenious matrix norm, i.e., kAkF =
i,j aij .
CHAPTER 2
Introduction to LLL Reduction Algorithms
In this chapter first we give the definition of the Lenstra-Lenstra-Lovasz (LLL)
reduction. Then we introduce the original LLL reduction algorithm [22] and the
partial LLL (PLLL) reduction algorithm [43], which will be the bases of our new
LLL reduction algorithms to be presented in later chapters.
2.1
LLL Reduction
The LLL reduction introduced in [22] can be described as a QRZ matrix fac-
torization:
R
B = Q Z 1 = Q1 RZ 1 ,
0
where B Rmn is a given matrix with full column rank, Q = [Q1 , Q2 ] Rmm is
n
mn
1 i < j n,
2
2
ri1,i1
rii2 + ri1,i
,
1 < i n,
(2.1)
(2.2)
with the parameter (1/4, 1). The conditions Eq.(2.1) and Eq.(2.2) are named as
the size-reduction condition and the Lovasz condition, respectively. The matrix BZ
or the matrix R is said to be LLL reduced.
The LLL reduction algorithm in [22] is the most well known lattice basis reduction algorithm with polynomial time complexity, which was originally designed for
factoring polynomials with rational coefficients using integer arithmetic operations.
Later, the LLL reduction has widely extended its applications to number theory
(see, e.g., [34, 37]), cryptography (see, e.g., [15, 25]), integer programming (see, e.g.,
[1, 20]), digital communications (see, e.g., [24]), and GPS (see, e.g., [11, 17]). Some
of these extended applications do not require exact integer LLL reduced basis, thus
floating point arithmetic is used to achieve better computational performance in such
application areas. One example of the floating point LLL application is to compute
a suboptimal solution (e.g., the Babai point [4]) or the optimal solution of an integer
least squares (ILS) problem.
In the following part of this chapter, the original LLL reduction algorithm and
the PLLL reduction algorithm are introduced and we assume they use floating point
arithmetic.
2.2
(see [44, Algorithm 3.3.1] and [13, Algorithm 2.6.3]). The algorithm involves the
Gram-Schmidt orthogonalization (GSO), integer Gauss transformations (IGT), column permutations and orthogonal transformations. GSO is applied to find the QR
factors Q and R of the given matrix B. The column permutations and IGTs produce
the unimodular matrix Z.
In the original exact integer LLL reduction algorithm, a column scaled Q and
a row scaled R which has unit diagonal entries are computed by a variation of GSO
to avoid square root computations. In the floating point LLL reduction algorithm in
this thesis, the regular GSO is adopted to B and gives the compact form of the QR
factorization:
B = Q1 R,
where Q1 Rmn has orthonormal columns, and R Rnn is upper triangular.
After the GSO of B, integer Gauss transformations, column permutations and
GSO are used to transform R to a LLL reduced basis. IGTs are used to perform sizereduction to the off diagonal entries to achieve Eq.(2.1). The column permutations
are used to order the columns to achieve Eq.(2.2). Since a column permutation
destroys the upper triangular structure, GSO is used to recover the upper triangular
structure.
2.2.1
Size-Reductions
i 6= j,
is an integer.
Permutations
The column permutations are applied to achieve Eq.(2.2). Suppose that the
Lavosz condition is not satisfied for i = k, then a permutation matrix Pk1,k is
9
Ik2
=
G
Ink
c s
G=
,
s c
rkk
s= q
.
2
2
rk1,k
+ rkk
rk1,k
,
c= q
2
2
rk1,k
+ rkk
q
2
2
rk1,k
+ rkk
,
rk1,k1 rk1,k
rk1,k = q
,
2
2
rk1,k
+ rkk
rk1,k1 rkk
rk,k = q
.
2
2
rk1,k
+ rkk
2
2
2
Thus, if rk1,k1
> rkk
+ rk1,k
with (1/4, 1), then the above operations guar2
2
2
antee rk1,k1
> rkk
+ rk1,k
.
10
indicates that the first k 1 columns of R are LLL reduced at the current stage, i.e.,
rij 1
,
rii 2
1 i < j k 1,
2
2
ri1,i1
rii2 + ri1,i
,
1 < i k 1.
(2.3)
(2.4)
At the beginning, set k to 2. Then during the reduction procedure, the value of k
shifts between 2 and n + 1 and changes by 1 in each step. At stage k, the algorithm
first uses the integer Gauss transformation to reduce rk1,k . Then it checks if it
needs to permute the columns k 1 and k according to the Lovasz condition. If
2
2
2
, it performs the permutation and applies the corresponding
+ rk1,k
> rkk
rk1,k1
2:
k := 2, Z := In
while k n do
rk1,k 1
4:
if rk1,k1
> 2 then
3:
11
5:
// Reduce rk1,k
m
j
rk1,k
:= rk1,k1
6:
7:
8:
end if
// is parameter chosen in ( 41 , 1)
9:
2
2
2
if rk1,k1
> rkk
+ rk1,k
then
10:
11:
12:
Triangularize R: R := Gk1,k R
13:
if k > 2 then
14:
k := k 1
15:
16:
end if
else
// Size-reduction
17:
18:
for i = k 2 : 1 do
j m
r
:= ri,k
ii
19:
20:
21:
end for
22:
k := k + 1
23:
end if
24:
end while
12
2.2.3
Complexity Analysis
Assume that the operations used in the algorithm are performed in exact arithmetic. The complexity of Algorithm 2.1 is measured by the number of arithmetic
operations. Part of the results of the complexity analysis will be used in Chapter 3
and Chapter 4. The QR factorization by GSO takes O(mn2 ) arithmetic operations
[16, Section 5.2]. Next, we give the analysis of the complexity of the while loop in
the LLL reduction algorithm. By adding the complexity of QR factorization and the
while loop together, we get the complexity of the LLL reduction algorithm.
For the complexity of the while loop, we would like to first determine the number
of loops and then count the number of arithmetic operations in each loop.
Lemma 2.1 ([22]): Let = maxj kbj k, and let = minxZn /{0} kBxk be the
length of the shortest vector of lattice L(B). The number of permutations involved
in Algorithm 2.1 is bounded by O(n3 + n2 log1/ ) and the algorithm converges.
Proof. We use the proof from [22] and [44, Chapter 3].
After the Gram-Schmidt QR factorization, we obtain QR factors Q1 and R in
the QR factorization B = Q1 R. Let R(p) denote the upper triangular matrix R after
the p-th permutation (R(0) = R). Define the quantities wi and after the p-th
permutation as
(p)
wi
i
Y
(p)
(rjj )2 ,
i = 1, 2, , n
(2.5)
j=1
and
(p)
n
Y
i=1
13
(p)
wi .
(2.6)
Suppose the p-th permutation is applied to columns (q1) and q of matrix R(p1)
and the orthogonal transformation by GSO is applied to keep the upper triangular
structure as described in the algorithm, we obtain matrix R(p) with following feature:
(p)
(p1)
rjj = rjj
, j 6= q 1, q,
(p1)
(p)
(p1)
(p)
|.
| = |rp1,p1 rpp
|rp1,p1 rpp
And by the permutation criterion (see line 9 of Algorithm 2.1) obtained from Eq.(2.2),
we have
(p)
(p1)
rq1,q1 < rq1,q1 .
Then from Eq.(2.5) we obtain
(p)
wi
(p1)
= wi
, i 6= q 1,
(p)
(p1)
(2.7)
n
n
Y
Y
(0)
(0)
(p)
(0)
(p)
= log1/ log1/ = log1/
wi log1/
wi .
(p)
i=1
i=1
14
(2.8)
(0)
(0)
Since = maxj kbj k and kbj k2 (rjj )2 , then (rjj )2 2 (j = 1, 2, , n). Thus
from Eq.(2.5)
(0)
wi
2i .
(2.9)
(2.10)
min
kBxk2 =
n
xZ /{0}
mini
(p) x
k2
kB
min
n
Z /{0}
x
(p) (:, 1 : i)
kB
x(1 : i)k2
(1:i)Z /{0}
x
(i1)/2
4
(p) (:, 1 : i)T B
(p) (:, 1 : i))|1/i
| det (B
3
(i1)/2
4
=
| det (R(p) (:, 1 : i)T R(p) (:, 1 : i))|1/i
3
(i1)/2
4
(p)
(wi )1/i (see Eq.(2.5)).
3
Then it follows that
(p)
wi
(3/4)i(i1)/2 2i .
15
(2.11)
n
Y
n
Y
(3/4)i(i1)/2 2i
log1/
2i
i=1
i=1
= (n + 1)n log1/
n
Y
+ log1/
(4/3)i(i1)/2
i=1
= (n + 1)n log1/
1 3
+ (n n) log1/ (4/3).
6
So Algorithm 2.1 involves at most O(n3 +n2 log1/ ) permutations and the algorithm
converges.
We should note that the bound on the number permutation from the lemma
suits for all kinds of LLL reduction algorithms, if they share the same permutation
criterion with Algorithm 2.1.
In Algorithm 2.1, k is either increased or decreased by 1 in the while loop. Since
the loops in which k is decreased must have a column permutation in it, we have
p loops in which k is decreased. The algorithm starts from k = 2 and ends when
k = n + 1, so the number of loops in which k is increased should equals to p + n 1.
Thus there are 2p + n 1 loops in total, which is bounded by O(n3 + n2 log1/ ).
Each loop costs O(n2 ) arithmetic operations in the worst situation. So the whole
algorithm takes at most O(mn2 + n5 + n4 log1/ ) arithmetic operations.
2.3
and Howgrave [23], and later the so-called partial LLL (PLLL) reduction algorithm
was developed by Xie, Chang and Borno [43]. Both algorithms are more efficient
16
than Algorithm 2.1. The ELLL reduction algorithm is essentially identical to Algorithm 2.1 after lines 17-21, which reduce the off-diagonal entries of R except the
super-diagonal ones, are removed. It has less computational complexity than LLL,
while it has the same effect on the performance of the Babai integer points as LLL.
[43] shows algebraically that the size-reduction condition of the LLL reduction has
no effect on a typical sphere decoding (SD) search process for solving an integer least
squares (ILS) problem. Thus it has no effect on the performance of the Babai integer point, the first integer point found in the search process. The PLLL is proposed
to avoid the numerical stability problem with ELLL, and to avoid some unnecessary size-reductions involved in LLL and ELLL. Both PLLL and LLL can compute
LLL reduced bases by adding an extra size-reduction procedure at the end of the
algorithms. The following part gives a description of the PLLL reduction.
2.3.1
The typical LLL algorithm first finds the QR factorization of the given matrix
B. In the original LLL algorithm, the Gram-Schmidt method is adopted for computing the QR factorization. However the Householder method without forming the
orthogonal factor Q which costs 43 mn2 flops, is more efficient than the Gram-Schmidt
method which costs 2mn2 flops [16]. The Householder method requires square root
operations, so it is not suitable for the exact integer LLL reduction. While the floating point LLL reduction has no problem with computing a square root, so it can use
the Householder transformation to computer the QR factorization.
The PLLL reduction uses the Householder QR factorization with minimum column pivoting (QRMCP) instead of the classic Householder QR factorization. In
17
general, the number of permutations is a crucial factor of the cost of the whole LLL
reduction process. If one can make the upper triangular factor close to an LLL reduced one in the QR factorization stage, the number of the permutations in the later
stage is likely to decrease. The minimum column pivoting strategy is used to help
to achieve the Lovasz condition, see [44, Section 4.1].
From Eq.(2.1) and Eq.(2.2), we can easily obtain
1 2
rii2 ,
( )ri1,i1
4
1 < i n,
(1/4, 1).
(2.12)
(2.13)
mn
of n Householder transformations.
The algorithm is given as follows.
18
P := In
2: lj
3:
:= kB(1 : m, j)k2 , j = 1 : n
for i = 1 : n do
4:
q := arg minijn lj
5:
if q > i then
6:
7:
8:
end if
9:
10:
B := Hi B
11:
lj := lj B(i, j)2 , j = i + 1, i + 2, , n
12:
end for
13:
R := B(1 : n, 1 : n)
2.3.2
After the QRMCP, the PLLL reduction performs permutations, IGTs and Givens
rotations on R in an efficient and numerical stable way. In the k-th column of R,
PLLL checks if it needs to permute the columns k and k 1 according to the Lovasz condition Eq.(2.2). If the Lovasz condition hold, then the permutation will not
19
occur, no IGT will be applied, and the algorithm moves to column k + 1. If the
Lovasz condition does not hold, rk1,k is reduced by IGT, IGTs are also applied to
rk2,k , , r1,k for stability consideration. Then PLLL performs the permutation and
the Givens rotation, and moves back to the previous column.
Givens rotations are used to do triangularization after permutations in PLLL
instead of GSO, in line 12 of Algorithm 2.1. Define the Givens rotation matrix as
c s
G=
,
s c
where
rkk
s= q
.
2
2
rk1,k
+ rkk
rk1,k
,
c= q
2
2
rk1,k
+ rkk
rk,k
0
0
rk,k
s c
The PLLL algorithms is given as follows.
Algorithm 2.3. ( PLLL Reduction) Suppose B Rmn has full column rank. This
algorithm computes the PLLL reduction of B: B = Q1 RZ 1 , and Q1 has orthonormal columns, R is upper triangular and Z is a unimodular. It computes IGTs only
when column permutation occurs.
function: [R, Z] = P LLL(B)
1:
2:
Set Z := P , k := 2
20
while k n do
m
j
rk1,k
4:
:= rk1,k1
3:
5:
:= rk1,k rk1,k1
// is parameter chosen in ( 41 , 1)
6:
2
2
if rk1,k1
> 2 + rkk
then
// Size-reduce R(1 : k 1, k)
7:
8:
for l = k 1 : 1 do
j m
r
:= rl,k
ll
Z(1 : n, k) := Z(1 : n, k) Z(1 : n, l)
9:
10:
11:
end for
// Column permutation and updating
12:
r
c := r2 k1,k+r2
k1,k
13:
14:
kk
rkk
s := r2 +r2
k1,k kk
c s
G :=
s c
15:
16:
17:
R(k 1 : k, k 1 : n) := GR(k 1 : k, k 1 : n)
18:
if k > 2 then
19:
k := k 1
20:
21:
end if
else
21
22:
k := k + 1
23:
end if
24:
end while
Notice that the final matrix R obtained by the PLLL reduction algorithm are
not fully size-reduced, since the algorithm only performs size-reduction when a permutation is followed immediately. However we can easily add an extra size-reduction
procedure at the end of the PLLL reduction algorithm, and transform R to a LLL reduced matrix. We name the PLLL algorithm with an extra size-reduction procedure
as PLLL+.
The PLLL reduction algorithm uses the same permutation criterion as the LLL
reduction algorithm, so it has the same upper bound of permutations/loops as the
upper bound for the LLL reduction algorithm, which is O(n3 + n2 log1/ ).
For each loop, the PLLL reduction algorithm has O(n2 ) arithmetic operations
in worst case situations. The Household QR costs O(mn2 ) flops [16, Section 5.2]. So
the PLLL algorithm takes at most O(mn2 + n5 + n4 log1/ ) arithmetic operations,
which is the same as the complexity bound of the LLL reduction algorithm. The
simulation results of PLLL in [43] show that it is faster and more stable than the
LLL reduction.
22
CHAPTER 3
Block LLL Reduction Algorithms
The blocking technique has been wildly used to speed up conventional matrix
algorithms on todays high performance computers. The key to achieve high performance on computers with a memory hierarchy has been to recast the algorithms
in terms of matrix-vector and matrix-matrix operations to permit efficient reuse of
data that resided in cache or local memory. The blocking technique partitions a
big matrix into small blocks, and performs matrix-matrix operations implemented
in level 3 basic linear algebra subprograms (BLAS) as much as possible [14]. The
matrix-matrix operations implemented in level 3 BLAS is more efficient than the
matrix-vector operation implemented in level 2 BLAS or the vector-vector operation
implemented in level 1 BLAS. The level 3 BLAS can maximumly reduce the movements of data between memories and registers, which can be as costly as arithmetic
operations on the data in matrix algorithms.
In this chapter, we first explain how to apply the blocking technique to the components of the partial LLL (PLLL) reduction algorithm. Then we propose two block
LLL reduction algorithms with different matrix partition strategies, and compare
their speed and stability with the original LLL reduction algorithm and the PLLL
reduction algorithm introduced in Chapter 2.
23
3.1
reduction algorithm named BSR, a variant of the PLLL reduction algorithm named
Local-PLLL and a block partial size-reduction algorithm named BPSR. They will
be used as subroutines of the block LLL reduction algorithms. Local-PLLL suits for
computing the PLLL reduction of blocks of the basis matrix. The block partial sizereduction algorithm uses an efficient size-reduction strategy proposed in the PLLL
reduction algorithm.
3.1.1
(3.1)
P Z
mn
Householder transformations:
QT = Hn H2 H1 ,
(3.2)
Hi = In i ui uTi , i = 1, 2, , n,
(3.3)
0
i Rmi+1 is a Householder vector,
where i = 2/(uTi ui ), ui = Rm , u
i
u
Hi Rmm is the Householder transformation matrix which zeros B(i + 1 : m, i).
The permutation matrix P is the product of n permutations:
P = P1 P2 Pn ,
where Pi (i = 1, 2 , n) is the permutation matrix which interchanges the i-th
column and another column in B(1 : m, i : n) such that the 2-norm of B(i : m, i) is
minimum.
In order to explain the block QR implementation, we define B (i) as the value of
B after i Householder transformations and i permutations, i.e.,
B (i) = Hi H2 H1 BP1 P2 Pi ,
(3.4)
(3.5)
Here we want to point out that B (i) will not be formed in the i-th step of the block
algorithm, and it is used only for explanations of the algorithm.
25
i
Y
Ht =
(Im t ut uTt ) = Im Yi Ti YiT ,
t=1
(3.6)
t=1
where
Yi = [u1 , u2 , , ui ] Rmi
(3.7)
Ti1 0
Ti =
,
hTi i
(3.8)
(i) Rin .
FiT = Ti YiT B
(3.9)
where
T
Fi1 Pi
(1) , FiT =
F1T = 1 uT1 B
.
T
(i) i uTi Yi1 Fi1
i uTi B
Pi
26
(3.10)
j = 1, 2, , n.
Utilizing l, a column in B with minimum 2-norm is permuted with the first column
by the permutation matrix P1 (actually P1 is not formed explicitly). Then we use
the Householder transformation H1 to zero B(2 : m, 1). At this moment, unlike
Algorithm 2.2 we do not apply H1 to other columns of B. However, the first row of
B must be updated in order to downdate the squared column norms:
lj := lj B(1, j)2 ,
j = 2, , n,
(3.11)
which will be used in the next step for minimum column pivoting. In order to
update the first row, we form the following matrices (actually there are vectors)
using Eq.(3.6) and Eq.(3.10):
Y1 := u1 ,
27
(1) (1 : m, 2 : n) given
Notice that B(1 : m, 2 : n) stores in memory is equivalent to B
in Eq.(3.10). From Eq.(3.8) the first row of B except the first entry is updated as
follows:
B(1, 2 : n) := B(1, 2 : n) Y1 (1, 1)F1T (1, 2 : n).
Then the squared column norms are downdated using Eq.(3.11). Thus at the end
of the first step, the first row and the first column have been updated, and the rest
part of B will be updated later.
In the second step, utilizing the vector l of the squared column norms, we apply
P2 to permute the second column of B with a column, say column p, 2 p n, such
that the 2-norm of B(2 : m, 2) is minimum, and we permute the second column of
F1T with its p-th column (i.e., F1T := F1T P2 ). Then from Eq.(3.8) the second column
B(2 : m, 2) is updated by the first Householder transformation H1 :
B(2 : m, 2) := B(2 : m, 2) Y1 (2 : m, 1)F1T (1, 2).
After this update, we apply the Household transformation H2 to zero B(3 : m, 2).
Same as step 1, we do not use H2 to update the rest columns of B at this moment.
But we need update the second row of B, because it will be used to compute the
2-norms of each column of B(3 : m, 3 : n). In order to perform the update, Y2 and F2
are formed by accumulating H2 into Y1 and F1 using Eq.(3.6) and Eq.(3.10):
T
F1 (1, 3 : n)
Y2 := Y1 u2 , F2T (1 : 2, 3 : n) :=
.
T
T
T
2 u2 B(1 : m, 3 : n) 2 u2 Y1 F1 (1, 3 : n)
28
Note that here F1T has been permuted by P2 . Then we update the second row of B
except the first two entries:
B(2, 3 : n) := B(2, 3 : n) Y2 (2, 1 : 2)F2T (1 : 2, 3 : n),
and compute the squared column norms of B(3 : m, 3 : n):
lj := lj B(2, j)2 ,
j = 3, , n.
At the end of the second step, the first two rows and the first two columns have been
updated.
Now we assume we are in the i-th step of transforming the first block of B to an
upper triangular matrix. The first (i 1) columns of B have been triangulized and
the first (i1) rows have been updated, while the rest part of the matrix B is waiting
to be updated. We first permute the i-th column with a column in B(1 : m, i : n)
such that the 2-norm of B(i : m, i) is minimum, and we permute the corresponding
T
T
T
columns of Fi1
(i.e., Fi1
:= Fi1
Pi ). Then we update the i-th column of B(i : m, i)
29
Yi := Yi1 ui ,
T
Fi1
(1 : i
1, i + 1 : n)
FiT (1 : i, i + 1 : n) :=
.
T
i uTi B(1 : m, i + 1 : n) i uTi Yi1 Fi1
(1 : i 1, i + 1 : n)
Then we update the i-th row B(i, i + 1 : n) and downdate the squared column norms:
B(i, i + 1 : n) := B(i, i + 1 : n) Yi (i, 1 : i)FiT (1 : i, i + 1 : n),
lj := lj B(i, j)2 ,
j = i + 1, , n.
30
Algorithm 3.1. (Block Householder QR Factorization with Minimum Column Pivoting) Suppose B Rmn has full column rank, k is the chosen block size which
is a factor of n for simplification. This algorithm computes the QR Factorization:
Q1 R = BP , where Q1 has orthonormal columns, P is a permutation matrix. Note
the matrix B is overwritten by R in computation.
function: [R, P ] = BQRM CP (B, k)
1:
P := In , m
:= m, n
:= n
2: lj
3:
:= kB(1 : m, j)k2 ,
j = 1 : n
for j = 1 : k : n do
4:
Y (1 : m,
1 : k) := 0, F (1 : n
, 1 : k) := 0
5:
for j = 1 : k do
// Permutation
6:
i := j + j 1, q := arg minipn lp
7:
8:
9:
10:
B(i : m, i) := B(i : m, i) Y (j : m,
1 : j 1)F (j, 1 : j 1)T
11:
12:
Y (j : m,
j) := ui (i : m)
13:
F (j + 1 : n
, j) := i B(1 : m, i + 1 : n)T ui
14:
F (1 : n
, j) := F (1 : n
, j) i F (1 : n
, 1 : j 1)Y (j : m,
1 : j 1)T ui (i : m)
31
16:
17:
18:
Y (j + 1 : m,
1 : k)F (j + 1 : n
, 1 : k)T
m
:= m k, n
:= n k
19:
20:
end for
21:
R := B(1 : n, 1 : n)
Block Size-Reduction
Z is the product of a sequence of IGTs which have the form of In ei eTj (1 i <
j n), where is an integer (see Section 2.2.1).
32
..
..
..
..
..
..
.
.
.
.
.
.
Udd
Rdd
Zdd
The blocks of R are size-reduced in a order similar to the order used in the conventional size-reduction. Thus the blocks Rij are reduced by IGTs, in the order of
i = j : 1 : 1 and j = 1 : d.
Let us use an example to illustrate the block size-reduction procedure. Assume
d = 2, the following block size-reduction is desired,
U11 U12 R11 R12 Z11 Z12
=
R22
Z22
U22
(3.12)
The blocks R11 , R22 and R12 are reduced one by one in 3 steps.
In step 1, R11 is reduced by applying IGTs to it as shown in Section 2.2.1, i.e.,
U11 = R11 Z11 where U11 is size-reduced, Z11 is formed by these IGTs. Thus, after
step 1 the matrix R becomes
.
R22
R22
Ik
33
(3.13)
In step 2, R22 is reduced by applying IGTs to it, i.e. U22 = R22 Z22 where U22 is
size-reduced, Z22 is formed by these IGTs. Thus, the matrix R becomes
,
U22
R22
Z22
(3.14)
,
=
U22
Ik
U22
(3.15)
where all the entries of U12 are size-reduced. Therefore from Eq.(3.13), Eq.(3.14)
and Eq.(3.15), Eq.(3.12) hold, where
Ik
Ik
Z22
Z22
Notice that Z12 = Z11 Z12 can be obtained by a matrix-matrix operation.
The block size-reduction algorithm is given as follows.
Algorithm 3.2. (Block Size-Reduction) Given an upper triangular matrix R Rnn
and a block size k. This algorithm computes a size-reduced matrix: U = RZ, where
U is upper triangular and Z is unimodular. In the computation, the matrix R is
overwritten by U . We use Ai1 :i2 ,j to denote the sub-matrix formed by block rows i1
to i2 in the j-th block column of A.
function: [U, Z] = BSR(R, k)
34
1:
Z := In , d := n/k
2:
for j = 1 : d do
for i = j : 1 : 1 do
3:
if i = j then
4:
5:
6:
7:
8:
9:
10:
end if
11:
end for
12:
13:
end for
14:
U := R
3.1.3
R11 R1d0
..
..
nn
k0 k0
R=
, 1 i j d0 .
.
.
R , Rij R
Rd0 d0
35
Rii Ri,i+1
0
Rlocal =
, 1 i d 1.
Ri+1,i+1
The Local-PLLL reduction computes the PLLL reduction of Rlocal :
Rlocal = Ql Rl Zl1
where Ql Rkk is orthogonal, Zl Rkk is unimodular and Rl Rkk is PLLL
reduced.
The Local-PLLL reduction algorithm is a variant of the PLLL reduction algorithm described in Section 2.3. Since the Local-PLLL reduction is applied to a
sub-matrix Rlocal instead of the whole matrix, four modifications are made to PLLL
in order to suit the structure of Rlocal .
First, Rlocal is already upper triangular. Thus, the initial QR factorization in
PLLL is not needed in Local-PLLL.
Second, IGTs if required are applied to all the entries in a column of R in
PLLL for stability consideration. In Local-PLLL, Rlocal is part of the matrix R,
and the Local-PLLL reduction algorithm can only access columns in Rlocal which
are parts of the columns in R. For stability consideration, if some columns of Rlocal
are size-reduced by Local-PLLL, other parts of these columns in R should also be
size-reduced. So the Local-PLLL subroutine stores the information of columns which
are size-reduced by IGTs in a vector c and returns c in order to reduce the other
parts of the columns later.
36
Third, as a subroutine of block LLL reduction algorithms, before applying LocalPLLL the first half of Rlocal may be already PLLL reduced (see details in Section
3.2). If the first half of Rlocal is PLLL reduced, it is more efficient to start the LocalPLLL reduction algorithm with column k 0 + 1 of Rlocal instead of column 2 in PLLL.
Thus, a parameter f stating whether the first half of Rlocal is PLLL reduced.
Fourth, the Local-PLLL reduction algorithm must form the orthogonal factor
Ql of the PLLL reduction of Rlocal in order to update other blocks of R, while the
PLLL reduction algorithm does not form the orthogonal factor Q for efficiency.
The Local-PLLL reduction algorithm is given as follows.
Algorithm 3.3. (Local-PLLL Reduction) Given an upper triangular matrix Rlocal
Rkk and a scalar f . This algorithm computes the PLLL reduction: Rlocal = Ql Rl Zl1 ,
and returns a vector c Rk storing indexes of size-reduced columns. If f = 0, the
first half of Rlocal is not PLLL reduced.
function: [Ql , Rl , Zl , c] = Local-P LLL(Rlocal , f )
1:
2:
3:
4:
if f = 0 then
i := 2
else
i := k/2 + 1
5:
end if
6:
Ql := Ik , Zl := Ik , c := 0
while i k do
m
j
ri1,i
8:
:= ri1,i1
7:
9:
:= ri1,i ri1,i1
37
// is parameter chosen in ( 14 , 1)
10:
2
if ri1,i1
< 2 + rii2 then
ci := 1
12:
for l = i 1 : 1 do
j m
r
:= rl,i
ll
13:
14:
15:
Zl (1 : k, i) := Zl (1 : k, i) Zl (1 : k, l)
16:
end for
// Column permutation and updating
17:
18:
19:
r
c := 2i1,i
2
ri1,i +rii
s := 2 rii 2
ii
ri1,i +r
c s
G :=
s c
20:
21:
22:
23:
Rlocal (i 1 : i, i 1 : k) := GRlocal (i 1 : i, i 1 : k)
24:
Ql (1 : k, i 1 : i) := Ql (1 : k, i 1 : i)GT
25:
if i > 2 then
26:
i := i 1
27:
28:
end if
else
38
i := i + 1
29:
30:
end if
31:
end while
32:
Rl = Rlocal
3.1.4
A block partial size-reduction algorithm is designed to coordinate with the LocalPLLL reduction algorithm. In BSR (Algorithm 3.2), all off-diagonal entries of the
upper triangular matrix are checked for IGTs. However, it is not the case for the
PLLL reduction, where the off-diagonal entries are reduced only when they are necessary. More specifically, if an IGT is applied to a super-diagonal entry of R, other
IGTs are applied to the off-diagonal entries in the same column in order to prevent
producing large numbers which may cause numerical stability problem. Thus, only
the entries in the columns which are affected by IGTs in Local-PLLL need to be
reduced. Local-PLLL stored the information of those columns affected by IGTs in
c, so block partial size-reduction (BPSR) algorithm can reduce only those marked
columns by IGTs.
Give an upper triangular matrix R Rnn which consists of d d blocks (here
we do not assume that each block has
R11
...
R=
R1d
..
.
, 1 i j d.
Rdd
39
It has sub-matrices:
R11
...
=
R
R1,i1
..
.
Ri1,i1
R1i
R2i
=
R
. ,
..
Ri1,i
1 < i d,
(3.16)
has k columns, Ri,i with 1 < i i has k rows, and R1,i may have either k
where R
or k/2 rows.
Given a vector c Zk , whose entries are either one or zero. For j = 1 : k, if
by applying IGTs to it, which
cj = 1 we performs size-reductions on column j of R
if cj = 0 we do nothing. After this, part of the entries of R
are size-reduced
involve R;
according to c:
:= R
+R
Z,
R
where Z which is formed by those IGTs has the same dimension and block partition
as R:
Z1i
Z2i
Z = . ,
..
Zi1,i
(3.17)
I Z
and
is unimodular.
0 I
The BPSR algorithm is given as follows.
R
in
Algorithm 3.4. (Block Partial Size-reduction) Given two sub-matrices R,
R
:= R+
R
Z,
where Z has block partition as in Eq.(3.17). We use Ai1 :i2 ,j to denote the sub-matrix
formed by block rows i1 to i2 in the j-th block column of A.
Z]
= BP SR(R,
R,
c)
function: [R,
1:
for i = i 1 : 1 : 1 do
// Partial size-reduction of Ri,i by Zi,i involving Ri,i
for j = 1 : k do
2:
if cj = 1 then
3:
Size-reduce Ri,i (:, j): Ri,i (:, j) := Ri,i (:, j) + Ri,i Zi,i (:, j)
4:
end if
5:
6:
end for
7:
8:
end for
3.2
rithm utilizing the subroutines introduced in the previous section, i.e., the block QR
factorization (Algorithm 3.1), the block size-reduction (Algorithm 3.2), the LocalPLLL reduction algorithm (Algorithm 3.3) and the block partial size-reduction (Algorithm 3.4). The complexity analysis of LRBLLL is presented in the second part
of this section.
3.2.1
The left-to-right block LLL reduction algorithm combines the blocking technique
with the PLLL algorithm. It includes 7 steps as follows.
41
Step 1. Compute the block QR factorization (Algorithm 3.1) of the full column
rank matrix B Rmn with minimum column pivoting: BP = Q1 R.
Step 2. Partition matrix R to d0 d0 blocks with block size k 0 (here for simplicity,
we assume that n is multiple of k 0 , i.e., n = d0 k 0 , and d0 is even, and define k = 2k 0 ,
d = d0 /2):
R11
..
R=
.
R1d0
..
nn
.
R ,
Rd0 d0
Rij Rk k ,
1 i j d0 .
Rii
=
where
Ri,i+3
Ri,i+2
Rright =
Ri+1,i+2 Ri+1,i+3
Ri,d0
,
Ri+1,d0
42
Rup
R
R
1,i+1
1,i
R2,i
R2,i+1
= .
.
.
..
.
.
Ri1,i Ri1,i+1
Ri,i+1
,
Ri+1,i+1
Step 5. Size-reduce Rup using the block partial size-reduction algorithm (Algorithm 3.4) :
Rup
R11
...
:= Rup +
R1,i1
..
.
Ri1,i1
Zupdate .
Step 6. Set := r(i1)k0 ,(i1)k0 +1 br(i1)k0 ,(i1)k0 +1 /r(i1)k0 ,(i1)k0 e r(i1)k0 ,(i1)k0 .
2
2
2
Check if the Lovasz condition r(i1)k
0 ,(i1)k 0 ( + r(i1)k 0 +1,(i1)k 0 +1 ) holds for the
43
is partitioned into blocks in the same way as R. We use Ai1 :i2 ,j1 :j2 to denote the
sub-matrix formed by block rows i1 to i2 and block columns j1 to i2 of A.
function: [R, Z] = LRBLLL(B, k)
// Compute the block QR factorization using Algorithm 3.1
1:
2:
i := 1, k 0 := k/2, d0 := 2n/k, f := 0
3:
while i < d0 do
// PLLL reduction of Rii using Algorithm 3.3
4:
Ri:i+1,i:i+1 , Z,
r] = Local-P LLL(Ri:i+1,i:i+1 , f )
[Q,
5:
f := 1
6:
if Z = I then
// The diagonal block is unchanged. The algorithm moves ahead.
7:
i := i + 1
8:
Continue
9:
end if
// Block updating
10:
11:
R1:i1,i:i+1 := R1:i1,i:i+1 Z
12:
T Ri:i+1,i+2:d0
Ri:i+1,i+2:d0 := Q
// Size-reduce the corresponding columns of R1:i1,i:i+1 using Algorithm 3.4
13:
= BP SR(R1:i1,i:i+1 , R1:i1,1:i1 , r)
[R1:i1,i:i+1 , Z]
14:
44
15:
16:
17:
i := i + 1
18:
else
19:
i := i 1
20:
21:
end if
22:
end while
// Size-reduce R using Algorithm 3.2
23:
= BSR(R)
[R, Z]
24:
Z := Z Z
Notice that if the Local-PLLL output Z is an identity matrix, we do not apply block
updating and BPSR to relevant blocks for efficiency. Also notice that, if the matrix
dimension n is not a multiple of the block size k, the algorithm still works by simply
changing the block size of last column blocks to fit the matrix dimension. At the
end of each while loop the first ik 0 columns of R are PLLL reduced. The while loop
breaks when i = d0 . Then the n = d0 k 0 columns of R are PLLL reduced. And the
matrix R is size-reduced after the final size-reduction. Thus the LRBLLL algorithm
outputs a basis matrix which is LLL reduced.
3.2.2
Complexity Analysis
45
LLL (Algorithm 2.1), Lemma 2.1 can be also applied to LRBLLL. As in Section 2.2.3,
we define = maxj kbj k, and = minxZn /{0} kBxk. Thus the LRBLLL algorithm
has at most O(n3 + n2 log1/ ) permutations, and the algorithm converges. During
the procedure of LRBLLL, the permutation operations are performed inside the
Local-PLLL subroutine. In the following part, we would like to obtain an upper
bound of the number of calls of Local-PLLL.
In the while loop of LRBLLL, it calls Local-PLLL reductions of diagonal submatrices of R. At each loop, the PLLL reduction of a diagonal sub-matrix is performed, and a diagonal sub-matrix which will be performed by the PLLL reduction
in the next loop, is selected in the current loop. From step 3 of LRBLLL, the diagonal sub-matrix Rlocal contains 2 diagonal blocks Ri,i and Ri+1,i+1 . And Rlocal may
move one diagonal block forward or backward at the end of each loop, according to
whether the Lovasz condition holds for columns (i 1)k 0 and (i 1)k 0 + 1, see step
6 of LRBLLL described at Section 3.2. The matrix R which is divided into d0 d0
blocks has d0 diagonal blocks. In the first call of Local-PLLL, Rlocal contains the first
two diagonal blocks R1,1 and R2,2 , and the block index i equals to 1; while in the
last call of Local-PLLL, Rlocal contains the last two diagonal blocks Rd0 1,d0 1 and
Rd0 ,d0 , and the block index i equals to d0 1. It needs only d0 1 loops for i to move
forward to i = d0 1 from i = 1, if there are no backward moves. Actually there may
be some backward moves, say s times, and the times of moving forward should be
added by an extra s. Thus the total number of moves of Rlocal is 2s + d0 1 which
equals to 2s + 2d 1.
46
48
R44 1k
1k
R11
1k
1k
1k
R12
R13
R14
1k
R23
R24
1k
R22
1.5k
1k
1.5k
R11
R12
R13
1.5k
R23
1k
R33
1.5k
R22
R33
R34
1k
R44 1k
1.5k
1k
1.5k
R11
R12
R13
BP = Q1 R,
1.5k
1k
R22
R
23
P Znn is a permutation
matrix.
Next we use an example to show how APBLLL works iteratively with two al-
R11
R=
R1d
..
..
nn
kk
.
.
R , Rij R , 1 i j d.
Rdd
50
R11 R1,d1
..
...
Rnn ,
R=
.
Rd1,d1
R11 R1.5k1.5k ,
R1v R1.5k1k ,
R1,d1 R1.5k1.5k ,
Ru,d1 R1.5k1k ,
Rd1,d1 R1.5k1.5k ,
Ruv R1k1k ,
1 < u v < d 1.
2:
d := n/k, f := 0
3:
for i = 1 : d do
4:
changei := 1, nextChangei := 1
5:
end for
6:
while (1) do
51
7:
8:
9:
10:
11:
if changei 6= 1 then
continue
end if
// Apply Local-PLLL to all diagonal blocks using Algorithm 3.3
12:
Rii , Z,
r] = Local-P LLL(Rii , f )
[Q,
13:
if Z = I then
// The diagonal block is unchanged, and updates are not needed
14:
15:
continue
end if
// Perform the corresponding updates
16:
nextChangemax(1,i1) := 1, nextChangei := 1
// Block updating
17:
Z1:d,i := Z1:d,i Z
18:
R1:i1,i := R1:i1,i Z
19:
T Ri,i+1:d
Ri,i+1:d := Q
// Size-reduce the corresponding columns of R1:i1,i using Algorithm 3.4
20:
= BP SR(R1:i1,i , R1:i1,1:i1 , r)
[R1:i1,i , Z]
21:
22:
end for
23:
if nextChange = 0 then
52
24:
25:
end if
26:
f := 1
27:
for i = 1 : d do
changei := nextChangei , nextChangei := 0
28:
29:
end for
30:
end while
// Size-reduce R using Algorithm 3.2
31:
= BSR(R)
[R, Z]
32:
Z := Z Z
Notice that the two vectors change and nextChange are used to tracing that if
the diagonal blocks are PLLL reduced in each iteration. If two diagonal blocks are
unchanged in a iteration, in the next iteration we do not apply Local-PLLL to the
diagonal block whose diagonal entries come from the two unchanged diagonal blocks,
since this diagonal block should also be PLLL reduced. Also notice that if the LocalPLLL output matrix Z is an identity matrix, we do not apply block updating and
BPSR to relevant blocks for efficiency.
3.3.2
Complexity Analysis
The APBLLL algorithm shares the same QR and final size-reduction parts as
LRBLLL. Thus the costs of these two parts are the same as they are in LRBLLL,
which are O(mn2 ) arithmetic operations for the QR factorization and O(n3 ) arithmetic operations for the final size-reduction. The cost of the rest parts of APBLLL
53
are divided into two parts: the cost of subroutine Local-PLLL and the cost outside
subroutine Local-PLLL, i.e., the block updating and the block partial size-reductions.
These two parts are calculated separately.
Since APBLLL uses the same permutation criterion as LLL (Algorithm 2.1),
Lemma 2.1 can be also applied to APBLLL. Thus the total number of permutations
p taking place in Local-PLLL reductions is bounded above by O(n3 + n2 log1/ ).
In Local-PLLL a permutation causes at most O(k 2 ) arithmetic operations for subsequent updating and size-reductions. Thus, all the call to subroutine Local-PLLL
cost O(n3 k 2 + n2 k 2 log1/ ) arithmetic operations.
In APBLLL, only if the output matrix Z of Local-PLLL is not identity, i.e.
there are some permutations taking place during the execution of Local-PLLL, the
block updating and BPSR line 17-21 are performed. Because the total number of
permutations is p, there are at most p calls to Local-PLLL and each one of which
So the worst case is that the block updating and BPSR
does not produce identity Z.
are executed p times. For each execution, the block updating and BPSR cause at
most O(n2 k) arithmetic operations. Thus the total cost of block updating and BPSR
is p O(n2 k) in the worst case.
From above, the total cost of APBLLL is obtained by adding the cost of all the
parts together:
CAP BLLL = O(mn2 ) + p O(k 2 ) + p O(n2 k) + O(n3 ) = O(mn2 + n5 k + n4 k log1/
).
This bound is lager than the bounds of LRBLLL, PLLL and LLL. However its simulation result shows that it performs better than LLL and PLLL and performs similar
54
as LRBLLL. The simulation results and analysis of the two block LLL reduction
algorithms will be given in the next section.
Table 32 lists the costs of the important processes and the total cost of APBLLL.
Table 32: Complexity analysis of APBLLL reduction algorithm
Processes
Bound
Cost of QR factorization
O(mn2 )
Cost of one permutation in Local-PLLL
O(k 2 )
Cost of block updating and
O(n2 k)
size-reduction for one diagonal block
Cost of final block size-reduction
O(n3 )
Number of permutations: p
O(n3 + n2 log1/ )
Total cost of the algorithm
O(mn2 + n5 k + n4 k log1/ )
3.4
has MATLAB 7.12.0 on a 64-bit Ubuntu 11.10 system with 4 Intel Xeon(R) CPU
W3530 2.8GH processors and 5GB memory. The other has MATLAB 7.13.0 on
a 64-bit Red Hat 6.2 system with 64 AMD Opteron(TM) 2.2GH processors and
64G memory. Our simulations use conventional MATLAB not Parallel MATLAB.
MATLAB use IEEE double precision model for the floating point arithmetic by
default. The unit round-off for double precision is about 1016 . We compare four
algorithms, i.e., the original LLL algorithm (Algorithm 2.1), the PLLL+ algorithm,
the LRBLLL algorithm (Algorithm 3.5), and the APBLLL algorithm (Algorithm
3.6). The PLLL+ algorithm is the PLLL algorithm (Algorithm 2.3) with an extra
size-reduction procedure to guarantee the resulted matrix is size-reduced. All these
55
four algorithms produce LLL reduced matrices. We compare the CPU run time, the
flops, and the relative backward errors
kBQc Rc Zc1 kF
kBkF
Qc is the computed orthogonal matrix, Rc is the computed LLL reduced matrix and
Zc1 is the unimodular matrix formed by the inverses of the computed permutation
matrix and IGTs. And the run time is measured by two separate parts, the run time
for the QR factorization and the run time for the rest part of each algorithm (for
simply, we just call this part the reduction), in order to observe how the blocking
technique performances in each part.
In the simulation, we test three cases of matrix B Rnn with n = 100 : 50 :
1000. The square matrices Bs are generated as follows.
Case 1: B is generated by MATLAB function randn: B = randn(n, n), i.e.,
each element follows the normal distribution N (0, 1).
Case 2: B = U SV T , U and V are randomly generated orthogonal matrices,
and S is a diagonal matrix as follows,
S(i, i) = 104(i1)/(n1) , i = 1, , n.
Case 3: B = U SV T , U and V are randomly generated orthogonal matrices,
and S is a diagonal matrix as follows,
S(i, i) = 1000, i = 1, , bn/2e
S(i, i) = 0.1, i = bn/2e + 1, , n.
Case 1 are the most typical testing matrices for numerical solutions. Case 2 and 3
intends to show the reduction speed when the condition numbers are fixed at 104 .
56
Case 3 also shows that the block algorithms gain more efficiency at the reduction
part, when it takes a long time to run.
For each dimension of all cases, we randomly generate 20 different matrices to
do the test. We only test 20 simulation runs, because LLL is too time consuming.
However we use box plots to show that the behaviors of the algorithms are stable,
thus 20 runs are enough for our simulation. For the block algorithms, the optimal
block size may vary according to the dimension of the matrix. In the simulation, a
fixed block size of 32 is adopted for matrices at all dimensions for simplicity. In the
average QR/reduction run time plots, the y-axis is the average run time (seconds)
for the 20 matrices, and the x-axis is the dimension. In the average flops plots, the
y-axis is the average flops, and the x-axis is the dimension. In the average relative
backward error plots, the y-axis is the relative backward error, and the x-axis is the
dimension.
In the simulation, we test matrices with various condition numbers, and give
the results in the various condition number plots. In these plots, the y-axis is the
average QR/reduction run time, the average flops or the average relative back ward
errors for 20 matrices with dimension 200 in case 2, and the x-axis is the matrix
condition number from 101 to 106 . Box plots of run time and relative backward
errors of all three cases with dimension 200 are drawn. In the box plot, the y-axis is
either the algorithm run time or the relative backward errors, and the x-axis is the
four algorithms, i.e., LLL, PLLL+, LRBLLL and APBLLL.
The simulation results given by Intel processors are shown in Figure 33, Figure
34 and Figure 35 for the overall performance of three cases, in Figure 36 for case
57
2 with different condition numbers, and in Figure 37 for the box plot of all the
cases. And the results given by AMD processors are shown in Figure 38, Figure 3
9, Figure 310 Figure 36 and Figure 37, respectively. For the overall performance
of each case, we give six plots. The two plots in the first row are the average run time
of QR factorization and the average reduction run time of LLL respectively. LLL
runs much longer than the other three algorithms, so we put it in individual plots in
order to compare the other three algorithms easily. The two plots in the middle row
are the average QR/reduction run time for PLLL+, LRBLLL, and APBLLL. The
two plots in the bottom row are the average flops and the average relative backward
errors for LLL, PLLL+, LRBLLL, and APBLLL. For case 2 with different condition
numbers, we also give six plots which are ordered in the same way as the overall
performance plots. For the box plot figure, we give six plots. The three plots in the
left column are the average algorithm run time of three cases. The three plots in the
right column are the average relative backward error of three cases.
From the simulation results, we can draw following observations and conclusions.
1. Comparing the results between two machines with Intel or AMD, we can observe that the performance of the four algorithms is consistent between these
two machines.
2. By comparing the run time of different algorithms, we found that LLL is the
slowest among the four algorithms. LRBLLL is as fast as APBLLL, and both
are faster than PLLL+ in all three cases. So on average the computational CPU
times for the four algorithms have the following order LRBLLL AP BLLL <
P LLL+ < LLL.
58
59
60
1500
LLL
QR Run Time
2.5
2
1.5
1
LLL
1000
500
0.5
0
200
400
600
Dimension
800
1000
4
Reuction Run Time
QR Run Time
200
400
600
Dimension
800
800
1000
800
1000
0.3
0.2
0.1
0
10
200
400
600
Dimension
13
10
10
Flops
400
600
Dimension
PLLL+
LRBLLL
APBLLL
0.4
1000
10
10
LLL
PLLL+
LRBLLL
APBLLL
10
10
200
0.5
PLLL+
LRBLLL
APBLLL
14
10
LLL
PLLL+
LRBLLL
APBLLL
15
200
400
600
Dimension
800
10
1000
200
400
600
Dimension
61
Figure 33: Performance comparison for Case 1, Intel
800
1000
600
QR Run Time
LLL
2.5
2
1.5
1
0.5
0
200
400
600
Dimension
800
200
200
400
600
Dimension
800
1000
0.8
Reuction Run Time
QR Run Time
300
1000
PLLL+
LRBLLL
APBLLL
4
3
2
1
200
400
600
Dimension
800
0.4
0.2
10
PLLL+
LRBLLL
APBLLL
0.6
1000
200
400
600
Dimension
800
1000
10
10
Flops
10
LLL
PLLL+
LRBLLL
APBLLL
10
200
400
600
Dimension
800
LLL
PLLL+
LRBLLL
APBLLL
10
10
400
100
LLL
500
15
10
1000
200
400
600
Dimension
62
Figure 34: Performance comparison for Case 2, Intel
800
1000
1200
LLL
QR Run Time
2.5
2
1.5
1
0.5
0
800
600
400
200
200
400
600
Dimension
800
1000
200
400
600
Dimension
800
400
600
Dimension
800
1000
800
1000
20
10
0
11
PLLL+
LRBLLL
APBLLL
30
1000
200
400
600
Dimension
10
10
10
10
LLL
PLLL+
LRBLLL
APBLLL
10
10
Flops
200
40
QR Run Time
10
50
PLLL+
LRBLLL
APBLLL
LLL
1000
10
10
LLL
PLLL+
BLLL
APBLLL
10
10
12
200
400
600
Dimension
800
10
1000
200
400
600
Dimension
63
Figure 35: Performance comparison for Case 3, Intel
800
1000
40
0.06
0.04
0.02
LLL
0
1
10
10
LLL
QR Run Time
0.08
10
10
Condition Number
10
30
20
10
0
1
10
10
0.03
0.02
PLLL+
LRBLLL
APBLLL
0.01
0
1
10
10
10
10
Condition Number
10
10
10
10
10
10
Condition Number
10
10
10
LLL
PLLL
LRBLLL
APBLLL
10
10
Flops
PLLL+
LRBLLL
APBLLL
0
1
10
10
10
15
10
10
10
Condition Number
8
Reuction Run Time
QR Run Time
0.04
10
LLL
PLLL+
LRBLLL
APBLLL
10
10
15
10
10
10
10
Condition Number
10
10
10
10
10
10
10
Condition Number
10
64
Figure 36: Performance comparison for Case 2 with dimension 200, Intel
10
13
10
1
10
10
10
15
10
14
10
10
LLL
PLLL
LRBLLL
APBLLL
LLL
PLLL
LRBLLL
APBLLL
LLL
PLLL
LRBLLL
APBLLL
PLLL
LRBLLL
APBLLL
10
10
10
10
10
10
10
10
LLL
PLLL
LRBLLL
APBLLL
11
10
1
10
10
13
10
12
10
10
LLL
PLLL
LRBLLL
APBLLL
LLL
65
Figure 37: Box plots of run time (left) and relative backward error (right) for Case
1 (top), Case 2 (middle), Case 3 (bottom) with dimension 200, Intel
4000
LLL
LLL
Reduction Time
QR Time
200
400
600
Dimension
800
3000
2000
1000
1000
10
PLLL+
LRBLLL
APBLLL
Reuction Time
QR Time
4
2
800
1000
0.6
800
1000
0.4
0.2
200
400
600
Dimension
800
1000
10
200
400
600
Dimension
13
10
10
10
LLL
PLLL+
LRBLLL
APBLLL
10
200
400
600
Dimension
800
10
Flops
400
600
Dimension
PLLL+
LRBLLL
APBLLL
0.8
10
200
14
10
LLL
PLLL+
LRBLLL
APBLLL
15
10
1000
200
400
600
Dimension
66
Figure 38: Performance comparison for Case 1, AMD
800
1000
1200
LLL
Reduction Time
6
QR Time
LLL
1000
800
600
400
200
200
400
600
Dimension
800
1000
10
Reuction Time
QR Time
6
4
400
600
Dimension
800
1000
PLLL+
LRBLLL
APBLLL
1.5
0.5
200
400
600
Dimension
800
1000
10
200
400
600
Dimension
800
1000
10
10
Flops
10
LLL
PLLL+
LRBLLL
APBLLL
10
200
400
600
Dimension
800
LLL
PLLL+
LRBLLL
APBLLL
10
10
200
2
PLLL+
LRBLLL
APBLLL
15
10
1000
200
400
600
Dimension
67
Figure 39: Performance comparison for Case 2, AMD
800
1000
2500
8
LLL
Reduction Time
QR Time
1500
1000
500
200
400
600
Dimension
800
1000
10
Reuction Time
QR Time
60
400
600
Dimension
800
1000
800
1000
40
20
200
400
600
Dimension
800
1000
11
200
400
600
Dimension
10
10
10
10
LLL
PLLL+
LRBLLL
APBLLL
10
10
Flops
200
PLLL+
LRBLLL
APBLLL
80
PLLL+
LRBLLL
APBLLL
10
100
LLL
2000
10
10
LLL
PLLL+
LRBLLL
APBLLL
10
10
12
200
400
600
Dimension
800
10
1000
200
400
600
Dimension
68
Figure 310: Performance comparison for Case 3, AMD
800
1000
0.2
80
LLL
0.1
0.05
10
10
10
Condition Number
10
20
0
1
10
10
0.08
20
0.06
15
0.04
PLLL+
LRBLLL
APBLLL
0.02
0
1
10
10
10
10
Condition Number
10
10
10
PLLL+
LRBLLL
APBLLL
0
1
10
10
10
10
Condition Number
10
10
10
LLL
PLLL+
LRBLLL
APBLLL
10
10
15
10
10
10
Condition Number
10
10
10
10
Flops
40
Reuction Time
QR Time
0
1
10
LLL
60
Reduction Time
QR Time
0.15
LLL
PLLL+
LRBLLL
APBLLL
10
10
15
10
10
10
10
Condition Number
10
10
10
10
10
10
10
Condition Number
10
69
Figure 311: Performance comparison for Case 2 with dimension 200, AMD
10
13
10
1
Total Time
10
10
15
10
14
10
10
LLL
PLLL
LRBLLL
APBLLL
LLL
PLLL
LRBLLL
APBLLL
LLL
PLLL
LRBLLL
APBLLL
PLLL
LRBLLL
APBLLL
10
Total Time
10
10
10
10
10
10
10
10
LLL
PLLL
LRBLLL
APBLLL
11
10
1
Total Time
10
10
13
10
10
12
10
LLL
PLLL
LRBLLL
APBLLL
LLL
70
Figure 312: Box plots of run time (left) and relative backward error (right) for Case
1 (top), Case 2 (middle), Case 3 (bottom) with dimension 200, AMD
CHAPTER 4
Parallelization of Block LLL Reduction
Recently, some work has been done on the parallelization of LLL reduction
algorithms. We would like to give a brief review of these efforts in the first section
of this chapter. Then we discuss how to parallelize the components of APBLLL
(Algorithm 3.6) introduced in the previous chapter. Finally, the performance of
parallelized APBLLL is investigated.
4.1
72
4.2.1
The APBLLL algorithm partitions the basis matrix into blocks, and then performs the Local-PLLL reductions to the diagonal blocks. In order to do the parallelization, we distribute the diagonal blocks among different processors according to
the index of the diagonal blocks. Diagonal block j (the j-th diagonal block counted
from left to right) is allocated to processor b(j 1)/np c + 1. For example, if d = 7
and np = 3, diagonal blocks 1,2,3 are allocated to processor 1, diagonal blocks 4,5,6
are allocated to processor 2, diagonal block 7 is allocated to processor 3. After this
allocation, each processor performs Local-PLLL on the diagonal blocks allocated to
it as well as the corresponding block updating. Using above strategy, lines 7-22 of
APBLLL (Algorithm 3.6) can be computed concurrently. The algorithm description
of this part is given in lines 7-20 of parallel APBLLL (Algorithm 4.1) given in next
section.
4.2.2
In order to size-reduce all the off-diagonal blocks, we first need to distribute these
blocks among the processors. Define modb (a) as the residual of a over b. The j-th
off-diagonal block column is allocated to processor modnp (j 1). Each processor will
thus manage at most ed/np d block columns. Since the block columns have diverse
lengths, they will cost diverse number of operations. For efficiency considerations,
we do not want processors to wait for other processors. Actually they do not need
to. If one processor finishes the size-reduction of a block column, say block column
j, then it starts to size-reduce the (j + np )-th block column without causing any
conflict.
73
i=5
R55
j=1
j=2
j=3
j=4
j=5
P1(1)
P2(2)
P3(3)
P1(5)
P2(1)
P3(2)
P1(4)
P3(1)
P1(3)
P1(2)
i=1
i=2
i=3
i=4
i=5
FigureFigure
41:4-2:
Task
three
processors
Taskallocation
allocation forfor
three
processors
(P1, P2, (P
P3)1 , P2 , P3 )
An example is given here to show how the parallel size-reduction works as shown
in Figure 41. Assume d = 5 and np = 3, the off-diagonal blocks (i, j) where i < j
need to be size-reduced in parallel. In the first step, the 3 processors reduce the
off-diagonal blocks (1, 2), (2, 3), (3, 4) respectively. In the next step, the 3 processors
reduce the blocks (4, 5), (1, 3), (2, 4) respectively. Then the processor 2 idles, the
processor 1, 3 reduce blocks (3, 5), (1, 4) respectively. At this point the processor 3
also idles, processor 1 will then finish reducing blocks (2, 5), (1, 5). We can see that
no data conflict between different processor accrues during the processor.
Thus, the size-reduction in lines 20-21 and 31-32 of APBLLL can be computed
concurrently.
During this size-reduction procedure, each processor size-reduces d2 /(2np ) blocks
in average, costs O(n3 /np ) arithmetic operations, while the sequential size-reduction
will cost O(n3 ) arithmetic operations.
A parallel APBLLL (PAPBLLL) reduction algorithm based on the previous
discussion is given as follows.
74
Algorithm 4.1. (Parallel Repartition Block LLL Reduction) Given a full column
rank matrix B Rmn and a block size k (assume n is multiple of k). This algorithm
computes the LLL reduction: B = Q1 RZ 1 in parallel.
function: [R, Z] = P AP BLLL(B, k)
1:
2:
d := n/k, f := 0
3:
for i = 1 : d do
4:
changei := 1, nextChangei := 1
5:
end for
6:
while (1) do
7:
8:
9:
10:
if changei 6= 1 then
Continue;
11:
end if
12:
Rii , Z,
r] = Local-P LLL(Rii , f )
[Q,
13:
if Z = I then
14:
continue
15:
end if
16:
nextChangemax(1,l1) := 1, nextChangei := 1
17:
Z1:d,i := Z1:d,i Z
18:
R1:i1,i := R1:i1,i Z
75
19:
T Ri,i+1:d
Ri,i+1:d := Q
20:
end for
21:
22:
= BP SR(R1:i1,i , R1:i1,1:i1 , r)
[R1:i1,i , Z]
23:
24:
end for
25:
if nextChange = 0 then
26:
break
27:
end if
28:
f := 1
29:
for i = 1 : d do
30:
31:
end for
32:
end while
33:
4.3
algorithm can be measured by the index of speedup which refers to how much a
parallel algorithm is faster than a corresponding sequential algorithm and can be
written as the ratio of the execution time of sequential algorithm over the execution
time of parallel algorithm:
S=
Ts + Tp
Ts +
76
Tp
np
Where Ts is the sequential portion of execution time of the algorithm during which
the algorithm must be executed sequentially, and Tp is the parallel portion of the
execution time of the algorithm during which the algorithm can be parallelized in a
machine with np processors.
By denoting parallel fraction as the parallel portion of the execution time over
the total execution time of the algorithm, i.e., fp =
S=
1
(1 fp ) +
fp
np
Tp
,
Ts +Tp
77
observed from the complexity analysis. In order to investigate how the parallel diagonal block reduction and block updating suggested in Section 4.2.1 works, a small
test is made in the following part of this section.
To simplify the test, assume the number of processors np is equal to the number
of diagonal blocks d. The serial diagonal block reduction in APBLLL is used to
simulate the parallel diagonal block reduction in PAPBLLL.
Define t(i, j) as the run time of the diagonal block reduction and block updating
at the i-th block or i-th processor during the j-th iteration, where 1 i d, 1
j s, s is the total number of iterations the algorithm needed. The value maxi t(i, j)
refers to the maximum run time of one diagonal block reduction and its corresponding
block updating during the j-th iteration, which will be the threshold of the parallel
diagonal block reduction and updating, since other processors must wait processor
i to finish and move to the next iteration. Thus we can assume the computation
time of the parallel diagonal block reduction and block updating will be the sum of
maxi t(i, j) where j is from 1 to s, while the computation time of the serial diagonal
block reduction and block updating is the sum of t(i, j) where 1 i d, 1 j s.
The speedup of the diagonal block reduction can be obtained by comparing these two
P
P
sums, the parallel one j=1:s maxi t(i, j) and the serial one j=1:s,i=1:d maxi t(i, j).
P
P
The ratio ( j=1:s,i=1:d maxi t(i, j))/( j=1:s maxi t(i, j)) shows the efficiency of the
parallel computing.
The Intel machine which is introduced in Chapter 3 is used in the test. The
average run time of the serial/parallel diagonal block reduction and block updating is
measured. We use 5 processors to test random upper triangular matrices B Rnn
78
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
100
200
300
400
500
600
Dimension
700
800
900
1000
79
CHAPTER 5
Conclusion and Future Work
The LLL reduction is the most popular lattice reduction and is a powerful tool
for solving many complex problems in mathematics and computer science such as
integer least square problems. The computation speed of a matrix algorithm is
determined not only by the number of floating point operations involved, but also by
the amount of memory traffic which is the movement of data between memory and
registers. The blocking technique casts matrix algorithms in terms of matrix-matrix
operations to permit efficient reuse of data.
In this thesis, two floating point block LLL reduction algorithms named the
left-to-right block LLL (LRBLLL) reduction algorithm and the alternating partition block LLL (APBLLL) reduction algorithm using the blocking technique have
been proposed, and the parallelization of APBLLL is discussed. The complexity
of LRBLLL is bound above by , this complexity bound of LRBLLL is the same
the complexity bound of LLL in the literature. First, the ordinary floating point
LLL reduction and its variant partial LLL (PLLL) reduction are introduced as fundamentals. Then the left-to-right block LLL (LRBLLL) reduction algorithm and
the alternating partition block LLL (APBLLL) reduction algorithm are proposed to
introduce efficient LLL reduction algorithms utilizing the blocking technique. The
performances of fours algorithms named LLL, PLLL+, LRBLLL and APBLLL are
compared. Later the parallelization of the APBLLL reduction is given with possible
80
81
82
References
[1] K. Aardal and F. Eisenbrand. The LLL algorithm and integer programming. In
[31], pp. 293-314, 2009.
[2] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger. Closest point search in lattices.
IEEE Transaction of Information Theory, vol. 48, pp. 2201-2214, 2002.
[3] U. Ahmad, A. Amin, M. Li, S. Pollin, L. Van der Perre, and F. Catthoor. Scalable
block-based parallel lattice reduction algorithm for an SDR baseband processor.
IEEE International Conference on Communications (ICC), pp. 1-5, 2011.
[4] L. Babai. On Lovasz lattice reduction and the nearest lattice point. Symposium
on Theoretical Aspects of Computer Science (STACS), vol. 182, pp. 13-20, 1985.
[5] W. Bacjes and S. Wetzel. Parallel lattice basis reduction using a multi-threaded
Schnorr-Euchner LLL algorithm. Euro-Par 2009, Lecture Notes in Computer
Science (LNCS), vol. 5704, pp. 960-973, Springer, 2009.
[6] T. Bartkewitz. Improved lattice basis reduction algorithms and their efficient
implementation on parallel systems. Diploma thesis, Department of Electrical
Engineering and Information Sciences, Ruhr-University Bochum, 2009.
[7] J.W.S. Cassels. An Introduction to the Geometry of Numbers. Springer, Berlin,
Heidelberg, New York, 1971.
[8] X.-W. Chang, X. Yang, and T. Zhou. MLAMBDA: A modified LAMBDA method
for integer least-squares estimation. Journal of Geodesy, vol. 79, pp. 552-565,
2005.
[9] X.-W. Chang and G.H. Golub. Solving ellipsoid-constrained integer least squares
problems. SIAM Journal on Matrix Analysis and Applications archive, vol. 31,
no. 3, pp. 1071-1089, 2009.
[10] X.-W. Chang and Q. Han. Solving box-constrained integer least square problems. IEEE Transactions on Wireless Communications, vol. 7, no. 1, pp. 277-287,
2008.
83
84
[11] X.-W. Chang and T. Zhou. MILES: MATLAB package for solving mixed integer
least squares problem. GPS Solutions, vol. 11, pp. 289-294, 2007.
[12] I.V.L. Clarkson. Approximation of linear forms by lattice points with applications to signal processing. PhD thesis, The Australian National University,
1997.
[13] H. Cohen. A Course in Computational Algebraic Number Theory. SpringerVerlag, Berlin, Germany, 1993.
[14] J.J. Dongarra, L.S. Duff , D.C. Sorensen, and H.A. van der Vorst. Numerical Linear Algebra for High-Performance Computers. Society for Industrial and
Applied Mathematics, Philadelphia, PA, 1998.
[15] O. Goldreich, S. Goldwasser, and S. Halevi. Public-key cryptosystems from
lattice reduction problems. CRYPTO 97 : Advances in Cryptology, vol. 1294,
pp. 112-131, 1997.
[16] G.H. Golub and C. Van Loan. Matrix Computations. The Johns Hopkins
University Press, Baltimore, Maryland, 3rd edition, 1996.
[17] A. Hassibi and S. Boyd. Integer parameter estimation in linear models with
applications to GPS. IEEE Transaction of Signal Processing, vol. 46, pp. 29382952, 1998.
[18] C. Heckler and L. Thiele. A parallel lattice basis reduction for mesh-connected
processor arrays and parallel complexity. In the Proceedings of Fifth IEEE Symposium on Parallel and Distributed Processing, pp. 400-407, 1993.
[19] C. Heckler and L. Thiele. Complexity analysis of a parallel lattice basis reduction
algorithm. SIAM J. Comput., vol. 27, no. 5, pp. 1295-1320, 1998.
[20] R. Kannan. Improved algorithms for integer programming and related lattice
problems. In the Proceedings of the 15th Annual ACM Symposium on Theory
of Computing (STOC), pp. 193-206, 1983.
[21] A. Korkine and G. Zolotareff. Sur les formes quadratiques. Mathematische
Annalen, vol. 6, pp. 366-389, 1873.
[22] A.K. Lenstra, H.W. Lenstra, and L. Lovasz. Factoring polynomials with rational
coefficients. Mathematische Annalen, vol. 261, pp. 515-534, 1982.
85
[23] C. Ling and N. Howgrave-Graham. Effective LLL reduction for lattice decoding.
In the Proceedings of IEEE International Symposium on Information Theory, pp.
196-200, 2007.
[24] C. Ling, W.H. Mow, and N. Howgrave-Graham. Variants of the LLL algorithm
in digital communications: Complexity analysis and fixed-complexity implementation. IEEE Trans. Inf. Theory, submitted for publication. Available online:
http://arxiv.org/abs/1006.1661.
[25] D. Micciancio and S. Goldwasser. Complexity of Lattice Problems: A Cryptographic Perspective. Kluwer Academic publishers, Boston, 2002.
[26] H. Minkowski. Geometrie der zahlen. Teubner, 1896.
[27] H. Minkowski. Diophantische approximationen. Teubner, 1907.
[28] W.H. Mow. Universal lattice decoding: Principle and recent advances. Wireless
Communications and Mobile Computing, vol. 3, pp. 553-569, 2003.
[29] W.H. Mow. Universal lattice decoding: A review and some recent results. In
the Proceedings of IEEE International Conference on Communications, vol. 5,
pp. 2842-2846, 2004.
[30] P.Q. Nguyen and D. Stehle. Floating-point LLL revisited. EUROCRYPT 2005,
Lecture Notes in Computer Science (LNCS) 3494, pp. 215-233, 2005.
[31] P.Q. Nguyen and B. Vallee (editors). The LLL Algorithm: Survey and Applications. Information Security and Cryptography, Springer, Berlin, 2009.
[32] G. Quintana-Orti, X. Sun, and C.H. Bischof. A BLAS-3 version of the QR
factorization with column pivoting. SIAM Journal on Scientific Computing, vol.
19, pp. 1486-1494, 1998.
[33] G. Quintana-Ort and E.S. Quintana-Ort. Parallel codes for computing the
numerical rank. Linear Algebra and its Applications, vol. 275-276, pp. 451-470,
1998.
[34] C.P. Schnorr. Factoring integers and computing discrete logarithms via diophantine approximation. In Advances in Cryptology: EuroCrypt 91, vol. 547,
pp. 281-293, 1991.
86
[35] C.P. Schnorr. Fast LLL-type lattice reduction. Information and Computation
vol. 204, no. 1, pp. 1-25, 2006.
[36] C.P. Schnorr. Progress on LLL and lattice reduction. In [31], pp. 145-178, 2009.
[37] C.P. Schnorr and M. Euchner. Lattice basis reduction: Improved practical
algorithms and solving subset sum problems. Mathematical Programming, vol.
66, pp. 181-191, 1994
[38] R. Schreiber and C. Van Loan. A storage efficient W Y representation for products of Householder transformations. SIAM Journal on Scientific and Statistical
Computing, vol. 10, pp. 53-57, 1989.
[39] H. Vetter, V. Ponnampalam, M. Sandell, and P.A. Hoeher. Fixed complexity
LLL algorithm. IEEE Transactions on Signal Processing, vol. 57, pp. 1634-1637,
2009.
[40] G. Villard. Parallel lattice basis reduction. In the Proceedings of The International Symposium on Symbolic and Algebraic Computation, pp. 269-277, 1992.
[41] S. Wetzel. An efficient parallel block-reduction algorithm. Algorithmic Number
Theory Symposium, Lecture Notes in Computer Science (LNCS), vol. 1423, pp.
323-337, 1998.
[42] D. Wubben, D. Seethaler, J. Jalden, and G. Matz. Lattice reduction: A survey
with applications in wireless communications. IEEE Signal Processing Magazine,
pp. 70-91, May 2011.
[43] X. Xie, X.-W. Chang, and M.A. Borno. Partial LLL reduction. In the Proceedings of IEEE GLOBECOM, 5 pages, 2011.
[44] T. Zhou. Modified LLL algorithms. Masters thesis, School of Computer Science,
McGill University, 2006.