Lecture 03-Parallel Prefix
Lecture 03-Parallel Prefix
Parallel Prefix
4. Return [bi ].
1
2 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
An example using the vector [1, 2, 3, 4, 5, 6, 7, 8] is shown in Figure 3.1. Going up the tree, we
simply compute the pairwise sums. Going down the tree, we use the updates according to points
2 and 3 above. For even position, we use the value of the parent node (bi ). For odd positions, we
add the value of the node left of the parent node (bi−1 ) to the current value (ai ).
We can create variants of the algorithm by modifying the update formulas 2 and 3. For example,
the excluded prefix sum
Figure 3.2 illustrates this algorithm using the same input vector as before.
The total number of ⊕ operations performed by the Parallel Prefix algorithm is (ignoring a
constant term of ±1):
I III
z}|{ II z}|{
n n
z }| {
Tn = + Tn/2 +
2 2
= n + Tn/2
= 2n
Preface 3
If there is a processor for each array element, then the number of parallel operations is:
I II III
z}|{ z }| { z}|{
Tn = 1 + Tn/2 + 1
= 2 + Tn/2
= 2 lg n
1. At each processor i, compute a local scan serially, for n/p consecutive elements, giving result
[di1 , di2 , . . . , dik ]. Notice that this step vectorizes over processors.
In the limiting case of p n, the lg p message passes are an insignificant portion of the
computational time, and the speedup is due solely to the availability of a number of processes each
doing the prefix operation serially.
A = [1 2 3 4 5 6 7 8 9 10]
C = [1 0 0 0 10 1 10 1]
plus scan(A, C) = [1 3 6 10 5 11 7 8 17 10 ]
4 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
L
We now show how to! reduce segmented scan to simple scan. We define an operator, 2 , whose
x
operand is a pair . We denote this operand as an element of the 2-element representation of
y
A and C, where x and y are corresponding elements from the vectors A and C. The operands of
the example above are given as:
! ! ! ! ! ! ! ! ! !
1 2 3 4 5 6 7 8 9 10
1 0 0 0 1 0 1 1 0 1
L
The operator ( 2) is defined as follows:
! !
L y y
2
0 1
! ! !
x x⊕y y
0 0 1
! ! !
x x⊕y y
1 1 1
L
As an exercise, we can show that the binary operator 2 defined above is associative and
exhibits the segmenting behavior we want: for each vector A and each boolean vector C, let AC
be the 2-element representation of A and C. For each binary associative operator ⊕, the result
L
of 2 scan(AC) gives a 2-element vector whose first row is equal to the vector computed by
segmented ⊕ scan(A, C). Therefore, we can apply the parallel scan algorithm to compute the
segmented scan.
Notice that the method of assigning each segment to a separate processor may results in load
imbalance.
The ci in this equation can be calculated by Leverier’s lemma, which relate the c i to sk = tr(Ak ).
The Csanky algorithm then, is to calculate the Ai by parallel prefix, compute the trace of each Ai ,
calculate the ci from Leverier’s lemma, and use these to generate A−1 .
Preface 5
Figure 3.3: Babbage’s Difference Engine, reconstructed by the Science Museum of London
While the Csanky algorithm is useful in theory, it suffers a number of practical shortcomings.
The most glaring problem is the repeated multiplication of the A matix. Unless the coefficients
of A are very close to 1, the terms of An are likely to increase towards infinity or decay to zero
quite rapidly, making their storage as floating point values very difficult. Therefore, the algorithm
is inherently unstable.
Charles Babbage is considered by many to be the founder of modern computing. In the 1820s he
pioneered the idea of mechanical computing with his design of a “Difference Engine,” the purpose
of which was to create highly accurate engineering tables.
A central concern in mechanical addition procedures is the idea of “carrying,” for example, the
overflow caused by adding two digits in decimal notation whose sum is greater than or equal to
10. Carrying, as is taught to elementary school children everywhere, is inherently serial, as two
numbers are added left to right.
However, the carrying problem can be treated in a parallel fashion by use of parallel prefix.
More specifically, consider:
c3 c2 c1 c0 Carry
a3 a2 a1 a0 First Integer
+ b3 b2 b1 b0 Second Integer
s4 s3 s2 s1 s0 Sum
By algebraic manipulation, one can create a transformation matrix for computing c i from ci−1 :
! ! !
ci ai + bi ai bi ci−1
= ·
1 0 1 1
Thus, carry look-ahead can be performed by parallel prefix. Each ci is computed by parallel
prefix, and then the si are calculated in parallel.
6 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004
[MPI_Scan] is much like MPI_Allreduce in that the values are formed by combining
values contributed by each process and that each process receives a result. The difference
is that the result returned by the process with rank r is the result of operating on the
input elements on processes with rank 0, 1, . . . , r.
Essentially, MPI_Scan operates locally on a vector and passes a result to each processor. If
the defined operation of MPI_Scan is MPI_Sum, the result passed to each process is the partial sum
including the numbers on the current process.
MPI_Scan, upon further investigation, is not a true parallel prefix algorithm. It appears that
the partial sum from each process is passed to the process in a serial manner. That is, the message
passing portion of MPI_Scan does not scale as lg p, but rather as simply p. However, as discussed in
the Section 3.2, the message passing time cost is so small in large systems, that it can be neglected.