0% found this document useful (0 votes)

17 views6 pages

Lecture 03-Parallel Prefix

The document discusses parallel prefix algorithms, which are important for parallel computing. It describes how to compute a parallel prefix sum in O(log n) time using n processors, one per array element. It also describes how the algorithm can be adapted when there are fewer processors than elements by having each processor handle a contiguous slice of the array.

Uploaded by

biubu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views6 pages

Lecture 03-Parallel Prefix

Uploaded by

biubu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Lecture 3

Parallel Prefix

3.1 Parallel Prefix

An important primitive for (data) parallel computing is the scan operation, also called prefix sum
which takes an associated binary operator ⊕ and an ordered set [a1 , . . . , an ] of n elements and
returns the ordered set
[a1 , (a1 ⊕ a2 ), . . . , (a1 ⊕ a2 ⊕ . . . ⊕ an )].
For example,
plus scan([1, 2, 3, 4, 5, 6, 7, 8]) = [1, 3, 6, 10, 15, 21, 28, 36].
Notice that computing the scan of an n-element array requires n − 1 serial operations.
Suppose we have n processors, each with one element of the array. If we are interested only
in the last element bn , which is the total sum, then it is easy to see how to compute it efficiently
in parallel: we can just break the array recursively into two halves, and add the sums of the two
halves, recursively. Associated with the computation is a complete binary tree, each internal node
containing the sum of its descendent leaves. With n processors, this algorithm takes O(log n) steps.
If we have only p < n processors, we can break the array into p subarrays, each with roughly
dn/pe elements. In the first step, each processor adds its own elements. The problem is then
reduced to one with p elements. So we can perform the log p time algorithm. The total time is
clearly O(n/p + log p) and communication only occur in the second step. With an architecture
like hypercube and fat tree, we can embed the complete binary tree so that the communication is
performed directly by communication links.
Now we discuss a parallel method of finding all elements [b1 , . . . , bn ] = ⊕ scan[a1 , . . . , an ] also
in O(log n) time, assuming we have n processors each with one element of the array. The following
is a Parallel Prefix algorithm to compute the scan of an array.
Function scan([ai ]):

1. Compute pairwise sums, communicating with the adjacent processor

ci := ai−1 ⊕ ai (if i even)
n
2. Compute the even entries of the output by recursing on the size 2 array of pairwise sums
bi := scan([ci ]) (if i even)

3. Fill in the odd entries of the output with a pairwise sum

bi := bi−1 ⊕ ai (if i odd)

4. Return [bi ].

1
2 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004

Up the tree Down the tree (Prefix)

36 36
10 26 10 36
3 7 11 15 3 10 21 36
1 2 3 4 5 6 7 8 1 3 6 10 15 21 28 36

Figure 3.1: Action of the Parallel Prefix algorithm.

Up the tree Down the tree (Prefix Exclude)

36 0
10 26 0 10
3 7 11 15 0 3 10 21
1 2 3 4 5 6 7 8 0 1 3 6 10 15 21 28

Figure 3.2: The Parallel Prefix Exclude Algorithm.

An example using the vector [1, 2, 3, 4, 5, 6, 7, 8] is shown in Figure 3.1. Going up the tree, we
simply compute the pairwise sums. Going down the tree, we use the updates according to points
2 and 3 above. For even position, we use the value of the parent node (bi ). For odd positions, we
add the value of the node left of the parent node (bi−1 ) to the current value (ai ).
We can create variants of the algorithm by modifying the update formulas 2 and 3. For example,
the excluded prefix sum

[0, a1 , (a1 ⊕ a2 ), . . . , (a1 ⊕ a2 ⊕ . . . ⊕ an−1 )]

can be computed using the rule:

bi := excl scan([ci ]) (if i odd), (3.1)

bi := bi−1 ⊕ ai−1 (if i even). (3.2)

Figure 3.2 illustrates this algorithm using the same input vector as before.
The total number of ⊕ operations performed by the Parallel Prefix algorithm is (ignoring a
constant term of ±1):
I III
z}|{ II z}|{
n n
z }| {
Tn = + Tn/2 +
2 2
= n + Tn/2
= 2n
Preface 3

If there is a processor for each array element, then the number of parallel operations is:
I II III
z}|{ z }| { z}|{
Tn = 1 + Tn/2 + 1
= 2 + Tn/2
= 2 lg n

3.2 The “Myth” of lg n

In practice, we usually do not have a processor for each array element. Instead, there will likely
be many more array elements than processors. For example, if we have 32 processors and an array
of 32000 numbers, then each processor should store a contiguous section of 1000 array elements.
Suppose we have n elements and p processors, and define k = n/p. Then the procedure to compute
the scan is:

1. At each processor i, compute a local scan serially, for n/p consecutive elements, giving result
[di1 , di2 , . . . , dik ]. Notice that this step vectorizes over processors.

2. Use the parallel prefix algorithm to compute

scan([d1k , d2k , . . . , dpk ]) = [b1 , b2 , . . . , bp ]

3. At each processor i > 1, add bi−1 to all elements dij .

The time taken for the will be

 
! Communication time
time to add and store  
T =2· + 2 · (log p) ·  up and down a tree, 
n/p numbers serially
and a few adds

In the limiting case of p n, the lg p message passes are an insignificant portion of the
computational time, and the speedup is due solely to the availability of a number of processes each
doing the prefix operation serially.

3.3 Applications of Parallel Prefix

3.3.1 Segmented Scan
We can extend the parallel scan algorithm to perform segmented scan. In segmented scan the
original sequence is used along with an additional sequence of booleans. These booleans are used
to identify the start of a new segment. Segmented scan is simply prefix scan with the additional
condition the the sum starts over at the beginning of a new segment. Thus the following inputs
would produce the following result when applying segmented plus scan on the array A and boolean
array C.

A = [1 2 3 4 5 6 7 8 9 10]
C = [1 0 0 0 10 1 10 1]
plus scan(A, C) = [1 3 6 10 5 11 7 8 17 10 ]
4 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004

L
We now show how to! reduce segmented scan to simple scan. We define an operator, 2 , whose
x
operand is a pair . We denote this operand as an element of the 2-element representation of
y
A and C, where x and y are corresponding elements from the vectors A and C. The operands of
the example above are given as:
! ! ! ! ! ! ! ! ! !
1 2 3 4 5 6 7 8 9 10
1 0 0 0 1 0 1 1 0 1
L
The operator ( 2) is defined as follows:
! !
L y y
2
0 1
! ! !
x x⊕y y
0 0 1
! ! !
x x⊕y y
1 1 1
L
As an exercise, we can show that the binary operator 2 defined above is associative and
exhibits the segmenting behavior we want: for each vector A and each boolean vector C, let AC
be the 2-element representation of A and C. For each binary associative operator ⊕, the result
L
of 2 scan(AC) gives a 2-element vector whose first row is equal to the vector computed by
segmented ⊕ scan(A, C). Therefore, we can apply the parallel scan algorithm to compute the
segmented scan.
Notice that the method of assigning each segment to a separate processor may results in load
imbalance.

3.3.2 Csanky’s Matrix Inversion

The Csanky matrix inversion algorithm is representative of a number of the problems that exist
in applying theoretical parallelization schemes to practical problems. The goal here is to create
a matrix inversion routine that can be extended to a parallel implementation. A typical serial
implementation would require the solution of O(n2 ) linear equations, and the problem at first looks
unparallelizable. The obvious solution, then, is to search for a parallel prefix type algorithm.
Csanky’s algorithm can be described as follows — the Cayley-Hamilton lemma states that for
a given matrix x:
p(x) = det(xI − A) = xn + c1 xn−1 + . . . + cn
where cn = det(A), then
p(A) = 0 = An + c1 An−1 + . . . + cn
Multiplying each side by A−1 and rearranging yields:

A−1 = (An−1 + c1 An−2 + . . . + cn−1 )/(−1/cn )

The ci in this equation can be calculated by Leverier’s lemma, which relate the c i to sk = tr(Ak ).
The Csanky algorithm then, is to calculate the Ai by parallel prefix, compute the trace of each Ai ,
calculate the ci from Leverier’s lemma, and use these to generate A−1 .
Preface 5

Figure 3.3: Babbage’s Difference Engine, reconstructed by the Science Museum of London

While the Csanky algorithm is useful in theory, it suffers a number of practical shortcomings.
The most glaring problem is the repeated multiplication of the A matix. Unless the coefficients
of A are very close to 1, the terms of An are likely to increase towards infinity or decay to zero
quite rapidly, making their storage as floating point values very difficult. Therefore, the algorithm
is inherently unstable.

3.3.3 Babbage and Carry Look-Ahead Addition

Charles Babbage is considered by many to be the founder of modern computing. In the 1820s he
pioneered the idea of mechanical computing with his design of a “Difference Engine,” the purpose
of which was to create highly accurate engineering tables.
A central concern in mechanical addition procedures is the idea of “carrying,” for example, the
overflow caused by adding two digits in decimal notation whose sum is greater than or equal to
10. Carrying, as is taught to elementary school children everywhere, is inherently serial, as two
numbers are added left to right.
However, the carrying problem can be treated in a parallel fashion by use of parallel prefix.
More specifically, consider:

c3 c2 c1 c0 Carry
a3 a2 a1 a0 First Integer
+ b3 b2 b1 b0 Second Integer
s4 s3 s2 s1 s0 Sum

By algebraic manipulation, one can create a transformation matrix for computing c i from ci−1 :

! ! !
ci ai + bi ai bi ci−1
= ·
1 0 1 1

Thus, carry look-ahead can be performed by parallel prefix. Each ci is computed by parallel
prefix, and then the si are calculated in parallel.
6 Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004

3.4 Parallel Prefix in MPI

The MPI version of “parallel prefix” is performed by MPI_Scan. From Using MPI by Gropp, Lusk,
and Skjellum (MIT Press, 1999):

[MPI_Scan] is much like MPI_Allreduce in that the values are formed by combining
values contributed by each process and that each process receives a result. The difference
is that the result returned by the process with rank r is the result of operating on the
input elements on processes with rank 0, 1, . . . , r.

Essentially, MPI_Scan operates locally on a vector and passes a result to each processor. If
the defined operation of MPI_Scan is MPI_Sum, the result passed to each process is the partial sum
including the numbers on the current process.
MPI_Scan, upon further investigation, is not a true parallel prefix algorithm. It appears that
the partial sum from each process is passed to the process in a serial manner. That is, the message
passing portion of MPI_Scan does not scale as lg p, but rather as simply p. However, as discussed in
the Section 3.2, the message passing time cost is so small in large systems, that it can be neglected.

Assignment 1: Name Class Date Period Sbuid Netid Email
No ratings yet
Assignment 1: Name Class Date Period Sbuid Netid Email
4 pages
Parallel Prefix Sum
No ratings yet
Parallel Prefix Sum
32 pages
Lecture 10
No ratings yet
Lecture 10
40 pages
Parallel Prefix Sum (Scan) With CUDA: Mark Harris
No ratings yet
Parallel Prefix Sum (Scan) With CUDA: Mark Harris
21 pages
CO-2 (2)
No ratings yet
CO-2 (2)
22 pages
qt6j57h5zw Nosplash
No ratings yet
qt6j57h5zw Nosplash
2 pages
Parallel Prefix Adders Presentation
No ratings yet
Parallel Prefix Adders Presentation
35 pages
Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
ARM Arithmetic Part3
No ratings yet
ARM Arithmetic Part3
18 pages
parallel and distributed algorithms
No ratings yet
parallel and distributed algorithms
21 pages
Lecture Parallelism DC PDF
No ratings yet
Lecture Parallelism DC PDF
7 pages
Chapter Parallel Prefix Sum
No ratings yet
Chapter Parallel Prefix Sum
21 pages
08_dataparallel
No ratings yet
08_dataparallel
51 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
CUDA Tricks PDF
No ratings yet
CUDA Tricks PDF
33 pages
Lesson1-5 5 PDF
No ratings yet
Lesson1-5 5 PDF
7 pages
Parallel Algorithms: Theory and Practice
No ratings yet
Parallel Algorithms: Theory and Practice
44 pages
UNIT-8 Forms of Parallelism: 8.1 Simple Parallel Computation: Example 1: Numerical Integration Over Two Variables
No ratings yet
UNIT-8 Forms of Parallelism: 8.1 Simple Parallel Computation: Example 1: Numerical Integration Over Two Variables
12 pages
Lecture 9 - Parallel Algorithms
No ratings yet
Lecture 9 - Parallel Algorithms
28 pages
Timing-Constrained Area Minimization Algorithm For Parallel Prefix Adders
No ratings yet
Timing-Constrained Area Minimization Algorithm For Parallel Prefix Adders
8 pages
Data Structures
No ratings yet
Data Structures
10 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
ECE408MT2ReviewFA24
No ratings yet
ECE408MT2ReviewFA24
58 pages
Parralel Demro 002
No ratings yet
Parralel Demro 002
61 pages
1.1 Parallelism Is Ubiquitous
No ratings yet
1.1 Parallelism Is Ubiquitous
3 pages
Scan Primitives For Vector Computers
No ratings yet
Scan Primitives For Vector Computers
10 pages
ECE645_lecture3_fast_adders and hw approaches to implementation
No ratings yet
ECE645_lecture3_fast_adders and hw approaches to implementation
51 pages
Lect5Brent
No ratings yet
Lect5Brent
10 pages
Content PDF
No ratings yet
Content PDF
14 pages
Parallel and Distributed Computing Lab Digital Assignment - 3
No ratings yet
Parallel and Distributed Computing Lab Digital Assignment - 3
10 pages
Lecture-1-Prefix SumsBit ManipulationTime Complexity
No ratings yet
Lecture-1-Prefix SumsBit ManipulationTime Complexity
41 pages
Sliding Window Sum Algorithms For Deep Neural Networks
No ratings yet
Sliding Window Sum Algorithms For Deep Neural Networks
8 pages
Parallel Prefix Sum
No ratings yet
Parallel Prefix Sum
17 pages
Arithmetic Coding in Parallel: Jan Supol and Bo Rivoj Melichar
No ratings yet
Arithmetic Coding in Parallel: Jan Supol and Bo Rivoj Melichar
11 pages
Improved Computing Performance For Listing Combinatorial Algorithms Using Multi-Processing Mpi and Thread Library
No ratings yet
Improved Computing Performance For Listing Combinatorial Algorithms Using Multi-Processing Mpi and Thread Library
16 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
29 pages
Lec5 Pramalgs 1
No ratings yet
Lec5 Pramalgs 1
13 pages
Basic PRAM Algorithm Design Techniques
No ratings yet
Basic PRAM Algorithm Design Techniques
13 pages
1 Parallel and Distributed Computation
No ratings yet
1 Parallel and Distributed Computation
10 pages
Main
No ratings yet
Main
10 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
RG2-ParallelizationPrinciples-HPCAI-Jan2020
No ratings yet
RG2-ParallelizationPrinciples-HPCAI-Jan2020
40 pages
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
No ratings yet
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
104 pages
CSE524sp10-01
No ratings yet
CSE524sp10-01
62 pages
7-Tree Sum Parallel Algorithm & Applications
No ratings yet
7-Tree Sum Parallel Algorithm & Applications
23 pages
Partial Solutions Manual Parallel and Distributed Computation: Numerical Methods
No ratings yet
Partial Solutions Manual Parallel and Distributed Computation: Numerical Methods
95 pages
Partial Solutions Manual Parallel and Distributed Computation: Numerical Methods
No ratings yet
Partial Solutions Manual Parallel and Distributed Computation: Numerical Methods
95 pages
CP_Workshop_Day_2 (1)
No ratings yet
CP_Workshop_Day_2 (1)
37 pages
Analysis and Design of Algorithm-2
No ratings yet
Analysis and Design of Algorithm-2
29 pages
ECE645 Lecture3 Fast Adders
No ratings yet
ECE645 Lecture3 Fast Adders
31 pages
6A-Divide-Conquer CP PC
No ratings yet
6A-Divide-Conquer CP PC
35 pages
Assignment of Algorithm
No ratings yet
Assignment of Algorithm
9 pages
PDA_4
No ratings yet
PDA_4
82 pages
Parallel and Distributed lec 11
No ratings yet
Parallel and Distributed lec 11
15 pages
Parallel Thinking: Guy Blelloch Carnegie Mellon University
No ratings yet
Parallel Thinking: Guy Blelloch Carnegie Mellon University
37 pages
The Sieve of Eratosthenes
No ratings yet
The Sieve of Eratosthenes
68 pages
Ble 90
No ratings yet
Ble 90
268 pages
DSNotes M2
No ratings yet
DSNotes M2
18 pages
Speed Mathamatics
From Everand
Speed Mathamatics
Naila Hina
1/5 (1)
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
From Everand
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
Peter Kattan
3/5 (4)
PPS Suggestions 2025
No ratings yet
PPS Suggestions 2025
6 pages
Dip Questions On Bit Plane Coding
No ratings yet
Dip Questions On Bit Plane Coding
3 pages
B.SC ComputerApplications1644403379
No ratings yet
B.SC ComputerApplications1644403379
32 pages
1 Introduction To Programming Concepts
No ratings yet
1 Introduction To Programming Concepts
7 pages
Ryan's Guide To Solving The 2x2: Face Names
No ratings yet
Ryan's Guide To Solving The 2x2: Face Names
7 pages
Buy Ebook C++ Templates: The Complete Guide Second Edition David Vandevoorde & Nicolai M. Josuttis & Douglas Gregor Cheap Price
100% (3)
Buy Ebook C++ Templates: The Complete Guide Second Edition David Vandevoorde & Nicolai M. Josuttis & Douglas Gregor Cheap Price
52 pages
XML Processing With Python
100% (1)
XML Processing With Python
447 pages
MCS-013 Maths QP
No ratings yet
MCS-013 Maths QP
76 pages
Discrete Mathematics
No ratings yet
Discrete Mathematics
3 pages
Module2 Math
No ratings yet
Module2 Math
8 pages
Python Term Work Mech 3rd Sem
100% (1)
Python Term Work Mech 3rd Sem
32 pages
2 - Types of Algorithms
No ratings yet
2 - Types of Algorithms
78 pages
KickC Reference Manual
No ratings yet
KickC Reference Manual
51 pages
Pseudocode to Python - A Compiler Approach
No ratings yet
Pseudocode to Python - A Compiler Approach
5 pages
Ruby Scripts: Keshav Memorial Institute of Technology
No ratings yet
Ruby Scripts: Keshav Memorial Institute of Technology
24 pages
Project Report On Vehicle Management System
No ratings yet
Project Report On Vehicle Management System
47 pages
C5 - The Revised Simplex Method
No ratings yet
C5 - The Revised Simplex Method
36 pages
AUTOSAR_CP_ASWS_TransformerGeneral
No ratings yet
AUTOSAR_CP_ASWS_TransformerGeneral
83 pages
VBA Lesson 16: Cells, Ranges, Columns and Rows in VBA For Excel
No ratings yet
VBA Lesson 16: Cells, Ranges, Columns and Rows in VBA For Excel
2 pages
CS401 Solved Subjective Final Term by Junaid
No ratings yet
CS401 Solved Subjective Final Term by Junaid
25 pages
switch statement in C-Programming
No ratings yet
switch statement in C-Programming
9 pages
Nokol
No ratings yet
Nokol
20 pages
07.local Classes in SAP ABAP
No ratings yet
07.local Classes in SAP ABAP
4 pages
.net
No ratings yet
.net
31 pages
DNSC605 - Cybersecurity and Cryptography in Fintech
No ratings yet
DNSC605 - Cybersecurity and Cryptography in Fintech
6 pages
UNIT 4 EXCEPTION HANDLING AND MULTITHREADING PPT
No ratings yet
UNIT 4 EXCEPTION HANDLING AND MULTITHREADING PPT
55 pages
Cohesion and Coupling in Software Engineering
No ratings yet
Cohesion and Coupling in Software Engineering
3 pages
Final Unit 5 Questions
No ratings yet
Final Unit 5 Questions
6 pages
Upsc Csat Reasoning Syllabus 43caa433
No ratings yet
Upsc Csat Reasoning Syllabus 43caa433
3 pages
Linked List3
No ratings yet
Linked List3
19 pages

Lecture 03-Parallel Prefix

Uploaded by

Lecture 03-Parallel Prefix

Uploaded by

Lecture 3

3.1 Parallel Prefix

1. Compute pairwise sums, communicating with the adjacent processor

3. Fill in the odd entries of the output with a pairwise sum

Up the tree Down the tree (Prefix)

Figure 3.1: Action of the Parallel Prefix algorithm.

Up the tree Down the tree (Prefix Exclude)

Figure 3.2: The Parallel Prefix Exclude Algorithm.

[0, a1 , (a1 ⊕ a2 ), . . . , (a1 ⊕ a2 ⊕ . . . ⊕ an−1 )]

can be computed using the rule:

bi := excl scan([ci ]) (if i odd), (3.1)

3.2 The “Myth” of lg n

2. Use the parallel prefix algorithm to compute

scan([d1k , d2k , . . . , dpk ]) = [b1 , b2 , . . . , bp ]

3. At each processor i > 1, add bi−1 to all elements dij .

The time taken for the will be

3.3 Applications of Parallel Prefix

3.3.2 Csanky’s Matrix Inversion

A−1 = (An−1 + c1 An−2 + . . . + cn−1 )/(−1/cn )

3.3.3 Babbage and Carry Look-Ahead Addition

3.4 Parallel Prefix in MPI

You might also like