0% found this document useful (0 votes)

12 views22 pages

Pram

Uploaded by

Ikhlas Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views22 pages

Pram

Uploaded by

Ikhlas Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

COMP 203: Parallel and Distributed Computing

PRAM Algorithms
Siddhartha Chatterjee
Jan Prins
Spring 2002

Contents
1 The PRAM model of computation 1

2 The Work-Time paradigm 3

2.1 Brent’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Designing good parallel algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Basic PRAM algorithm design techniques 6

3.1 Balanced trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Pointer jumping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Algorithm cascading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Euler tours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Divide and conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Symmetry breaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 A tour of data-parallel algorithms 17

4.1 Basic scan-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Parallel lexical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 The relative power of PRAM models 19

5.1 The power of concurrent reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 The power of concurrent writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Quantifying the power of concurrent memory accesses . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.4 Relating the PRAM model to practical parallel computation . . . . . . . . . . . . . . . . . . . . . . . 22

1 The PRAM model of computation

In the first unit of the course, we will study parallel algorithms in the context of a model of parallel computation called
the Parallel Random Access Machine (PRAM). As the name suggests, the PRAM model is an extension of the familiar
RAM model of sequential computation that is used in algorithm analysis. We will use the synchronous PRAM which
is defined as follows.

1. There are processors connected to a single shared memory.

2. Each processor has a unique index called the processor id.

3. A single program is executed in single-instruction stream, multiple-data stream (SIMD) fashion. Each instruc-
tion in the instruction stream is carried out by all processors simultaneously and requires unit time, regardless
of the number of processors.
2
Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002

4. Each processor has a private flag that controls whether it is active in the execution of an instruction. Inactive
processors do not participate in the execution of instructions, except for instructions that reset the flag.

The processor id can be used to distinguish processor behavior while executing the common program. For example,
each processor can use its processor id to form a distinct address in the shared memory from which to read a value.
A sequence of instructions can be conditionally executed by a subset of processors. The condition is evaluated by all
processors and is used to set the private flags. Only active processors carry out the instructions that follow. At the end
of the sequence the private flags are reset so that execution is resumed by all processors.
The operation of a synchronous PRAM can result in simultaneous access by multiple processors to the same
location in shared memory. There are several variants of our PRAM model, depending on whether such simultaneous
access is permitted (concurrent access) or prohibited (exclusive access). As accesses can be reads or writes, we have
the following four possibilities:

1. Exclusive Read Exclusive Write (EREW): This PRAM variant does not allow any kind of simultaneous access
to a single memory location. All correct programs for such a PRAM must insure that no two processors access
a common memory location in the same time unit.

2. Concurrent Read Exclusive Write (CREW): This PRAM variant allows concurrent reads but not concurrent
writes to shared memory locations. All processors concurrently reading a common memory location obtain the
same value.

3. Exclusive Read Concurrent Write (ERCW): This PRAM variant allows concurrent writes but not concurrent
reads to shared memory locations. This variant is generally not considered independently, but is subsumed
within the next variant.

4. Concurrent Read Concurrent Write (CRCW): This PRAM variant allows both concurrent reads and concur-
rent writes to shared memory locations. There are several sub-variants within this variant, depending on how
concurrent writes are resolved.

(a) Common CRCW: This model allows concurrent writes if and only if all the processors are attempting to
write the same value (which becomes the value stored).
(b) Arbitrary CRCW: In this model, a value arbitrarily chosen from the values written to the common mem-
ory location is stored.
(c) Priority CRCW: In this model, the value written by the processor with the minimum processor id writing
to the common memory location is stored.
(d) Combining CRCW: In this model, the value stored is a combination (usually by an associative and com-
mutative operator such as or max) of the values written.

The different models represent different constraints in algorithm design. They differ not in expressive power but
in complexity-theoretic terms. We will consider this issue further in Section 5.
We study PRAM algorithms for several reasons.

1. There is a well-developed body of literature on the design of PRAM algorithms and the complexity of such
algorithms.

2. The PRAM model focuses exclusively on concurrency issues and explicitly ignores issues of synchronization
and communication. It thus serves as a baseline model of concurrency. In other words, if you can’t get a good
parallel algorithm on the PRAM model, you’re not going to get a good parallel algorithm in the real world.

3. The model is explicit: we have to specify the operations performed at each step, and the scheduling of operations
on processors.

4. It is a robust design paradigm. Many algorithms for other models (such as the network model) can be derived
directly from PRAM algorithms.

Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002 3

Digression 1 In the following, we will use the words vector and matrix to denote the usual linear-algebraic entities,
and the word sequence for a linear list. We reserve the word array for the familiar concrete data structure that is used to

implement all of these other kinds of abstract entities. Arrays can in general be multidimensional. The triplet notation
with and denotes the set . If , we drop it from the notation. Thus,

.

¾
Example 1 (Vector Sum) As our first example of a PRAM algorithm, let us compute where , , and
are vectors of length stored as 1-dimensional arrays in shared memory. We describe a PRAM algorithm by giving
the single program executed by all processors. The processor id will generally appear as a program variable that
takes on a different value at each processor. So if , the vector sum program simply consists of the
statement .
To permit the problem size and the number of processors to vary independently, we generalize the program as
shown in Algorithm 1. Line 4 performs simultaneous additions and writes consecutive elements of the result into
. The for loop is used to apply this basic parallel step to in successive sections of size . The conditional in line 3
ensures that the final parallel step performs the correct number of operations, in case does not divide evenly.

Algorithm 1 (Vector sum on a PRAM)

Input: Vectors and in shared memory.
Output: Vector in shared memory.

1 local integer
2 for = 1 to do
if then

3

4

5 endif
6 enddo

To simplify the presentation of PRAM programs, we assume that each processor has some local memory or,
equivalently, some unique portion of the shared memory, in which processor-private variables such as and may be
kept. We will typically assume that parameters such as and are are in this memory as well. Under this assumption,
all references to shared memory in Algorithm 1 are exclusive, and the algorithm requires only an EREW PRAM.
Algorithm 1 requires on the order of steps to execute, so the concurrent running time . ¾

2 The Work-Time paradigm

The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this
model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time
(WT) paradigm to be independent of these details. After discussing this framework, we will present Brent’s Theorem,
which will allow us to convert a WT algorithm into a PRAM algorithm.
In the PRAM model, algorithms are presented as a program to be executed by all the processors; in each step an
operation is performed simultaneously by all active processors. In the WT model, each step may contain an arbitrary
number of operations to be performed simultaneously, and the scheduling of these operations over processors is left
implicit. In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we
drop explicit mention of the processor id and , the number of processors. In fact the forall construct is the only
construct that distinguishes a WT algorithm from a sequential algorithm.
We associate two complexity measures parameterized in the problem size with the WT description of an algo-
rithm. The work complexity of the algorithm, denoted , is the total number of operations the algorithm performs.
In each step, one or more operations are performed simultaneously. The step complexity of the algorithm, denoted
, is the number of steps that the algorithm executes. If is the number of simultaneous operations at parallel
step , then

(1)

4
Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002

Armed with this notation and definitions, let us examine our second parallel algorithm. We are given a sequence
of elements of some type in shared memory, and a binary associative operator . Associativity
implies that for all elements , , and . Examples of such an operator on primitive
types include addition, multiplication, maximum, minimum, boolean AND, boolean OR, and string concatenation.
More complex operators can be built for structured and recursive data types. We want to compute the quantity

, again in shared memory. This operation is also called reduction.

Algorithm 2 (Sequence reduction, WT description)
Input: Sequence with elements of type , binary associative operator .
Output: .

R EDUCE(sequence T , )

forall do

2

3
4 enddo
for = 1 to do

5

forall do

6

7
8 enddo

9 enddo
10
11 return

The WT program above is a high-level description of the algorithm—there are no references to processor ids.
Also note that it contains both serial and concurrent operations. In particular, the final assignment in line 10 is to be
performed by a single processor (since it is not contained in a forall construct), and the loop in line 5 is a serial for-
loop. A couple of subtleties of this algorithm are worth emphasizing. First, in line 7, all instances of the expression
on the right hand side must be evaluated before any of the assignments are performed. Second, the additions are
performed in a different order than in the sequential program. (Verify this.) Our assumption of the associativity of
addition is critical to insure the correctness of the result.
Let us determine and for of Algorithm 2. Both are determined inductively from the structure of the
program. In the following, subscripts refer to lines in the program.
– –
–
– –

–
– –

– – – –
It is reassuring to see that the total amount of work done by the parallel algorithm is (asymptotically) the same as that
performed by an optimal sequential algorithm. The benefit of parallelism is the reduction in the number of steps.
We extend the PRAM classification for simultaneous memory references to the WT model. The algorithm above
specifies only exclusive read and write operations to the shared memory, and hence requires only an EREW execution
model.

2.1 Brent’s Theorem

The following theorem, due to Brent, relates the work and time complexities of a parallel algorithm described in the
WT formalism to its running time on a -processor PRAM.

Theorem 1 (Brent 1974) A WT algorithm with step complexity and work complexity can be simulated
on a -processor PRAM in no more than

parallel steps.

Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002 5

Proof: For each time step , for , let be the number of operations in that step. We simulate each
step of the WT algorithm on a -processor PRAM in parallel steps, by scheduling the
operations on
the processors in groups of operations at a time. The last group may not have operations if does not divide
evenly. In this case, we schedule the remaining operations among the smallest-indexed processors. Given this
simulation strategy, the time to simulate step of the WT algorithm will be and the total time for a
processor PRAM to simulate the algorithm is

There are a number of complications that our simple sketch of the simulation strategy does not address. For example,
to preserve the semantics of the forall construct, we should generally not update any element of the left-hand side of a
WT assignment until we have evaluated all the values of the right-hand side expression. This can be accomplished by
the introduction of a temporary result that is subsequently copied into the left hand side. ¾
Let us revisit the sequence reduction example, and try to write the barebones PRAM algorithm for a -processor
PRAM, following the simulation strategy described. Each forall construct of Algorithm 2 is simulated using a sequen-
tial for loop with a body that applies up to operations of the forall body at a time.

Algorithm 3 (Sequence reduction, PRAM description)
Input: Sequence with elements of type , binary associative operator , and processor
id .
Output: .

PRAM-R EDUCE(sequence T , )

1
2 local integer , ,
3 for = 1 to do
if then

4

5
6 endif
7 enddo
8 for = 1 to do

for = 1 to do

9

10
if then

11

12
13 endif
14 enddo
15 enddo
if then

16

17
18 endif
19 return

The concurrent running time of Algorithm 3 can be analyzed by counting the number of executions of the loop
bodies.

This is the bound provided by Brent’s theorem for the simulation of Algorithm 2 with a processor PRAM. To verify
that the bound is tight, consider the summation above in the case that or the case that is odd. With some minor
6
Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002

assumptions, the simulation preserves the shared-memory access model, so that, for example, an EREW algorithm in
the WT framework can be simulated using an EREW PRAM.

2.2 Designing good parallel algorithms

PRAM algorithms have a time complexity in which both problem size and the number of processors are parameters.
Given a PRAM algorithm with running time , let be the optimal (or best known) sequential time
complexity for the problem. We define the speedup

(2)

as the factor of improvement in the running time due to parallel execution. The best speedup we can hope to achieve
(for a deterministic algorithm) is when using processors. An asymptotically greater speedup would contra-
dict the assumption that our sequential time complexity was optimal, since a faster sequential algorithm could be
constructed by sequential simulation of our PRAM algorithm.
Parallel algorithms in the WT framework are characterized by the single-parameter step and work complexity
measures. The work complexity is the most critical measure. By Brent’s Theorem, we can simulate a WT
algorithm on a -processor PRAM in time

(3)

If asymptotically dominates , then we can see that with a fixed number of processors , increasing problem
size decreases the speedup, i.e.

Since scaling of has hard limits in many real settings, we will want to construct parallel WT algorithms for which
. Such algorithms are called work-efficient.
The second objective is to minimize step complexity . By Brent’s Theorem, we can simulate a work-efficient
WT algorithm on a -processor PRAM in time

(4)

Thus, the speedup achieved on the -processor PRAM is

(5)

Thus, will be (the best we can hope) as long as

(6)

Thus, among two work-efficient parallel algorithms for a problem, the one with the smaller step complexity is more
scalable in that it maintains optimal speedup over a larger range of processors.

3 Basic PRAM algorithm design techniques

We now discuss a variety of algorithm design techniques for PRAMs. As you will see, these techniques can deal with
many different kinds of data structures, and often have counterparts in design techniques for sequential algorithms.

Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002 7

3.1 Balanced trees

One common PRAM algorithm design technique involves building a balanced binary tree on the input data and sweep-
ing this tree to and from the root. This “tree” is not an actual data structure but rather a concept in our head, often
realized as a recursion tree. We have already seen a use of this technique in the array summation example. This
technique is widely applicable. It is used to construct work-efficient parallel algorithms for problems such as prefix

sum, broadcast, and array compaction. We will look at the prefix sum case.
In the prefix sum problem (also called parallel prefix or scan), we are given an input sequence of
elements of some type , and a binary associative operator

. As output, we are to produce the sequence
, where for , we require that .
The sequential time complexity of the problem is clearly : the lower bound follows trivially from the
fact that output elements have to be written, and the upper bound is established by the algorithm that computes
as . Thus, our goal is to produce a parallel algorithm with work complexity . We will do this using the
balanced tree technique. Our WT algorithm will be different from previous ones in that it is recursive. As usual, we
will assume that to simplify the presentation.

Algorithm 4 (Prefix sum)
Input: Sequence of elements of type , binary associative operator .
Output: Sequence of elements of type , with for .

sequence T S CAN(sequence T , )

then

1 if
2
3 return

4 endif
forall

5 do
6

,
7 enddo
S CAN(

8 )
9 forall do
if even() then

10

11
elsif then

12

14 else
15
16 endif
17 enddo
18 return

Figure 1 illustrates the data flow of Algorithm 4.

Theorem 2 Algorithm 4 correctly computes the prefix sum of the sequence with step complexity and work
complexity .
Proof: The correctness of the algorithm is a simple induction on . The base case is , which is correct
by line 2. Now assume the correctness for inputs of size , and consider an input of size . By the induction
hypothesis, = S CAN( , ) computed in line 8 is correct. Thus, . Now consider
the three possibilities for . If is even (line 11), then
. If
(line 13), then . Finally, if is odd (line 15), then . These three
cases are exhaustive, thus establishing the correctness of the algorithm.
To establish the resource bounds, we note that the step and work complexities satisfy the following recurrences.
(7)
8
Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002

Recursion

Figure 1: Data flow in Algorithm 4 for an example input of eight elements.

(8)

These are standard recurrences that solve to and . ¾

Thus we have a work-efficient parallel algorithm that can run on an EREW PRAM. It can maintain optimal speedup
with processors.
Why is the minimal PRAM model EREW when there appears to be a concurrent read of values in
(as suggested by Figure 1)? It is true that each of these values must be read twice, but these reads can be serialized
without changing the asymptotic complexity of the algorithm. In fact, since the reads occur on different branches of
the conditional (lines 11 and 15), they will be serialized in execution under the synchronous PRAM model. In the next
section, we will see an example where more than a constant number of processors are trying to read a common value,
making the minimal PRAM model CREW.
We can define two variants of the scan operation: inclusive (as above) and exclusive. For the exclusive scan, we
require that the operator have an identity element . (This means that for all elements .)
The exclusive scan is then defined as follows: if is the output sequence, then and for
. It is clear that we can obtain the inclusive scan from the exclusive scan by the relation .
Going in the other direction, observe that for , and .

Finally, what do we do if ? If , we can simply pad the input to size , use Algorithm 4,
and discard the extra values. Since this does not increase the problem size by more than a factor of two, we maintain
the asymptotic complexity.

3.2 Pointer jumping

The technique of pointer jumping (or pointer doubling) allows fast parallel processing of linked data structures such
as lists and trees. We will usually draw trees with edges directed from children to parents (as we do in representing
disjoint sets, for example). Recall that a rooted directed tree is a directed graph with a special root vertex such that
the outdegree of the root is zero, while the outdegree of all other vertices is one, and there exists a directed path from

Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002 9

each non-root vertex to the root vertex. Our example problem will be to find all the roots of a forest of directed trees,
containing a total of vertices (and at most edges).
We will represent the forest using an array (for “Parent”) of integers, such that if and only if
is an edge in the forest. We will use self-loops to recognize roots, i.e., a vertex is a root if and only if . The
desired output is an array , such that is the root of the tree containing vertex , for . A sequential
algorithm using depth-first search gives .

Algorithm 5 (Roots of a forest)

Input: A forest on vertices represented by the parent array .
Output: An array giving the root of the tree containing each vertex.

forall do

1

while do

4
5 endwhile
6 enddo

Note again that in line 4 all instances of are evaluated before any of the assignments to are performed.
The pointer is the current “successor” of vertex , and is initially its parent. At each step, the tree distance between
vertices and doubles as long as is not a root of the forest. Let be the maximum height of a tree
in the forest. Then the correctness of the algorithm can be established by induction on . The algorithm runs on
a CREW PRAM. All the writes are distinct, but more than a constant number of vertices may read values from a
common vertex, as shown in Figure 2. To establish the step and work complexities of the algorithm, we note that the
while-loop iterates times, and each iteration performs steps and work. Thus, , and
. These bounds are weak, but we cannot assert anything stronger without assuming more about
the input data. The algorithm is not work-efficient unless is constant. In particular, for a linked list, the algorithm
takes steps and work. An interesting exercise is to associate with each vertex the distance to
its successor measured along the path in the tree, and to modify the algorithm to correctly maintain this quantity. On
termination, the will be the distance of vertex from the root of its tree.
The algorithm glosses over one important detail: how do we know when to stop iterating the while-loop? The first
idea is to use a fixed iteration count, as follows. Since the height of the tallest tree has a trivial upper bound of , we
do not need to repeat the pointer jumping loop more than times.

forall do

enddo
for = 1 to

do
forall do

enddo
enddo
This is correct but inefficient, since our forest might consist of many shallow and bushy trees. Its work complexity is
instead of , and its step complexity is instead of . The second idea is an “honest”
termination detection algorithm, as follows.

forall do

enddo

repeat
forall do

if then 1 else 0 endif
10
Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002

8 13

12
6 7
11

3 5
4 10

9
1 2

8 13

12
6 7
11

3 5
4 10

9
1 2

8 13

12
6 7
11

3 5
4 10

9
1 2

Figure 2: Three iterations of line 4 in Algorithm 5 on a forest with 13 vertices and two trees.

Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002 11

enddo
R EDUCE( , )
until

This approach has the desired work complexity of , but its step complexity is , since we perform
an step reduction in each of the iterations.
In the design of parallel algorithms, minimizing work complexity is most important, hence we would probably
favor the use of the honest termination detection in Algorithm 5. However, the basic algorithm, even with this mod-

ification, is not fully work efficient. The algorithm can be made work efficient using the techniques presented in the
next section; details may be found in JáJá 3.1.

3.3 Algorithm cascading

Parallel algorithms with suboptimal work complexity should not be dismissed summarily. Algorithm cascading is the
composition of a work-inefficient (but fast) algorithm with an efficient (but slower or sequential) algorithm to improve
the work efficiency. We can sometimes convert a work-inefficient algorithm to a work-efficient algorithm using this
technique.
Our example problem to illustrate the method is the following. Given a sequence ! of integers in the range
where , find how many times each integer in this range occurs in !. That is, compute a histogram "
such that for all , " records the number of entries in ! that have value .
An optimal sequential algorithm for this problem with is the following.

" 0
for = 1 to do
"! "!
enddo

To create a parallel algorithm, we might construct # where

if !
#
otherwise

in parallel. Now to find the number of occurrences of in !, we simply sum column of # , i.e. # . The
complete algorithm is

forall do
#

enddo
forall do
# !

enddo
forall do
" R EDUCE(# , )
enddo

The step complexity of this algorithm is as a result of the step complexity of the R EDUCE operations. The
work complexity of the algorithm is as a result of the first and last forall constructs. The algorithm
is not work efficient because # is too large to initialize and too large to sum up with only work.
However, a variant of the efficient sequential algorithm given earlier can create and sum successive rows of # in
(sequential) steps and work. Using $ parallel applications of this sequential algorithm we
$ in
can create # steps and performing a total of $ work. Subsequently we
can compute the column sums of # with these same complexity bounds.
12
Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002

Algorithm 6 (Work-efficient cascaded algorithm for label counting problem)

Input: Sequence ! with values in the range
Output: Sequence " the occurrence counts for the values in !
$
integer #

1

forall $ do

2

3

#

4 enddo
5 forall $ do
for = 1 to do

6

7
!
# !
#
8 enddo

9 enddo
forall do

10

11 " $ , )
R EDUCE(#
12 enddo

The cascaded algorithm has and hence has been made work efficient without an
asymptotic increase in step complexity. The algorithm runs on the EREW PRAM model.

3.4 Euler tours

The Euler tour technique is used in various parallel algorithms operating on tree-structured data. The name comes
from Euler circuits of directed graphs. Recall that an Euler circuit of a directed graph is a directed cycle that traverses
each edge exactly once. If we take a tree % & and produce a directed graph % & by replacing each
edge ' & with two directed edges ' and ' in & , then the graph has an Euler circuit. We call a
Euler circuit of an Euler tour of . Formally, we specify an Euler tour of by defining a successor function
& & , such that ( is the edge following edge ( in the tour. By defining the successor function appropriately,
we can create an Euler tour to enumerate the vertices according to an inorder, preorder or postorder traversal.
For the moment, consider rooted trees. There is a strong link between the Euler tour representation of a tree
and a representation of the tree structure as a balanced parenthesis sequence. Recall that every sequence of balanced
parentheses has an interpretation as a rooted tree. This representation has the key property that the subsequence
representing any subtree is also balanced. The following is a consequence of this property: if is a binary associative
operator with an inverse, and we place the element ( at each left parenthesis and its inverse element ( at each right
parenthesis, then the sum (with respect to ) of the elements of any subtree is zero.

Algorithm 7 (Depth of tree vertices)

Input: A rooted tree on vertices in Euler tour representation using two arrays ! and " .
Output: The array ) containing the depth of each vertex of .

integer *

1

forall do

2

*!

3 1
4 *" 1

5 enddo
E XCL -S CAN*

6

forall do

7

8 ) !
9 enddo

Algorithm 7 shows how to use this property to obtain the depth of each vertex in the tree. For this algorithm, the tree
with vertices is represented as two arrays ! and " of length in shared memory (for left parentheses and right

Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002 13

2 6
5

3 4 7 8 9

!
"
*

)

Figure 3: Determining the depth of the vertices of a tree.

parentheses, respectively). These arrays need to be created from an Euler tour that starts and ends at the root. ! is the
earliest position in the tour in which vertex is visited. " is the latest position in the tour in which vertex is visited.
Figure 3 illustrates the operation of Algorithm 7. The algorithm runs on an EREW PRAM with step complexity
and work complexity .
We now describe how to construct an Euler tour from a pointer-based representation of a tree . We assume that
the edges of are represented as a set of adjacency lists for each vertex and assume further that the adjacency
list ! for vertex ' is circularly linked, as shown in Figure 4. An element of the list ! defines the edge ' . The
symmetric edge ' is found on list ! . We assume that symmetric edges are linked by pointers in both directions
(shown as dashed arrows in Figure 4). Thus there are a total of edges in the adjacency lists, and we assume that
these elements are organized in an array & of size .
If we consider the neighbors of vertex ' the order in which they appear on ! , where is the
degree of vertex ', and we define the successor function as follows: ' ' , then we have
defined a valid Euler tour of . (To prove this we have to show that we create a single cycle rather than a set of
edge-disjoint cycles. We establish this fact by induction on the number of vertices.)
The successor function can be evaluated in parallel for each edge in & by following the symmetric (dashed)
pointer and then the adjacency list (solid) pointer. This requires steps with work complexity, which is
work-efficient. Furthermore, since the two pointers followed for each element in & are unique, we can do this on an
EREW PRAM.
Note that what we have accomplished is to link the edges of into an Euler tour. To create a representation like
the ! and " array used in Algorithm 7, we must do further work.

3.5 Divide and conquer

The divide-and-conquer strategy is the familiar one from sequential computing. It has three steps: dividing the problem
into sub-problems, solving the sub-problems recursively, and combining the sub-solutions to produce the solution. As
always, the first and third steps are critical.

To illustrate this strategy in a parallel setting, we consider the planar convex hull problem. We are given a set
of points, where each point is an ordered pair of coordinates . We further assume that
points are sorted by x-coordinate. (If not, this can be done as a preprocessing step with low enough complexity
bounds.) We are asked to determine the convex hull CH( ), i.e., the smallest convex polygon containing all the points
of , by enumerating the vertices of this polygon in clockwise order. Figure 5 shows an instance of this problem.
The sequential complexity of this problem is . Any of several well-known algorithms for this
problem establishes the upper bound. A reduction from comparison-based sorting establishes the lower bound. See
14
Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002

2 2 1 2 2 1 1 6 6 6
6
5

9 5 3 7
3 4 7 8

6 4 8

Figure 4: Building the Euler tour representation of a tree from a pointer-based representation.

UH(S)

p pn
1

Upper common tangent

S2
S1

Figure 5: Determining the convex hull of a set of points.

35.3 of CLR for details.

We first note that and belong to CH( ) by virtue of the sortedness of , and they partition the convex hull
into an upper hull UH( ) and an lower hull LH( ). Without loss of generality, we will show how to compute UH( ).

The division step is simple: we partition into and
. We then

recursively obtain UH( ) = +
+ and UH( ) = . Assume that for , we solve the
problem by brute force. This gives us the termination condition for the recursion.
The combination step is nontrivial. Let the upper common tangent (UCT) be the common tangent to UH( )
and UH( ) such that both UH( ) and UH( ) are below it. Thus, this tangent consists of two points, one each
from UH( ) and UH( ). Let UCT( , ) = + . Assume the existence of an

sequential algorithm
for determining UCT( , ) (See Preparata and Shamos, Lemma 3.1). Then UH( ) = + + , and
contains points. Given , , , and , we can obtain UH( ) in steps and work as follows.

forall do
UH( ) if then + else endif
enddo

This algorithm requires a minimal model of a CREW PRAM. To analyze its complexity, we note that

(9)
(10)

giving us and .

3.6 Symmetry breaking

The technique of symmetry breaking is used in PRAM algorithms to distinguish between identical-looking elements.

This can be deterministic or probabilistic. We will study a randomized algorithm (known as the random mate algo-
rithm) to determine the connected components of an undirected graph as an illustration of this technique. See 30.5 of
CLR for an example of a deterministic symmetry breaking algorithm.
Let , % & be an undirected graph. We say that edge ' hits vertices ' and . The degree of a vertex is
the number of edges that hit . A path from to (denoted ) is a sequence of vertices such
that
& for . A connected subgraph is a subset - of % such that for all ' - we have ' .
A connected component is a maximal connected subgraph. A supervertex is a directed rooted tree data structure used
to represent a connected subgraph. We use the standard disjoint-set conventions of edges directed from children to
parents and self-loops for roots to represent supervertices.
We can find connected components optimally in a sequential model using depth-first search. Thus, % &

% & . Our parallel algorithm will actually be similar to the algorithm in 22.1 of CLR. The idea behind the
algorithm is to merge supervertices to get bigger supervertices. In the sequential case, we examine the edges in a
predetermined order. For our parallel algorithm, we would like to examine multiple edges at each time step. We break
symmetry by arbitrarily choosing the next supervertex to merge, by randomly assigning genders to supervertices. We
call a graph edge ' live if ' and belong to different supervertices, and we call a supervertex live if at least one
live edge hits some vertex of the supervertex. While we still have live edges, we will merge supervertices of opposite
gender connected by a live edge. This merging includes a path compression step. When we run out of live edges, we
have the connected components.
16
Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002

parent[v]
M parent[v] parent[v]
F
parent[u]
parent[u]

u v u v u v

Figure 6: Details of the merging step of Algorithm 8. Graph edges are undirected and shown as dashed lines. Super-
vertex edges are directed and are shown as solid lines.

Algorithm 8 (Random-mate algorithm for connected components)

Input: An undirected graph , % & .
Output: The connected components of ,, numbered in the array % .

forall % do

1

2 parent[ ]
3 enddo
while there are live edges in , do

4

forall % do

5

6 gender[ ] = rand( M, F )

7 enddo
8 forall ' & live' do
if gender[parent[']] = M and gender[parent[ ]] = F then

9

10 parent[parent[']] parent[ ]
11 endif
if gender[parent[ ]] = M and gender[parent[']] = F then

12

13 parent[parent[ ]] parent[']
14 endif

15 enddo
forall % do

16

17 parent[ ] parent[parent[ ]]
18 enddo
19 endwhile

Figure 6 shows the details of the merging step of Algorithm 8. We establish the complexity of this algorithm by
proving a succession of lemmas about its behavior.

Lemma 1 After each iteration of the outer while-loop, each supervertex is a star (a tree of height zero or one).
Proof: The proof is by induction on the number of iterations executed. Before any iterations of the loop have been
executed, each vertex is a supervertex with height zero by the initialization in line 2. Now assume that the claim holds
after iterations, and consider what happens in the st iteration. Refer to Figure 6. After the forall loop in
line 8, the height of a supervertex can increase by one, so it is at most two. After the compression step in line 16, the
height goes back to one from two. ¾
Lemma 2 Each iteration of the while-loop takes steps and % & work.
Proof: This is easy. The only nonobvious part is determining live edges, which can be done in steps and &

work. Since each vertex is a star by Lemma 1, edge ' is live if and only if parent['] parent[ ]. ¾

Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002 17

Lemma 3 The probability that at a given iteration a live supervertex is joined to another supervertex .

Proof: A live supervertex has at least one live edge. The supervertex will get a new root if and only if its gender is
M and it has a live edge to a supervertex whose gender is F. The probability of this is . The probability is at
least this, since the supervertex may have more than one live edge. ¾

Lemma 4 The probability that a vertex is a live root after % iterations of the while-loop is % .

Proof: By Lemma 3, a live supervertex at iteration has probability to remain live after iteration . Therefore

the probability that it is lives after % iterations is . The inequality follows, since

% and ¾
% . ¾

Lemma 5 The expected number of live supervertices after % iterations % .

Proof: We compute the expected number of live supervertices by summing up the probability of each vertex to be a
live root. By Lemma 4, this is % ¾ . ¾

Theorem 3 With probability at most % , the algorithm will not have terminated after % iterations.

Proof: Let be the probability of having live supervertices after % iterations. By the definition of expectation,

the expected number of live supervertices after % iterations is , and by Lemma 5, this is . Since

and are all positive, . Now, the algorithm terminates when the number of live

supervertices is zero. Therefore, is the probability of still having to work after % steps. ¾
The random mate algorithm requires a CRCW PRAM model. Concurrent writes occur in the merging step (line 8),
since different vertices can have a common parent.
The step complexity of the algorithm is % with high probability, as a consequence of Theorem 3. The work
complexity is % & % by Theorem 3 and Lemma 2. Thus the random mate algorithm is not work-optimal.
A key factor in this algorithm is that paths in supervertices are short (in fact, ). This allows the supervertices
after merging to be converted back to stars in a single iteration of path compression in line 16. If we used some
deterministic algorithm to break symmetry, we would not be able to guarantee short paths. We would have multiple
supervertices and long paths within supervertices, and the step complexity of such an algorithm would be % .
There is a deterministic algorithm due to Shiloach and Vishkin that avoids this problem by not doing complete path
compression at each step. Instead, it maintains a complicated set of invariants that insure that the supervertices left
when the algorithm terminates truly represent the connected components.

4 A tour of data-parallel algorithms

In this section, we present a medley of data-parallel algorithms for some common problems.

4.1 Basic scan-based algorithms

A number of useful building blocks can be constructed using the scan operation as a primitive. In the following
examples, we will use both zero-based and one-based indexing of arrays. In each case, be sure to calculate step and
work complexities and the minimum PRAM model required.

Enumerate The enumerate operation takes a Boolean vector and numbers its true elements, as follows.

sequence integer E NUMERATE(sequence boolean Flag)

forall Flag do

1

2 V[i] if Flag[i] then 1 else 0 endif

3 enddo
4 return S CAN(V, )
18
Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002

Flag true true false true false false true

%
Result

Copy The copy (or distribute) operation copies an integer value across an array.

sequence integer C OPY(integer v, integer n)

forall do

1

2 % if then v else 0 endif

3 enddo
4 return S CAN(V, )

%
Result

Pack The pack operation takes a vector * of values and a Boolean vector . of flags, and returns a vector containing
only those elements of * whose corresponding flags are true.

sequence T PACK(sequence T A, sequence boolean F)
E NUMERATE(. )

1

2 forall . do
if . then

3

4 " *

5 endif
6 enddo
7 return " .

*
. true true false true false false true

"
Split The split operation takes a vector * of values and a Boolean vector . of flags, and returns a vector with the
elements with false flags moved to the bottom and the elements with true flags moved to the top.

sequence T S PLIT(sequence T A, sequence boolean F)

1 Down E NUMERATE(not (F))

2 P E NUMERATE(F)
forall . do

3

4 Index[i] if F[i] then P[i] + Down[#F] else Down[i] endif

5 enddo
forall . do

6

7 R[Index[i]] A[i]
8 enddo
9 return R

Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002 19

*
. true true false true false false true
Down
Index
"
This parallel split can be used as the core routine in a parallel radix sort. Note that split is stable: this is critical.

4.2 Parallel lexical analysis

Our final example is a parallel lexical analyzer that we implement using scans. The problem is that of breaking a string
over some alphabet into tokens corresponding to the lexical structure specified by a regular language and recognized
Æ + . , where

by a deterministic finite automaton (DFA). Recall that, formally, a DFA is the 5-tuple
is a set of states, is the input alphabet, Æ is the transition function, + is the initial state, and
. is the set of final states. The use of scans in this example is interesting in that the binary associative operator
used (function composition) is noncommutative.

Consider the family of functions /
, where / Æ . That is, the function / describes the
on input symbol ; or, put another way, each input symbol is encoded as a function. Observe that

action of the DFA
the family of functions / form a partition of Æ . In fact, they are precisely the columns of the transition table of the
DFA. We can represent each / as a one-dimensional array of states indexed by states. Since function composition is
a binary associative operator, we can define a scan based on it, and this is the key to this algorithm. The composition
of functions represented as arrays can be accomplished by replacing every entry of one array with the result of using
that entry to index into the other array.

.
Algorithm 9 (Parallel lexical analysis)
Input: A DFA Æ + . , input sequence 0
Output: Tokenization of 0.

1. Replace each symbol of the input sequence 0 with the array representation of / . Call this sequence 1.
2. Perform a scan of 1 using function composition as the operator. This scan replaces symbol 0 of the
original input sequence with a function 2 that represents the state-to-state transition function for the
prefix 0 0 of the sequence. Note that we need a CREW PRAM to execute this scan (why?).
3. Create the sequence 3 where 3 2 + . That is, use the initial state of the DFA to index each of
these arrays. Now we have replaced each symbol by the state the DFA would be in after consuming that
symbol. The states that are in the set of final states demarcate token boundaries.

A sequential algorithm for lexical analysis on an input sequence of symbols has complexity . The
parallel algorithm has step complexity
and work complexity . Thus the parallel algorithm is faster but
not work-efficient.

5 The relative power of PRAM models

We say that a PRAM model is more powerful than another PRAM model 4 if an algorithm with step complexity
on model 4 has step complexity on model . We show that the ability to perform concurrent reads
and writes allows us to solve certain problems faster, and then we quantify just how much more powerful the CREW
and CRCW models are compared to the EREW model.
20
Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002

5.1 The power of concurrent reads

How long does it take for processors of an EREW PRAM to read a value in shared memory? There are two possible
ways for the processors to read the shared value. First, they can serially read the value in round robin fashion in
time. Second, processors can replicate the value as they read it, so that a larger number of processors can read it
in the next round. If each processor makes one copy of the value as it reads it, the number of copies doubles at each
round, and time suffices to make the value available to the processors.
The argument above is, however, only an upper bound proof. We now give a lower bound proof, i.e., an argument
that steps are necessary. Suppose that as each processor reads the value, it sequentially makes copies, where
can be a function of . Then the number of copies grows by a factor of at each round, but each round takes
time. The number of rounds needed to replicate the value -fold is
, and the total time taken is .

This is asymptotically greater than unless , i.e., is a constant. This means that it is no good trying
to make more copies sequentially at each round. The best we can do is to make a constant number of copies, and that
gives us the desired bound. A replication factor of gives us the smallest constant, but any constant value
of will suffice for the lower bound argument.

5.2 The power of concurrent writes

To show the power of concurrent writes, we reconsider the problem of finding the maximum elements of the sequence
5 . We have seen one solution to this problem before using the binary tree technique. That resulted in
a work-efficient algorithm for an EREW PRAM, with work complexity and step complexity . Can we
produce a CRCW algorithm with lower step complexity? The answer is yes, as shown by the following algorithm.

Algorithm 10 (Common-CRCW or Arbitrary-CRCW algorithm for maximum finding)

Input: A sequence 5 of elements.
Output: A maximum element of 5 .

integer ,

1

forall do

2

3 1

4 enddo
forall do

5

forall do

6

7 if 5 5 then 1 else 0 endif

8 enddo

9 enddo
forall do

10

11 forall do
if not then

12

13 0
14 endif
15 enddo
16 enddo

It is easy to verify that at the end of this computation if and only if 5 is a maximum element. Analysis
of this algorithm reveals that but . Thus the algorithm is very fast but far indeed from
being work efficient. However, we may cascade this algorithm with the sequential maximum reduction algorithm or
the EREW PRAM maximum reduction algorithm to obtain a work efficient CRCW PRAM algorithm with
step complexity. This is optimal for the Common and Arbitrary CRCW model. Note that a trivial work-
efficient algorithm with exists for maximum value problem in the Combining-CRCW model, which
demonstrates the additional power of this model.
The step complexity for maximum in CRCW models is suspicious, and points to the lack of realism in the
CRCW PRAM model. Nevertheless, a bit-serial maximum reduction algorithm based on ideas like the above but

Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002 21

employing only single-bit concurrent writes (i.e. a wired “or” tree), has proved to be extremely fast and practical in a
number of SIMD machines. The CRCW PRAM model can easily be used to construct completely unrealistic parallel
algorithms, but it remains important because it has also led to some very practical algorithms.

5.3 Quantifying the power of concurrent memory accesses

The CRCW PRAM model assumes that any number of simultaneous memory accesses to the same location can be
served in a single timestep. A real parallel machine will clearly never be able to do that. The CRCW model may
be a powerful model for algorithm design, but it is architecturally implausible. We will try to see how long it takes
an EREW PRAM to simulate the concurrent memory accesses of a CRCW PRAM. Specifically, we will deal with
concurrent writes under a priority strategy. We assume the following lemma without proof.

Lemma 6 (Cole 1988) A -processor EREW PRAM can sort elements in steps. ¾
Based on this lemma, we can prove the following theorem.

Theorem 4 A -processor EREW PRAM can simulate a -processor Priority CRCW PRAM with slowdown.
Proof: In fact, all we will show is a simulation that guarantees slowdown. The lower bound does
hold, but establishing it is beyond the scope of this course. See Chapter 10 of JáJá if you are interested.
Assume that the CRCW PRAM has processors through and $ memory locations through , and
that the EREW PRAM has the same number of processors but extra memory locations. We will show how
to simulate on the EREW PRAM a CR or CW step in which processor accesses memory location . In our
simulation, we will use an array to store the pairs , and an array to record the processor that finally got the
right to access a memory location.
The following code describes the simulation.

Algorithm 11 (Simulation of Priority CRCW PRAM by EREW PRAM)

Input: A CRCW memory access step, in which processor reads/writes shared memory location

, with
and $. For a read step, gets the value stored at

. For a write step, the value
written to is the value being written by processor , where writes .
Output: A simulation of the CRCW step on an EREW PRAM with processors and $ shared memory
locations.

1
forall do
2 Processor writes the pair into
3 enddo
Sort first on ’s and then on ’s.

4

5 forall do
Processor reads and sets to . For , processor reads

6

and and sets to if and to otherwise.

7 enddo
8 forall do
9 For CW, processor writes its value to if .
10 For CR, processor reads the value in if and duplicates the value for the other
processors in time.
11 enddo

Line 1 takes steps. By Lemma 6, line 4 takes steps. Line 5 again takes steps, and line 8 takes
steps for a concurrent write access and steps for a concurrent read access. Thus the simulation of the
CRCW step runs in EREW steps.
Figure 7 shows an example of this simulation. ¾
22
Copyright c Siddhartha Chatterjee, Jan Prins 1997–2002

Processor
Memory location accessed
Value written

Sorted

Figure 7: Illustration of the simulation of a concurrent write by a five-processor EREW PRAM. Processor succeeds
in writing the value into memory location , and processor succeeds in writing the value into memory location
.

5.4 Relating the PRAM model to practical parallel computation

The PRAM model is often criticized because it can not be implemented in a reasonable fashion using more than a few
processors, and hence PRAM algorithms can not scale in practice as their complexity measures would suggest. The
principal problem is the PRAM assumption of constant latency for parallel memory references. This rapidly becomes
impractical as we increase the number of processors for the following reasons.

1. Current memory components resolve a sequential stream of references at some maximum rate. With increasing
, the memory system must deliver an increasing amount of data per unit time which in turn implies
memory components (banks) must eventually be employed. To connect these processors and memory banks
simutaneously requires a switching network that can not have latency.
2. Current memory components permit at most a constant number of simultaneous reads, and, as we have seen in
the previous two sections, this means that there is a latency involved in the servicing of concurrent reads
and writes using these memory components.
3. With increasing the physical volume of an actual machine must increase, so at least some memory locations
must be at a greater distance from some processors. Thus the PRAM memory latency must scale as ¿ .

Actually a similar argument shows that we cannot even implement constant-time memory reference in a simple
RAM model as we increase memory size. This is why caches are pervasive and why the RAM model is not a
completely accurate cost model for sequential computing. But for a PRAM, the presence of much more “stuff”
to scale (processors, interconnection network, memory banks) greatly exacerbates the increase in latency with
increasing .

Nevertheless, a PRAM algorithm can be a valuable start for a practical parallel implementation. For example, any
algorithm that runs efficiently in a processor PRAM model can be translated into an algorithm that runs efficiently
on a !-processor machine with a latency ! memory system, a much more realistic machine than the PRAM. In
the translated algorithm, each of the ! processors simulates ! PRAM processors. The memory latency is “hidden”
because a processor has ! units of useful and independent work to perform while waiting for a memory access to
complete. Parallel vector processors or the recent Tera MTA (multi-threaded architecture) are examples of shared-
memory parallel machines that are suitable candidates for this translation.
In other cases the PRAM algorithm can be arranged to reduce its synchronization and shared memory access
requirements to make it a better match to shared memory multiprocessors based on conventional processors with
caches. This is the topic of the next unit of this course.

PRAM Parallel Computing Algorithms
No ratings yet
PRAM Parallel Computing Algorithms
49 pages
Parallel Computing: Algorithmic Models
No ratings yet
Parallel Computing: Algorithmic Models
41 pages
Pda 3
No ratings yet
Pda 3
90 pages
PRAM and Distributed Computing Report
No ratings yet
PRAM and Distributed Computing Report
5 pages
Parallel Random Access Machine
No ratings yet
Parallel Random Access Machine
8 pages
Abstract Machine Models in Parallel Computing
No ratings yet
Abstract Machine Models in Parallel Computing
48 pages
PRAM Model in Parallel Computing
No ratings yet
PRAM Model in Parallel Computing
7 pages
Parallel Random Access Machine
No ratings yet
Parallel Random Access Machine
22 pages
PRAM and RAM Models Explained
No ratings yet
PRAM and RAM Models Explained
17 pages
Efficient PRAM Algorithms Overview
No ratings yet
Efficient PRAM Algorithms Overview
44 pages
PRAM Algorithms: Parallel Computing Techniques
No ratings yet
PRAM Algorithms: Parallel Computing Techniques
9 pages
PRAM Models
No ratings yet
PRAM Models
6 pages
PRAM Algorithms and Complexity Analysis
No ratings yet
PRAM Algorithms and Complexity Analysis
47 pages
PRAM Model
No ratings yet
PRAM Model
72 pages
Ram, Pram, and Logp Models
No ratings yet
Ram, Pram, and Logp Models
72 pages
PRAM Algorithms
100% (1)
PRAM Algorithms
24 pages
Parallel Computation Models Explained
No ratings yet
Parallel Computation Models Explained
9 pages
PRAM Models in Parallel Computing
No ratings yet
PRAM Models in Parallel Computing
4 pages
Parallel Algorithms in PRAM Model
No ratings yet
Parallel Algorithms in PRAM Model
26 pages
Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
Parallel Computation Models Overview
No ratings yet
Parallel Computation Models Overview
28 pages
Parallel Computation Models
No ratings yet
Parallel Computation Models
59 pages
Unit-3.3 PRAM Model
No ratings yet
Unit-3.3 PRAM Model
29 pages
Parallel Algorithms & Architectures
No ratings yet
Parallel Algorithms & Architectures
22 pages
Parallel
No ratings yet
Parallel
59 pages
PRAM Model of Parallel Computation
No ratings yet
PRAM Model of Parallel Computation
24 pages
PRAM Model and Mesh Algorithm Overview
No ratings yet
PRAM Model and Mesh Algorithm Overview
12 pages
Parallel Algorithms for Multi-Processor Systems
No ratings yet
Parallel Algorithms for Multi-Processor Systems
28 pages
Understanding Parallel Algorithms
No ratings yet
Understanding Parallel Algorithms
23 pages
1 Overview, Models of Computation, Brent's Theorem
No ratings yet
1 Overview, Models of Computation, Brent's Theorem
8 pages
Lecture 8 Miscellaneous Topics
No ratings yet
Lecture 8 Miscellaneous Topics
52 pages
Parallel Sum with PRAM EREW Model
No ratings yet
Parallel Sum with PRAM EREW Model
27 pages
Parallel Algorithm Merged
No ratings yet
Parallel Algorithm Merged
76 pages
Parallel Algorithms for Shared Memory
No ratings yet
Parallel Algorithms for Shared Memory
23 pages
Three
No ratings yet
Three
10 pages
Parallel Computer Architecture Overview
No ratings yet
Parallel Computer Architecture Overview
18 pages
Computer Memory Architectures Explained
No ratings yet
Computer Memory Architectures Explained
21 pages
Need for Parallel Computing Explained
No ratings yet
Need for Parallel Computing Explained
76 pages
Module 3
No ratings yet
Module 3
104 pages
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
No ratings yet
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
104 pages
Introduction to Parallel Algorithms
No ratings yet
Introduction to Parallel Algorithms
68 pages
4 Pram Algorithms
No ratings yet
4 Pram Algorithms
16 pages
The Art of Multiprocessor Programming. Second Edition Maurice Herlihy Available Full Chapters
No ratings yet
The Art of Multiprocessor Programming. Second Edition Maurice Herlihy Available Full Chapters
110 pages
ACA Solution Manual
No ratings yet
ACA Solution Manual
39 pages
Lecture 4 Flynn's Classical Taxonomy
No ratings yet
Lecture 4 Flynn's Classical Taxonomy
43 pages
Understanding Distributed Shared Memory
No ratings yet
Understanding Distributed Shared Memory
3 pages
Introduction to Parallel Programming
100% (1)
Introduction to Parallel Programming
38 pages
Understanding the PRAM Model in Parallel Computing
No ratings yet
Understanding the PRAM Model in Parallel Computing
5 pages
Overview of Parallel Random Access Machines
No ratings yet
Overview of Parallel Random Access Machines
6 pages
Understanding Parallel Computing Basics
No ratings yet
Understanding Parallel Computing Basics
21 pages
Shared Memory Multiprocessor Design Insights
No ratings yet
Shared Memory Multiprocessor Design Insights
107 pages
Parallel Algorithms Course Overview
No ratings yet
Parallel Algorithms Course Overview
65 pages
Parallel Computing
100% (1)
Parallel Computing
12 pages
The Art of Multiprocessor Programming
No ratings yet
The Art of Multiprocessor Programming
11 pages
Parallel Algorithms for CS Students
No ratings yet
Parallel Algorithms for CS Students
353 pages
Parallel Algorithms
No ratings yet
Parallel Algorithms
19 pages
Parallel Computation Models Explained
No ratings yet
Parallel Computation Models Explained
3 pages
Overview of Parallel Computing Models
No ratings yet
Overview of Parallel Computing Models
2 pages
Cache-Oblivious Algorithms Overview
No ratings yet
Cache-Oblivious Algorithms Overview
29 pages
Final Fall16
No ratings yet
Final Fall16
4 pages
Exams
No ratings yet
Exams
15 pages
Practise Questions
No ratings yet
Practise Questions
3 pages
27 Jan
No ratings yet
27 Jan
3 pages
March 12
No ratings yet
March 12
7 pages
April 9
No ratings yet
April 9
4 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
10 pages
Computer System Principles Course Guide
No ratings yet
Computer System Principles Course Guide
73 pages
s71500 Cpu 1517t 3 PNDP Manual en-US en-US
No ratings yet
s71500 Cpu 1517t 3 PNDP Manual en-US en-US
55 pages
0 Git t24 Documentation PDF
100% (3)
0 Git t24 Documentation PDF
347 pages
Aptitude (IGATE, Infosys, Tcs Ect)
50% (2)
Aptitude (IGATE, Infosys, Tcs Ect)
105 pages
Unit 4 8051 Assembly Language Programming and Interfacing
No ratings yet
Unit 4 8051 Assembly Language Programming and Interfacing
14 pages
Cloud Computing Unit-1
No ratings yet
Cloud Computing Unit-1
51 pages
Tc1973en-Ed03 Migration To Omnipcx Office Rce r10.2
No ratings yet
Tc1973en-Ed03 Migration To Omnipcx Office Rce r10.2
7 pages
Computer Application in Management (MGMT 2091) : Instructor
100% (3)
Computer Application in Management (MGMT 2091) : Instructor
46 pages
List of ARM Microprocessor Cores
No ratings yet
List of ARM Microprocessor Cores
9 pages
Compal Confidential: PAV70 DDR3 Schematics Document
No ratings yet
Compal Confidential: PAV70 DDR3 Schematics Document
40 pages
Chapter 03: Hardware: Input, Processing, Output and Storage Devices
No ratings yet
Chapter 03: Hardware: Input, Processing, Output and Storage Devices
21 pages
c200h Alpha (Ingles)
No ratings yet
c200h Alpha (Ingles)
40 pages
VU-Advanced Computer Architecture Lecture 1-Introduction 1
No ratings yet
VU-Advanced Computer Architecture Lecture 1-Introduction 1
40 pages
Unit-2 - 6 (Processor Organization)
50% (2)
Unit-2 - 6 (Processor Organization)
13 pages
Microprocessor Delay Techniques Explained
No ratings yet
Microprocessor Delay Techniques Explained
20 pages
Understanding Parallel Algorithms Basics
No ratings yet
Understanding Parallel Algorithms Basics
37 pages
Intel 4004: First Single-Chip CPU
No ratings yet
Intel 4004: First Single-Chip CPU
2 pages
1.introductory Concepts
No ratings yet
1.introductory Concepts
16 pages
Isesyll
No ratings yet
Isesyll
140 pages
Overview of Operating Systems Explained
No ratings yet
Overview of Operating Systems Explained
25 pages
DDC Control for Building Systems
100% (1)
DDC Control for Building Systems
31 pages
《DPDK Cookbook - Intel® Developer Zone》
No ratings yet
《DPDK Cookbook - Intel® Developer Zone》
107 pages
Grade8 - DL&C Textbook
No ratings yet
Grade8 - DL&C Textbook
248 pages
Mitac 5033 Service Manual
No ratings yet
Mitac 5033 Service Manual
84 pages
PGH FC FB Fuer s7cp 76
No ratings yet
PGH FC FB Fuer s7cp 76
191 pages
Alpha Mpu Project: General Information
No ratings yet
Alpha Mpu Project: General Information
5 pages
Programming Fundamentals For Software Engineering
No ratings yet
Programming Fundamentals For Software Engineering
40 pages
CSE Lab Experiments: Logic Design & CPU
No ratings yet
CSE Lab Experiments: Logic Design & CPU
2 pages
Os Last Min Notes Operating System
No ratings yet
Os Last Min Notes Operating System
33 pages

Pram

Uploaded by

Pram

Uploaded by

COMP 203: Parallel and Distributed Computing

2 The Work-Time paradigm 3

3 Basic PRAM algorithm design techniques 6

4 A tour of data-parallel algorithms 17

5 The relative power of PRAM models 19

1 The PRAM model of computation

1. There are processors connected to a single shared memory.

2. Each processor has a unique index called the processor id.

Algorithm 1 (Vector sum on a PRAM)

4      

2 The Work-Time paradigm

2.1 Brent’s Theorem

2.2 Designing good parallel algorithms

Thus, the speedup achieved on the -processor PRAM is

Thus,    will be  (the best we can hope) as long as

3 Basic PRAM algorithm design techniques

3.1 Balanced trees

Figure 1 illustrates the data flow of Algorithm 4.

Figure 1: Data flow in Algorithm 4 for an example input of eight elements.

These are standard recurrences that solve to    and     . ¾

3.2 Pointer jumping

Algorithm 5 (Roots of a forest)

while    do

3.3 Algorithm cascading

Algorithm 6 (Work-efficient cascaded algorithm for label counting problem)

3.4 Euler tours

Algorithm 7 (Depth of tree vertices)

Figure 3: Determining the depth of the vertices of a tree.

3.5 Divide and conquer

Upper common tangent

Figure 5: Determining the convex hull of a set of points.

35.3 of CLR for details.

3.6 Symmetry breaking

Algorithm 8 (Random-mate algorithm for connected components)

Lemma 5 The expected number of live supervertices after  % iterations % .

4 A tour of data-parallel algorithms

4.1 Basic scan-based algorithms

2 V[i] if Flag[i] then 1 else 0 endif

Flag  true true false true false false true

2 %  if  then v else 0 endif

4 "  *

4 Index[i] if F[i] then P[i] + Down[#F] else Down[i] endif

4.2 Parallel lexical analysis

5 The relative power of PRAM models

5.1 The power of concurrent reads

5.2 The power of concurrent writes

Algorithm 10 (Common-CRCW or Arbitrary-CRCW algorithm for maximum finding)

7   if 5 5 then 1 else 0 endif

5.3 Quantifying the power of concurrent memory accesses

Algorithm 11 (Simulation of Priority CRCW PRAM by EREW PRAM)

5.4 Relating the PRAM model to practical parallel computation

You might also like

4

Thus, will be (the best we can hope) as long as

These are standard recurrences that solve to and . ¾

while do

35.3 of CLR for details.

Lemma 5 The expected number of live supervertices after % iterations % .

Flag true true false true false false true

2 % if then v else 0 endif

4 " *

7 if 5 5 then 1 else 0 endif