0% found this document useful (0 votes)

129 views17 pages

Domain Decomposition for Parallel Computing

This document summarizes research on applying domain decomposition techniques to solve sparse linear systems arising from partial differential equations on parallel computers. The researchers consider two popular parallel architectures - message passing and shared memory - and evaluate the performance of different domain decomposition preconditioners in various problem and machine parameter spaces. Their results show that for large problems and numbers of processors, preconditioners requiring some global information sharing are most efficient to achieve good parallel scaling, otherwise iteration counts increase significantly outweighing concurrency gains. While the specific performance depends on the problem, the findings provide insight into optimizing domain decomposition solvers for general sparse systems on parallel hardware.

Uploaded by

Olabanji Shonibare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views17 pages

Domain Decomposition for Parallel Computing

Uploaded by

Olabanji Shonibare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

To

'DEC

2 1989
ETVER1T'r

Domain Decomposition
on Parallel Computers

William D. Groppt and David E. Keyest

Research Report YALEU/DCS/RR-723
August 1989

... 1 i"

yUnii

pub.c rete.se

O-

YALE UNIVERSITY
DEPARTMENT OF COMPUTER SCIENCE

&9 1,2

132'

We consider the application of domain decomposition techniques to the solution of sparse linear
systems arising from implicit PDE discretizations on parallel computers. Representatives of two
popular MIMID architectures, message passing (the Intel iPSC/2-SX) and shared memory (the
Encore Multimax 320), are employed. We run the same numerical experiments on each, namely
stripwise and boxwise decompositions of the unit square, using up to 64 subdomains and containing
up to 64K degrees of freedom. We produce a tight-fitting complexity model for the former and
discuss the difficulty of doing so for the latter. We also evaluate which of three types of domain
decomposition preconditioners that have appeared in the literature of self-adjoint elliptic problems
are most efficient in different regions of machine-problem parameter space. Some form of global
sharing of information in the preconditioner is required for efficient overall parallel implementation
in the region of most practical interest (large problem sizes and large numbers of processors);
otherwise, an increasing iteration count inveighs against the gains of concurrency. Our resuits
on a per iteration basis also hold for sparse discrete systems arising from other types of nartial
differential equations, but in the absence of a theory for the dependence of the convergence rate
upon the granularity of the decomposition, the overall results are only suggestive for more general
svsterns.

-.-

A..

NTIS
Domain Decomposition
on Parallel Computers

William D. Groppt and David E. Keyes+

Df.
u

c("Ai

r- i!
,.1*'

Diflb1to'

Research R%port YALFU/DCS/RR-723

August 1989

Avltbity Ccd."

Ir'dior

Dist

.\ pp ro r,,l for public release: distribution is unlimited.

t Dopartniont of (ompiutor Science. Yale University, New laven, CT 06520. The work of this
author was supported in part by t e Office of Naval Research under contract N0001,t-86-K-0H10
and tie National Science Fotndation under contract number D(CR 8521.1-51.
l)epartmen t of .Mechanical Engineering. Yale F niversit v. New Hlaven, CT 06.520. The work of
this awthor was srpportod in part by the National Science Foundalion under contract number

FJE'-

717109f.

1. Introduction
6

Domain decomposition techniques appear to be a natural way to distribute the solution of

large sparse linear systems across many parallel processors. In this paper we develop complexity
estimates for two types of decompositions and two parameterized types of "real" parallel computer
architectures, and validate those estimates on representative machines, with particular emphasis
on the case of large numbers of processors and large problems. We examine the tradeoffs between
various forms of preconditioning, as characterized by the efficiency of their parallel implementation.
Parallel computers may be divided into two broad classes: distributed memory and shared
memory. In a distributed memory parallel processor, each processor has its own memory and
no direct access to memory on any other processor. Such machines are usually termed "message
pissing" computers since interprocessor communication is accomplished through the sending and
receiving of messages. In a shared memory parallel processor, each processor has direct, random access to the same memory space as every other processor. Interprocessor communication is
conducted directly in the shared memory. In practice, of course, most shared memory machines
have local memory, called the cache, and communication is through messages, called cache faults.
However, each type of parallel processor is optimized for a different interprocossor communication
pattern, and we consider the effects of these optimizations on domain decomposition.
Domain decomposition refers in a generic vay to the replacement of a partial differential
equation problem defined over a global domain with a series of problems over subdomains which
collectively cover the 3riginal. Early domain decomposition techniques, whether iterative [12] or
direct (10], were basec on exact reductions of the global problem to a set of lower-dimensional
problems on interfaces between subdomains by means of direct elimination of the degrees of freedom
interior to the subdomains. In this sense, domain decomposition is analogous to the finite element
procedure of static conensatiolk. A more modern viewpoint has emerged [2, 13] in which the
interior unknowns are aot directly eliminated, but reLained in the outer iteration. preconditioned
by an appropriate fast approximate solver. In [8], we referred to such approaches as "partitioned
matrix methods", and it is such that we consider here, in combination with preconditioned conjugate
gradients as the outer iterative method. Partitioned matrix iteration was shown in [8] to be identical
to iteration on the reduced system when the subdomain solves are exact. However, it has the
advantage of being much more flexible for problems for which the only known exact solvers are
expensive. The important question as to how much approximate subdomain solves "penalize"
overall iterative convergence has begun to be addressed (see, in addition to the references above.
[1] and section 6 of [4]).
An example of the simplest form of domain decomposition is shown in Figure la. A single
interface (3) divides a domain into subdomains 1 and 2. The matrix obtained upon finite difference or finite element discretization is partitioned according to the physical decomposition into
subdomains connected by a small (lower dimension) interface region and looks like

.4, -

0
1~3

A22

.423

-'133/

where AII and A22 come from the interior of the subdomains. .433 from along the interface, and
.'113 and .A23 from the interactions between the subdomains and the interfaces. A detailed view of
this matrix for the 5-point operator is shown in Figure 2.
T,e ge,,.ldiz.ation of Figure ia o any number of strips is straightforward. For reasons that
will become clear, it is important to also consider a decomposition into boxes as shown in Figuro lb.

The matrix for this decomposition looks like

(Al

A,=

ATB
0

AIB

AB
ATc

ABC
Ac

where

Al = Block diagonal subdomain matrices

AIB = Coupling to interiors from subdomain interfaces
AB = Block diagonal suibdomain interfaces

ABC = Coupling to interfaces from cress points

Ac = Cross point system

This representation is based on a five-point difference template, so that the coupling matrix Aic
is zero. More generally, nonzero coupling between the crosspoints and the corner points of the
subdomain interiors would need to be taken into account. A detailed view of this matrix for the
5-point operator is shown in Figure 3, for the case of a decomposition into two by two blocks. We
are interested in various preconditioners for both A, and A,, based on their efficacy and on their
parallel limitations.
The choice of preconditioner is critical in domain decomposition, as with any iterative met od.
In the context of r-rallel computing, the main distinction is between preconditioners which are
purely local, those . ic, involve neighbor communication, and those which involve global communication. (We use ti,
:d "communication" here in a general sense; in a shared memory machine.
this refers to shared access to memory.) As example preconditioners we consider a block diagonal
matrix for the purely local case and a preconditioner based on FFT solves along the interfaces for
the neighbor communication case. As an example involving global communication. we consider the
Bramble et al preconditioner [2], which requires the solution of a linear system for the cross points
(or vertices). This method involves only low bandwidth global communication (that is. the size of
the messages scales with the number of processors); we do not consider any method which u.,es
high bandwidth communication (where message size scales with the size of the problem).
We develop a complexity model for each type of parallel computer that is based on two major
contributions: floating point work and "shared memory access". This latter term measures the
cost of communicating information between processors. In a distributed memory system, this is
represnted by a communication time. In a shared memory system. there are several contributions,
including cache size and bandwidth, and the number of simultaneous memory requests which may
2

All

(b)

(a)

Figure 1: Two forms of decompositions. A two domain strip

decomposition is shown in (a); a many domain box decomposition with cross points (labeled "C") is shown in (b).

NN
Figure 2: The partitioning of the matrix A, for the 5-point
Laplacian and two strips. The dashed lines separate subdomains; the solid lines se parate the interior unknowns from
the interface unknowns.
be served.
2. Comments on Parallelism Costs

In any parallel algorithm, there are a number of different costs to consider. The most obviou1s
of these are intrinsically serial computations. For example, the dot products in tile conjugate
gradient method involve the reduction of values to a single sum; this takes at least logp time. More
subtle are costs from the implementation, both software and hardware. Ani example of the soft ware
3

II"

I
".e11 II

"%
".

I.I
"%

I
I

-A ,
"I

I
I

"1i
1

----

.
-

%iI

%
.
".%
*

--".
---

IIII

"
"k i

_ .

%K"4I

- -

.!I

.le ,. II

- -

"". \Ab%% ,
"I

.L-

T -- -----

Figure 3: The partitioning of the matrix A, for the 5-point

Laplacian and four boxes arranged ir, a two by two manner.
The dashed lines separate subdomains; the solid lines separate the interior unknowns from the interface unknowns and
the cross point unknown.
cost is the need to guarantee safe access to shared data; this is often handled with barriers or
more general critical region5. Sample hardware costs include bandwidth limits in shared resources
such as memory buses and startup and transfer speeds in communication links. Perhaps the most
subtle cost lies in algorithmic changes to "improve" parallelism; by choosing a poor algorithm
over another, less parallel algorithm, aitificially good parallel efficiencies can be found. We call a
high parallel efficiency "artificial" if there exists an algorithm with lower parallel efficiency which
nevertheless executes in less wall1-clock time on a given number of processors. Of the many exam pies
in the literature, one of the most dramatic is [5] on computing the forces in the n-body potential
problem; the naive algorithm is almost perfectly parallel but substantially slower than the linearin-n algorithm, which contains some reductions and hence some intrinsically serial computations.
We can identify the following costs in the domain decomposition algorithms that we are considering:
" Dot (inner) products. These involve a reduction, and hence at least logp time: in addition.
there mav be some critical sections (depending on the implementation).
" Matrix-vector pr A4ucts. These involve shared data, and hence may introduce some constraints
on shared hardware resources.
" Preconditioner solves. These depend on the preconditioner chosen, and hence gives us thle most
freedom in trading off greater parallelism against superior algorithmic performance.
Note that the sharing of data is not random; most of the data sharing occurs between neighbors.
4

2.1. Message passing and Shared memory models

Two methods for achieving parallelism in computer hardware for MIMD machines are message
passing and shared memory. In both of these, the software and hardware costs discussed above
miow up in the cost to access shared data. Each of these methods is optimized for a different
domain, and these optimizations are reflected in the actual costs. In the following, to simplify the
notation will will express all times in terms of the dine to do a floating point operation. Further,
we will drop constant factors from our estimates.
In a message passing machine, each processor has some local memory and a set of communication links to some (usually not all) other processors (called nodes). Each processor has access only
to its own local memory. Communication of shared data is handled (usuall:' by th: prograrmner)
by explicitly delivering data over the communication links. This takes time s + rn for n words.
where s is a startup time (latency) and r is the time to transfer a single word. This is good for
local or nearest neighbor communication. For more global communication (such as a dot product).
times depend on the interconnection network. For a hypercube, the global time is (s + rn)log p:
for a mesh, it is (s + rn)Fp.
In a shared memory machine, each processor may directly access a shared global memory.
Communication (access to) shared data is handled by simply reading the data. However. the
actual implementation of this introduces a number of limits. For example, if the memory is on a
commi 1 bus, then there is a limit to the number of processors that can cirnultaneously read from
the shared memory. One way to model this cost is as 1 + p/ min(p, P) [7]. Here p is the number of
processors and P is the maximum number of processors that may use the resource at one time.
In addition, the access to the shared data must be controlled; this can add additional costs
in terms of barriers or critical sections. These can add additional terms which are proportional to
log p.
3. Complexity Estimates
We can estimate the computation-l complexity for these two models for several forms of doniatin
decomposition. We note that these are rather rough estimates, good (because of their generality)
for ident;fying trends. Additional estimates of the complexity of parallel domain decomposition
may be found in [6].
3.1. Message Passing
In this case we can easily separate out the computation terms and the communication terms.
For each part of the algorithm, we will place a computation term above the related communication
term. In the formulas below, the constants in front of each term have been dropped for clarity.
In two dimensions, with n unknowns per side. we have for strips
Ax multiply + dot products + subdomain solv2s + interface solves
--2
+ n +logp
+ "3 logr
+ nlogn +
s+rn

(s+r)logp

s+rn

s+ rn

and for boxes, we have

Ax multiply

+
+

dot products
22+
p logp

+
+

subdomain solves

71l2

.q+ r n

(s+

s+r

-p!

loog

interface solves
log

+
+

vertex solv,s
vlop-31++
+
log p

r)logp

s+ r

These costs are all per iteration. It is assumed that all neighbor-neighbor interactions can occur
simultaneously. In the box case, the vertex coupling equations are solved by first exchanging
5

all of the vertex data with all the processors. Then each processor computes the entire vorex
system. For the small systems considered, this is an efficient way to handle the vertex system. A
more cooperative parallel solution approach would incur more communication startups and could
actually take longer [11]. (A complexity comparison appears under #3 below.)
3.2. Shared Memory
In this case, a detailed formula depends on the specific design tradeoffs made in the hardware.
The formula here applies to bus-oriented shared memory machines; a different formula would
be needed for machines like the BBN Butterfly. These formulas are dominated by bandwidth
limitations (the min(p, P) terms) and barrier or synchronization costs (the logp terms).
For strips, we hz,'.'e
2

(a+
p

bp
min(p, P1 )

+2 ( n +logpg

A similar formula holds for

the prticular hardware and
effectively share a hardware
of the shared resource (such

min(p, P2
dp P2)

+ nlogn+

log.

boxes. Here, a, b, c, d, P 1 , and P2 are all constants that depend on

implementation. Pi give a limit on the number of processors that can
resource. The ratios a/b and cid reflect the ratio of local work to use
as m,-mory banks or memory bus).

3.3. Implications
In domain decomposition algerithms, we can trade iteration count against work and parallel
overhead. We will consider three representative tradeoffs:
I. No communication. This amounts to diagonal preconditioning. Call the number of iterations
'decou pled"
2. Local communication only. The FFT-implementable "KI/ 2'' preconditionings such as those in
[2], where we can expect the iteration count to be proportional to p for strips and Vp for boxes.
Call the number of iterations 'local"
The cost of the "K 1 / 2" and all more implicit forms of preconditioning includes an extra subdomain solve to symmetrize the preconditioner and is roughly
n'2

nlogn+-log- +2(.s+rn)
P
P
for a strip decomposition and a message passing system. The preconditioning complexity
estimate is dominated by the subdomain solve terms, and it is thus roughly twice that of
case 1. The communication costs are the same as for the matrix-vector product. Thus the
local communication preconditioning is effective if
2

llocal S 'decoupled'

3. Global communication. In the case of box-wi,,e decompositions, cross points occur at the
intersections of the interfaces. The cross-points form a global linear system that is discussed
in [2]. The iteration count is approximately independent of n as p varies: call it /alobal"
In thi, case-the additional costs over and above the symmetrization stage mentioned in #2 are
those of communicating the entries in the cross-point problem and its solving it. For the case
of a message passing system, we can do this in p2 + (s + r) logp time if each processor solves
the cross-point system and in time p3 / 2 + p(s + rv-p) time if a parallel algorithm is used for lhe
63

solve, assuming a straightforward approach based on Banded Gaussian elimination on a ring.

More sophisticated approaches for hypercubes which are p + (s + r) logp are known [3. 1Ii.
Comparing this to the cost of the local computation of (n 2 /p)log(n/vfp) + (s + n)log p tlie
floating point work is negligible unless p is approximately equal to n, or higher. Assuming then
that we use the local method, then the additional cost of the global communication makes the
full cross-point method better whenever
Iglobal(2 - e)
'local,
where e is the parallel efficiency, equal to 1 minus the ratio of communication time to computation time.
For the shared memory case, if barriers and the dot product reduction are the dominant
parallelism costs, the result is similar.
4. Experiments
The standard test problem considered was
V 2u = g
where g - 32(x(1 - x) + y(1 - y)) on the unit square. Though rather ideal, to minimize coefficient
storage space, this problem has structural (from the graph-theoretic viewpoint) and spectral (from
the operator-theoretic viewpoint) similarities to the generic self-adjoint elliptic problem and also
to the non-self-adjoint problem in the limit of sufficiently high mesh refinement. The first set of
experiments was conducted on an Encore Multimax 320 shared memory parallel computer with 1
processors; we used only 16, allowing the remaining 2 processors to handle various system functions.
The experiments on the Encore Multimax were done in double precision. The results are shown
in Tables 1 and 2. The Encore is a time-sharing machine, so these timings are accurate and have
been reproduced only to about 10%. The computations were performed with no other users on the
machine: however, various system programs (mailers, network daemons) used some resources. In
addition. even a programmer who fully understands the dependencies in a given problem can riot
force each process to run on a different processor; operating system logic intervenes.
The tables show the iteration count I, an estimate or the condition number K (estimated as in
[9]). the time in seconds T, and the relative speedup s. The relative speedup is defined as the ratio
of the time from the previous column with the time from the current column. The times do not
inc!,!de initial setup. While this slightly distorts the total time, it does allow the time per iteration
to be determined by dividing the time by .ie itratie:. count
The next set of experiments was run on a 64 node Intel iPSC/2-SX llypercube, with .1
Megabytes of memory on each node and the SX floating point accelerator (a scalar floating point
accc erator). All runs were in single precision to allow a large problem to fit in this memory space.
The results are shown in Tables 3-6.
The programs in both of these cases were nearly the same. Only the code dealing with shared
data was changed to use either messages or shared memory. (The Encore implementation is not a
meissage passing implementation rising the shared memory to simulate message: it is a "'nat ral'
shared memory code. In particular, the solution arrays are all shared. The Intel code is a "nat,,ral'
m,,ssago passing code. The solution arrays are all local (private).)
There are some differences between the iteration count results for the Encore and the Intel
implementations. These differences seem to be due to the difference in floating point arithmetic on
the two machines. Double precision was required on the Encore to get the fast Poisson solver we
7

[hI\p
16

1
K
T

1
1
1.00
0.17

s
32

I
T

s
1
K
T
s

128 1
T

T
,s
512 1
K

T
s

0.74
1
1.00
0.86

2
1.09
0.73

5
3.29
0.78

1
1.00
4.28

1.18
2
1.09
3.75
1.14

0.94
4
3.29
3.03
1.24

8
1..0
2.43
1.25

1
1.00
20.3

2
1.09
1,S. 5

4
3.29
14.7

8
13.0
12.6

16
51.9
10.:3

1.10

1.26
4
3.29
76.1
1 '1
4
3.29
377
1.12

1.17
8
13.0
63.8
1.19
8
13.0
320
1.18

1.22

2
1.09
92.2
1.12
2
1.09
423
1.03

s
256 1

2
3
1.26
0.23

1
1.00
103
1
1.00
436

16
51.9
55.0
1.16
16
51.9
:309
1.04

Table 1: Results for strips on the Encore Multimax 320.

h-\p
64 1
V
T

1
1
1.00
4.28

6
10.7
5.43

3.55
7
14.1
6.9
3.94

0.79
6
11.0
27.2
0.75
6

1.00
103

13.6
130

18.6
38.3

0.79
7

:.39
8

16.7
708

23.7
179
3.96

128 1
K

T
256 1
K
T
512 1
T

16
7
10.2
1.53

1
1.00
20.3

1.00
436
s0.62

Table 2: Results for boxes on the Encore Multimax 320.

using full vertex coupling.

.S1

1.00
0.5.16

1.09
0.479
1.14
.
2
1.09
2.29
1.11
2
1.09

.7 Z..1
_ 7 7 7 .7

1
1.00
2.62

1281 1

T
s1.13
25f; 1

1
1.00
12.1
1
1.00
56

:
32

3.29
0.551
0.87
4
3.29
1.97
1.16
4
3.29
_.79.99
1.15

8
13.0
1.73
1.14
8
13.0
1.18

19
51.9
.!1
0.97

2
1.09
49.5

3.29
43.5

8
13.0
37.4

19
.51.9
37.8

1.13

1.11

1.16

0.99

207.5
35.4
1.07

Table 3: Residts for strips on the Intel Hlypercube.

1
1

1.00
0.5.16
1l

7.88
0.6..5
0.85
6

7.01
0.298
2.16
76

K
S

1.00
2.62

10.7
3.53

10.2
0.965

0.7.1
6

3.66

11.01
16.3
0.711
-56,
13..57
71.5
1.50

I-1.1
.t.23
:3.85

10.5
1.05
4.03

18.6
21.9
3.10

14.S
4.79
1.57

1v-\\p

6.11

T
.

1.00
12.1

21
S

1.00
112

7.16
0.32.1
2.98
7

tl vprcu e. lSills
Table 4: ieslts for boxes ,nt ihe ln1ii
full wrr x couplill.
1i-,li to)w,)rk for h - 1/,12. Siiiqle pri.isimi was required ()n til Intl in ord(er 1,)fit the prohl,,mns
In iil'in '

rv.

5. Comments
While th, ov,,rall r ,sults for strip'. ia
,M a

r m t, o i
tt

se(-m p(mr. they actallv represent ,ervgood

speedi, p

,, : it 1, thLe l,,r;led iterai()n coiiit for this tyPe of iteration that supp)r,-,-

lquoted in [8]. that the condition umniher, rI-,1

the e'flimincy. \V, fid, in accordamc, with te t lory
1L.vnptot ally quma1ratlcally in p. and hielice that t ie nimniher of precoi(ditiolld ronjut )(,gr;idli(elI
iterat,,- rises liwnarly in p. (efeating the, ;dvantage of parallelism.
9

hK-'7T

32 1

1.00

12.1

25.3

0.546

0.76
0.72

0.41
1.33

T
64

s
1

15.9

32.2

941.4

1.00
2.62

3.52
0.74
6
20.1
16.3

1.59
2.21
13
40.5
7.79
2.09
13
49.6
35.5

0.674
2.36
19
119.7
2,57
3.0
NA
NA

s
1281

1
1.00
12.1

s0.74

256 1
V,
T___

1
1.00

112

7
24.8
8 7. 0

Table 5: Results for boxes on the Intel Hypercube. without

vertex coupling but with interface preconditioning (neighbor
coupling). The "NA" entries could not be computed on oui
iPSC,/2.
Two of the full vertex coupling (global) results show superlinear relative Speedup. ThV,1
a real offect, which derives from the superlinear growth in the cost of solving, a sinizle
as a ounction of the number of points in the domain. This speedup is of course availabl !,).
single processor domain decomposition algorithm. In fact, the siight relative speedups sonfor'if
strip dcz~mpositions are due almost entirely to this effect alone, since the number of itorat iii>
proportional to The number of processors.
5.1. Comparison with the theory
Ini the case of the message passing results (Intel Hvpercubel, it is possible to fit the thor'icomplexity estimate to the measured times. Taking Gnlv the highest order terms in th l at 'tc%
and the arithmetic suggests a fit for the strip dlecomposition of

a,-

n-logna
p
p

+a.log p.

WAe h ave- ignore-d t he r terms because s~ >~rn for given n onl The In tel hivpercur e. A e>'c'2.
0.0017. -111"
0.000090. a13= 0.00)0027. anrd a,
0.t000131, a2
fit to t he dat a yWilds (11=
were comnpurted by- ,calinrg the equations bY the inverse of the timne, mia kig lie loa,,i s0 ia r'
'iM (1.
1.
T. The residual Kpl., -(
I )T 1 2 is n.0 14.
The fit is show,n iM Fii ir'' 1. \\
these values (or directly fromn the data), the efficiency pe r itrration can be Shown T') 1o ;irw'il
for the largefr problenis.
,Sorme care is necessary in interjpretirig the gr;,t)lis In Figure 1. Ini particular. thf1
e fficiency used in the figure is relative to thle singfle stibdoimnain ca;se, for a single iterat ion. lecxi cost of solving oil a siibdomnain falls faster than linearly with increasing nitmnhers of siihdtnx;i-.
tris rnvasu re of efficiency- will oft en be, grea;ter than one. An est iiate of tOli pa ralel etfii-

'2.2

T
-

1.8

2.0

-6

1.6

1.2

Figure 4: Data for iPSC/2 runs with strip decomposition.

The fit is made using Equation (5.1). The y axis is the efficiency, defined as the time on a single processor, divided
by the number of processors times the time on that many
processors.
(computation time divided by total time) can be made from Ecuation (5.1):

+ a 2 - logp

al'-

+ a22 log L + a 3 + a 4 1og p

(5.2)

Figure 5 shows the behavior of E with respect to p for a sample value of n.

For the box decompositions, the highest order terms in the complexity model are
a,-

+ a2p

log-

+ a 3 (vP -1)+

a4+ as logp.

(5.3)

We have again ignored the r terms because of the high latency of the Intel hypercube. A le;st
squares fit to the data yields a, = 0.00012, a 2 = 0.000091. a 3 = 0.000014, a 4 = 0.0067. and
a5 = o0.0031. The coefficients were computed in the same way as for strips, and the residual is
0.032. A graph of the data and the fit is shown in Figure 6. The a 3 term comes from the arithniel ic
complexity of solving the glob,-1 cross-point system on a single processor using an already factorvd
matrix. The efficiency per iteration is lower, because of the increased communication overhoad.
but is still above 70% for the larger problems. This makes the strip decomposition superior for
moderate numbers of processors (where the iteration counts are similar).
To get a better idea for the individual overheads, the program for the Intel 1lypercube was
instrumented to provide data on the cost of each communication operation. This data showd
11

h-_\p
32 I
K

T
s
64

K
T
s
128 1
r,
T
s
256 1
r,
T
s

1
1
1.00
0.546
1
1.00
2.62
1
1.00
12.1
1
1.00
112

4
12
29.1
0.827
0.66
17
61.0
5.45
0.48
23
124.9
29.6
0.41
27
254.9
182
0.62

16
18
49.2
0.382
2.16
25
103.6
1.81
3.01
35
212.6
11.5
2.57
48
432.4
71.4
2.55

32
187.9
0.741
2.44
44
397.6
3.28
3.51
62
818.0
20.5
3.48

Table 6: Results for boxes on the Intel Hypercube, using

diagonal blocks only (no coupling.)
1.00

0.99

0.97

0.96 1-

0.95

110

20
P

Figure 5: Estimated parallel efficiency for the strips case on

the iPSC/2, using Equation (5.2), n = 255, and the parameters in the text.
that the expense of the global communication used to form the vertex coupling system was always
smaller than the communication to form the dot-products, and smaller in all but one case than
the nearest-neighbor communication used in forming the matrix-vector product Ax. This result is
consistent with the known communication behavior of the Intel Ilypercube. The two (lot products
12

1.6

1.4

o - n = 16
o-

n = 32
O- n 6
+ - n : 128
x - n =256

"----

- -

1.2"

1.0

C.8

0 .,III
0

Figure 6: Data for iPSC/2 runs with full vertex coupling.

The fit is made using Equation (5.3). The y axis is the efficiency, defined as the time on a single processor, divided
by the number of processors times the time on that many
processors.
done in each iteration involve a time 2(s + r)logp while the global exchange involves (roughly)
slogp + pr; since s > r, the global exchanges take less time. In the case of the nearest-neighbor
communications, the comparison is with 4(s + rn/,p). The larger amount of data moved in the
neighbor computation makes it often more expensive than the global communication for the vertex
coupling. Further, when n is small, we would expect the neighbor communication to be faster than
the global communication, and this is exactly what is observed. A different balance of s and r.
or significantly more processors would of course change these results. For example, on a truely
massively parallel machine with thousands of processors, the neighbor communication would be
smaller than either the dot-product or global communication.
This leads to the results in Tables 5 and 6, where a simpler communication strategy has heen
traded against larger iteration counts.
For the shared memory machine, the complexity estimates are harder to demonstrate. In part.
this is due to the design of the shared memory machines; the number of processors is deliberately
limited to roughly what the hardware (i.e., memory bus) can support. The dominant effect is
usually load balancing or the intrinsically serial parts of the computation (synchronization points
and dot products).
The results for the Intel Ilypercube are summarized in Table 7 which gives the optimal choice
of domain decomposition algorithm (from among those considered) for various values of p and h for
the Intel Ilypercube. As either the number of processors or the number of mesh points increase. the
global or full vertex coupling algorithm becomes more efficient, despite its additional communicat ion
13

p\h - '
4
16
64

16
Decoupled

32
Global
Global

64
Local
Global
Global

128
Local
Global
Global

256
Global
Global
Global

Table 7: Optimal choice of algorithm for the given problem

and implementation on the Intel Hypercube, from the choices
of decoupled block diagonal, locally coupled interfaces, and
full vertex coupling preconditioners for the box decomposition.
demands. These .esults emphasize the importance of considering a wide range of decompositions
and problem sizes. For example, the results for p = 4 can give a misleading picture of the usefulness
of both the fully decoupled and the locally coupled interface preconditioners. In fact, the "local"
preconditioning wins by an extremely narrow margin in the two cases in which it beats out "global".
Our results show that, even for relatively small problems, the asymptotically superior performance
(in iteration count) of the full vertex coupling preconditioners more than compensates for the the
additional cost of the communication. This is true even for a machine such as the Intel iPSC/2
Hypercube, with relatively slow communication and global communication that is proportional to
logp. The older iPSC/1 Hypercube had an identical table, which is not surprising, given that the
ratio of communication speed to computation speed for the two machines is similar.

References
[1] C. Borgers, The iVeumann-Dirichlet Domain Decomposition Method with Inexact Solvcrs on
the Subdomains, 1989. Numer. Math. (to appear).
[2] J. H. Bramble, J. E. Pasciak, and A. H. Schatz, The Construction of Preconditionersfor
Elliptic Problems by Substructuring, 1, Mathematics of Computation, 47 July (1986),
pp. 103-134.
(33 T. F. Chan, Y. Saad, and M. H. Schultz, Solving Elliptic PartialDifferential Equations on the
Hypercube Multiprocessor, Technical Report YALE/DCS/RR-373, Yale University.
Department of Computer Science, March 1985.
[4] M. Dryja, A Finite Element - Capacitance Method for Elliptic Problems on Regions Partitioned
into Subregions, Numer. Math., 44 (1984), pp. 153-168.
[5] L. Greengard and W. D. Gropp, A Parallel Version of the Fast Multipole Method, Paralh/l
Processingfor Scientific Computing, SIAM, 1989, pp. 213-222.
[6] W. D. Gropp and D. E. Keyes, Complexity of ParallelImplementation of Domain L'cx,.'o.ition
Techniques for Elliptic PartialDifferential Equations, SIAM Journal on Scientific and
Statistical Computing, 9/2 (1988), pp. 312-326.
[7] H. F. Jordan, Interpreting Parallel Processor Performance Measurements, SIAM Journal on
Scientific and Statistical Computing, 8/2 (1987), pp. s220-s226.
[8] D. E. Keyes and W. D. Gropp, A Comparison of Domain Decomposition Techniques for Elliptic
Partial Differential Equations and their Parallel Implementation, SIAM Journal on
Scientific and Statistical Computing, 8/2 (1987), pp. s166-s202.
[9] D. P. O'Leary and 0. Widlund, Capacitance Matrix Methods for the Helmholtz Equation oil
General Three-dimensionalRegions, MATHCOMP, 33 (1979), pp. 849-879.
[10] J. S. Przemieniecki, Matrix Structural Analysis of Substructures, AIAA J., 1(1963). pp.
138-147.
[11] Y. Saad and M. Schultz, Parallel direct methods for solving banded linear systems. Technical
Report YALE/DCS/RR-387, Yale University, Department of Computer Science.
August 1985.
[12] Ht.A. Schwarz, Gesammelte mathematische Abhandlungen, Berlin, Springer, 2(1890). pp.
133-143. (First published in 1870).
(13] 0. B. Widlund, Iterative Substructuring Methods: the General Elliptic Case, Technical Report
260, Courant Institute of Mathematical Sciences, NYU, November 1986.

Domain Decomposition in Parallel Programming
No ratings yet
Domain Decomposition in Parallel Programming
29 pages
Parallel Computing in CFD: Milovan Perić
No ratings yet
Parallel Computing in CFD: Milovan Perić
25 pages
ADA233453
No ratings yet
ADA233453
25 pages
Lecture Notes in Computational Science and Engineering
No ratings yet
Lecture Notes in Computational Science and Engineering
11 pages
Sparse 1
No ratings yet
Sparse 1
68 pages
Domain Decomposition Methods Overview
No ratings yet
Domain Decomposition Methods Overview
55 pages
Book On Damage Mechanics
No ratings yet
Book On Damage Mechanics
297 pages
Parallel Algorithms For Successive Convolution
No ratings yet
Parallel Algorithms For Successive Convolution
44 pages
Understanding PDEs: Types and Methods
0% (1)
Understanding PDEs: Types and Methods
18 pages
Numerical Methods For Partial Differential Algebraic Systems of Equations
No ratings yet
Numerical Methods For Partial Differential Algebraic Systems of Equations
61 pages
Thesis 1997 Abdullah
No ratings yet
Thesis 1997 Abdullah
259 pages
Parallel and Distributed Algorithms-IMPORTANT QUESTION
100% (1)
Parallel and Distributed Algorithms-IMPORTANT QUESTION
15 pages
Content PDF
No ratings yet
Content PDF
14 pages
Domain Decomposition in Finite Element Analysis
No ratings yet
Domain Decomposition in Finite Element Analysis
24 pages
Biot's Linear Systems Preconditioners
No ratings yet
Biot's Linear Systems Preconditioners
258 pages
Preconditioning Book Chapter123
No ratings yet
Preconditioning Book Chapter123
70 pages
Chen - Matrix Preconditioning Techniques and Applications PDF
100% (1)
Chen - Matrix Preconditioning Techniques and Applications PDF
601 pages
Parallel Solutions for Poisson's Equation
No ratings yet
Parallel Solutions for Poisson's Equation
8 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Dutto 1999
No ratings yet
Dutto 1999
14 pages
Pc8 Laplace
No ratings yet
Pc8 Laplace
45 pages
Parallel Multigrid Solver Techniques
No ratings yet
Parallel Multigrid Solver Techniques
18 pages
Introduction To Parallel Computing Design and Anal
No ratings yet
Introduction To Parallel Computing Design and Anal
53 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
CS 267 Sources of Parallelism: Kathy Yelick
No ratings yet
CS 267 Sources of Parallelism: Kathy Yelick
55 pages
The Use of Parallel Polynomial Preconditioners in The Solution of Systems of Linear Equations
No ratings yet
The Use of Parallel Polynomial Preconditioners in The Solution of Systems of Linear Equations
223 pages
pp12 PDF
No ratings yet
pp12 PDF
70 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
CompactNumerical Methods For ComputersLinearAlgebra - Muya
100% (1)
CompactNumerical Methods For ComputersLinearAlgebra - Muya
288 pages
Compact Numerical Methods by John Nash
No ratings yet
Compact Numerical Methods by John Nash
288 pages
Parallel AMG for H(Curl) Problems
No ratings yet
Parallel AMG for H(Curl) Problems
23 pages
mathCFDproj1 Rajeev
No ratings yet
mathCFDproj1 Rajeev
22 pages
Nash J.C. Compact Numerical Methods For Computers.. Lin. Algebra and Function Minimisation (2ed., IOP, 1990) (288s) - MN
No ratings yet
Nash J.C. Compact Numerical Methods For Computers.. Lin. Algebra and Function Minimisation (2ed., IOP, 1990) (288s) - MN
288 pages
A General Purpose Sparse Matrix Parallel Solvers Package
No ratings yet
A General Purpose Sparse Matrix Parallel Solvers Package
7 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
Performance Analysis of Different Iterative Solvers Parallelized On Gpu Architecture
No ratings yet
Performance Analysis of Different Iterative Solvers Parallelized On Gpu Architecture
8 pages
L03 Geometric Decomposition
No ratings yet
L03 Geometric Decomposition
27 pages
Parallel-Vector Equation Solvers For Finite Element Engineering Applications
No ratings yet
Parallel-Vector Equation Solvers For Finite Element Engineering Applications
15 pages
TH 122ISMEConference2003
No ratings yet
TH 122ISMEConference2003
10 pages
JCP Symmpois Published
No ratings yet
JCP Symmpois Published
19 pages
MIT - Applied Parallel Computing - Alan Edelman
No ratings yet
MIT - Applied Parallel Computing - Alan Edelman
187 pages
S To Chas Tic Multi Scale
No ratings yet
S To Chas Tic Multi Scale
201 pages
LectureNoteInCS1573 (VECPAR'98)
No ratings yet
LectureNoteInCS1573 (VECPAR'98)
11 pages
Compre 1
No ratings yet
Compre 1
2 pages
CS 240A: Solving Ax B in Parallel: Dense A: Gaussian Elimination With Partial Pivoting (LU)
No ratings yet
CS 240A: Solving Ax B in Parallel: Dense A: Gaussian Elimination With Partial Pivoting (LU)
35 pages
3.2 - Foster Methodology Ch2
No ratings yet
3.2 - Foster Methodology Ch2
99 pages
Unit 2
No ratings yet
Unit 2
64 pages
Iterative Methods for Linear Systems
No ratings yet
Iterative Methods for Linear Systems
77 pages
RG2 ParallelizationPrinciples HPCAI Jan2020
No ratings yet
RG2 ParallelizationPrinciples HPCAI Jan2020
40 pages
0024 3795 (86) 90158 8
No ratings yet
0024 3795 (86) 90158 8
2 pages
Axioms 14 00397 v2
No ratings yet
Axioms 14 00397 v2
17 pages
Achieving Superlinear Speedup in Parallel Algorithms
No ratings yet
Achieving Superlinear Speedup in Parallel Algorithms
7 pages
Chap3 Slides Week4
No ratings yet
Chap3 Slides Week4
42 pages
Solving Pdes With Cuda
No ratings yet
Solving Pdes With Cuda
34 pages
HPC Chapter 2
No ratings yet
HPC Chapter 2
16 pages
Viscoelastic Drop Deformation in A Micro-Contraction: FDMP, Vol.7, No.3, pp.317-328, 2011
No ratings yet
Viscoelastic Drop Deformation in A Micro-Contraction: FDMP, Vol.7, No.3, pp.317-328, 2011
12 pages
Open FOAM
No ratings yet
Open FOAM
12 pages
Ce 4000000
No ratings yet
Ce 4000000
12 pages
Velocity Pressure Coupling On A Collocated Grid
No ratings yet
Velocity Pressure Coupling On A Collocated Grid
17 pages
Existence and Uniqueness Theorem
No ratings yet
Existence and Uniqueness Theorem
8 pages
Adv Math
No ratings yet
Adv Math
43 pages
Friday Khutbah
100% (1)
Friday Khutbah
2 pages
Internship Report on IoT and AI Projects
No ratings yet
Internship Report on IoT and AI Projects
23 pages
Computer Specifications
No ratings yet
Computer Specifications
8 pages
C Programming System Concepts Explained
No ratings yet
C Programming System Concepts Explained
11 pages
Iot Unit 1 Notes For RGPV Exam
100% (3)
Iot Unit 1 Notes For RGPV Exam
9 pages
United States Patent: Bourdon Et A) - (10) Patent N0.: (45) Date of Patent
No ratings yet
United States Patent: Bourdon Et A) - (10) Patent N0.: (45) Date of Patent
14 pages
Introduction to Visual Basic Programming
No ratings yet
Introduction to Visual Basic Programming
22 pages
BC402 - ABAP Programming Techniques
100% (5)
BC402 - ABAP Programming Techniques
415 pages
Computer Architecture: Cache Design
No ratings yet
Computer Architecture: Cache Design
61 pages
Computer System Performance Guide
100% (1)
Computer System Performance Guide
10 pages
Module 4 (UNIX)
No ratings yet
Module 4 (UNIX)
75 pages
Multi-Core Computer Architecture: Instruction Encoding
No ratings yet
Multi-Core Computer Architecture: Instruction Encoding
14 pages
OOAD 2 ND Unit
No ratings yet
OOAD 2 ND Unit
40 pages
Reference Books: 1.: Lesson Plan
No ratings yet
Reference Books: 1.: Lesson Plan
4 pages
K 2016 Arico Technology Co. Ltd. Paper k2016.2434751
No ratings yet
K 2016 Arico Technology Co. Ltd. Paper k2016.2434751
24 pages
GKS Graphic Kernel System in Modeling and Simulation
No ratings yet
GKS Graphic Kernel System in Modeling and Simulation
4 pages
Dot Net Developer Resume - 6.6 Yrs Experience
No ratings yet
Dot Net Developer Resume - 6.6 Yrs Experience
4 pages
VDR F1/S1 System Upgrade Guide
No ratings yet
VDR F1/S1 System Upgrade Guide
2 pages
Maulana Azad Polytechnic, Solapur: Unit Test-02
No ratings yet
Maulana Azad Polytechnic, Solapur: Unit Test-02
2 pages
SQL Server & Oracle Connection Strings
100% (4)
SQL Server & Oracle Connection Strings
18 pages
220644-665-666-DLD Lab 09 F.. (2) .Docxooooooooooo
No ratings yet
220644-665-666-DLD Lab 09 F.. (2) .Docxooooooooooo
8 pages
How To Install TWRP - A Full Guide To TWRP On Any Android Device
No ratings yet
How To Install TWRP - A Full Guide To TWRP On Any Android Device
10 pages
Problem Solutions To Problems Marked With A in Logic Computer Design Fundamentals, Ed. 2
No ratings yet
Problem Solutions To Problems Marked With A in Logic Computer Design Fundamentals, Ed. 2
2 pages
Airpcap Installation Guide
No ratings yet
Airpcap Installation Guide
8 pages
Ahsan Shakeel - 70317 - OOP - Lab - Manual
No ratings yet
Ahsan Shakeel - 70317 - OOP - Lab - Manual
258 pages
OOP Microproject Sample Report
No ratings yet
OOP Microproject Sample Report
4 pages
HDFS
100% (2)
HDFS
6 pages
Mod 2 Pi
No ratings yet
Mod 2 Pi
3 pages
How To Install AfterCodecs
No ratings yet
How To Install AfterCodecs
2 pages
SIMATIC S7-1200 Starter Kit Overview
No ratings yet
SIMATIC S7-1200 Starter Kit Overview
8 pages
Farmer Support Chatbort (1) .Docx Final
No ratings yet
Farmer Support Chatbort (1) .Docx Final
5 pages