To
'DEC
2 1989
ETVER1T'r
Domain Decomposition
on Parallel Computers
William D. Groppt and David E. Keyest
Research Report YALEU/DCS/RR-723
August 1989
... 1 i"
yUnii
pub.c rete.se
O-
YALE UNIVERSITY
DEPARTMENT OF COMPUTER SCIENCE
&9 1,2
27
132'
We consider the application of domain decomposition techniques to the solution of sparse linear
systems arising from implicit PDE discretizations on parallel computers. Representatives of two
popular MIMID architectures, message passing (the Intel iPSC/2-SX) and shared memory (the
Encore Multimax 320), are employed. We run the same numerical experiments on each, namely
stripwise and boxwise decompositions of the unit square, using up to 64 subdomains and containing
up to 64K degrees of freedom. We produce a tight-fitting complexity model for the former and
discuss the difficulty of doing so for the latter. We also evaluate which of three types of domain
decomposition preconditioners that have appeared in the literature of self-adjoint elliptic problems
are most efficient in different regions of machine-problem parameter space. Some form of global
sharing of information in the preconditioner is required for efficient overall parallel implementation
in the region of most practical interest (large problem sizes and large numbers of processors);
otherwise, an increasing iteration count inveighs against the gains of concurrency. Our resuits
on a per iteration basis also hold for sparse discrete systems arising from other types of nartial
differential equations, but in the absence of a theory for the dependence of the convergence rate
upon the granularity of the decomposition, the overall results are only suggestive for more general
svsterns.
-.-
A..
NTIS
Domain Decomposition
on Parallel Computers
William D. Groppt and David E. Keyes+
Df.
u
r.
c("Ai
r- i!
,.1*'
Diflb1to'
Research R%port YALFU/DCS/RR-723
August 1989
Avltbity Ccd."
av
Ir'dior
Dist
.\ pp ro r,,l for public release: distribution is unlimited.
t Dopartniont of (ompiutor Science. Yale University, New laven, CT 06520. The work of this
author was supported in part by t e Office of Naval Research under contract N0001,t-86-K-0H10
and tie National Science Fotndation under contract number D(CR 8521.1-51.
l)epartmen t of .Mechanical Engineering. Yale F niversit v. New Hlaven, CT 06.520. The work of
this awthor was srpportod in part by the National Science Foundalion under contract number
FJE'-
717109f.
1. Introduction
6
Domain decomposition techniques appear to be a natural way to distribute the solution of
large sparse linear systems across many parallel processors. In this paper we develop complexity
estimates for two types of decompositions and two parameterized types of "real" parallel computer
architectures, and validate those estimates on representative machines, with particular emphasis
on the case of large numbers of processors and large problems. We examine the tradeoffs between
various forms of preconditioning, as characterized by the efficiency of their parallel implementation.
Parallel computers may be divided into two broad classes: distributed memory and shared
memory. In a distributed memory parallel processor, each processor has its own memory and
no direct access to memory on any other processor. Such machines are usually termed "message
pissing" computers since interprocessor communication is accomplished through the sending and
receiving of messages. In a shared memory parallel processor, each processor has direct, random access to the same memory space as every other processor. Interprocessor communication is
conducted directly in the shared memory. In practice, of course, most shared memory machines
have local memory, called the cache, and communication is through messages, called cache faults.
However, each type of parallel processor is optimized for a different interprocossor communication
pattern, and we consider the effects of these optimizations on domain decomposition.
Domain decomposition refers in a generic vay to the replacement of a partial differential
equation problem defined over a global domain with a series of problems over subdomains which
collectively cover the 3riginal. Early domain decomposition techniques, whether iterative [12] or
direct (10], were basec on exact reductions of the global problem to a set of lower-dimensional
problems on interfaces between subdomains by means of direct elimination of the degrees of freedom
interior to the subdomains. In this sense, domain decomposition is analogous to the finite element
procedure of static conensatiolk. A more modern viewpoint has emerged [2, 13] in which the
interior unknowns are aot directly eliminated, but reLained in the outer iteration. preconditioned
by an appropriate fast approximate solver. In [8], we referred to such approaches as "partitioned
matrix methods", and it is such that we consider here, in combination with preconditioned conjugate
gradients as the outer iterative method. Partitioned matrix iteration was shown in [8] to be identical
to iteration on the reduced system when the subdomain solves are exact. However, it has the
advantage of being much more flexible for problems for which the only known exact solvers are
expensive. The important question as to how much approximate subdomain solves "penalize"
overall iterative convergence has begun to be addressed (see, in addition to the references above.
[1] and section 6 of [4]).
An example of the simplest form of domain decomposition is shown in Figure la. A single
interface (3) divides a domain into subdomains 1 and 2. The matrix obtained upon finite difference or finite element discretization is partitioned according to the physical decomposition into
subdomains connected by a small (lower dimension) interface region and looks like
.4, -
0
1~3
A22
23
.423
-'133/
where AII and A22 come from the interior of the subdomains. .433 from along the interface, and
.'113 and .A23 from the interactions between the subdomains and the interfaces. A detailed view of
this matrix for the 5-point operator is shown in Figure 2.
T,e ge,,.ldiz.ation of Figure ia o any number of strips is straightforward. For reasons that
will become clear, it is important to also consider a decomposition into boxes as shown in Figuro lb.
The matrix for this decomposition looks like
(Al
A,=
ATB
0
AIB
AB
ATc
ABC
Ac
where
Al = Block diagonal subdomain matrices
AIB = Coupling to interiors from subdomain interfaces
AB = Block diagonal suibdomain interfaces
ABC = Coupling to interfaces from cress points
Ac = Cross point system
This representation is based on a five-point difference template, so that the coupling matrix Aic
is zero. More generally, nonzero coupling between the crosspoints and the corner points of the
subdomain interiors would need to be taken into account. A detailed view of this matrix for the
5-point operator is shown in Figure 3, for the case of a decomposition into two by two blocks. We
are interested in various preconditioners for both A, and A,, based on their efficacy and on their
parallel limitations.
The choice of preconditioner is critical in domain decomposition, as with any iterative met od.
In the context of r-rallel computing, the main distinction is between preconditioners which are
purely local, those . ic, involve neighbor communication, and those which involve global communication. (We use ti,
:d "communication" here in a general sense; in a shared memory machine.
this refers to shared access to memory.) As example preconditioners we consider a block diagonal
matrix for the purely local case and a preconditioner based on FFT solves along the interfaces for
the neighbor communication case. As an example involving global communication. we consider the
Bramble et al preconditioner [2], which requires the solution of a linear system for the cross points
(or vertices). This method involves only low bandwidth global communication (that is. the size of
the messages scales with the number of processors); we do not consider any method which u.,es
high bandwidth communication (where message size scales with the size of the problem).
We develop a complexity model for each type of parallel computer that is based on two major
contributions: floating point work and "shared memory access". This latter term measures the
cost of communicating information between processors. In a distributed memory system, this is
represnted by a communication time. In a shared memory system. there are several contributions,
including cache size and bandwidth, and the number of simultaneous memory requests which may
2
All
22
(b)
(a)
Figure 1: Two forms of decompositions. A two domain strip
decomposition is shown in (a); a many domain box decomposition with cross points (labeled "C") is shown in (b).
NN
Figure 2: The partitioning of the matrix A, for the 5-point
Laplacian and two strips. The dashed lines separate subdomains; the solid lines se parate the interior unknowns from
the interface unknowns.
be served.
2. Comments on Parallelism Costs
In any parallel algorithm, there are a number of different costs to consider. The most obviou1s
of these are intrinsically serial computations. For example, the dot products in tile conjugate
gradient method involve the reduction of values to a single sum; this takes at least logp time. More
subtle are costs from the implementation, both software and hardware. Ani example of the soft ware
3
%.
II"
I
".e11 II
%.
"%
".
I.I
"%
I
I
-A ,
"I
I
I
--
"1i
1
----
.
-
%iI
%
.
".%
*
--
--".
---
IIII
I'
"
"k i
_ .
Ii
%K"4I
- -
.!I
.le ,. II
- -
"". \Ab%% ,
"I
.L-
--
T -- -----
Figure 3: The partitioning of the matrix A, for the 5-point
Laplacian and four boxes arranged ir, a two by two manner.
The dashed lines separate subdomains; the solid lines separate the interior unknowns from the interface unknowns and
the cross point unknown.
cost is the need to guarantee safe access to shared data; this is often handled with barriers or
more general critical region5. Sample hardware costs include bandwidth limits in shared resources
such as memory buses and startup and transfer speeds in communication links. Perhaps the most
subtle cost lies in algorithmic changes to "improve" parallelism; by choosing a poor algorithm
over another, less parallel algorithm, aitificially good parallel efficiencies can be found. We call a
high parallel efficiency "artificial" if there exists an algorithm with lower parallel efficiency which
nevertheless executes in less wall1-clock time on a given number of processors. Of the many exam pies
in the literature, one of the most dramatic is [5] on computing the forces in the n-body potential
problem; the naive algorithm is almost perfectly parallel but substantially slower than the linearin-n algorithm, which contains some reductions and hence some intrinsically serial computations.
We can identify the following costs in the domain decomposition algorithms that we are considering:
" Dot (inner) products. These involve a reduction, and hence at least logp time: in addition.
there mav be some critical sections (depending on the implementation).
" Matrix-vector pr A4ucts. These involve shared data, and hence may introduce some constraints
on shared hardware resources.
" Preconditioner solves. These depend on the preconditioner chosen, and hence gives us thle most
freedom in trading off greater parallelism against superior algorithmic performance.
Note that the sharing of data is not random; most of the data sharing occurs between neighbors.
4
2.1. Message passing and Shared memory models
Two methods for achieving parallelism in computer hardware for MIMD machines are message
passing and shared memory. In both of these, the software and hardware costs discussed above
miow up in the cost to access shared data. Each of these methods is optimized for a different
domain, and these optimizations are reflected in the actual costs. In the following, to simplify the
notation will will express all times in terms of the dine to do a floating point operation. Further,
we will drop constant factors from our estimates.
In a message passing machine, each processor has some local memory and a set of communication links to some (usually not all) other processors (called nodes). Each processor has access only
to its own local memory. Communication of shared data is handled (usuall:' by th: prograrmner)
by explicitly delivering data over the communication links. This takes time s + rn for n words.
where s is a startup time (latency) and r is the time to transfer a single word. This is good for
local or nearest neighbor communication. For more global communication (such as a dot product).
times depend on the interconnection network. For a hypercube, the global time is (s + rn)log p:
for a mesh, it is (s + rn)Fp.
In a shared memory machine, each processor may directly access a shared global memory.
Communication (access to) shared data is handled by simply reading the data. However. the
actual implementation of this introduces a number of limits. For example, if the memory is on a
commi 1 bus, then there is a limit to the number of processors that can cirnultaneously read from
the shared memory. One way to model this cost is as 1 + p/ min(p, P) [7]. Here p is the number of
processors and P is the maximum number of processors that may use the resource at one time.
In addition, the access to the shared data must be controlled; this can add additional costs
in terms of barriers or critical sections. These can add additional terms which are proportional to
log p.
3. Complexity Estimates
We can estimate the computation-l complexity for these two models for several forms of doniatin
decomposition. We note that these are rather rough estimates, good (because of their generality)
for ident;fying trends. Additional estimates of the complexity of parallel domain decomposition
may be found in [6].
3.1. Message Passing
In this case we can easily separate out the computation terms and the communication terms.
For each part of the algorithm, we will place a computation term above the related communication
term. In the formulas below, the constants in front of each term have been dropped for clarity.
In two dimensions, with n unknowns per side. we have for strips
Ax multiply + dot products + subdomain solv2s + interface solves
--2
+ n +logp
+ "3 logr
+ nlogn +
s+rn
(s+r)logp
s+rn
s+ rn
and for boxes, we have
Ax multiply
+
+
dot products
22+
p logp
+
+
subdomain solves
71l2
.q+ r n
(s+
s+r
-p!
loog
interface solves
log
+
+
vertex solv,s
vlop-31++
+
log p
r)logp
s+ r
These costs are all per iteration. It is assumed that all neighbor-neighbor interactions can occur
simultaneously. In the box case, the vertex coupling equations are solved by first exchanging
5
all of the vertex data with all the processors. Then each processor computes the entire vorex
system. For the small systems considered, this is an efficient way to handle the vertex system. A
more cooperative parallel solution approach would incur more communication startups and could
actually take longer [11]. (A complexity comparison appears under #3 below.)
3.2. Shared Memory
In this case, a detailed formula depends on the specific design tradeoffs made in the hardware.
The formula here applies to bus-oriented shared memory machines; a different formula would
be needed for machines like the BBN Butterfly. These formulas are dominated by bandwidth
limitations (the min(p, P) terms) and barrier or synchronization costs (the logp terms).
For strips, we hz,'.'e
2
(a+
p
bp
min(p, P1 )
+2 ( n +logpg
A similar formula holds for
the prticular hardware and
effectively share a hardware
of the shared resource (such
P)
min(p, P2
dp P2)
+ nlogn+
log.
boxes. Here, a, b, c, d, P 1 , and P2 are all constants that depend on
implementation. Pi give a limit on the number of processors that can
resource. The ratios a/b and cid reflect the ratio of local work to use
as m,-mory banks or memory bus).
3.3. Implications
In domain decomposition algerithms, we can trade iteration count against work and parallel
overhead. We will consider three representative tradeoffs:
I. No communication. This amounts to diagonal preconditioning. Call the number of iterations
'decou pled"
2. Local communication only. The FFT-implementable "KI/ 2'' preconditionings such as those in
[2], where we can expect the iteration count to be proportional to p for strips and Vp for boxes.
Call the number of iterations 'local"
The cost of the "K 1 / 2" and all more implicit forms of preconditioning includes an extra subdomain solve to symmetrize the preconditioner and is roughly
n'2
nlogn+-log- +2(.s+rn)
P
P
for a strip decomposition and a message passing system. The preconditioning complexity
estimate is dominated by the subdomain solve terms, and it is thus roughly twice that of
case 1. The communication costs are the same as for the matrix-vector product. Thus the
local communication preconditioning is effective if
2
llocal S 'decoupled'
3. Global communication. In the case of box-wi,,e decompositions, cross points occur at the
intersections of the interfaces. The cross-points form a global linear system that is discussed
in [2]. The iteration count is approximately independent of n as p varies: call it /alobal"
In thi, case-the additional costs over and above the symmetrization stage mentioned in #2 are
those of communicating the entries in the cross-point problem and its solving it. For the case
of a message passing system, we can do this in p2 + (s + r) logp time if each processor solves
the cross-point system and in time p3 / 2 + p(s + rv-p) time if a parallel algorithm is used for lhe
63
solve, assuming a straightforward approach based on Banded Gaussian elimination on a ring.
More sophisticated approaches for hypercubes which are p + (s + r) logp are known [3. 1Ii.
Comparing this to the cost of the local computation of (n 2 /p)log(n/vfp) + (s + n)log p tlie
floating point work is negligible unless p is approximately equal to n, or higher. Assuming then
that we use the local method, then the additional cost of the global communication makes the
full cross-point method better whenever
Iglobal(2 - e)
'local,
where e is the parallel efficiency, equal to 1 minus the ratio of communication time to computation time.
For the shared memory case, if barriers and the dot product reduction are the dominant
parallelism costs, the result is similar.
4. Experiments
The standard test problem considered was
V 2u = g
where g - 32(x(1 - x) + y(1 - y)) on the unit square. Though rather ideal, to minimize coefficient
storage space, this problem has structural (from the graph-theoretic viewpoint) and spectral (from
the operator-theoretic viewpoint) similarities to the generic self-adjoint elliptic problem and also
to the non-self-adjoint problem in the limit of sufficiently high mesh refinement. The first set of
experiments was conducted on an Encore Multimax 320 shared memory parallel computer with 1
processors; we used only 16, allowing the remaining 2 processors to handle various system functions.
The experiments on the Encore Multimax were done in double precision. The results are shown
in Tables 1 and 2. The Encore is a time-sharing machine, so these timings are accurate and have
been reproduced only to about 10%. The computations were performed with no other users on the
machine: however, various system programs (mailers, network daemons) used some resources. In
addition. even a programmer who fully understands the dependencies in a given problem can riot
force each process to run on a different processor; operating system logic intervenes.
The tables show the iteration count I, an estimate or the condition number K (estimated as in
[9]). the time in seconds T, and the relative speedup s. The relative speedup is defined as the ratio
of the time from the previous column with the time from the current column. The times do not
inc!,!de initial setup. While this slightly distorts the total time, it does allow the time per iteration
to be determined by dividing the time by .ie itratie:. count
The next set of experiments was run on a 64 node Intel iPSC/2-SX llypercube, with .1
Megabytes of memory on each node and the SX floating point accelerator (a scalar floating point
accc erator). All runs were in single precision to allow a large problem to fit in this memory space.
The results are shown in Tables 3-6.
The programs in both of these cases were nearly the same. Only the code dealing with shared
data was changed to use either messages or shared memory. (The Encore implementation is not a
meissage passing implementation rising the shared memory to simulate message: it is a "'nat ral'
shared memory code. In particular, the solution arrays are all shared. The Intel code is a "nat,,ral'
m,,ssago passing code. The solution arrays are all local (private).)
There are some differences between the iteration count results for the Encore and the Intel
implementations. These differences seem to be due to the difference in floating point arithmetic on
the two machines. Double precision was required on the Encore to get the fast Poisson solver we
7
[hI\p
16
1
K
T
1
1
1.00
0.17
s
32
I
T
64
s
1
K
T
s
128 1
T
T
,s
512 1
K
T
s
16
0.74
1
1.00
0.86
2
1.09
0.73
5
3.29
0.78
1
1.00
4.28
1.18
2
1.09
3.75
1.14
0.94
4
3.29
3.03
1.24
8
1..0
2.43
1.25
1
1.00
20.3
2
1.09
1,S. 5
4
3.29
14.7
8
13.0
12.6
16
51.9
10.:3
1.10
1.26
4
3.29
76.1
1 '1
4
3.29
377
1.12
1.17
8
13.0
63.8
1.19
8
13.0
320
1.18
1.22
2
1.09
92.2
1.12
2
1.09
423
1.03
s
256 1
2
3
1.26
0.23
1
1.00
103
1
1.00
436
16
51.9
55.0
1.16
16
51.9
:309
1.04
Table 1: Results for strips on the Encore Multimax 320.
h-\p
64 1
V
T
1
1
1.00
4.28
6
10.7
5.43
3.55
7
14.1
6.9
3.94
0.79
6
11.0
27.2
0.75
6
1.00
103
13.6
130
18.6
38.3
0.79
7
:.39
8
16.7
708
23.7
179
3.96
128 1
K
T
256 1
K
T
512 1
T
16
7
10.2
1.53
1
1.00
20.3
1.00
436
s0.62
Table 2: Results for boxes on the Encore Multimax 320.
using full vertex coupling.
'
\"
.S1
I'
1.00
0.5.16
1.09
0.479
1.14
.
2
1.09
2.29
1.11
2
1.09
.7 Z..1
_ 7 7 7 .7
64
1
1.00
2.62
1'
1281 1
T
s1.13
25f; 1
1
1.00
12.1
1
1.00
56
16
:
32
3.29
0.551
0.87
4
3.29
1.97
1.16
4
3.29
_.79.99
1.15
8
13.0
1.73
1.14
8
13.0
1.18
19
51.9
.!1
0.97
2
1.09
49.5
3.29
43.5
8
13.0
37.4
19
.51.9
37.8
1.13
1.11
1.16
0.99
41
207.5
35.4
1.07
Table 3: Residts for strips on the Intel Hlypercube.
4
1
1
1.00
0.5.16
1l
7.88
0.6..5
0.85
6
7.01
0.298
2.16
76
K
S
1.00
2.62
10.7
3.53
10.2
0.965
0.7.1
6
3.66
11.01
16.3
0.711
-56,
13..57
71.5
1.50
I-1.1
.t.23
:3.85
10.5
1.05
4.03
18.6
21.9
3.10
14.S
4.79
1.57
1v-\\p
32
6.11
12
T
.
1.00
12.1
21
S
1.00
112
16
61
7.16
0.32.1
2.98
7
tl vprcu e. lSills
Table 4: ieslts for boxes ,nt ihe ln1ii
full wrr x couplill.
1i-,li to)w,)rk for h - 1/,12. Siiiqle pri.isimi was required ()n til Intl in ord(er 1,)fit the prohl,,mns
In iil'in '
rv.
5. Comments
While th, ov,,rall r ,sults for strip'. ia
,M a
r m t, o i
tt
se(-m p(mr. they actallv represent ,ervgood
speedi, p
,, : it 1, thLe l,,r;led iterai()n coiiit for this tyPe of iteration that supp)r,-,-
lquoted in [8]. that the condition umniher, rI-,1
the e'flimincy. \V, fid, in accordamc, with te t lory
1L.vnptot ally quma1ratlcally in p. and hielice that t ie nimniher of precoi(ditiolld ronjut )(,gr;idli(elI
iterat,,- rises liwnarly in p. (efeating the, ;dvantage of parallelism.
9
hK-'7T
16
32 1
11
1.00
12.1
25.3
0.546
0.76
0.72
0.41
1.33
T
64
s
1
12
15.9
32.2
941.4
1.00
2.62
3.52
0.74
6
20.1
16.3
1.59
2.21
13
40.5
7.79
2.09
13
49.6
35.5
0.674
2.36
19
119.7
2,57
3.0
NA
NA
s
1281
1
1.00
12.1
s0.74
256 1
V,
T___
1
1.00
112
7
24.8
8 7. 0
17
NA
Table 5: Results for boxes on the Intel Hypercube. without
vertex coupling but with interface preconditioning (neighbor
coupling). The "NA" entries could not be computed on oui
iPSC,/2.
Two of the full vertex coupling (global) results show superlinear relative Speedup. ThV,1
a real offect, which derives from the superlinear growth in the cost of solving, a sinizle
as a ounction of the number of points in the domain. This speedup is of course availabl !,).
single processor domain decomposition algorithm. In fact, the siight relative speedups sonfor'if
strip dcz~mpositions are due almost entirely to this effect alone, since the number of itorat iii>
proportional to The number of processors.
5.1. Comparison with the theory
Ini the case of the message passing results (Intel Hvpercubel, it is possible to fit the thor'icomplexity estimate to the measured times. Taking Gnlv the highest order terms in th l at 'tc%
and the arithmetic suggests a fit for the strip dlecomposition of
a,-
+a
n-logna
p
p
+a.log p.
WAe h ave- ignore-d t he r terms because s~ >~rn for given n onl The In tel hivpercur e. A e>'c'2.
0.0017. -111"
0.000090. a13= 0.00)0027. anrd a,
0.t000131, a2
fit to t he dat a yWilds (11=
were comnpurted by- ,calinrg the equations bY the inverse of the timne, mia kig lie loa,,i s0 ia r'
'iM (1.
1.
T. The residual Kpl., -(
I )T 1 2 is n.0 14.
The fit is show,n iM Fii ir'' 1. \\
these values (or directly fromn the data), the efficiency pe r itrration can be Shown T') 1o ;irw'il
for the largefr problenis.
,Sorme care is necessary in interjpretirig the gr;,t)lis In Figure 1. Ini particular. thf1
e fficiency used in the figure is relative to thle singfle stibdoimnain ca;se, for a single iterat ion. lecxi cost of solving oil a siibdomnain falls faster than linearly with increasing nitmnhers of siihdtnx;i-.
tris rnvasu re of efficiency- will oft en be, grea;ter than one. An est iiate of tOli pa ralel etfii-
iO
'2.2
T
-
1.8
16
32
0-
64
2.0
ni
x-
28
-6
1.6
1.2
10
20
30
Figure 4: Data for iPSC/2 runs with strip decomposition.
The fit is made using Equation (5.1). The y axis is the efficiency, defined as the time on a single processor, divided
by the number of processors times the time on that many
processors.
(computation time divided by total time) can be made from Ecuation (5.1):
a,
+ a 2 - logp
al'-
+ a22 log L + a 3 + a 4 1og p
(5.2)
Figure 5 shows the behavior of E with respect to p for a sample value of n.
For the box decompositions, the highest order terms in the complexity model are
a,-
+ a2p
log-
+ a 3 (vP -1)+
a4+ as logp.
(5.3)
We have again ignored the r terms because of the high latency of the Intel hypercube. A le;st
squares fit to the data yields a, = 0.00012, a 2 = 0.000091. a 3 = 0.000014, a 4 = 0.0067. and
a5 = o0.0031. The coefficients were computed in the same way as for strips, and the residual is
0.032. A graph of the data and the fit is shown in Figure 6. The a 3 term comes from the arithniel ic
complexity of solving the glob,-1 cross-point system on a single processor using an already factorvd
matrix. The efficiency per iteration is lower, because of the increased communication overhoad.
but is still above 70% for the larger problems. This makes the strip decomposition superior for
moderate numbers of processors (where the iteration counts are similar).
To get a better idea for the individual overheads, the program for the Intel 1lypercube was
instrumented to provide data on the cost of each communication operation. This data showd
11
h-_\p
32 I
K
T
s
64
K
T
s
128 1
r,
T
s
256 1
r,
T
s
1
1
1.00
0.546
1
1.00
2.62
1
1.00
12.1
1
1.00
112
4
12
29.1
0.827
0.66
17
61.0
5.45
0.48
23
124.9
29.6
0.41
27
254.9
182
0.62
16
18
49.2
0.382
2.16
25
103.6
1.81
3.01
35
212.6
11.5
2.57
48
432.4
71.4
2.55
64
32
187.9
0.741
2.44
44
397.6
3.28
3.51
62
818.0
20.5
3.48
Table 6: Results for boxes on the Intel Hypercube, using
diagonal blocks only (no coupling.)
1.00
0.99
0.97
0.97
0.96 1-
0.95
110
20
P
Figure 5: Estimated parallel efficiency for the strips case on
the iPSC/2, using Equation (5.2), n = 255, and the parameters in the text.
that the expense of the global communication used to form the vertex coupling system was always
smaller than the communication to form the dot-products, and smaller in all but one case than
the nearest-neighbor communication used in forming the matrix-vector product Ax. This result is
consistent with the known communication behavior of the Intel Ilypercube. The two (lot products
12
1.6
1.4
o - n = 16
o-
n = 32
O- n 6
+ - n : 128
x - n =256
"----
- -
--
1.2"
1.0
C.8
0 .,III
0
20
40
C,
Figure 6: Data for iPSC/2 runs with full vertex coupling.
The fit is made using Equation (5.3). The y axis is the efficiency, defined as the time on a single processor, divided
by the number of processors times the time on that many
processors.
done in each iteration involve a time 2(s + r)logp while the global exchange involves (roughly)
slogp + pr; since s > r, the global exchanges take less time. In the case of the nearest-neighbor
communications, the comparison is with 4(s + rn/,p). The larger amount of data moved in the
neighbor computation makes it often more expensive than the global communication for the vertex
coupling. Further, when n is small, we would expect the neighbor communication to be faster than
the global communication, and this is exactly what is observed. A different balance of s and r.
or significantly more processors would of course change these results. For example, on a truely
massively parallel machine with thousands of processors, the neighbor communication would be
smaller than either the dot-product or global communication.
This leads to the results in Tables 5 and 6, where a simpler communication strategy has heen
traded against larger iteration counts.
For the shared memory machine, the complexity estimates are harder to demonstrate. In part.
this is due to the design of the shared memory machines; the number of processors is deliberately
limited to roughly what the hardware (i.e., memory bus) can support. The dominant effect is
usually load balancing or the intrinsically serial parts of the computation (synchronization points
and dot products).
The results for the Intel Ilypercube are summarized in Table 7 which gives the optimal choice
of domain decomposition algorithm (from among those considered) for various values of p and h for
the Intel Ilypercube. As either the number of processors or the number of mesh points increase. the
global or full vertex coupling algorithm becomes more efficient, despite its additional communicat ion
13
p\h - '
4
16
64
16
Decoupled
32
Global
Global
64
Local
Global
Global
128
Local
Global
Global
256
Global
Global
Global
Table 7: Optimal choice of algorithm for the given problem
and implementation on the Intel Hypercube, from the choices
of decoupled block diagonal, locally coupled interfaces, and
full vertex coupling preconditioners for the box decomposition.
demands. These .esults emphasize the importance of considering a wide range of decompositions
and problem sizes. For example, the results for p = 4 can give a misleading picture of the usefulness
of both the fully decoupled and the locally coupled interface preconditioners. In fact, the "local"
preconditioning wins by an extremely narrow margin in the two cases in which it beats out "global".
Our results show that, even for relatively small problems, the asymptotically superior performance
(in iteration count) of the full vertex coupling preconditioners more than compensates for the the
additional cost of the communication. This is true even for a machine such as the Intel iPSC/2
Hypercube, with relatively slow communication and global communication that is proportional to
logp. The older iPSC/1 Hypercube had an identical table, which is not surprising, given that the
ratio of communication speed to computation speed for the two machines is similar.
14
References
[1] C. Borgers, The iVeumann-Dirichlet Domain Decomposition Method with Inexact Solvcrs on
the Subdomains, 1989. Numer. Math. (to appear).
[2] J. H. Bramble, J. E. Pasciak, and A. H. Schatz, The Construction of Preconditionersfor
Elliptic Problems by Substructuring, 1, Mathematics of Computation, 47 July (1986),
pp. 103-134.
(33 T. F. Chan, Y. Saad, and M. H. Schultz, Solving Elliptic PartialDifferential Equations on the
Hypercube Multiprocessor, Technical Report YALE/DCS/RR-373, Yale University.
Department of Computer Science, March 1985.
[4] M. Dryja, A Finite Element - Capacitance Method for Elliptic Problems on Regions Partitioned
into Subregions, Numer. Math., 44 (1984), pp. 153-168.
[5] L. Greengard and W. D. Gropp, A Parallel Version of the Fast Multipole Method, Paralh/l
Processingfor Scientific Computing, SIAM, 1989, pp. 213-222.
[6] W. D. Gropp and D. E. Keyes, Complexity of ParallelImplementation of Domain L'cx,.'o.ition
Techniques for Elliptic PartialDifferential Equations, SIAM Journal on Scientific and
Statistical Computing, 9/2 (1988), pp. 312-326.
[7] H. F. Jordan, Interpreting Parallel Processor Performance Measurements, SIAM Journal on
Scientific and Statistical Computing, 8/2 (1987), pp. s220-s226.
[8] D. E. Keyes and W. D. Gropp, A Comparison of Domain Decomposition Techniques for Elliptic
Partial Differential Equations and their Parallel Implementation, SIAM Journal on
Scientific and Statistical Computing, 8/2 (1987), pp. s166-s202.
[9] D. P. O'Leary and 0. Widlund, Capacitance Matrix Methods for the Helmholtz Equation oil
General Three-dimensionalRegions, MATHCOMP, 33 (1979), pp. 849-879.
[10] J. S. Przemieniecki, Matrix Structural Analysis of Substructures, AIAA J., 1(1963). pp.
138-147.
[11] Y. Saad and M. Schultz, Parallel direct methods for solving banded linear systems. Technical
Report YALE/DCS/RR-387, Yale University, Department of Computer Science.
August 1985.
[12] Ht.A. Schwarz, Gesammelte mathematische Abhandlungen, Berlin, Springer, 2(1890). pp.
133-143. (First published in 1870).
(13] 0. B. Widlund, Iterative Substructuring Methods: the General Elliptic Case, Technical Report
260, Courant Institute of Mathematical Sciences, NYU, November 1986.
15